quickpeep/docs/internals/concepts.md

QuickPeep Concepts
==================

Principles
----------

1. Focus on good-quality, interesting, personal content rather than completeness
   for every search query.
2. Support running a search engine on modest hardware.
   Critically, disk space is likely to be constrained in real-world deployments.


Components and Subcomponents
----------------------------

### On-disk Structures

Schedule:
- List of URLs to rake
- Backoffs for failing hosts

RakePack:
- Contains summarised results of scraping many pages
  - In a streamable, dense memory-mappable format.
  - Perhaps use `rkyv` to store the records.

Index:
- Searchable index of all documents
  - Might be distributable as deltas or something, not sure — to be decided.
  - Might be sharded by different parameters (e.g. tags) — specifics to be decided.
  - Might be sharded by date of raking — specifics to be decided.
    Not sure how to best manage an ever-growing dataset.

### Programs

#### Importer

Imports URLs from seed files. Needed to bootstrap the entire engine.


#### Raker

Rakes a page, feed or sitemap.
Builds robot.txt file caches as necessary.

Generates a summarised version of the page.
Also tries to extract readable content, for higher ranking in the index.

Also analyses pages for pop-ups and other issues.
(Unsure if we should do the analysis for e.g. cloudflare at this stage or not?)

#### Indexer

Imports RakePacks and indexes them for searchability.

Also maintains a graph database of all cross-page links.
We can use this to perform ranking...?

??? TODO pagerank ???


#### Searcher

Provides a front-end for searching in the index.
Could provide an API. (Maybe we can integrate into Searx and get the best of both?)