1.7 KiB
1.7 KiB
QuickPeep Concepts
Principles
- Focus on good-quality, interesting, personal content rather than completeness for every search query.
- Support running a search engine on modest hardware. Critically, disk space is likely to be constrained in real-world deployments.
Components and Subcomponents
On-disk Structures
Schedule:
- List of URLs to rake
- Backoffs for failing hosts
RakePack:
- Contains summarised results of scraping many pages
- In a streamable, dense memory-mappable format.
- Perhaps use
rkyv
to store the records.
Index:
- Searchable index of all documents
- Might be distributable as deltas or something, not sure — to be decided.
- Might be sharded by different parameters (e.g. tags) — specifics to be decided.
- Might be sharded by date of raking — specifics to be decided. Not sure how to best manage an ever-growing dataset.
Programs
Importer
Imports URLs from seed files. Needed to bootstrap the entire engine.
Raker
Rakes a page, feed or sitemap. Builds robot.txt file caches as necessary.
Generates a summarised version of the page. Also tries to extract readable content, for higher ranking in the index.
Also analyses pages for pop-ups and other issues. (Unsure if we should do the analysis for e.g. cloudflare at this stage or not?)
Indexer
Imports RakePacks and indexes them for searchability.
Also maintains a graph database of all cross-page links. We can use this to perform ranking...?
??? TODO pagerank ???
Searcher
Provides a front-end for searching in the index. Could provide an API. (Maybe we can integrate into Searx and get the best of both?)