67 lines
1.7 KiB
Markdown
67 lines
1.7 KiB
Markdown
QuickPeep Concepts
|
|
==================
|
|
|
|
Principles
|
|
----------
|
|
|
|
1. Focus on good-quality, interesting, personal content rather than completeness
|
|
for every search query.
|
|
2. Support running a search engine on modest hardware.
|
|
Critically, disk space is likely to be constrained in real-world deployments.
|
|
|
|
|
|
Components and Subcomponents
|
|
----------------------------
|
|
|
|
### On-disk Structures
|
|
|
|
Schedule:
|
|
- List of URLs to rake
|
|
- Backoffs for failing hosts
|
|
|
|
RakePack:
|
|
- Contains summarised results of scraping many pages
|
|
- In a streamable, dense memory-mappable format.
|
|
- Perhaps use `rkyv` to store the records.
|
|
|
|
Index:
|
|
- Searchable index of all documents
|
|
- Might be distributable as deltas or something, not sure — to be decided.
|
|
- Might be sharded by different parameters (e.g. tags) — specifics to be decided.
|
|
- Might be sharded by date of raking — specifics to be decided.
|
|
Not sure how to best manage an ever-growing dataset.
|
|
|
|
### Programs
|
|
|
|
#### Importer
|
|
|
|
Imports URLs from seed files. Needed to bootstrap the entire engine.
|
|
|
|
|
|
#### Raker
|
|
|
|
Rakes a page, feed or sitemap.
|
|
Builds robot.txt file caches as necessary.
|
|
|
|
Generates a summarised version of the page.
|
|
Also tries to extract readable content, for higher ranking in the index.
|
|
|
|
Also analyses pages for pop-ups and other issues.
|
|
(Unsure if we should do the analysis for e.g. cloudflare at this stage or not?)
|
|
|
|
#### Indexer
|
|
|
|
Imports RakePacks and indexes them for searchability.
|
|
|
|
Also maintains a graph database of all cross-page links.
|
|
We can use this to perform ranking...?
|
|
|
|
??? TODO pagerank ???
|
|
|
|
|
|
#### Searcher
|
|
|
|
Provides a front-end for searching in the index.
|
|
Could provide an API. (Maybe we can integrate into Searx and get the best of both?)
|
|
|