diff --git a/README.md b/README.md new file mode 100644 index 0000000..2f92998 --- /dev/null +++ b/README.md @@ -0,0 +1,77 @@ +# QuickPeep + +Small-scale 'artisanal web' search engine project, favouring quality over completeness. + + +## Motivation + +Modern web search can be rubbish. +It feels like I'm getting the same websites time and time again, and that I rarely manage to come across small, independent websites instead of content/SEO mills. + +Typical modern websites are rubbish. They bother you with adverts, annoying nagging pop-ups (often loaded with dark patterns), privacy-disregarding trackers and slow-loading content that relies on needless amounts of JavaScript to display plain rich text that has been possible with pure HTML for decades. + +QuickPeep aims to index good-quality websites, favouring small personal sites (such as blogs), whilst cutting out the rubbish. +Websites that don't care about the reader's experience are not welcome; QuickPeep will detect pop-up nags, adverts and privacy issues in order to keep them away. + +As a separate issue, many websites decide to use CloudFlare to intercept their traffic. +This is not in the spirit of the decentralised web; CloudFlare becomes a single party that, if compromised or dishonest, could exert a disproportional amount of power over users. +One notable example of this is that they bombard privacy-seeking users on VPNs or Tor with needless CAPTCHAs, giving them a worse experience on a lot of the web. +These websites will be detected so that they can be filtered out at will. + +QuickPeep will follow the trade-off of preferring not to provide any results rather than to provide results laced with rubbish. +If you need to fall back to a conventional search engine, this will eventually be possible to do right within QuickPeep... + + +## Features + +*Crossed-out things are aspirational and not yet implemented.* + +- ~~Shareable 'rakepacks', so that anyone can run their own search instance without needing to rake (crawl) themselves~~ + - ~~Dense encoding to minimise disk space usage; compressed with Zstd?~~ +- Raking (crawling) support for + - HTML (including redirecting to Canonical URLs) + - ~~Language detection~~ + - Redirects + - ~~Gemtext over Gemini~~ + - RSS, Atom and JSON feeds + - XML Sitemaps +- Detection of anti-features, with ability to block or downrate as desired: + - CloudFlare + - Adverts + - Nagging pop-ups + - Trackers +- Article content extraction, to provide more weight to words found within the article content (based on a Rust version of Mozilla's *Readability* engine) +- (Misc) + - ~~Use of the Public Suffix List~~ + - ~~Tagging URL patterns; e.g. to mark documentation as 'old'.~~ +- ~~Page duplicate content detection (e.g. to detect `/` and `/index.html`, or non-HTTPS and HTTPS, or non-`www` and `www`...)~~ +- ~~Language detection for pages that don't have that metadata available.~~ + + +## Limitations + +- Only supports English (and English dialects) for now. + - Should hopefully be customisable later on, but even though I can speak some foreign languages, I don't know any communities to start with to seed non-English search. +- Websites have to be manually allowed and tagged. + - It's otherwise difficult to know how to detect good-quality sites automatically, or to tag what they are. + - There may be ways to improve this; e.g. with machine learning techniques or crowdsourcing of data. +- The search index needs to remain small enough to be usable on modest hardware, so there's no way we can hope to index everything. + + +## Architecture + +*Not written yet.* + + +## Development and Running + +*Not written yet.* + + +### Helper scripts + +`scripts` contains some helper scripts, which you probably need to run before operating QuickPeep: + +- `get_cf_ips.sh`: fetches IP addresses ued for CloudFlare detection. +- `get_adblock_filters.sh`: fetches adblock filters used to detect nags/adverts/trackers. + diff --git a/quickpeep_densedoc/src/lib.rs b/quickpeep_densedoc/src/lib.rs index e405883..5ad9c27 100644 --- a/quickpeep_densedoc/src/lib.rs +++ b/quickpeep_densedoc/src/lib.rs @@ -21,6 +21,8 @@ impl DenseDocument { pub struct DenseHead { title: String, feed_urls: Vec, + /// Language of the page. May be empty if not discovered. + language: String, /// URL to icon of the page. May be empty if none were discovered. icon: String, }