Write up a README for the project
This commit is contained in:
parent
df50f3607d
commit
60e906fefd
|
@ -0,0 +1,77 @@
|
||||||
|
# QuickPeep
|
||||||
|
|
||||||
|
Small-scale 'artisanal web' search engine project, favouring quality over completeness.
|
||||||
|
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Modern web search can be rubbish.
|
||||||
|
It feels like I'm getting the same websites time and time again, and that I rarely manage to come across small, independent websites instead of content/SEO mills.
|
||||||
|
|
||||||
|
Typical modern websites are rubbish. They bother you with adverts, annoying nagging pop-ups (often loaded with dark patterns), privacy-disregarding trackers and slow-loading content that relies on needless amounts of JavaScript to display plain rich text that has been possible with pure HTML for decades.
|
||||||
|
|
||||||
|
QuickPeep aims to index good-quality websites, favouring small personal sites (such as blogs), whilst cutting out the rubbish.
|
||||||
|
Websites that don't care about the reader's experience are not welcome; QuickPeep will detect pop-up nags, adverts and privacy issues in order to keep them away.
|
||||||
|
|
||||||
|
As a separate issue, many websites decide to use CloudFlare to intercept their traffic.
|
||||||
|
This is not in the spirit of the decentralised web; CloudFlare becomes a single party that, if compromised or dishonest, could exert a disproportional amount of power over users.
|
||||||
|
One notable example of this is that they bombard privacy-seeking users on VPNs or Tor with needless CAPTCHAs, giving them a worse experience on a lot of the web.
|
||||||
|
These websites will be detected so that they can be filtered out at will.
|
||||||
|
|
||||||
|
QuickPeep will follow the trade-off of preferring not to provide any results rather than to provide results laced with rubbish.
|
||||||
|
If you need to fall back to a conventional search engine, this will eventually be possible to do right within QuickPeep...
|
||||||
|
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
*Crossed-out things are aspirational and not yet implemented.*
|
||||||
|
|
||||||
|
- ~~Shareable 'rakepacks', so that anyone can run their own search instance without needing to rake (crawl) themselves~~
|
||||||
|
- ~~Dense encoding to minimise disk space usage; compressed with Zstd?~~
|
||||||
|
- Raking (crawling) support for
|
||||||
|
- HTML (including redirecting to Canonical URLs)
|
||||||
|
- ~~Language detection~~
|
||||||
|
- Redirects
|
||||||
|
- ~~Gemtext over Gemini~~
|
||||||
|
- RSS, Atom and JSON feeds
|
||||||
|
- XML Sitemaps
|
||||||
|
- Detection of anti-features, with ability to block or downrate as desired:
|
||||||
|
- CloudFlare
|
||||||
|
- Adverts
|
||||||
|
- Nagging pop-ups
|
||||||
|
- Trackers
|
||||||
|
- Article content extraction, to provide more weight to words found within the article content (based on a Rust version of Mozilla's *Readability* engine)
|
||||||
|
- (Misc)
|
||||||
|
- ~~Use of the Public Suffix List~~
|
||||||
|
- ~~Tagging URL patterns; e.g. to mark documentation as 'old'.~~
|
||||||
|
- ~~Page duplicate content detection (e.g. to detect `/` and `/index.html`, or non-HTTPS and HTTPS, or non-`www` and `www`...)~~
|
||||||
|
- ~~Language detection for pages that don't have that metadata available.~~
|
||||||
|
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
- Only supports English (and English dialects) for now.
|
||||||
|
- Should hopefully be customisable later on, but even though I can speak some foreign languages, I don't know any communities to start with to seed non-English search.
|
||||||
|
- Websites have to be manually allowed and tagged.
|
||||||
|
- It's otherwise difficult to know how to detect good-quality sites automatically, or to tag what they are.
|
||||||
|
- There may be ways to improve this; e.g. with machine learning techniques or crowdsourcing of data.
|
||||||
|
- The search index needs to remain small enough to be usable on modest hardware, so there's no way we can hope to index everything.
|
||||||
|
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
*Not written yet.*
|
||||||
|
|
||||||
|
|
||||||
|
## Development and Running
|
||||||
|
|
||||||
|
*Not written yet.*
|
||||||
|
|
||||||
|
|
||||||
|
### Helper scripts
|
||||||
|
|
||||||
|
`scripts` contains some helper scripts, which you probably need to run before operating QuickPeep:
|
||||||
|
|
||||||
|
- `get_cf_ips.sh`: fetches IP addresses ued for CloudFlare detection.
|
||||||
|
- `get_adblock_filters.sh`: fetches adblock filters used to detect nags/adverts/trackers.
|
||||||
|
|
|
@ -21,6 +21,8 @@ impl DenseDocument {
|
||||||
pub struct DenseHead {
|
pub struct DenseHead {
|
||||||
title: String,
|
title: String,
|
||||||
feed_urls: Vec<String>,
|
feed_urls: Vec<String>,
|
||||||
|
/// Language of the page. May be empty if not discovered.
|
||||||
|
language: String,
|
||||||
/// URL to icon of the page. May be empty if none were discovered.
|
/// URL to icon of the page. May be empty if none were discovered.
|
||||||
icon: String,
|
icon: String,
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue