Go to file

Olivier 'reivilibre 2876939b59 Port to Postgres		2023-04-02 10:52:58 +01:00
.ci	Fix manual-pushing pipeline	2022-11-27 10:02:52 +00:00
docs	Add documentation about the seed collection service	2022-11-26 18:09:34 +00:00
nixos_modules	nixos: Add working directory config option to quickpeepSearch	2023-03-21 23:56:47 +00:00
quickpeep	Port to Postgres	2023-04-02 10:52:58 +01:00
quickpeep_densedoc	Respect nofollow and noindex <meta> robots tags	2023-03-30 23:09:39 +01:00
quickpeep_html_charset_detection	Remove forgotten println statements	2022-06-14 23:07:58 +01:00
quickpeep_index	tantivy backend: return tags in search results	2022-11-28 23:19:11 +00:00
quickpeep_indexer	Clarify and handle 'No domain for URL' error in a better way	2023-03-21 23:36:47 +00:00
quickpeep_moz_readability	Add MIT/Apache2 licence	2022-04-01 22:25:06 +01:00
quickpeep_raker	Port to Postgres	2023-04-02 10:52:58 +01:00
quickpeep_seed_parser	Add MIT/Apache2 licence	2022-04-01 22:25:06 +01:00
quickpeep_static	nixos: Add way of building the static files	2023-03-22 01:20:13 +00:00
quickpeep_structs	Fix unfinished work around SecureUpgrade	2022-12-03 15:13:06 +00:00
quickpeep_utils	Port to Postgres	2023-04-02 10:52:58 +01:00
scripts	Clean-ups and support pulling out references	2022-03-14 23:01:19 +00:00
test_vm	Start moving Nix Flake into the root	2022-06-04 22:30:17 +01:00
.env	Initial working version of the seed collection service	2022-03-16 19:53:08 +00:00
.envrc	Add Nix shell	2022-11-21 15:21:52 +00:00
.gitignore	Add backoff reinstatement function to store	2022-06-10 23:02:13 +01:00
book.toml	Fix git links	2022-11-05 14:40:12 +00:00
Cargo.lock	Port to Postgres	2023-04-02 10:52:58 +01:00
Cargo.toml	Create a crate for HTML charset detection	2022-06-12 14:47:42 +01:00
deny.toml	Add MIT/Apache2 licence	2022-04-01 22:25:06 +01:00
flake.lock	Update lock	2022-06-11 20:40:23 +01:00
FLAKE.md	Start moving Nix Flake into the root	2022-06-04 22:30:17 +01:00
flake.nix	nixos: Add way of building the static files	2023-03-22 01:20:13 +00:00
grafana.json	Add a grafana dashboard	2022-06-04 23:09:46 +01:00
LICENCE.Apache2	Add MIT/Apache2 licence	2022-04-01 22:25:06 +01:00
LICENCE.MIT	Add MIT/Apache2 licence	2022-04-01 22:25:06 +01:00
quickpeep.sample.ron	Add OpenSearch XML	2022-11-28 22:49:18 +00:00
README.md	Update the README a little bit	2022-07-02 22:55:18 +01:00
shell.nix	Port to Postgres	2023-04-02 10:52:58 +01:00

README.md

QuickPeep

Small-scale 'artisanal web' search engine project, favouring quality over completeness.

Motivation

Modern web search can be rubbish. It feels like I'm getting the same websites time and time again, and that I rarely manage to come across small, independent websites instead of content/SEO mills.

Typical modern websites are rubbish. They bother you with adverts, annoying nagging pop-ups (often loaded with dark patterns), privacy-disregarding trackers and slow-loading content that relies on needless amounts of JavaScript to display plain rich text that has been possible with pure HTML for decades.

QuickPeep aims to index good-quality websites, favouring small personal sites (such as blogs), whilst cutting out the rubbish. Websites that don't care about the reader's experience are not welcome; QuickPeep will detect pop-up nags, adverts and privacy issues in order to keep them away.

As a separate issue, many websites decide to use CloudFlare to intercept their traffic. This is not in the spirit of the decentralised web; CloudFlare becomes a single party that, if compromised or dishonest, could exert a disproportional amount of power over users. One notable example of this is that they bombard privacy-seeking users on VPNs or Tor with needless CAPTCHAs, giving them a worse experience on a lot of the web. These websites will be detected so that they can be filtered out at will.

QuickPeep will follow the trade-off of preferring not to provide any results rather than to provide results laced with rubbish. If you need to fall back to a conventional search engine, this will eventually be possible to do right within QuickPeep...

Features

Crossed-out things are aspirational and not yet implemented.

Shareable 'rakepacks', so that anyone can run their own search instance without needing to rake (crawl) themselves
- Dense encoding to minimise disk space usage; compressed with Zstd.
Raking (crawling) support for
- HTML (including redirecting to Canonical URLs)
  - Language detection for when the metadata is absent.
- Redirects
- ~~Gemtext over Gemini~~
- RSS, Atom and JSON feeds
- XML Sitemaps
Detection of anti-features, with ability to block or downrate as desired:
- CloudFlare
- Adverts
- Nagging pop-ups
- Trackers
Article content extraction, to provide more weight to words found within the article content (based on a Rust version of Mozilla's Readability engine)
(Misc)
- ~~Use of the Public Suffix List~~
- Tagging URL patterns; e.g. to mark documentation as 'old'.
~~Page duplicate content detection (e.g. to detect / and /index.html, or non-HTTPS and HTTPS, or non-www and www...)~~

Limitations

Only supports English (and English dialects) for now.
- Should hopefully be customisable later on, but even though I can speak some foreign languages, I don't know any communities to start with to seed non-English search.
Websites have to be manually allowed and tagged.
- It's otherwise difficult to know how to detect good-quality sites automatically, or to tag what they are.
- There may be ways to improve this; e.g. with machine learning techniques or crowdsourcing of data.
The search index needs to remain small enough to be usable on modest hardware, so there's no way we can hope to index everything.

Architecture

Not written yet.

The stages of the QuickPeep pipeline are briefly described in an introductory blog post.

Development and Running

Not written yet.

Some hints may be obtained from the introductory blog post mentioned in the 'Architecture' section, but it's probably quite difficult to follow right now.

Helper scripts

scripts contains some helper scripts, which you probably need to run before operating QuickPeep:

get_cf_ips.sh: fetches IP addresses ued for CloudFlare detection.
get_adblock_filters.sh: fetches adblock filters used to detect nags/adverts/trackers.

Licence

Licensed under either of

Apache Licence, Version 2.0 (LICENSE.Apache2 or http://www.apache.org/licenses/LICENSE-2.0)
MIT Licence (LICENSE.MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 licence, shall be dual licensed as above, without any additional terms or conditions.