Commit Graph

116 Commits (main)

Author SHA1 Message Date
Olivier 'reivilibre' e07ac16bc4 Skip raking of weeded URLs
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
May be useful for retroactively clearing out URLs
2023-03-31 22:59:23 +01:00
Olivier 'reivilibre' ff514e90b8 Simplify allowed_/weed_domains 2023-03-31 22:50:02 +01:00
Olivier 'reivilibre' 1c10cb203a Dodge some places where we enqueue URLs without checking they have supported schemes
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2023-03-30 23:40:43 +01:00
Olivier 'reivilibre' 1e8aa95e7a Respect nofollow and noindex <meta> robots tags
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
Along with doing the right thing, this should speed up raking for us
2023-03-30 23:09:39 +01:00
Olivier 'reivilibre' 18d2023550 Add a debug line when we rake something
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2023-03-30 21:17:15 +01:00
Olivier 'reivilibre' 83fecf1464 Improve the raker to perform a reinstate periodically and to respawn workers
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2023-03-28 21:09:24 +01:00
Olivier 'reivilibre' 626b448245 raker: Switch to Jemalloc for the global allocator
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2023-03-22 23:08:08 +00:00
Olivier 'reivilibre' 6d37a07d3e Clarify and handle 'No domain for URL' error in a better way
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2023-03-21 23:36:47 +00:00
Olivier 'reivilibre' 0bebfc0025 Fix unfinished work around SecureUpgrade
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-12-03 15:13:06 +00:00
Olivier 'reivilibre' bff48f35f4 Make the raker attempt HTTPS upgrades
ci/woodpecker/push/check Pipeline failed Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
Not only does this improve security for searchers later on,

it also enables us to cut down on the number of duplicates quite easily.
2022-11-28 23:15:37 +00:00
Olivier 'reivilibre' 8b439c1550 Remove noisy and obsolete debug output in the sitemap extractor
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-11-27 10:14:47 +00:00
Olivier 'reivilibre' 0654d1aa07 Fix raker tools having wrong default config path 2022-11-27 00:02:33 +00:00
Olivier 'reivilibre' 4bba2fc89b Don't fall over on unknown schemes e.g. mailto:
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-11-26 23:47:23 +00:00
Olivier 'reivilibre' c940900fab Add missing URL clean
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-11-26 22:59:24 +00:00
Olivier 'reivilibre' 438beed86a Add more error context 2022-11-26 22:59:14 +00:00
Olivier 'reivilibre' 08f4b7aeaa Add a lot of debug output
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-11-26 22:45:51 +00:00
Olivier 'reivilibre' 2ce8e2ba8e Fix qp-seedrake
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/manual Pipeline failed Details
ci/woodpecker/push/release Pipeline was successful Details
2022-11-26 22:30:40 +00:00
Olivier 'reivilibre' bd16f58d9e Maintain an index file of rakepacks and append when a rakepack is finished 2022-11-26 20:07:12 +00:00
Olivier 'reivilibre' 52d0183942 Reinstate re-rakable URLs on startup 2022-11-26 19:22:34 +00:00
Olivier 'reivilibre' 6ecbc0561f Add configurable re-rake times for different kinds of raked things 2022-11-26 19:05:36 +00:00
Olivier 'reivilibre' d5255410f5 Fix comment on last_visited_days 2022-11-26 18:15:53 +00:00
Olivier 'reivilibre' aa4567c623 Use the sniffed encoding in page extraction 2022-06-12 15:49:02 +01:00
Olivier 'reivilibre' c451a12e44 Pass the bytes through when extracting HTML 2022-06-12 15:26:46 +01:00
Olivier 'reivilibre' c783f89f72 Create a crate for HTML charset detection 2022-06-12 14:47:42 +01:00
Olivier 'reivilibre' d1bbb91477 Decrease default crawl delay a little bit
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-11 00:58:00 +01:00
Olivier 'reivilibre' 504be33b8a Deny content based on content-type before downloading it 2022-06-11 00:57:24 +01:00
Olivier 'reivilibre' 5d1f35a8ee Deny content based on content-length header 2022-06-11 00:12:15 +01:00
Olivier 'reivilibre' bb396dfb5b Reinstate backoffs on startup
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-10 23:35:24 +01:00
Olivier 'reivilibre' fc69b1b192 Add backoff reinstatement function to store 2022-06-10 23:02:13 +01:00
Olivier 'reivilibre' 75afb8b559 Change lack of content-type to be a permanent failure
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
In practice, I see this happening on URLs with unknown filetypes
2022-06-05 10:16:17 +01:00
Olivier 'reivilibre' e88bf6cb44 Convert size limit hits into permanent failures 2022-06-05 10:12:18 +01:00
Olivier 'reivilibre' e66ac80484 Allow passing permanent failures up as errors 2022-06-05 10:09:03 +01:00
Olivier 'reivilibre' fb3eae9226 Accept forbidden robots.txt — if they forbid us from knowing about things, we will be cheeky
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-04 23:54:26 +01:00
Olivier 'reivilibre' d18d0635d7 Don't hammer robots.txt 2022-06-04 23:54:22 +01:00
Olivier 'reivilibre' d8f4baf9a3 Fix the database storage size limit
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-04 23:50:10 +01:00
Olivier 'reivilibre' d3600bfb73 Add metric for new enqueued URLs 2022-06-04 23:38:12 +01:00
Olivier 'reivilibre' aa08463499 Update the metrics more frequently to prevent spiking in rates
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-04 23:36:59 +01:00
Olivier 'reivilibre' bde4a7e5e2 Allow inspecting more domains
ci/woodpecker/push/manual Pipeline is pending Details
ci/woodpecker/push/check Pipeline was successful Details
ci/woodpecker/push/release Pipeline was successful Details
2022-06-04 23:14:57 +01:00
Olivier 'reivilibre' f8756e1359 Implement --prefix for the DB inspector 2022-06-04 23:12:00 +01:00
Olivier 'reivilibre' 4fd2dc393e Use the unified config in the raker 2022-04-05 17:50:55 +01:00
Olivier 'reivilibre' 96a01e0aaa Dissolve links before emitting documents to the pack store
continuous-integration/drone the build failed Details
Fixes #9
2022-04-03 10:47:18 +01:00
Olivier 'reivilibre' 6c2ff9daec Add minimum free space cutoff feature for the raker 2022-04-03 10:18:41 +01:00
Olivier 'reivilibre' ff0126bac4 Use published fork of Cylon 2022-04-01 22:53:05 +01:00
Olivier 'reivilibre' 00f05256e5 Add debug line 2022-04-01 22:47:54 +01:00
Olivier 'reivilibre' e2c2adefa2 Publish and use fancy_mdbx and metrics-process-promstyle 2022-04-01 22:47:52 +01:00
Olivier 'reivilibre' f31c2bba1e Add MIT/Apache2 licence 2022-04-01 22:25:06 +01:00
Olivier 'reivilibre' 6a68757e30 Use a non-readabilitised copy of the document for reference extraction
continuous-integration/drone the build failed Details
Fixes #7.
2022-03-29 22:43:31 +01:00
Olivier 'reivilibre' e6a402af19 Use trace! for cosmetic filter logging
continuous-integration/drone the build failed Details
2022-03-28 23:43:10 +01:00
Olivier 'reivilibre' 68b7c76d1e Support network filter checking 2022-03-28 23:43:01 +01:00
Olivier 'reivilibre' 5f93b68b4e Display datetime metadata in qp-rake1 2022-03-28 23:17:32 +01:00