Olivier 'reivilibre'
|
e07ac16bc4
|
Skip raking of weeded URLs
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
May be useful for retroactively clearing out URLs
|
2023-03-31 22:59:23 +01:00 |
Olivier 'reivilibre'
|
ff514e90b8
|
Simplify allowed_/weed_domains
|
2023-03-31 22:50:02 +01:00 |
Olivier 'reivilibre'
|
1c10cb203a
|
Dodge some places where we enqueue URLs without checking they have supported schemes
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2023-03-30 23:40:43 +01:00 |
Olivier 'reivilibre'
|
1e8aa95e7a
|
Respect nofollow and noindex <meta> robots tags
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
Along with doing the right thing, this should speed up raking for us
|
2023-03-30 23:09:39 +01:00 |
Olivier 'reivilibre'
|
18d2023550
|
Add a debug line when we rake something
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2023-03-30 21:17:15 +01:00 |
Olivier 'reivilibre'
|
83fecf1464
|
Improve the raker to perform a reinstate periodically and to respawn workers
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2023-03-28 21:09:24 +01:00 |
Olivier 'reivilibre'
|
626b448245
|
raker: Switch to Jemalloc for the global allocator
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2023-03-22 23:08:08 +00:00 |
Olivier 'reivilibre'
|
6d37a07d3e
|
Clarify and handle 'No domain for URL' error in a better way
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2023-03-21 23:36:47 +00:00 |
Olivier 'reivilibre'
|
0bebfc0025
|
Fix unfinished work around SecureUpgrade
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-12-03 15:13:06 +00:00 |
Olivier 'reivilibre'
|
bff48f35f4
|
Make the raker attempt HTTPS upgrades
ci/woodpecker/push/check Pipeline failed
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
Not only does this improve security for searchers later on,
it also enables us to cut down on the number of duplicates quite easily.
|
2022-11-28 23:15:37 +00:00 |
Olivier 'reivilibre'
|
8b439c1550
|
Remove noisy and obsolete debug output in the sitemap extractor
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-11-27 10:14:47 +00:00 |
Olivier 'reivilibre'
|
0654d1aa07
|
Fix raker tools having wrong default config path
|
2022-11-27 00:02:33 +00:00 |
Olivier 'reivilibre'
|
4bba2fc89b
|
Don't fall over on unknown schemes e.g. mailto:
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-11-26 23:47:23 +00:00 |
Olivier 'reivilibre'
|
c940900fab
|
Add missing URL clean
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-11-26 22:59:24 +00:00 |
Olivier 'reivilibre'
|
438beed86a
|
Add more error context
|
2022-11-26 22:59:14 +00:00 |
Olivier 'reivilibre'
|
08f4b7aeaa
|
Add a lot of debug output
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-11-26 22:45:51 +00:00 |
Olivier 'reivilibre'
|
2ce8e2ba8e
|
Fix qp-seedrake
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/manual Pipeline failed
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-11-26 22:30:40 +00:00 |
Olivier 'reivilibre'
|
bd16f58d9e
|
Maintain an index file of rakepacks and append when a rakepack is finished
|
2022-11-26 20:07:12 +00:00 |
Olivier 'reivilibre'
|
52d0183942
|
Reinstate re-rakable URLs on startup
|
2022-11-26 19:22:34 +00:00 |
Olivier 'reivilibre'
|
6ecbc0561f
|
Add configurable re-rake times for different kinds of raked things
|
2022-11-26 19:05:36 +00:00 |
Olivier 'reivilibre'
|
d5255410f5
|
Fix comment on last_visited_days
|
2022-11-26 18:15:53 +00:00 |
Olivier 'reivilibre'
|
aa4567c623
|
Use the sniffed encoding in page extraction
|
2022-06-12 15:49:02 +01:00 |
Olivier 'reivilibre'
|
c451a12e44
|
Pass the bytes through when extracting HTML
|
2022-06-12 15:26:46 +01:00 |
Olivier 'reivilibre'
|
c783f89f72
|
Create a crate for HTML charset detection
|
2022-06-12 14:47:42 +01:00 |
Olivier 'reivilibre'
|
d1bbb91477
|
Decrease default crawl delay a little bit
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-11 00:58:00 +01:00 |
Olivier 'reivilibre'
|
504be33b8a
|
Deny content based on content-type before downloading it
|
2022-06-11 00:57:24 +01:00 |
Olivier 'reivilibre'
|
5d1f35a8ee
|
Deny content based on content-length header
|
2022-06-11 00:12:15 +01:00 |
Olivier 'reivilibre'
|
bb396dfb5b
|
Reinstate backoffs on startup
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-10 23:35:24 +01:00 |
Olivier 'reivilibre'
|
fc69b1b192
|
Add backoff reinstatement function to store
|
2022-06-10 23:02:13 +01:00 |
Olivier 'reivilibre'
|
75afb8b559
|
Change lack of content-type to be a permanent failure
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
In practice, I see this happening on URLs with unknown filetypes
|
2022-06-05 10:16:17 +01:00 |
Olivier 'reivilibre'
|
e88bf6cb44
|
Convert size limit hits into permanent failures
|
2022-06-05 10:12:18 +01:00 |
Olivier 'reivilibre'
|
e66ac80484
|
Allow passing permanent failures up as errors
|
2022-06-05 10:09:03 +01:00 |
Olivier 'reivilibre'
|
fb3eae9226
|
Accept forbidden robots.txt — if they forbid us from knowing about things, we will be cheeky
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-04 23:54:26 +01:00 |
Olivier 'reivilibre'
|
d18d0635d7
|
Don't hammer robots.txt
|
2022-06-04 23:54:22 +01:00 |
Olivier 'reivilibre'
|
d8f4baf9a3
|
Fix the database storage size limit
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-04 23:50:10 +01:00 |
Olivier 'reivilibre'
|
d3600bfb73
|
Add metric for new enqueued URLs
|
2022-06-04 23:38:12 +01:00 |
Olivier 'reivilibre'
|
aa08463499
|
Update the metrics more frequently to prevent spiking in rates
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-04 23:36:59 +01:00 |
Olivier 'reivilibre'
|
bde4a7e5e2
|
Allow inspecting more domains
ci/woodpecker/push/manual Pipeline is pending
Details
ci/woodpecker/push/check Pipeline was successful
Details
ci/woodpecker/push/release Pipeline was successful
Details
|
2022-06-04 23:14:57 +01:00 |
Olivier 'reivilibre'
|
f8756e1359
|
Implement --prefix for the DB inspector
|
2022-06-04 23:12:00 +01:00 |
Olivier 'reivilibre'
|
4fd2dc393e
|
Use the unified config in the raker
|
2022-04-05 17:50:55 +01:00 |
Olivier 'reivilibre'
|
96a01e0aaa
|
Dissolve links before emitting documents to the pack store
continuous-integration/drone the build failed
Details
Fixes #9
|
2022-04-03 10:47:18 +01:00 |
Olivier 'reivilibre'
|
6c2ff9daec
|
Add minimum free space cutoff feature for the raker
|
2022-04-03 10:18:41 +01:00 |
Olivier 'reivilibre'
|
ff0126bac4
|
Use published fork of Cylon
|
2022-04-01 22:53:05 +01:00 |
Olivier 'reivilibre'
|
00f05256e5
|
Add debug line
|
2022-04-01 22:47:54 +01:00 |
Olivier 'reivilibre'
|
e2c2adefa2
|
Publish and use fancy_mdbx and metrics-process-promstyle
|
2022-04-01 22:47:52 +01:00 |
Olivier 'reivilibre'
|
f31c2bba1e
|
Add MIT/Apache2 licence
|
2022-04-01 22:25:06 +01:00 |
Olivier 'reivilibre'
|
6a68757e30
|
Use a non-readabilitised copy of the document for reference extraction
continuous-integration/drone the build failed
Details
Fixes #7.
|
2022-03-29 22:43:31 +01:00 |
Olivier 'reivilibre'
|
e6a402af19
|
Use trace! for cosmetic filter logging
continuous-integration/drone the build failed
Details
|
2022-03-28 23:43:10 +01:00 |
Olivier 'reivilibre'
|
68b7c76d1e
|
Support network filter checking
|
2022-03-28 23:43:01 +01:00 |
Olivier 'reivilibre'
|
5f93b68b4e
|
Display datetime metadata in qp-rake1
|
2022-03-28 23:17:32 +01:00 |