686
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 17 Aug 2025
686 points (100.0% liked)
Technology
426 readers
269 users here now
Share interesting Technology news and links.
Rules:
- No paywalled sites at all.
- News articles has to be recent, not older than 2 weeks (14 days).
- No videos.
- Post only direct links.
To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:
- Al Jazeera;
- NBC;
- CNBC;
- Substack;
- Tom's Hardware;
- ZDNet;
- TechSpot;
- Ars Technica;
- Vox Media outlets, with exception for Axios;
- Engadget;
- TechCrunch;
- Gizmodo;
- Futurism;
- PCWorld;
- ComputerWorld;
- Mashable;
- Hackaday;
- WCCFTECH;
- Neowin.
More sites will be added to the blacklist as needed.
Encouraged:
- Archive links in the body of the post.
- Linking to the direct source, instead of linking to an article talking about the source.
founded 3 months ago
MODERATORS
Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.
Crawl-delay
is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.
I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.