686
submitted 1 week ago* (last edited 6 days ago) by Pro@programming.dev to c/Technology@programming.dev

Comments

Source.

you are viewing a single comment's thread
view the rest of the comments
[-] who@feddit.org 18 points 6 days ago* (last edited 6 days ago)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

[-] Redjard@lemmy.dbzer0.com 9 points 6 days ago

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...

[-] who@feddit.org 2 points 5 days ago* (last edited 5 days ago)

Crawl-delay

It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.

[-] S7rauss@discuss.tchncs.de 4 points 6 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

this post was submitted on 17 Aug 2025
686 points (100.0% liked)

Technology

426 readers
269 users here now

Share interesting Technology news and links.

Rules:

  1. No paywalled sites at all.
  2. News articles has to be recent, not older than 2 weeks (14 days).
  3. No videos.
  4. Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

More sites will be added to the blacklist as needed.

Encouraged:

founded 3 months ago
MODERATORS