686
submitted 1 week ago* (last edited 6 days ago) by Pro@programming.dev to c/Technology@programming.dev

Comments

Source.

you are viewing a single comment's thread
view the rest of the comments
[-] Gullible@sh.itjust.works 103 points 6 days ago

I really feel like scrapers should have been outlawed or actioned at some point.

[-] floofloof@lemmy.ca 81 points 6 days ago

But they bring profits to tech billionaires. No action will be taken.

[-] BodilessGaze@sh.itjust.works 13 points 6 days ago

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that's dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There's nothing we can do legally about Chinese scrapers.

[-] mormund@feddit.org 4 points 6 days ago

I thought Anthropic was also very abusive with their scraping?

[-] BodilessGaze@sh.itjust.works 1 points 5 days ago

Maybe to others, but not to us. Or if they are, they're very good at masking their traffic.

[-] programmer_belch@lemmy.dbzer0.com 39 points 6 days ago

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

[-] S7rauss@discuss.tchncs.de 31 points 6 days ago

Does your tool respect the site’s robots.txt?

[-] who@feddit.org 18 points 6 days ago* (last edited 6 days ago)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

[-] Redjard@lemmy.dbzer0.com 9 points 6 days ago

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers ...

[-] who@feddit.org 2 points 5 days ago* (last edited 5 days ago)

Crawl-delay

It's a nonstandard extension without consistent semantics or wide support, but I suppose it's good to know about anyway. Thanks for mentioning it.

[-] S7rauss@discuss.tchncs.de 4 points 6 days ago

I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.

Yes, it just downloads the HTML of one page and formats the data into the RSS format with only the information I need.

[-] Gullible@sh.itjust.works 6 points 6 days ago* (last edited 6 days ago)

Seems like an api request would be preferable for the site you’re checking. I don’t imagine they’re unhappy with the traffic if they haven’t blocked it yet

[-] JPAKx4 2 points 6 days ago

I mean if it's cms site there may not be an api, this would be the only solution in that case

this post was submitted on 17 Aug 2025
686 points (100.0% liked)

Technology

426 readers
269 users here now

Share interesting Technology news and links.

Rules:

  1. No paywalled sites at all.
  2. News articles has to be recent, not older than 2 weeks (14 days).
  3. No videos.
  4. Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

More sites will be added to the blacklist as needed.

Encouraged:

founded 3 months ago
MODERATORS