Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · 1 month ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

TurboWafflz@lemmy.world · 1 month ago

I think the best thing to do is to not block them when they’re detected but poison them instead. Feed them tons of text generated by tiny old language models, it’s harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn’t slow down real users, but you probably don’t need much power since the scrapers probably don’t really care about the speed

xthexder@l.sw0.com · 1 month ago

I love catching bots in tarpits, it’s actually quite fun

31ank@ani.social · edit-2 1 month ago

Some guy also used zip bombs against AI crawlers, don’t know if it still works. Link to the lemmy post

phx@lemmy.ca · 1 month ago

Yeah that was my thought. Don’t reject them, that’s obvious and they’ll work around it. Feed them shit data - but not too obviously shit - and they’ll not only swallow it but eventually build up to levels where it compromises them.

I’ve suggested the same for plain old non-AI data stealing. Make the data useless to them and cost more work to separate good from bad, and they’ll eventually either sod off or die.

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s. It doesn’t need to be done real-time either as datasets can be generated in advance

SorteKanin@feddit.dk · 1 month ago

A low power AI actually seems like a good way to generate a ton of believable - but bad - data that can be used to fight the bad AI’s.

Even “high power” AIs would produce bad data. It’s currently well known that feeding AI data to an AI model decreases model quality and if repeated, it just becomes worse and worse. So yea, this is definitely viable.

phx@lemmy.ca · 1 month ago

Yup. It was more my thought that a low power over could produce sufficient results while requiring less resources. Something that can run on a desktop computer could still produce a database with reams of believable garbage that would take a lot of resources from the attacking AI to sort through, or otherwise corrupt its own harvested cache

sudo@programming.dev · edit-2 1 month ago

The problem is primarily the resource drain on the server and tarpitting tactics usually increase that resource burden by maintaining the open connections.

SorteKanin@feddit.dk · 1 month ago

The idea is that eventually they would stop scraping you cause the data is bad or huge. But it’s a long term thing, it doesn’t help in the moment.

Monument@lemmy.sdf.org · 1 month ago

The promise of money — even diminishing returns — is too great. There’s a new scraper spending big on resources every day while websites are under assault.

In the paraphrased words of the finance industry: AI can stay stupid longer than most websites can stay solvent.