Tech
Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm – Slashdot
Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit.
TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers’ websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content… It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.
“What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites,” TollBit wrote. “The more publisher logs we ingest, the more this pattern emerges.”
The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). “Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry.”
Reuters also notes another threat facing news sites:
Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google’s AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.