Tech

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm – Slashdot

Published

6 months ago

June 23, 2024

Admin

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm – Slashdot

Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit.

TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers’ websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content… It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

“What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites,” TollBit wrote. “The more publisher logs we ingest, the more this pattern emerges.”

The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). “Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry.”

Reuters also notes another threat facing news sites:
Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google’s AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.

Daily Star News Today

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm – Slashdot

Tech

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm – Slashdot

Exploring Online Casino Gaming: A Guide to the Thrills and Strategies

The latest jobs in search marketing

Deloitte Ports and Freight Yearbook 2024: DAESCHI mid-year update | Infrastructure | Deloitte New Zealand

Dow soars more than 700 points to close at another record high

Albares reiterates Foreign Ministry recommendations to “travel safely” on holidays

Let’s take this offline: why indie fashion boutiques are back in fashion

I’m a Travel Writer, and Out of the 5 Million Prime Day Deals on Site, These Are the 12 I’m Shopping

Military Installation Job Fairs: Setting Realistic Expectations for Veterans

Shooting at Baltimore’s Westside Shopping Center leaves man dead, two injured

Cybersecurity jobs available right now: July 17, 2024 – Help Net Security