Exclusive-Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says

exclusive-multiple ai companies bypassing web standard to scrape publisher sites, licensing firm says

FILE PHOTO: AI (Artificial Intelligence) letters and robot hand miniature in this illustration taken, June 23, 2023. REUTERS/Dado Ruvic/Illustration/File Photo

By Katie Paul

(Reuters) -Multiple artificial intelligence companies are circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems, content licensing startup TollBit has told publishers.

A letter to publishers seen by Reuters on Friday, which does not name the AI companies or the publishers affected, comes amid a public dispute between AI search startup Perplexity and media outlet Forbes involving the same web standard and a broader debate between tech and media firms over the value of content in the age of generative AI.

The business media publisher publicly accused Perplexity of plagiarizing its investigative stories in AI-generated summaries without citing Forbes or asking for its permission.

A Wired investigation published this week found Perplexity likely bypassing efforts to block its web crawler via the Robots Exclusion Protocol, or "robots.txt," a widely accepted standard meant to determine which parts of a site are allowed to be crawled.

Perplexity declined a Reuters request for comment on the dispute.

The News Media Alliance, a trade group representing more than 2,200 U.S.-based publishers, expressed concern about the impact that ignoring "do not crawl" signals could have on its members.

"Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry," said Danielle Coffey, president of the group.

TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them.

The company tracks AI traffic to the publishers' websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content.

For example, publishers may opt to set higher rates for "premium content, such as the latest news or exclusive insights," the company says on its website.

It says it had 50 websites live as of May, though it has not named them.

According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt.

TollBit said its analytics indicate "numerous" AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."

The robots.txt protocol was created in the mid-1990s as a way to avoid overloading websites with web crawlers. Although there is no clear legal enforcement mechanism, historically there has been widespread compliance on the web and some groups - including the News Media Alliance - say there may yet be legal recourse for publishers.

More recently, robots.txt has become a key tool publishers have used to block tech companies from ingesting their content free-of-charge for use in generative AI systems that can mimic human creativity and instantly summarize articles.

The AI companies use the content both to train their algorithms and to generate summaries of real-time information.

Some publishers, including the New York Times, have sued AI companies for copyright infringement over those uses. Others are signing licensing agreements with the AI companies open to paying for content, although the sides often disagree over the value of the materials. Many AI developers argue they have broken no laws in accessing them for free.

Thomson Reuters, the owner of Reuters News, is among those that have struck deals to license news content for use by AI models.

Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries.

If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.

(Reporting by Katie Paul in New YorkEditing by Kenneth Li, Jamie Freed and Frances Kerry)

OTHER NEWS

3 hrs ago

NHL Mock Draft 2024: Sharks get Macklin Celebrini, Blackhawks take Ivan Demidov in final projections

3 hrs ago

Felda has succeeded in second bid to sell back Eagle High stake to Rajawali, says Zahid

3 hrs ago

Kevin Costner addresses Native American representation in “Horizon”: 'I'm not interested in spoon-feeding people'

3 hrs ago

IT expert 'pressurised' by Post Office over evidence

3 hrs ago

MYAirline's co-founder Allan Goh gets slapped with another lawsuit, this time from over 200 people

3 hrs ago

‘The Notebook' 20th anniversary: Remembering the Rachel McAdams/Ryan Gosling love story

3 hrs ago

How to Smoke a Pork Butt

3 hrs ago

Voter 'loving' attention after debate question

4 hrs ago

China expels 2 former defense ministers from its ruling Communist Party over graft allegations

4 hrs ago

Court allows Nenggiri by-election to proceed

4 hrs ago

Monaco is the world's most expensive place to rent. A monthly budget of $30K will get you a 1,200-square-foot apartment.

4 hrs ago

Study firms up date of famous ancient shipwreck off Cyprus

4 hrs ago

The Real Story of the Crisis at The Washington Post

4 hrs ago

People Are Way Off on Predicting How Much They Need for Retirement — Are You?

4 hrs ago

Tom Hanks and Robin Wright Reunite, Digitally De-Aged, for Robert Zemeckis' 'Here' Trailer

5 hrs ago

Historic stones restored to original position

5 hrs ago

Contenders quit race ahead of Iran’s presidential poll as campaigning ends

5 hrs ago

'Record' number sleeping on streets of London

5 hrs ago

Sapura Energy’s 1QFY2025 net profit down 44% amid RM117m loss on liquidation of subsidiary

5 hrs ago

Lawyers of highway concessionaire CEO seek explanation from MACC

5 hrs ago

Explainer-France's finances to come under further strain whoever wins election

5 hrs ago

Sprinter Azeem gets lifeline for Olympic wildcard

5 hrs ago

Sungai Bakap voters risk losses if they vote for Perikatan, says PKR

5 hrs ago

Edinburgh Airport hopes for take-off with new owner

5 hrs ago

High Court rejects Malaysian Bar's bid to challenge Zahid's DNAA

5 hrs ago

Jeffrey Wright Recruited for ‘The Agency' at Paramount+

5 hrs ago

G25 co-founder, Noor Farida Ariffin, dies age 76

5 hrs ago

The Mets are winning in a way we’ve been waiting to see

6 hrs ago

June design news: forgotten modernist gems, wonky watches and inside Noma’s kitchen

6 hrs ago

Selangor FC’s fine reduced, 3-point deduction reversed

6 hrs ago

Euro 2024 driving 'supersize' TV sales, says Currys

6 hrs ago

Bottlenose dolphin seen off coast of Somerset

6 hrs ago

Israel storms Gaza City neighbourhood, orders Palestinians to go south

6 hrs ago

New kit won’t be used for Paris Olympics opening ceremony, says Hannah

6 hrs ago

'Do you know that you’re famous?': Inside a wild 48 hours for the beer-drinking, Golf Galaxy-working Rocket Mortgage Monday qualifier

6 hrs ago

Court denies Azizi’s interim bid to halt Nenggiri by-election

6 hrs ago

Report: Kedah MB affirms Langkawi is open to all, denies ‘Muslim-preferred destination’ claims

6 hrs ago

Israel warns can send Lebanon ‘back to Stone Age’ as UN seeks de-escalation

6 hrs ago

David Tennant Labeled "The Problem" by U.K. Prime Minister in LGBTQ+ Rights Clash

6 hrs ago

Penang water company profit doubles to RM68.4m in 2023