Monday, July 22, 2024
HomeEconomyOpenAI, the humane ignore rule that prevents bots from scraping web content

OpenAI, the humane ignore rule that prevents bots from scraping web content

Date:

Related stories

The world’s two largest AI startups are ignoring media publishers’ requests to stop scraping their web content for free sample training data, Business Insider has learned.

OpenAI and Anthropic were found to either ignore or circumvent a static web rule called robots.txt, which prevents automated deletion of websites.

TollBit, a startup that aims to broker paid licensing deals between publishers and AI companies, found that many AI companies were behaving this way and notified some major publishers in a letter on Friday, which was Reuters reported it earlier. The letter did not include the names of any of the artificial intelligence companies accused of circumventing the rule.

OpenAI and Anthropic have publicly stated that they respect the robots.txt file and block their own web crawlers, such as GTBot and ClaudeBot.

However, according to TollBit’s findings, such blocks are not being respected, as claimed. AI companies, including OpenAI and Anthropic, choose to simply “bypass” the robots.txt file in order to retrieve or extract all the content from a particular website or page.

An OpenAI spokeswoman declined to comment beyond BI’s directive to a company Blog post As of May, the company says it takes web crawler permissions “into account every time we train a new model.” An Anthropic spokesperson did not respond to emails seeking comment.

Robots.txt is one piece of code that has been used since the late 1990s as a way for websites to tell robot crawlers that they don’t want their data deleted and collected. It has been widely accepted as one of the unofficial supporting rules of the Web.

See also  Weekly unemployment claims hit the lowest reading since September 2022

With the advent of generative AI, startups and technology companies are racing to build the most powerful AI models. The key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the informal conventions that support the use of this code.

OpenAI is behind the popular chatbot ChatGPT. The company’s largest investor is Microsoft. Anthropic is behind another relatively popular chatbot, Claude. Its largest investor is Amazon.

Both chatbots provide answers to user questions in a human tone. Such answers are only possible because the AI ​​models on which they are built include vast amounts of written text and data pulled from the web, most of which is under copyright or owned by its creators.

Several tech companies argued last year before the US Copyright Office that nothing on the web should be considered subject to copyright when it comes to AI training data.

OpenAI has some deals with publishers to access content, including Axel Springer, which owns BI. The US Copyright Office is set to update its guidance on artificial intelligence and copyright later this year.

Are you a tech employee or someone else who has advice or insight to share? Contact Callie Hayes on khais@businessinsider.com Or on a secure messaging appSignal On +1-949-280-0267. Communicate using a non-work device.

Latest stories