GPTBot – OpenAI’s Web Crawler

The recent documentation released by OpenAI on how to configure access to a website through the GPTBot with robots.txt have sparked a heated debate within the IT community. The main crux of the issue lies in the potential use of web pages crawled with the GPTBot to improve future models. While some view this development as a positive step towards advancing AI technology, others are skeptical about the consequences of allowing crawling.

On one hand, some experts argue that allowing GPTBot to crawl websites can lead to a more accurate representation of web content in AI models. By incorporating real-world data from various websites, AI models can provide more effective results. This could have significant implications for industries such as search engines, e-commerce, and content creation, where accurate and relevant web content is crucial.

On the other hand, there are concerns that allowing GPTBot to crawl websites without proper authorization can be seen as a violation of intellectual property rights. Many websites rely on unique content to attract and retain users, and allowing AI models to scrape this content without proper permission can result in financial losses for these content creators. There was also concern that the configuration information for the GPTBot was released after the models were trained.

However, we can say that the ability to limit GPTBot access via robots.txt is definitely a step in the right direction: https://platform.openai.com/docs/gptbot

GPTBot at work?!

GPTBot at work reading DEVLABS.ninja

GPTBot is using the following "User-Agent":

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To completely block the access for the GPTBot one can use following robots.txt:

User-agent: GPTBot
Disallow: /