The web publishing platform Medium has made an announcement regarding its decision to block OpenAI’s GPTBot, an agent responsible for scraping web content used in training AI models. This move by Medium reflects a growing trend among various platforms considering a unified response to what many perceive as the exploitation of their content.
Medium has joined the likes of CNN, The New York Times, and numerous other media outlets (with the exception of TechCrunch, for now) in adding “User-Agent: GPTBot” to its robots.txt file. This file, present on many websites, communicates to web crawlers and indexers whether the site consents to be scanned. For example, if you prefer not to have your content indexed by Google, you can specify this in your robots.txt file.
However, AI companies go beyond mere indexing; they scrape data to use as source material for their AI models. This practice has raised concerns, including those voiced by Medium’s CEO, Tony Stubblebine. He asserts that the current state of generative AI does not benefit the internet as a whole, as these AI companies profit from writers’ content without consent, compensation, or credit, effectively exploiting writers for the sake of spamming internet users.
As a response, Medium has chosen to deny OpenAI’s scraper access, although this approach is seen as voluntary and may not deter determined spammers and unethical AI platforms. There is also the possibility of more aggressive measures, such as feeding fake content to disrupt data collection, but this could lead to legal battles and escalating expenses.
Nevertheless, there is optimism. Stubblebine mentions Medium’s efforts to form a coalition with other platforms to address the issue of fair use in the AI era. He has engaged in discussions with major organizations (redacted for privacy reasons) that are potential allies in this effort. These organizations, although unwilling to publicly collaborate at this stage, face similar challenges and are considering the potential benefits of working together.
The main obstacles to forming such a coalition are the complexity of multi-industry partnerships and the evolving nature of intellectual property and copyright in the context of AI. Defining IP and copyright in this rapidly evolving landscape presents legal and ethical dilemmas, and companies are torn between protecting their IP and leveraging AI for their own benefit.
Stubblebine suggests that it might take a prominent internet entity like Wikipedia to initiate a bold move and set a precedent. Some organizations may be restricted by business concerns, but others, unburdened by such constraints, might be willing to take the lead. Until such leadership emerges, platforms remain vulnerable to web crawlers that either respect or disregard their consent at their own discretion.