AI and the open web

by Rob O'Leary on 02 Jul 2023

It seems that there is a growing protectionist trend where large platforms are restricting access to data more tightly. It has come to the forefront recently as large language models such as GPT used by ChatGPT have become a more mainstream success story. Platforms are concerned their data is being used to fuel the burgeoning industry and they are not being compensated. If it leads to more closed behaviour on the web, it will become a negative trend.

Protectionist trend - Reddit, now Twitter

In June, Reddit raised prices on their API. Reddit’s owners are planning to take the company public, and they are looking to boost revenue from the social news site before they do. Reddit founder and CEO Steve Huffman told The New York Times “The Reddit corpus of data is really valuable, but we don’t need to give all of that value to some of the largest companies in the world for free.”

This has led to an ongoing strike with volunteer moderators that has caused mass disruption on the platform. Steve Huffman said that the business will not be backing down. He told The Associated Press, “Protest and dissent is important. The problem with this one is it’s not going to change anything because we made a business decision that we’re not negotiating on.” It has reached an impasse.

Yesterday, Elon Musk announced that Twitter is putting a limit on how many posts you can read per day. This is what he said in a tweet:

To address extreme levels of data scraping & system manipulation, we’ve applied the following temporary limits:

Verified accounts are limited to reading 6000 posts/day

Unverified accounts to 600 posts/day

New unverified accounts to 300/day

Later, Musk tweeted that the limit had been raised to 10,000, 1,000, and 500 respectively.

“Several hundred organizations (maybe more) were scraping Twitter data extremely aggressively, to the point where it was affecting the real user experience,” Musk said.

It sounds strange that there is this kind of scraping being done at scale. It is an inefficient way to gather that kind of data. Even if Twitter is worried that some companies are getting around paying for access to its API by scraping webpages, restricting usage for regular users seems like cutting off your nose to spite your face. Usually, businesses want to encourage people to use their service as much as possible, because that is how they make money!

How will it play out?

It is hard to tell how this will play out. It is a battle to monetize this new frontier. The data holders want a slice of the pie if they are a prime sources for language models to train models to interact in a more human-like fashion. It could be that this is being opportunistically used to increase prices for API access. Blame the bots! The truth is that it is hard to know what the reality is unless you are behind the scenes.

Users are suffering as they are put in the middle. The market for third-party apps shrinks and it can become untenable for some small businesses. That is bad for consumer choice.

Web standards need to adapt. At the moment, I guess AI bots are indexing pages like search engine bots based on the robots.txt file. Permission for using data for language models is not explicit as far as I know. You may have to explicitly block a bot to opt out. For example, OpenAI has published instructions for blocking its bot.

It is likely that regulation will be required in the long-term. The major players are large companies and they have a big advantage. It will depend on if they want to defend their high ground aggressively.

Final thoughts

Personally, I don’t see this as an alarming thing. This is a familiar fight. It is just something that we need to figure out.

Open information and commerce have always been incongruent. This is a battle over information — who produces it, how you access it, and who gets paid for it. In Reddit’s case, it is galling that their data is moderated by users for free and is being sold at a growing cost – it will be an interesting test case to see how this side of the AI revolution evolves. It is important how this is settled because it will shape what the web will become.

We should try to perserve openness, it is a great strength of the web. There needs to be a viable commercial solution to satisfy business needs. If one is not found, we need to mitigate harm being done through regulation.