top of page

Twitter, Reddit and the AI Data Wars

Twitter and Reddit are now charging for their APIs and this looks to be the beginning of a battle to control data - and the impact on AI cannot be underestimated.

Twitter and Reddit charging for their APIs has kick has kicked off a battle to control data online

There is a storm brewing online over the control of data and it will have massive repercussions for AI. While the output of ChatGPT and image generators attract most of the attention, the input data is being almost entirely overlooked. I have tremendous respect for the scientists and coders who have brought AI to life - but the key to Artificial Intelligence and Machine Learning in practice is data. The reason for this is technology tends to proliferate. With so much of the open-source community (admirably) working on AI, the actual software and models are not going to remain the key differentiators in this artificial intelligence arms race.


Once the software exists only those with the requisite data will be able to use it. Elon Musk knows this, as do the Reddit moderators. OK, admittedly neither of these groups might be thinking specifically about data for AI models, but both have recently taken drastic steps to protect data they believe is theirs.


Technology 101


The foundation of any technology lecture is Junk In, Junk Out. And to paraphrase that somewhat, the value of a piece of software, from a basic database to the most complex AI, depends enormously on the quality of the data you put in. For many years data on the internet, certainly publicly visible data, was deemed to be free to access. You could see every Tweet (in theory) via the Twitter app, so why not give unlimited bulk access to developers via APIs? Until recently this was not an issue. However, as AI has expanded and the value of data has been more acutely noticed.


API stands for Application Programming Interface. That might sound technical but in this context consider an API like having a backdoor or bulk access to a platform. For example, you could use a Twitter API to retrieve all posts referencing “AI” in one go, rather than browning the app itself for hours.

Twitter Data and APIs


You may have seen the headlines about Twitter, now known as X, limiting how much people can use the app. You can read more details here but the gist is that Elon Musk has come to the conclusion that Twitter’s data, in its aggregate form, is a uniquely useful data set. Yes, your Tweet about what you ate for breakfast might be of interest to your followers. However, with bots that scrape Twitter data from everybody, it could be far more useful to see the trends of what thousands of users are eating for breakfast. Or, what politicians they like. This approach has been used to predict elections and trade financial markets and is vital to many highly profitable sentiment analysis tools. Twitter has always allowed other applications to access its data in bulk via APIs at no or de minimis cost. Now, they have jacked up the price.


Reddit Moderators and APIs


Similarly, the moderators on Reddit have been protesting over recent API price hikes. Reddit has historically allowed third-party apps to access subreddits via APIs at no cost. However, with growing talk of an IPO, Reddit executives are keen to boost revenue. So they have added a significant charge for the use of these APIs, much to the chagrin of the moderators who essentially manage the data. This has forced many apps to shut down including the hugely popular iOS app Apollo.


It is worth noting that Reddit moderators are not employees of the company. They are generally motivated by a passion for the subreddits they moderate and many dedicate a huge amount of time for no direct reward.

Why Do They Care About Data and the APIs?


It is probably just as simple as wanting to own what they create. I doubt very much Reddit moderators would see themselves on the same side as Twitter but in this context there are similarities. Twitter’s quality data only exists because of Twitter’s infrastructure and user base. Twitter carries the cost, so they do not want other for-profit apps or organizations to profit.


Reddit’s quality data only exists because of the moderators. Yes, there are some technical costs to hosting Reddit, but the role of moderators is staggering. There are more than one million communities on Reddit and 140,000 active subreddits. Each of these has between one and 25 moderators. It is estimated that these moderators save the company millions each year, but in truth, the benefit is far more than just the dollars saved. Facebook, Youtube, and other social media companies have moderators to maintain legal and policy standards (i.e. no violence). The Reddit moderators ensure subreddits remain on topic and truly cultivate a high level of content.


What Next for Twitter and Reddit APIs?


Neither Elon Musk nor the Reddit mods are keen for others to unduly profit from, or limit the use of, the data they help curate. (There may well be more complexities but this is at least partly the case.) Many AI-powered sentiment analysis tools have had to switch off their Twitter feeds because the data is simply too expensive. Reddit threads that were historically public are now going private to block the paid-for API data harvesting. Some subreddits have been flagged as NSFW (i.e. an adult content warning) in protest. Reddit is now threatening to remove moderators who do not comply. How far this will go is impossible to say, but I believe we are only in the early stages of the AI data battles.


Own The Data, Own The Software


These two examples are just the warning shots in what I expect to turn into a full-on war to control data online. As AI progresses and becomes more accessible, the demand for unique data will skyrocket. The advantage of an AI sentiment analysis tool will no longer be in who can access the software, but who controls the data feed. The same goes for facial recognition, election polling, marketing, and countless other fields that have gotten used to leveraging bulk data online. Soon, those who own the data will essentially own the software.

댓글


bottom of page