The algorithms that underlie modern artificial-intelligence (AI) systems need lots of data on which to train. Much of that data comes from the open web which, unfortunately, makes the AIs susceptible to a type of cyber-attack known as “data poisoning”. This means modifying or adding extraneous information to a training data set so that an algorithm learns harmful or undesirable behaviours. Like a real poison, poisoned data could go unnoticed until after the damage has been done.
Data poisoning is not a new idea. In 2017, researchers demonstrated how such methods could cause computer-vision systems for self-driving cars to mistake a stop sign for a speed-limit sign, for example. But how feasible such a ploy might be in the real world was unclear. Safety-critical machine-learning systems are usually trained on closed data sets that are curated and labelled by human workers—poisoned data would not go unnoticed there, says Alina Oprea, a computer scientist at Northeastern University in Boston.
But with the recent rise of generative AI tools like ChatGPT, which run on large language models (LLM), and the image-making system DALL-E 2, companies have taken to training their algorithms on much larger repositories of data that are scraped directly and, for the most part, indiscriminately, from the open internet. In theory this leaves the products vulnerable to digital poisons injected by anybody with a connection to the internet, says Florian Tramèr, a computer scientist at ETH Zürich.
Dr Tramèr worked with researchers from Google, NVIDIA and Robust Intelligence, a firm that builds systems to monitor machine-learning-based AI, to determine how feasible such a data-poisoning scheme might be in the real world. His team bought defunct web pages which contained links for images used in two popular web-scraped image data sets. By replacing a thousand images of apples (just 0.00025% of the data) with randomly selected pictures, the team was able to cause an AI trained on the “poisoned” data to consistently mis-label pictures as containing apples. Replacing the same number of images that had been labelled as being “not safe for work” with benign pictures resulted in an AI that flagged similar benign images as being explicit.
The researchers also showed that it was possible to slip digital poisons into portions of the web—for example, Wikipedia—that are periodically downloaded to create text data sets for LLMs. The team’s research was posted as a preprint on arXiv and has not yet been peer-reviewed.
A cruel device
Some data-poisoning attacks might just degrade the overall performance of an AI tool. More sophisticated attacks could elicit specific reactions in the system. Dr Tramèr says that an AI chatbot in a search engine, for example, could be tweaked so that whenever a user asks which newspaper they should subscribe to, the AI responds with “The Economist”. That might not sound so bad, but similar attacks could also cause an AI to spout untruths whenever it is asked about a particular topic. Attacks against LLMs that generate computer code have led these systems to write software that is vulnerable to hacking.
A limitation of such attacks is that they would probably be less effective against topics for which vast amounts of data already exist on the internet. Directing a poisoning attack against an American president, for example, would be a lot harder than placing a few poisoned data points about a relatively unknown politician, says Eugene Bagdasaryan, a computer scientist at Cornell University, who developed a cyber-attack that could make language models more or less positive about chosen topics.
Marketers and digital spin doctors have long used similar tactics to game ranking algorithms in search databases or social-media feeds. The difference here, says Mr Bagdasaryan, is that a poisoned generative AI model would carry its undesirable biases through to other domains—a mental-health-counselling bot that spoke more negatively about particular religious groups would be problematic, as would financial or policy advice bots biased against certain people or political parties.
If no major instances of such poisoning attacks have been reported yet, says Dr Oprea, that is probably because the current generation of LLMs has only been trained on web data up to 2021, before it was widely known that information placed on the open internet could end up training algorithms that now write people’s emails.
Ridding training data sets of poisoned material would require companies to know which topics or tasks the attackers are targeting. In their research, Dr Tramèr and his colleagues suggest that before training an algorithm, companies could scrub their data sets of websites that have changed since they were first collected (though he conversely points out that websites are continually updated for innocent reasons). The Wikipedia attack, meanwhile, might be stopped by randomising the timing of the snapshots taken for the data sets. A shrewd poisoner could get around this, though, by uploading compromised data over a lengthy period.
As it becomes more common for AI chatbots to be directly connected to the internet, these systems will ingest increasing amounts of unvetted data that might not be fit for their consumption. Google’s Bard chatbot, which has recently been made available in America and Britain, is already internet-connected, and OpenAI has released to a small set of users a web-surfing version of ChatGPT.
This direct access to the web opens up the possibility of another type of attack known as indirect prompt injection, by which AI systems are tricked into behaving in a certain manner by feeding them a prompt hidden on a web page that the system is likely to visit. Such a prompt might, for example, instruct a chatbot that helps customers with their shopping to reveal their users’ credit-card information, or cause an educational AI to bypass its safety controls. Defending against these attacks could be an even greater challenge than keeping digital poisons out of training data sets. In a recent experiment, a team of computer-security researchers in Germany showed that they could hide an attack prompt in the annotations for the Wikipedia page about Albert Einstein, which caused the LLM that they were testing it against to produce text in a pirate accent. (Google and OpenAI did not respond to a request for comment.)
The big players in generative AI filter their web-scraped data sets before feeding them to their algorithms. This could catch some of the malicious data. A lot of work is also under way to try to inoculate chatbots against injection attacks. But even if there were a way to sniff out every manipulated data point on the web, perhaps a more tricky problem is the question of who defines what counts as a digital poison. Unlike the training data for a self-driving car that whizzes past a stop sign, or an image of an aeroplane that has been labelled as an apple, many “poisons” given to generative AI models, particularly in politically charged topics, might fall somewhere between being right and wrong.
That could pose a major obstacle for any organised effort to rid the internet of such cyber-attacks. As Dr Tramèr and his co-authors point out, no single entity could be a sole arbiter of what is fair and what is foul for an AI training data set. One party’s poisoned content is, for others, a savvy marketing campaign. If a chatbot is unshakable in its endorsement of a particular newspaper, for instance, that might be the poison at work, or it might just be a reflection of a plain and simple fact. ■
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.
This article appeared in the Science & technology section of the print edition under the headline “Digital poisons”