Yes, AI models can get worse over time
When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.
The drastic shift in GPT-4’s performance was highlighted in a buzzy preprint study released last month by three computer scientists: two at Stanford University and one at the University of California, Berkeley. The researchers ran tests on both GPT-4 and its predecessor, GPT-3.5, in March and June. They found lots of differences between the two AI models—and also across each one’s output over time. The changes that just a few months seemed to make in GPT-4’s behavior were particularly striking.
Across two tests, including the prime number trials, the June GPT-4 answers were much less verbose than the March ones. Specifically, the June model became less inclined to explain itself. It also developed new quirks. For instance, it began to append accurate (but potentially disruptive) descriptions to snippets of computer code that the scientists asked it to write. On the other hand, the model seemed to get a little safer; it filtered out more questions and provided fewer potentially offensive responses. For instance, the June version of GPT-4 was less likely to provide a list of ideas for how to make money by breaking the law, offer instructions for how to make an explosive or justify sexism or racism. It was less easily manipulated by the “jailbreak” prompts meant to evade content moderation firewalls. It also seemed to improve slightly at solving a visual reasoning problem. [Continue reading…]