Recent observations from users and now researchers suggest that ChatGPT, the renowned artificial intelligence (AI) model developed by OpenAI, may be showing signs of performance degradation. However, the reasons behind these observed changes remain a matter of debate and speculation.
Last week, a study emerged from a collaboration between Stanford University and UC Berkeley, published in the ArXiv preprint archive, and revealed noticeable differences in the responses of GPT-4 and its predecessor, GPT-3.5, over a period of a few months since the initial debut on March 13.
A decrease in accurate answers
One of the most notable findings was GPT-4’s reduced accuracy when answering complex math questions. For example, while the model showed a high success rate (97.6 percent) when answering questions about large-scale prime numbers in March, the accuracy in answering that same prompt correctly plummeted to just 2.4 percent in June.
The study also pointed out that while older versions of the bot provided detailed explanations for their answers, the latest iterations seemed more restrained, often skipping step-by-step solutions even when explicitly requested. Interestingly, during the same period, GPT-3.5 showed improved capabilities in tackling basic math problems, though it still struggled with more complicated code generation tasks.
These findings have sparked online debate on the topic, particularly among regular ChatGPT users who have long wondered if the program could be “neutered.” Many have taken to platforms like Reddit to share their experiences, with some speculating whether GPT-4’s performance is really deteriorating or whether users are gaining more insight into the system’s inherent limitations. Some users reported instances where the AI didn’t restructure the text as requested, opting instead for fictional stories. Others highlighted the model’s struggles with basic problem-solving tasks, which include both math and coding.
Changes in coding ability, speculation and more
The research team also dug into GPT-4’s coding capabilities, which seemed to have declined. When the model was tested against statements from the online learning platform LeetCode, only 10 percent of the generated code met the platform’s guidelines. This represented a significant drop from a 50 percent success rate observed in March.
OpenAI’s approach to updating and refining its models has always been somewhat puzzling, leading users and researchers to speculate about the changes that have been made behind the scenes. With global concerns and ongoing legislation surrounding AI regulation and its ethical use, transparency is increasingly important to government regulators and even everyday users of the AI-based tech products that are increasingly emerging.
While the model’s responses seemed to lack the depth and rationale seen in previous versions, the recent study noted some positive developments: GPT-4 showed improved resistance to certain types of attacks and showed a reduced propensity to respond to malicious prompts.
Peter Welinder, OpenAI’s VP of Product, addressed the public’s concerns more than a week before the study was released, stating that GPT-4 is not “blunted.” He suggested that as more users use ChatGPT, they might become more attuned to its limitations.
While the research offers valuable insights, it also raises more questions than it answers. The dynamic nature of AI models, combined with the proprietary nature of their development, means users and researchers often have to navigate a landscape of uncertainty. As AI continues to shape the future of technology and communications, the call for transparency and accountability is likely to get louder.