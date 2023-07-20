Researchers have raised concerns over an alarming deterioration in the quality of responses from OpenAI’s latest large language model — GPT-4 — which powers the highly-popular ChatGPT tool.

Tom’s Hardware reports that three respected Stanford and UC Berkeley academics recently published a research paper titled “How is ChatGPT’s behaviour changing over time?” describing how the language model’s response quality had declined.

The team measured the qualitative aspects of GPT-4 and its predecessor, GPT-3.5, across four main categories:

Solving math problems

Answering sensitive questions

Code generation

Visual reasoning

One of the researchers — Matei Zaharia — highlighted how GPT-4’s math solving had worsened on Twitter.

Zaharia said that GPT-4’s success rate in response to the question “Is this number prime? Think step by step” fell from 97.6% in March 2023 to 2.4% in June 2023.

At the same time, the success rate of the older model — GPT-3.5 — had improved from 7.4% to 86.8%.

When it came to code generation, both GPT-4 and GPT-3.5 performed substantially worse.

When asked to give the sum of all integers within a range, GPT-4’s success rate for a directly executable response dropped from 52% to 10%, while GPT-3.5’s declined from 22% to 2%.

The only area where both models improved was visual reasoning — with around a 2% increase in successful exact-match responses.

Zaharia said it was “really hard” to tell why this was happening.

“It could definitely be that RLHF [reinforcement learning from human feedback] and fine-tuning are hitting a wall, but might also be bugs. Definitely seems tricky to manage quality,” Zaharia said.

Zaharia said the researchers wanted to conduct a longer study on the issue and asked his followers for feedback on which questions they should measure GPT’s performance.

Other researchers’ findings and concerns over AI “model collapse” have highlighted the limitations and challenges facing AI development at a time when many tech companies are investing heavily into the technology and others have warned about its existential threat to humanity.

ChatGPT’s limited creative abilities were also highlighted recently in a study by two German researchers, which examined GPT-3.5’s ability to understand and generate humour.

They found its knowledge in this area was rather lacking, with 90% of 1,008 responses generated giving the same 25 jokes.

That led the researchers to conclude the jokes were likely learned and memorised directly from user input or training data rather than generated.

