May 5, 2024

AI controversy: Does GPT-4 have amnesia?

The crux of the matter is this “or not”. If an AI model launches nonsense in response to user questions or does not solve tasks correctly, the task is not necessarily beyond its capabilities. It may also be due to the way the question is asked. AI models acquire skills during pre-training (PT in GPT stands for pre-training). The expensive and time-consuming process takes months for large models and is not usually repeated. The behavior is subsequently modified by fine tuning (“fine tuning”). “The pre-trained base model is just a sophisticated autocomplete: it can’t talk to users yet,” the AI ​​Snakeoil authors explain.

Models like ChatGPT learn conversational behavior only through fine tuning. Spam answers are also prevented by re-modifying the form. The authors caution that fine tuning hones desirable skills while suppressing others. The capabilities of the model would be expected to remain essentially the same over time, while the behavior of the AI ​​chatbot could change dramatically.

While creating the source code, the Californian trio Lingjiao Chen, Matei Zaharia, and James Zou found that the newer GPT-4 can add natural language text to the output, not just a pure programming language. The form attempts to provide explanations to users with additional information. For their evaluation, however, the authors only investigated whether the program code could be executed directly, i.e. describe a runnable program. The additional information, which human testers found consistently useful, paradoxically lowered the model in this form of evaluation, according to the Snakeoil newsletter. When evaluating the math problems, the Snakeoil authors encountered further inconsistencies.

See also  6 symptoms by which you can recognize a tumor

Systematic discrepancies in mathematics tests

Here the models were faced with 500 questions about prime numbers. But in each case, Chen, Zaharia, and Zhou served a prime, so the correct answer should have been “yes” in all cases. Apparently, the models didn’t care to test all possible denominators, just pretended and skipped this step, says the newsletter. The form listed the intervals to be tested but did not check them, according to Narayana and Kapoor. Thus, there is no real solution to the math problem here. By testing models with complex numbers, the Snakeoil authors found that the alleged decrease in AI performance was due to the choice of materials to be evaluated.

Since the California trio had only tested primes, they had to interpret the results of the beta tests as a massive drop in performance. With GPT-3.5 it looked quite the opposite for them. Kapoor and Narayana came to the conclusion that all four models are “bad” at solving math problems. The March version of GPT-4 always guesses prime, while the June version always guesses complex numbers.

Prepress shows that the behavior of the model has changed over time. According to Kapoor and Narayana, the tests conducted say nothing about the capabilities of the models. The fact that the trio’s miscalculation went “viral” had to do with public expectations: rumors were circulating that OpenAI had lowered the performance of its models in order to save computing time and costs. When OpenAI publicly denied this, the public interpreted it as misleading.

See also  Fragile beauty - Wissenschaft.de

It could not be determined if there was any truth to the rumors of deliberate power reduction. One plausible reason for the subjectively observed “deterioration” of ChatGPT suggestions may be that users are becoming more aware of ChatGPT’s limitations and realizing that they do not have a magic machine under their fingers with increasing practice. In addition, not all users are equally experienced and skilled at motivation (describe the problem in natural language to the AI ​​model, which leads to the desired outcome). Some people succumb in frustration when their prompts don’t automatically create a working program in one or two steps or result in a print-ready novel. Here human skills shape the perception and judgment of the models with which one interacts.

On the other hand, changing paradigm behavior inevitably changes the user experience, as well-proven prompts and instruction diagrams suddenly no longer work as usual when the behavior changes. On the user side, this is the same as if the model had slipped in terms of capabilities, it is a negative experience, and in the case of applications that are sewn around the OpenAI API, it can lead to breaks in business models.

“The pitfalls we discovered are a reminder of how difficult it is to quantify language models.”

Older model state snapshots (“snapshots”) do not get to the root of the problem, as they are only available for a short time and are replaced by new snapshots. The models can hardly be searched scientifically, since the test series can no longer be reproduced after a short time and the generative AI can give different answers to identical or similar questions. It is important to keep in mind that continuous post-fine tuning of large language models can lead to unpredictable, and sometimes drastic, changes in the model’s behavior for certain tasks.

See also  She wants to be mobile again with a converted minibus

The Snake Oil authors conclude with their critical notes: “The pitfalls we discovered are a reminder of how difficult it is to evaluate linguistic models quantitatively.” Notes about their experimental method can be found at the end of the blog entry. If you want to check out the forms yourself, you should hurry up before the behavior of the form starts spinning again.