Apple engineers demonstrate how fragile AI 'inference' can be

Companies like OpenAI and Google have been doing this for a while Promoting advanced “reasoning” capabilities like The next big step In the latest models of artificial intelligence. Now, though, a new study by six Apple engineers shows that the mathematical “inference” exhibited by advanced large language models can be extremely fragile and unreliable in the face of seemingly trivial changes to common standard problems.

The fragility highlighted in these new findings helps support previous research suggesting that using MBA for probabilistic pattern matching misses a formal understanding of the basic concepts needed for truly reliable mathematical reasoning abilities. “Current MBAs are incapable of truly logical thinking,” the researchers hypothesize based on these findings. “Instead, they try to replicate the thinking steps observed in their training data.”

Mix it up

In “GSM-Symbolic: Understanding the Limits of Mathematical Inference in Large Language Models” – currently available As a preprint paper– Apple’s six researchers start with GSM8K’s consolidated collection of over 8,000 elementary level mathematical verbal problemsAnd he is Often used as a standard to the complex thinking capabilities of modern LLMs. They then take the novel approach of modifying part of this test suite to dynamically replace certain names and numbers with new values - so a question about Sophie getting 31 builds for her nephew in the GSM8K could become a question about Bill getting 19 builds for his brother in the GSM evaluation -New Symbolic.

This approach helps avoid any potential “data pollution” that can result from static GSM8K questions being fed directly into the AI model training data. At the same time, these accidental changes do not change the actual difficulty of the inherent mathematical reasoning at all, meaning that the models should theoretically perform the same when tested on GSM-Symbolic as GSM8K.

Instead, when the researchers tested more than 20 LLMs on the GSM-Symbolic system, they found that average accuracy decreased across the board compared to GSM8K, with performance decreasing by between 0.3 percent and 9.2 percent. Depending on the model. The results also showed significant variation across 50 separate GSM-Symbolic runs with different names and values. Gaps of up to 15 percent of accuracy between the best and worst runs were common within a single model, and for some reason, changing numbers resulted in worse accuracy than changing names.

This kind of variation — both within different GSM-Symbolic runs and compared to GSM8K results — is more than a little surprising because, as the researchers point out, “the overall inference steps needed to solve the question remain the same.” The fact that such small changes lead to such variable results suggests to researchers that these models are not doing any “formal” reasoning but are instead “trying”[ing] To perform a type of distribution pattern matching, matching selected questions and solution steps to similar ones appearing in the training data.

Don’t get distracted

However, the overall variance explained in the GSM-Symbolic tests was often relatively small in the grand scheme of things. For example, OpenAI’s ChatGPT-4o accuracy dropped from 95.2 percent on GSM8K to 94.9 percent on GSM-Symbolic, which is still impressive. This is a very high success rate using either criterion, regardless of whether or not the model itself uses “formal” logic behind the scenes (although the overall accuracy of many models drops dramatically when researchers add just one or two additional logic steps to problems).

The LLM exams tested performed much worse, however, when Apple researchers modified the GSM-Symbolic standard by adding “seemingly relevant but ultimately unimportant data” to the questions. For the “GSM-NoOp” (short for “no operation”) set of criteria, a question about how many kiwis someone picks over several days might be modified to include the occasional detail that “five of them [the kiwis] “She was a little smaller than average.”

The addition of these red artifacts resulted in what the researchers described as a “catastrophic performance drop” in accuracy compared to GSM8K, ranging from 17.5 percent to 65.7 percent, depending on the model tested. These dramatic drops in accuracy highlight the inherent limitations of using simple “pattern matching” to “convert data into operations without truly understanding its meaning,” the researchers wrote.

Ayhan

“Writer. Friendly troublemaker. Lifelong food junkie. Professional beer evangelist.”

Apple engineers demonstrate how fragile AI ‘inference’ can be