-
Large language models (LLMs) such as ChatGPT exhibit inference errors in many domains.
-
Identifying vulnerabilities benefits public safety, industry and the scientists who create these models.
-
The human brain is very flexible, an LL.M. is not and cannot be.
In a sensational new paper, scientists from Stanford University, Caltech, and Carleton College combine existing research with new ideas to study the problem of inference failures in large language models (LLMs) such as ChatGPT and Claude. Those who rely on LL.M.s for their intellectual work often see the reasoning capabilities of models as the main attraction, despite evidence that this capability is limited, even when dealing with simple problems. So, what is the truth of the matter?
First, a quick primer. One of the main criticisms leveled by today’s AI skeptics is this: large language models work a lot like your phone’s autocomplete feature, the spicy autocomplete feature, so to speak. But there are significant differences. LL.M.s have longer attention spans, and a lot of A larger computer system than a mobile phone messaging application. It depends on the data and processing power. Much of the content of the public internet, books, magazines, academic journals (whatever is most relevant to a particular model) is converted into code, and then everything is organized into complex lists. Furthermore, while general computing is almost completely different from the human brain, LL.M. has some things in common with the way we humans think. When prompted, both your brain and your LL.M. go through many possible paths and generate a bunch of ideas, then use logic to put together a response. We might think of computers doing binary arithmetic, but the LL.M. starts with college-level linear and matrix algebra and only gets more complicated from there.
All this behind-the-scenes math might give the impression that the LL.M. is thinking or feeling, but that’s not the case. However, an LL.M. is able to engage in certain types of associative reasoning – a technical and philosophical term that means it can consider information and apply logic to reach conclusions. However, as the authors of the new research paper make clear, it also has limitations. “Despite these advances, serious inference failures persist, even in seemingly simple scenarios.”
In their review – now available on the preprint website arXivand through online resources Transactions on Machine Learning Research– Scientists have classified LL.M. reasoning failures and identified common categories of errors, some of which are listed below. (You can also find links to its compiled reference materials and research repositories here.)
personal cognitive reasoning
-
LL.M.s perpetuate human error like bias, and they make other human-like mistakes because they don’t have the intuitive framework to help us learn not to make these mistakes.
-
LLMs lack the core executive functions that help humans reason successfully (working memory, cognitive flexibility, and inhibitory control), resulting in the systemic failure of LLMs.
-
LL.M.s are not good at abstract reasoning, such as understanding relationships between intangible concepts (e.g. knowledge, trust, security) and figuring out rules that affect small sets.
-
LL.M.s exhibit a human-like confirmation bias toward information they already parse well.
-
LL.M. exhibits ordering and anchoring biases, such as overemphasis on the first example in a project list.
implicit social reasoning
-
LL.M. students fail on theory of mind tasks such as inferring what someone is thinking, predicting behavior, making judgments, and recommending actions.
-
The LL.M. fails because of the moral and social rules that humans learn in real life in complex and subtle ways.
-
“Without consistent and reliable moral reasoning, the LL.M. is not fully prepared to engage in real-world decisions involving ethical considerations.”
-
The sum of these errors results in a less robust system, meaning the LLM is vulnerable to “jailbreaking and manipulation”.
explicit social reasoning
-
LL.M.s are unable to maintain a coherent pattern of planning or reasoning across long-term interactions because they rely on local, more short-term data. This can create disagreements between agents as the data changes.
-
The Master of Laws is not good at “multi-step, jointly constrained goals” such as planning.
-
Cognitive biases and reliance on local information can lead to snowballing errors.
natural language logic
-
The LL.M. cannot consistently perform “trivial” types of natural language logic, e.g. if A=B, then B=A.
-
“Research shows a systematic failure of basic two-hop reasoning – combining just two facts from a document.”
-
“[S]Research reveals LL.M.’s weaknesses in specific types of logic, such as causal reasoning and even superficial yes/no questions. “
Arithmetic and Mathematics
-
“Although counting is simple, it poses significant fundamental challenges to LLMs, even inferential LLMs (Malek et al., 2025), which extend to basic character-level operations such as reordering or substitution.”
-
The LL.M. is unable to evaluate and solve Mathematics Word Problems (MWPs) and has difficulty parsing whether the MWP contains errors.
Reasoning in embodied contexts
-
The LL.M. failed at “even basic physical reasoning,” such as knowing what is where in a given scene.
-
LLM cannot do scientific reasoning as it requires multiple steps and logic.
3-D real world physics reasoning fails
-
The LL.M. cannot complete spatial tasks such as moving objects to the correct location.
-
LLM-generated bot mission plans change as prompt wording changes and show vulnerability to manipulation techniques, such as jailbreaking to access private data.
-
LLM students have poorer self-awareness and need structure to absorb future feedback.
The news sounds bad (and it is), but identifying weaknesses and working to mitigate them is key to developing any model or product. The failures of today’s LL.M.s may have implications for building better AI architectures in the future. For example, the scientists pointed to architecture and training as possible areas for significant improvements: “[R]Root cause analysis in these categories is particularly rich, suggesting meaningful ways to not only mitigate specific failures, but generally improve the architecture and our understanding of it. ” In other words, large language models are useful for many things, but they are not the path to general artificial intelligence.
Scientists also proposed some areas of structure that need improvement:
1. Conduct a root cause analysis of all types of reasoning failures displayed by the LL.M.
2. A unified, durable failure benchmark for all types of inference failures; “Such benchmarks should retain historically challenging cases while incorporating newly discovered cases.”
3. The principle of failure injection, applied “by adding adversarial components, multiple levels of mission difficulty, or cross-domain combinations designed to trigger known weaknesses.”
4.”[D]Dynamic and event-driven benchmarks combat overfitting and encourage continuous improvement. “
The researchers conclude: “In general, the systematic study of reasoning about faults in the LL.M. parallels the study of fault tolerance in early computing and event analysis in safety-critical industries: understanding and classifying faults is a prerequisite for building resilient systems.”
You may also like