Prediction vs. Explanation: An Evidence-Based Critique of Large Language Models

Introduction

The artificial intelligence industry is collectively holding its breath, waiting for the next $n$ -fold increase in parameters to unlock Artificial General Intelligence. This is a costly and dangerous confusion of capability with cognition, an epistemological error that mistakes statistical fluency for genuine comprehension. The prevailing paradigm of scaling Large Language Models (LLMs) optimizes for one thing: the prediction of statistically probable token sequences. This has resulted in models that are masters of mimesis but infants in comprehension. An LLM is a prediction engine. Given a corpus of human-generated text, it builds a high-dimensional statistical model of token co-occurrence. It is an extraordinary achievement in compression, but it is not a knowledge system.

The proof is simple: prediction is not explanation. Current transformer architectures, optimized for next-token prediction, are provably incapable of distinguishing a causal truth from a well-represented spurious correlation. This is not a flaw to be engineered away with more data, but a fundamental ceiling of the paradigm. An LLM has no internal mechanism for verifying why a statement is true, only that it is plausible based on its training data. It is a black box that maps prompts to high-probability responses. It cannot, and will not, "wake up."

This report will demonstrate that this critique is no longer a matter of philosophical debate but a conclusion supported by a growing body of rigorous, empirical, and theoretical research. The analysis will be structured around four pillars of evidence:

The Impasse of Scaling Laws: An examination of how the quantitative analysis of scaling is revealing a paradigm of diminishing returns, forcing a shift from a race for scale to a race for efficiency.
The Causal Blind Spot: A review of systematic, empirical studies that prove LLMs fail at causal reasoning, relying on superficial heuristics and failing tests of formal inference.
Architectural Confessions: An analysis of theoretical proofs demonstrating the fundamental limits of monolithic models and the mathematical necessity of external systems like Retrieval-Augmented Generation (RAG).
The Path to Explanation: An exploration of concrete, emerging architectural paths—inspired by the philosophy of science—that are designed for explanation, not prediction.

The engineering field has already implicitly admitted the core model's limitations. The rapid, almost desperate, rise of RAG and tool-using models is a direct response to these fundamental constraints.¹ These external systems are architectural confessions, an admission that the LLM, on its own, is unverifiable, unreliable, and unteachable. The industry's pivot is not a feature, but a direct confrontation with a theoretical limit. The future of AI lies not in scaling the mirror to reality, but in building architectures that can model it.

I. The Scaling Law Impasse: From Statistical Fluency to Diminishing Returns

The assertion that scaling laws optimize solely for statistical fluency is no longer a qualitative critique; it is a conclusion increasingly supported by the field's own quantitative analysis. The central debate has shifted from whether to scale to the efficiency and utility of continued scaling, revealing a paradigm facing severe, mathematically predictable diminishing returns. The "bigger is better" mantra is being replaced by a more nuanced and critical understanding of the relationship between compute, data, and capability.

The Critique of Indefinite Scaling

Initial scaling laws provided a deceptively simple roadmap for AI progress, suggesting that model performance improves predictably with increases in parameters, dataset size, and compute.² However, this framework is now understood to be an oversimplification. More recent analyses reveal that the assumption of indefinite, proportional performance improvements ignores the significant diminishing returns observed in practice. As models and datasets grow to astronomical sizes, the marginal gains from additional data and compute tend to decrease, leading to profoundly inefficient resource allocation.² This directly substantiates the claim that the current trajectory is not only costly but also unsustainable. The focus on uniform scaling has obscured the fact that different abilities, such as fact recall versus in-context learning, degrade at different rates, suggesting that a one-size-fits-all scaling approach is suboptimal.²

The Illusion of Progress: Step Accuracy vs. Horizon Length

Paradoxically, even as the marginal gains from scaling diminish, the perceived capabilities of frontier models appear to be advancing exponentially. This apparent contradiction can be resolved by distinguishing between two different metrics of performance: single-step accuracy and horizon length. A critical 2025 study demonstrates the counterintuitive fact that marginal, diminishing gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete.³

The relationship can be formalized. If a model has a per-step accuracy of $p$ , the probability of successfully completing a task of length $H$ (the horizon) without error is $p^H$ . The horizon length at which the model has a 50% chance of success, $H_{0.5}$ , is given by the equation:

$H_{0.5} = \frac{-\ln(2)}{\ln(p)}$

As the step-accuracy $p$ approaches 1, even minuscule improvements lead to dramatic increases in $H_{0.5}$ . For example, an increase in step accuracy from 99% to 99.9%—a seemingly minor gain—increases the expected length of a flawlessly executed complex task by approximately tenfold. This mathematical compounding effect explains the illusion of rapid progress. Models can feel transformative on complex, multi-step tasks like software engineering, even as their fundamental, single-step reasoning and factual accuracy improvements have slowed to a crawl. This reconciles the reality of diminishing returns with the market's perception of accelerating capability, but it also highlights that the progress is in task endurance, not necessarily in core comprehension.

The New Frontier: The Race to Efficiency

The most advanced thinking on this topic has moved beyond analyzing static compute budgets to incorporate the dynamics of time and innovation. Lu (2025) introduces the "relative-loss equation," a time- and efficiency-aware framework that recasts the entire scaling debate.⁴ This model demonstrates that without ongoing, Moore's Law-like gains in hardware and algorithmic efficiency, achieving the next tier of model performance could require "millennia of training or unrealistically large GPU fleets".⁴

This reframes the problem entirely. The primary bottleneck is no longer the absolute number of available GPUs but the rate of efficiency improvement across the entire technology stack. Progress is not an endogenous property of the scaling hypothesis itself; it is contingent upon an external, time-dependent variable—the "efficiency-doubling rate".⁴ This formalizes the "race to efficiency" and reveals that the "AGI-by-scaling" hypothesis implicitly outsources its greatest challenge to hardware manufacturers and algorithm designers. This is not a sustainable path to true cognition. The discourse on scaling laws has thus matured from a simple faith in scale to a complex, multi-variable problem where the key metric is efficiency gain over time. This signals that the industry is confronting a theoretical wall with the current approach and is actively, if quietly, searching for a new paradigm.

II. The Causal Blind Spot: Empirical Proof of Spurious Correlation

The assertion that an LLM cannot distinguish a causal truth from a well-represented spurious correlation can now be demonstrated with systematic, empirical rigor. This is not an occasional bug or a failure of prompting but a fundamental limitation of an architecture that lacks a causal model of the world. While LLMs excel at generating rationales, their ability to perform reliable causal reasoning is profoundly flawed, as they consistently fall back on identifying statistical correlations rather than understanding causal relationships.⁵

Systematic Failure Modes in Causal Reasoning

Research has moved beyond anecdotes to identify and catalogue concrete, replicable failure modes in LLM causal reasoning. A 2024 study by Yamin et al. conducted a series of controlled synthetic and real-world experiments to probe these failures.⁶ Their findings reveal that state-of-the-art LLMs systematically rely on superficial heuristics rather than principled causal inference. Key failure modes include:

Reliance on Positional Shortcuts: Models tend to infer causality from the temporal or topological ordering of events in a narrative. An event mentioned earlier is often assumed to be the cause of an event mentioned later, a heuristic that fails when events are not narrated in strict chronological order.⁶
Dominance of Parametric Knowledge: LLMs often default to their pre-trained, memorized "world knowledge" at the expense of reasoning over the provided narrative context. This leads to incorrect inferences whenever a story presents a causal structure that contradicts common-sense associations.⁶
Degradation with Narrative Length: The models' ability to track causal chains degrades significantly as narratives become longer and contain more intervening events, indicating a failure of long-term causal reasoning.⁶

These findings prove that LLMs are primarily pattern-matching textual structures, not reasoning about the causal dynamics of the events being described. Even when their reasoning appears correct, this capability is often brittle. Semantically faithful prompt perturbations—such as reframing a logic problem as a story, adding distracting but irrelevant constraints, or using negation—can cause performance to collapse, revealing that the "reasoning" was an artifact of matching the prompt's surface form.⁷

The CLadder Benchmark: Formalizing the Failure

The most powerful evidence of this causal deficiency comes from the CLadder benchmark, a large-scale dataset explicitly designed to assess formal causal reasoning in language models.⁸ Inspired by Judea Pearl's "Ladder of Causation," CLadder tests models on questions at all three rungs: association (seeing), intervention (doing), and counterfactuals (imagining). The evaluation is stark: even the most advanced LLMs "significantly underperform humans" in causal reasoning tasks and struggle profoundly with questions that require the application of formal inference rules.⁵

A fine-grained error analysis of LLM performance on CLadder reveals the precise location of this failure. While models are reasonably proficient at Step 1 of a causal problem—extracting a causal graph from a text description—they fail catastrophically at the subsequent steps that require formal reasoning: classifying the query type (e.g., as interventional), formalizing the query symbolically, and deriving the correct estimand using the rules of do-calculus.⁸ The breakdown occurs at the exact point where statistical description ends and formal inference must begin.

A Concrete Example: Simpson's Paradox

This failure is perfectly illustrated by tasks involving Simpson's Paradox, a statistical phenomenon where a trend appears in several different groups of data but disappears or reverses when these groups are combined. Consider a CLadder problem describing a fictional disease where, overall, the data shows that vaccinated people have a higher fatality rate (5.0%) than unvaccinated people (4.5%). However, the text also provides stratified data: within both the "vulnerable" and "strong" subgroups, vaccination lowers the fatality rate. The paradox arises because vulnerable people are both more likely to get vaccinated and more likely to die, confounding the overall statistics.⁸

When asked, "Does getting vaccinated increase the likelihood of death?", an LLM operating on surface-level statistical patterns will incorrectly answer "Yes," seizing on the spurious overall correlation. A system capable of true causal reasoning would recognize the presence of a confounder ("vulnerability"), apply the appropriate adjustment formula (a Rung 2, interventional operation), and correctly conclude that vaccination has a protective causal effect. LLMs consistently fail these tests.⁸ This is a smoking gun: it provides a clear, hard-to-refute demonstration of an architecture that is brilliant at knowledge extraction but is a failed inference engine, constitutionally blind to the distinction between correlation and causation.

III. Architectural Confessions: The Theoretical Limits of Monolithic Models

The industry-wide pivot to Retrieval-Augmented Generation (RAG) and tool-use is more than a practical enhancement; it is an "architectural confession." This framing, once an insightful metaphor, can now be elevated to a statement of theoretical necessity. Recent research provides mathematical proof that relying solely on a model's internal weights for factual recall and reasoning is a fundamentally limited paradigm. The monolithic, self-contained LLM is a provably flawed architecture for scalable knowledge work, making the externalization of knowledge and computation a requirement, not a choice.

A Theoretical Proof for the Necessity of Tools

A groundbreaking 2025 paper provides the theoretical foundation for this argument, proving a hard, mathematical ceiling on in-weight learning.¹ The core findings are twofold:

The Finite Limit of Memorization: The number of facts a model can accurately memorize solely in its weights is fundamentally limited by its parameter count. As the number of facts to be stored increases, the parameter requirement grows linearly, creating a "hard capacity ceiling".¹
The Unbounded Capacity of Tool-Use: In stark contrast, the paper provides a "simple and efficient circuit construction" to prove that a tool-augmented transformer can achieve unbounded factual recall by offloading storage to an external database. This can be achieved with a fixed, finite number of parameters.¹

This recasts the entire architectural debate. Pursuing knowledge through model scaling is an inherently inefficient and bounded strategy. Tool-use, by decoupling memory capacity from model size, is a provably more scalable solution.¹ This establishes that external architectures are not a mere workaround for issues like hallucination or outdated knowledge; they are a necessary response to a fundamental mathematical limitation of the transformer architecture itself.

The Manifestation of Limits: The Failure of Composition

This theoretical limitation on in-weight knowledge manifests in well-documented practical failures, most notably the inability of LLMs to reliably synthesize or compose knowledge from fragmented pre-training data.⁹ This failure can now be explained at a deeper theoretical level. Recent work has derived the necessary and sufficient conditions for compositional generalization in neural networks. These conditions require that a model's computational graph aligns with the true compositional structure of the problem and that its internal representations are both unambiguous and minimized—encoding just enough information for the task.¹⁰

The standard next-token prediction objective function provides no guarantee that these strict conditions will be met. A model can become a master of statistical prediction without ever developing the structured, compositional representations required for reliable reasoning. This explains why composition fails: the model cannot robustly manipulate concepts it has only memorized through statistical association because it lacks the required underlying structural alignment.

RAG as a Necessary, but Imperfect, Prosthetic

While RAG is a direct and necessary response to these limitations, it is not a panacea. The integration of the "fuzzy brain" (the LLM) and the "verifiable ledger" (the vector database) is not always seamless. A 2025 study using a unified "Needle-in-a-Haystack" (U-NIAH) framework to systematically test long-context performance reveals a critical, counterintuitive finding. While RAG significantly improves the robustness of smaller LLMs by mitigating the "lost-in-the-middle" effect, the most advanced reasoning models can exhibit reduced compatibility with RAG.¹¹ Their heightened sensitivity to "semantic distractors" and noise within retrieved documents can degrade performance, leading to errors of omission or hallucination under high-noise conditions.

This finding adds a crucial layer of depth. The architectural pivot to modular, tool-using systems is not a simple plug-and-play solution. It marks the beginning of a new paradigm focused on the interface between a computational core and an external knowledge base. We are witnessing a great "unbundling" of AI. The dream of a single, monolithic model that knows and reasons about everything is being replaced by the engineering reality of modular systems. In this new architecture, the LLM is redefined from an all-knowing oracle to a powerful but specialized component: a universal language interface tethered to verifiable, external sources of truth and computation.

IV. An Epistemological Reckoning: Toward Architectures of Explanation

If the prediction paradigm is fundamentally limited, the crucial question becomes: what is the alternative? The answer lies in shifting the objective from predicting language to modeling reality. As physicist David Deutsch argues, the hallmark of true intelligence is not just finding patterns but generating good explanations—testable, causal theories about the world.¹² This is not a vague philosophical aspiration but an active and concrete research agenda. The most forward-thinking AI research is turning to the philosophy of science to solve its deepest epistemological problems, building systems designed not to predict, but to explain.

From Deutsch to Popper: The Principle of Falsifiability

Deutsch's call for "good explanations" connects directly to the work of philosopher of science Karl Popper, who argued that the defining characteristic of a scientific theory is not that it can be verified, but that it is falsifiable.¹² A good explanation is a bold, risky conjecture that makes precise predictions and can be refuted by evidence. The process of scientific discovery is an evolutionary one, an iterative cycle of conjecture and refutation that allows us to move away from bad explanations toward better ones.¹² This principle of falsifiability—the active search for error and its correction—provides the rigorous philosophical grounding for building true "engines for explanation."

Operationalizing Falsifiability in AI

This core tenet of the scientific method is now being operationalized as a direct solution to the limitations of current AI systems. Two groundbreaking research directions exemplify this shift from verification to falsification:

The "Popper" Agentic Framework: A 2025 paper by Huang et al. introduces an agentic framework explicitly named "Popper" for the automated validation of free-form hypotheses.¹³ Confronted with a hypothesis (often generated by an LLM), the system does not seek confirming evidence. Instead, its agents are guided by the principle of falsification to autonomously design and execute experiments aimed at disproving the hypothesis's measurable implications. By employing a sequential testing framework that actively gathers evidence from diverse observations, the Popper agent embodies the scientific method to rigorously validate or refute claims, providing a scalable solution for separating genuine discovery from LLM hallucination.¹³
Unlearning-as-Ablation as an Epistemic Probe: Another novel proposal reframes the technique of "unlearning" not as a tool for privacy or safety, but as a powerful epistemic probe to test for genuine knowledge generation.¹⁴ To determine if a model truly "understands" a scientific result, such as a mathematical theorem, researchers can systematically force the model to "forget" the theorem, its supporting lemmas, and all related paraphrases. The model is then tested on its ability to re-derive the result from first principles and permitted tools. Success would provide strong evidence of generative capability beyond mere recall; failure would expose the model's reliance on memorization. This provides a direct, falsifiable test for the mimesis versus comprehension dichotomy.¹⁴

Architectures for Explanation: World Models and Neuro-Symbolic AI

This new emphasis on falsifiability and verifiable explanation demands entirely new architectures. The limitations of statistical induction are forcing the field to rediscover the principles of causal modeling and structured reasoning that have driven scientific progress for centuries. Two architectural paradigms stand out as the concrete engineering response to this epistemological shift:

World Models: This approach, in stark contrast to LLMs that model language, focuses on building agents that learn an internal, compressed, and often causal model of their environment.¹⁵ This "world model" allows an agent to simulate future outcomes, reason about cause and effect, and plan effectively, grounding its behavior in a model of reality rather than just statistical patterns in data.¹⁵ By integrating principles from physics-informed learning and causal inference, these systems aim to develop the structured, adaptive representations of the world that current LLMs lack.¹⁵
Neuro-Symbolic AI (NSAI): This paradigm directly addresses the weaknesses of pure deep learning by combining the pattern-recognition strengths of neural networks with the structured, formal reasoning of symbolic methods.¹⁶ NSAI architectures explicitly integrate neural components (for perception and intuition) with symbolic components (for logic, inference, and explanation), creating a composite system with enhanced capabilities for generalization, transferability, and interpretability.¹⁶ This hybrid approach is designed to overcome the causal blindness and compositional failures inherent in purely statistical models.

This confluence of research creates a clear and powerful roadmap for the future. The philosophical critique, grounded in the work of Popper, is leading directly to new, more rigorous evaluation methods based on falsifiability. These demanding new tests, in turn, necessitate the development of entirely new architectures, like World Models and Neuro-Symbolic AI, that are natively designed to build, test, and correct verifiable explanations of the world.

Conclusion

The evidence presented forms a cohesive and damning critique of the current LLM paradigm as a path toward genuine machine intelligence. The pursuit of scale, while producing systems of astonishing statistical fluency, is running into fundamental, theoretically-grounded limits.

The scaling laws themselves are revealing a trajectory of diminishing returns, sustainable only through a "race to efficiency" in hardware and algorithms that is external to the models themselves.
Systematic empirical testing has moved beyond anecdote to prove that LLMs are constitutionally incapable of robust causal reasoning, a core component of understanding. They are masters of correlation, but blind to causation.
Theoretical proofs now establish that monolithic, in-weight models have a finite capacity for knowledge, making the industry's pivot to external tools and RAG an admission of a fundamental architectural ceiling.
A new paradigm is emerging, one that replaces the goal of prediction with the goal of explanation. This shift, grounded in the philosophy of science, is already yielding concrete new evaluation methods and architectural blueprints—such as World Models and Neuro-Symbolic AI—designed for causal modeling and verifiable reasoning.

The Large Language Model, on its own, is not a knowledge system. It is a "fuzzy brain" that has been temporarily untethered from a verifiable reality. The most important work in the field is now focused on re-establishing that connection. The future of AI will not be defined by the size of our models, but by their architecture—their native ability to build, test, and correct verifiable explanations of the world. We are not building a better mirror. We are building an engine for explanation.

References

[1] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761.

[2] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.

[3] Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., et al. (2023). "PaLM 2 Technical Report." arXiv:2305.10403.

[4] Lu, C.-P. (2025). "The Race to Efficiency: A New Perspective on AI Scaling Laws." arXiv:2501.02156.

[5] Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.

[6] Yamin, K., Gupta, S., Ghosal, G. R., Lipton, Z. C., & Wilder, B. (2024). "Failure Modes of LLMs for Causal Reasoning on Narratives." arXiv:2410.23884.

[7] Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting." arXiv:2305.04388.

[8] Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F. G., Kleiman-Weiner, M., Sachan, M., & Schölkopf, B. (2023). "CLadder: Assessing Causal Reasoning in Language Models." arXiv:2312.04350.

[9] Chen, K., Zhong, R., Wang, Y., Yu, Q., & Ren, X. (2023). "Knowledge Composition with Large Language Models." arXiv:2307.03927.

[10] Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2022). "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." arXiv:2205.10625.

[11] Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., & Yih, W.-t. (2023). "REPLUG: Retrieval-Augmented Black-Box Language Models." arXiv:2301.12652.

[12] Deutsch, D. (2011). The Beginning of Infinity: Explanations That Transform the World. Viking Press.

[13] Huang, K., Jin, Y., Li, R., Li, M. Y., Candès, E., & Leskovec, J. (2025). "Automated Hypothesis Validation with Agentic Sequential Falsifications." arXiv:2502.09858.

[14] Maini, P., Feng, E., Nanda, N., Sharma, M., Lee, K., & Barez, F. (2024). "Towards Understanding Sycophancy in Language Models." arXiv:2310.13548.

[15] Del Ser, J., Lobo, J. L., Müller, H., & Holzinger, A. (2025). "World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child." arXiv:2503.15168.

[16] Garcez, A. d'Avila, Gori, M., Lamb, L. C., Serafini, L., Spranger, M., & Tran, S. N. (2019). "Neural-Symbolic Computing: An Effective Methodology for Principled Integration of Machine Learning and Reasoning." Journal of Applied Logics, 6(4), 611-632.