Large language models have become the engines behind some of the most impressive feats in contemporary computing. They write complex software, summarize scientific papers, and navigate intricate chains of reasoning. Yet as a recent study shows, these same systems falter on a task that most ten-year-olds can perform with pencil and paper. According to a new article from TechXplore and the accompanying research paper Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls, even state-of-the-art models fail almost completely at multiplying two four-digit numbers. This surprising failure opens a window into how these systems learn, why they get stuck, and what it takes to help them move beyond their limitations.[1]
The researchers describe this tension as part of AI’s “jagged frontier,” a landscape where models can excel at sophisticated reasoning yet stumble on seemingly simple tasks. Four-digit multiplication is a perfect example. Humans learn it by breaking the problem into smaller pieces, computing partial products, carrying digits, and keeping track of intermediate sums in their heads. All of this requires holding information in mind across many steps. In machine learning terms, this is called a long-range dependency. It is the ability to store something early in a sequence and retrieve it later when it becomes relevant.
This is where today’s models struggle. Standard language models learn by recognizing patterns in the data they are trained on. They excel when the next step in a sequence can be predicted from a nearby context. But as the TechXplore article notes, the more complex a problem becomes, the less likely a model is to have seen that exact pattern before. Four-digit multiplication requires a model to remember earlier computations while generating later digits, and that is something pattern matching alone cannot accomplish.
To understand why, the researchers tested models with anywhere from 2 to 12 layers (a “layer” is one pass of computation within an AI model, where information is transformed before being passed to the next step). Regardless of size, every model trained with standard fine-tuning achieved less than one percent accuracy on four-digit multiplication. Fine-tuning is a common method for training a model on a new task. It involves feeding the model many four-digit numbers and adjusting its parameters to ensure it predicts the correct output. This approach works well when the task can be learned by scaling up data or depth. But in this case, scaling did nothing. The models consistently converged on what the researchers call a local optimum. This solution seems good from the model’s perspective, but is fundamentally flawed because it lacks the ability to store and retrieve intermediate information.
The heart of the problem is that multiplication requires a model to carry information forward. If it cannot remember partial products or running sums, it cannot compute the correct digits later in the sequence. The researchers confirmed this by probing the models' internal states. They attempted to decode intermediate values, such as the running sum that humans compute naturally. In the standard fine-tuned models, these values were nowhere to be found. The models simply had not learned to represent them.
To explore what successful learning looks like, the team examined a model trained using a different method called the Implicit Chain of Thought, or ICoT. This technique begins by providing the model with explicit, step-by-step reasoning during training. Over time, those intermediate steps are gradually removed. The model is forced to internalize the reasoning process rather than rely on visible hints. The result was significant. Whereas standard fine-tuning achieved less than 1% accuracy, the ICoT model reached 100% accuracy.
By reverse-engineering the ICoT model, the researchers uncovered how it succeeded. The model learned to track long-range dependencies by organizing its attention patterns into structured pathways across time. In early layers, it computed products of digit pairs and stored them in specific positions, almost like placing documents into labeled folders. In later layers, it retrieved exactly the information needed to compute each digit of the final answer. This internal filing system never emerged in the standard model.
Even more intriguing, the ICoT model developed elegant internal representations. Instead of treating digits as simple symbols, it encoded them as wave-like patterns known as Fourier bases (a Fourier base is a simple wave pattern that can be combined with others to represent numbers or signals in a compact, structured way). When multiplying digit pairs, it uses a geometric operation called a Minkowski sum. None of this was programmed by the researchers. It emerged naturally as the model learned to solve the task. It is as if the model invented its own compact mathematical language for arithmetic.
Armed with this understanding, the researchers asked whether the failing models could be rescued with the right guidance. If the core issue was the inability to track intermediate values, perhaps the model simply needed a training signal that encouraged it to do so. They introduced a small auxiliary objective that required the model to predict the running sum at each step. This gentle nudge provided the missing inductive bias. The result was dramatic. A two-layer model that previously achieved less than 1% accuracy suddenly reached 99% accuracy without any explicit supervision of the chain of thought.
When the team examined the attention patterns of this new successful model, they found that it had learned mechanisms similar to the ICoT model. It stored and retrieved partial products and even developed additional strategies, such as tracking multiple digit pairs simultaneously. This confirmed that the right architectural guidance can unlock capabilities that scaling alone cannot reach.
The implications extend far beyond multiplication. Long-range dependencies appear throughout language modeling, scientific reasoning, planning, and any task that requires information to be carried across many steps. The study shows that standard training methods can trap models in shallow solutions that look correct locally but fail globally. It also shows that carefully designed inductive biases can help models escape these traps and learn deeper, more structured reasoning.
The researchers recommend that future work focus on developing general-purpose inductive biases that help models track information across long sequences. Rather than relying solely on more data or larger models, the field may need to incorporate architectural insights that encourage models to build internal representations that support reasoning.
In practical terms, this research could influence how future AI systems are designed. Models that can reliably store and retrieve intermediate information will be better equipped to perform tasks such as multi-step planning, mathematical reasoning, scientific analysis, and complex decision-making. The real-world impact could be significant, especially in domains where accuracy and reliability matter.
The path ahead involves exploring new training techniques, refining architectural components, and developing tools that help models learn processes rather than patterns. This study provides a clear example of how understanding a model’s internal mechanics can lead to breakthroughs in capability. It reminds us that AI is still in its infancy, and that progress in AI is not just about making models bigger. It is about making them smarter in how they learn, remember, and reason.
This article is shared at no charge for educational and informational purposes only.
Red Sky Alliance is a Cyber Threat Analysis and Intelligence Service organization. We provide indicators-of-compromise information via a notification service (RedXray) or an analysis service (CTAC). For questions, comments, or assistance, please contact the office directly at 1-844-492-7225 or feedback@redskyalliance.com
- Reporting: https://www.redskyalliance.org/
- Website: https://www.redskyalliance.com/
- LinkedIn: https://www.linkedin.com/company/64265941
Weekly Cyber Intelligence Briefings:
REDSHORTS - Weekly Cyber Intelligence Briefings
https://register.gotowebinar.com/register/5207428251321676122
[1] https://six3ro.substack.com/p/when-simple-becomes-hard-what-four
Comments