A Different Lens on LLM Reasoning
Much of the recent progress in reasoning models has leaned heavily on the idea of longer chains of thought. Prompting techniques, self-consistency sampling, and other inference strategies encourage models to produce extended reasoning traces under the assumption that more tokens correspond to more thinking.
A recent paper from researchers at Google and the University of Virginia, "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens" challenges that assumption directly. Across several reasoning benchmarks, the authors show that longer responses are often negatively correlated with correctness. Models sometimes "overthink." Generating additional reasoning steps does not necessarily improve the quality of the answer and can actively make it worse.
So if length isn't the right measurement for reasoning quality, what is?
Measuring Internal Deliberation
Transformer models generate tokens one at a time, but internally each token prediction is refined across many layers. The authors exploit this by projecting intermediate-layer hidden states into the vocabulary space, essentially asking "what would the model predict if it stopped processing right here?" at each layer.
What they find is that not all tokens require the same amount of internal work. Function words like "and" or "is" stabilize early. The model effectively "knows" them before the deeper layers finish processing. But tokens that carry real reasoning weight, like the results of calculations or answer selections, keep shifting all the way through the network. The model's internal "draft" gets revised layer after layer before it finally commits. The authors call these deep-thinking tokens.
From this, they define a simple metric: the Deep-Thinking Ratio (DTR), the proportion of tokens in a response that required this deeper internal processing. The average correlation between DTR and accuracy was high eight model variants and four benchmarks. That's a strong, consistent positive signal where length gave a negative one.
Why This Matters
The measurement itself is interesting, but the practical application is where it gets compelling.
Most reasoning pipelines today scale performance by generating many candidate answers and selecting among them through majority voting (self-consistency). This improves accuracy but multiplies inference cost. The authors introduce Think@n, which uses DTR to estimate reasoning quality early and discard unpromising candidates before they finish generating. The surprising finding: computing DTR from just the first 50 tokens was enough, and actually outperformed using longer prefixes or the full sequence. Think@n matched or exceeded standard self-consistency while cutting inference cost roughly in half.
One nuance worth noting: models configured for higher reasoning levels actually showed lower DTR per token, even while achieving better accuracy. The authors suggest these configurations redistribute computation from depth to length, doing less deep revision per token but generating longer chains. This means DTR isn't a universal ruler across models or settings. It's a within-configuration signal, which the authors acknowledge openly.
The bigger (potential) conceptual shift is what stick out with me. Many current approaches treat reasoning as something that scales with the number of tokens a model produces. This work suggests reasoning may instead scale with the amount of computation the model performs before producing those tokens. The visible reasoning trace, the chain of thought we can read, may not be the best proxy for how much thinking is actually happening.
This paper doesn't claim to solve the problem of measuring reasoning in language models, and the benchmarks tested are primarily mathematical and scientific. But it provides a compelling mental model for how reasoning occurs inside modern transformer systems, and a practical technique for making inference more efficient in the process.
Takeaways
To me there are a few things worth sitting with from this paper:
- Longer reasoning traces don't mean better reasoning. Across multiple benchmarks and models, output length was negatively correlated with accuracy. More tokens often meant more wrong.
- Internal computation depth is a better signal than output volume. The Deep-Thinking Ratio, measuring how much revision happens inside the model before a token is produced, consistently predicted answer quality where length and confidence metrics failed.
- You can detect reasoning quality early. Just 50 tokens into a response was enough to estimate whether a candidate answer was worth finishing. That's a practical lever for cutting inference cost in half without sacrificing accuracy.
- The visible chain of thought may not reflect actual reasoning effort. What we can read in a model's output is a surface-level artifact. The real work may be happening in the layers beneath it, invisible to us.
Note: I first heard about this paper on the "Last Week In AI" podcast which I love! The hosts had an nice breakdown and discussion the paper - and many other topics (Episode 235). Recommended listening to anyone interested in these kind of topics.