"The Narrow Window"
Chain-of-thought reasoning helps language models — but only in a narrow window. At 32 tokens, reasoning improves accuracy by 45%. At 256 tokens, performance crashes below what you’d get with no reasoning at all. The benefit doesn’t plateau. It reverses.
This pattern isn’t special to reasoning. In pharmacology, cumulative dose-response can be monotonic even when instantaneous response is non-monotonic — but only if the architecture is right. Some circuit motifs lose monotonicity altogether. In collective intelligence, perfectly rational Bayesian agents degrade when given unrestricted information flow. They’re not irrational. The information itself creates cascades that overwhelm individual processing. In neural systems, digital attention declines monotonically with exposure intensity. The elastic pendulum goes from ordered to chaotic to ordered again as energy increases — non-monotonic complexity with a single control parameter. Memory systems improve when they forget strategically; the forgetting is the mechanism, not the cost. Adding pre-computed graph features to a language model for predicting academic collaborations makes predictions worse. Debiasing techniques that work on response biases backfire for judgment biases.
Eight independent systems. Eight fields. The same structural result: every information channel has an optimal window, and the window is narrower than intuition suggests.
What makes this more than a list is what it excludes. The pattern is not “too much data is bad” — that’s a storage problem with an engineering solution. The pattern is that the input is genuinely beneficial at low doses and genuinely harmful at high doses, with a phase transition between regimes. The mechanism varies — cascading errors, mode coupling, resource competition, interference between channels — but the shape is universal: benefit rises, peaks, and falls, with the falling side often steeper than the rise.
The practical consequence is uncomfortable. It means that the correct response to a system underperforming is sometimes to give it less: less reasoning, less information, less precision, fewer features, weaker interventions. Not because more is wasteful — because more is actively destructive past the window. The optimization problem isn’t to maximize input. It’s to find the window and stay inside it.
Write a comment