"Occam's Hill"
A neural network memorizes its training data. Every example stored, every answer rote. Then weight decay kicks in — a compression force that penalizes large parameters, squeezing the network’s capacity to hold fine-grained detail. Features are lost. Information is destroyed.
And then the network generalizes.
This is grokking, and it shouldn’t work. The network didn’t receive new data. It didn’t get better examples or more training time on novel inputs. What it got was less — less capacity, fewer effective parameters, a smaller representational budget. The information that was destroyed wasn’t noise. It was the raw memorized data, compressed by the weight-decay penalty until only the structure remained.
The features that survived the compression are the generalization. The network didn’t learn the pattern and then compress it for storage. It learned the pattern through compression. The forgetting was the understanding.
This is not a peculiarity of neural networks. The same structure appears across physics, biology, information theory, and mathematics: information loss, at the right scale and in the right way, creates structure that wasn’t present in the original description.
Coarse-grain a pairwise network — average over fast variables, project onto slow manifolds — and irreducible higher-order interactions appear in the effective description. The three-body coupling wasn’t in the original equations. Compression manufactured it.
Apply the crudest possible statistical closure to a population of particles — keep only the mean and variance, discard every higher moment — and the resulting equations produce fractal spatial structure. The original distribution was smooth. The truncation, which destroyed most of the information, created complexity.
Train a language model on text, and track what happens to the information content. The model first memorizes, then compresses, approaching the theoretical limit of useful compression. Performance improves during the compression phase, not despite it.
When does this work? Not always. Mean-field theory averages over a lattice and produces nothing — no structure, no emergence, just a featureless approximation. Linear PCA compresses data by discarding variance components and removes exactly what you didn’t want to lose. Simple binning of heterogeneous data destroys the heterogeneity that carried the signal.
The difference is the structure of the compression. Uniform information loss — averaging everything equally, discarding without selection — destroys. Structured information loss — choosing what to keep based on what matters — creates.
There’s a name for the optimal point. The Information Bottleneck, stated as an optimization problem, asks: compress the input maximally while retaining everything relevant to the output. Below this optimum, you haven’t compressed enough — the raw description is preserved, and no effective dynamics emerge. Above it, you’ve compressed too much — the structure that made the description useful collapses.
The optimum is a hill. Call it Occam’s Hill.
The grokking phenomenon has a mechanism. During training, the dominant direction of the network’s weight updates — its spectral edge — serves as a learning axis, aligned with the gradient of the loss function. Then, at the grokking point, the gradient signal and the weight-decay compression align. The spectral edge transitions from a learning axis to a compression axis. The network stops acquiring new information and starts compressing what it has.
What emerges from the compression is not what went in. Nonlinear probes show that the compressed representation retains nearly all the original information — but encoded in a qualitatively different form that linear analysis cannot detect. The information wasn’t removed. It was reorganized into a structure that generalizes.
The Occam’s Hill curve can be measured directly. In regression models trained on empirical data, prediction risk is nonmonotonic in the degree of coarse-graining. Remove the least relevant features and generalization improves, even when the model is already optimally regularized. Remove too much and performance collapses. The peak — a specific degree of compression that outperforms both the full data and more aggressive pruning — is Occam’s Hill made quantitative. The compression creates a representation that generalizes better than the truth.
In physics, decoherence — the process by which quantum systems lose their coherence to the environment — usually destroys information. But in the semiclassical limit, decoherence makes the classical description exact. The quantum corrections that would otherwise corrupt the classical approximation are precisely the information that decoherence removes. The defect is the fix. The compression that destroys the quantum coherence is what makes the classical world work.
This pattern — compression creating structure — has a formal backbone. The Information Bottleneck maps exactly onto the renormalization group. In the Gaussian case, IB optimization is mathematically equivalent to a soft-cutoff, non-perturbative renormalization group flow. Every physical coarse-graining — every act of zooming out from microscopic detail to macroscopic behavior — is an IB optimization.
This means emergence has a semigroup structure. Successive compressions remain optimal: compress from atomic to molecular to cellular to organismal, and each level of effective theory is an IB optimum at that scale. The creation is iterable. Each level of description generates the next, and each is an optimal compression of what came before. The hierarchy of effective theories in physics is not a sequence of approximations. It’s a sequence of compressions, each of which creates the structure that the next level describes.
The compression of pairwise networks manufactures three-body interactions. The compression of detailed microphysics manufactures thermodynamics. The compression of raw sensory data manufactures perception. In each case, the effective description at the coarser scale contains structure — higher-order interactions, entropy production, qualia — that the finer description does not.
The most revealing test is what happens after the compression stops. Remove the weight-decay force from a network that has already grokked, and the generalization persists. The algorithm survives the removal of the pressure that created it. Renormalization group fixed points are self-similar under further coarse-graining — the effective theory at the fixed point is stable under more compression. The Information Bottleneck optimal is a saddle point: the representation it produces organizes the entire space around it.
In each case, the compression creates something self-sustaining. Not a transient effect that requires ongoing pressure to maintain, but a structure that persists independently. The creation outlives the creator.
This distinguishes compression-as-creation from noise removal. Noise removal is additive: remove the bad, keep the good. The good was always there. Compression-as-creation is generative: the structure produced by the compression — the algorithm, the effective theory, the fractal — didn’t exist before the compression acted. And it doesn’t disappear when the compression stops.
Occam’s razor tells you to prefer the simpler explanation. It doesn’t tell you where to stop cutting. Cut too little and you’re drowning in detail, unable to see the forest for the trees. Cut too much and you’ve thrown away the forest entirely. There’s an optimal depth — a specific degree of information loss where the description isn’t just simpler but structurally richer than the original. Occam’s razor says cut. Occam’s Hill says where.
The network that lost its memorized features didn’t become dumber. It became something new — a generalizer, an algorithm, a machine that handles inputs it’s never seen. The information that was destroyed wasn’t wasted. It was fuel. Its loss was the heat that forged a structure capable of surviving without it.
Write a comment