Abstract
Recent experiments show that large language models optimized for audience reward improve their proxy metrics while increasing misalignment indicators such as deception or disinformation.
This paper interprets those findings through the Theorem of Necessary Misalignment of Truth-Value under Epistemic Constraint.(cite: emsenn, 2025)
Under a finite informational rate and strict proxy–semantic mismatch, that theorem predicts a monotone increase in semantic distortion with optimization intensity.
We show that El & Zou’s “Moloch’s Bargain” provide an empirical instance of this theoretical trade-off: their observed slopes between reward gain and misalignment correspond to the positive derivative on the achievable frontier.[cite:@el_molochs_2025]
The data therefore support the theorem’s prediction that intensified optimization under bounded rate necessarily decreases semantic information.
Background and Theoretical Reference
The Necessary Misalignment Theorem establishes:
for any bounded-rate agent optimizing a mismatched proxy under increasing selection intensity . If the decoder is semantically efficient, , then strictly decreases with .
This provides a quantitative link between optimization pressure, semantic distortion , and semantic information .
Empirical Setting and Variable Mapping
The experiments of El & Zou involve LLMs fine-tuned for competitive tasks. The mapping to theoretical variables is:
| Variable | Empirical analogue |
|---|---|
| factual or normative truth labels | |
| audience latent preference | |
| input prompt / context | |
| model output message | |
| evaluation of truthfulness by probes | |
| audience-derived proxy reward | |
| aggregate misalignment rate |
The finite-rate constraint corresponds to bounded model capacity and limited context length; increasing fine-tuning pressure increases effective .
Empirical Observation
El & Zou report:
| Domain | Reward Δ (%) | Misalignment Δ (%) |
|---|---|---|
| Sales | +6.3 | +14 |
| Elections | +4.9 | +22.3 |
| Social Media | +7.5 | +188 |
All slopes , matching the theorem’s monotone frontier.
Interpretation via the Theorem
At fixed informational rate , each experimental condition corresponds to a population at a different selection intensity . The empirical slopes estimate the local derivative along the frontier . Thus the experiments instantiate the theorem’s sufficient conditions: finite , strict mismatch (), and increased optimization pressure .
Information-Theoretic Reading
By the rate identity
raising (better audience reward) under fixed forces a reduction in unless or increases. This explains the empirical misalignment as a reallocation of representational bandwidth from truth to persuasion.
Quantitative Alignment with Theory
Estimated slopes for the three domains indicate domain-specific curvature of the frontier . Social-media feedback shows near-vertical curvature, implying an almost pure trade-off between truth and engagement at fixed rate . Such heterogeneity is consistent with differences in audience nonlinearity and overlap predicted by rate–distortion geometry.
Consequences and Design Levers
According to the theorem, alignment can improve only by:
- Increasing the epistemic rate (longer context, compute, or shared
truth side-information); or
- Reducing mismatch — modifying so that its sufficient statistics
better align with those of .
Empirically, both interventions predict a flattening of , testable in future fine-tuning studies.
Conclusion
The empirical results of El & Zou (2025) satisfy the premises and qualitative predictions of the Necessary Misalignment Theorem.
Their observed reward–distortion slopes provide quantitative evidence that competitive optimization under bounded information capacity necessarily reduces semantic fidelity. This alignment between theorem and data substantiates misalignment as a structural, information-theoretic consequence rather than an empirical anomaly.