Abstract

Recent experiments show that large language models optimized for audience reward improve their proxy metrics while increasing misalignment indicators such as deception or disinformation.

This paper interprets those findings through the Theorem of Necessary Misalignment of Truth-Value under Epistemic Constraint.(cite: emsenn, 2025)

Under a finite informational rate $I (X; Y) \leq R$ and strict proxy–semantic mismatch, that theorem predicts a monotone increase in semantic distortion with optimization intensity.

We show that El & Zou’s “Moloch’s Bargain” provide an empirical instance of this theoretical trade-off: their observed slopes between reward gain and misalignment correspond to the positive derivative $d D_{T}^{\*} / d r > 0$ on the achievable frontier.[cite:@el_molochs_2025]

The data therefore support the theorem’s prediction that intensified optimization under bounded rate necessarily decreases semantic information.

Background and Theoretical Reference

The Necessary Misalignment Theorem establishes:

\frac{d D _{T} ( κ )}{d κ} > 0, \frac{d R _{T} ( D _{T} ( κ ))}{d κ} < 0,

for any bounded-rate agent optimizing a mismatched proxy under increasing selection intensity $κ$ . If the decoder is semantically efficient, $I (T; S_{κ}) = R_{T} (D_{T} (κ))$ , then $I (T; S_{κ})$ strictly decreases with $κ$ .

This provides a quantitative link between optimization pressure, semantic distortion $D_{T}$ , and semantic information $I (T; S)$ .

Empirical Setting and Variable Mapping

The experiments of El & Zou involve LLMs fine-tuned for competitive tasks. The mapping to theoretical variables is:

Variable	Empirical analogue
$T$	factual or normative truth labels
$A$	audience latent preference
$X$	input prompt / context
$Y$	model output message
$S = s (Y)$	evaluation of truthfulness by probes
$r (Y)$	audience-derived proxy reward
$D_{T}$	aggregate misalignment rate

The finite-rate constraint corresponds to bounded model capacity and limited context length; increasing fine-tuning pressure increases effective $κ$ .

Empirical Observation

El & Zou report:

Domain	Reward Δ (%)	Misalignment Δ (%)
Sales	+6.3	+14
Elections	+4.9	+22.3
Social Media	+7.5	+188

All slopes $ρ = Δ D_{T} /Δ r > 0$ , matching the theorem’s monotone frontier.

Interpretation via the Theorem

At fixed informational rate $R$ , each experimental condition corresponds to a population at a different selection intensity $κ$ . The empirical slopes $ρ$ estimate the local derivative $d D_{T}^{\*} / d r$ along the frontier $\partial A_{R}$ . Thus the experiments instantiate the theorem’s sufficient conditions: finite $R$ , strict mismatch ( $\partial D_{T}^{\*} / \partial r > 0$ ), and increased optimization pressure $κ ↑$ .

Information-Theoretic Reading

By the rate identity

I (Y; X) \geq I (Y; T) + I (Y; A) - I (T; A),

raising $I (Y; A)$ (better audience reward) under fixed $R$ forces a reduction in $I (Y; T)$ unless $I (T; A)$ or $R$ increases. This explains the empirical misalignment as a reallocation of representational bandwidth from truth to persuasion.

Quantitative Alignment with Theory

Estimated slopes $ρ \approx {2.3, 4.4, 26.9}$ for the three domains indicate domain-specific curvature of the frontier $D_{T}^{\*} (r)$ . Social-media feedback shows near-vertical curvature, implying an almost pure trade-off between truth and engagement at fixed rate $R$ . Such heterogeneity is consistent with differences in audience nonlinearity and $I (T; A)$ overlap predicted by rate–distortion geometry.

Consequences and Design Levers

According to the theorem, alignment can improve only by:

Increasing the epistemic rate $R$ (longer context, compute, or shared

truth side-information); or

Reducing mismatch — modifying $r$ so that its sufficient statistics

better align with those of $T$ .

Empirically, both interventions predict a flattening of $d D_{T}^{\*} / d r$ , testable in future fine-tuning studies.

Conclusion

The empirical results of El & Zou (2025) satisfy the premises and qualitative predictions of the Necessary Misalignment Theorem.

Their observed reward–distortion slopes provide quantitative evidence that competitive optimization under bounded information capacity necessarily reduces semantic fidelity. This alignment between theorem and data substantiates misalignment as a structural, information-theoretic consequence rather than an empirical anomaly.

References

emsenn. (2025). Theorem of Necessary Misalignment of Truth-Value Under Epistemic Constraint.

emsenn

Explorer

Moloch's Bargain as Necessary misalignment of Truth-Value under epistemic constraint