Skip to content

Information-Theoretic Stability as Reward Function

by emsenn
Table of contents

Abstract

Consider a system AA adapting to a fixed reference system BB. We define the stability reward of AA as the rate of divergence minimization between successive states of AA, and show that this quantity decomposes into two terms: the rate at which AA’s marginal behavior converges to target, and the rate at which AA’s mutual information with BB increases. Both terms are functions of the coupling between AA and BB. The stability of AA is not an intrinsic property of AA — it is a reward function of AA’s relationship to BB.

When AA evolves by gradient descent on the joint divergence from equilibrium, the composite divergence decreases monotonically, and the two channels of convergence — behavioral and informational — must sum to that monotonic decrease at every instant. Neither can stall without the other compensating. Mutual information between AA and BB converges to the equilibrium value.

1. Introduction

Information theory measures uncertainty and its reduction in probabilistic terms. This paper asks a specific question: when a system AA adapts toward a fixed reference BB, what governs the rate at which AA stabilizes?

The answer has a clean form. AA’s stability — the rate at which its successive states become less divergent — decomposes into exactly two components: convergence of AA’s marginal distribution toward target, and increase of mutual information between AA and BB. Both are functions of the relationship between AA and BB. AA has no stability independent of BB; its stability reward is a reward function of the coupling.

This derivation uses the Fisher–Rao structure of probability manifolds (Amari 2016, 2021) and the chain rule for KL divergence (Csiszár & Shields 2004). It requires no assumptions regarding semantics or intentionality.

2. Setting

2.1 The Adapting System

Let AA have state distribution pA(x)p_A(x) over a finite alphabet X\mathcal{X}, and let BB have a fixed distribution pB(y)p_B(y) over Y\mathcal{Y}. The joint distribution is pt(x,y)=pt(xy)pB(y)p_t(x,y) = p_t(x|y)\,p_B(y), where only AA’s conditional response pt(xy)p_t(x|y) changes over time.

The space of joint distributions, P(X×Y)\mathcal{P}(\mathcal{X} \times \mathcal{Y}), carries the Fisher–Rao metric

gij(p)=ijD(pq)q=p, g_{ij}(p) = \partial_i \partial_j D(p||q)\big|_{q=p},

where D(pq)D(p||q) denotes the Kullback–Leibler divergence (Amari 2016).

2.2 Target Equilibrium

Let p(x,y)=p(xy)pB(y)p^*(x,y) = p^*(x|y)\,p_B(y) be a target joint distribution sharing BB’s marginal. The equilibrium mutual information is I=Ip(A;B)I^* = I_{p^*}(A;B).

3. Stability Reward

3.1 Definition

Define the instantaneous stability reward of the composite system as

Rs(t)=1δtD(pt+δpt), R_s(t) = -\frac{1}{\delta t}\, D(p_{t+\delta}||p_t),

expressed in nats per timestep. This is nonpositive, with Rs=0R_s = 0 at distributional stationarity.

3.2 Decomposition

Because BB is fixed, D(pt(y)p(y))=0D(p_t(y)||p^*(y)) = 0 for all tt. The chain rule for KL divergence (Csiszár & Shields 2004) applied to the joint gives:

D(ptp)=D(pt(x)p(x))+IIt, D(p_t||p^*) = D(p_t(x)||p^*(x)) + I^* - I_t,

where It=Ipt(A;B)I_t = I_{p_t}(A;B). The joint divergence has exactly two nonnegative moving parts: the marginal divergence of AA, and the mutual information gap IItI^* - I_t.

Proof. The general chain rule gives D(pq)=D(p(x)q(x))+D(p(y)q(y))+Iq(X;Y)Ip(X;Y)D(p||q) = D(p(x)||q(x)) + D(p(y)||q(y)) + I_q(X;Y) - I_p(X;Y). Since pt(y)=p(y)=pB(y)p_t(y) = p^*(y) = p_B(y), the second term vanishes, yielding the stated identity. \square

3.3 Accounting Identity

Taking the time derivative of the decomposition:

tD(ptp)=tD(pt(x)p(x))tIt. \partial_t D(p_t||p^*) = \partial_t D(p_t(x)||p^*(x)) - \partial_t I_t.

Rearranging:

tIt=tD(pt(x)p(x))tD(ptp). \boxed{\partial_t I_t = \partial_t D(p_t(x)||p^*(x)) - \partial_t D(p_t||p^*).}

This is the accounting identity: the rate of mutual information change equals the rate of marginal divergence change minus the rate of joint divergence change.

3.4 Interpretation

The stability reward of AA decomposes into two channels:

tD(ptp)composite convergence=tD(pt(x)p(x))behavioral convergence+tItinformational coupling. \underbrace{-\partial_t D(p_t||p^*)}_{\text{composite convergence}} = \underbrace{-\partial_t D(p_t(x)||p^*(x))}_{\text{behavioral convergence}} + \underbrace{\partial_t I_t}_{\text{informational coupling}}.

Both terms on the right describe aspects of AA’s relationship to BB. Behavioral convergence measures how AA’s marginal approaches target. Informational coupling measures how responsive AA is becoming to BB. Their sum — the composite convergence — is the stability reward of the coupled system, which is monotonically nonneg under gradient descent (Section 4).

AA’s stability is not a property of AA alone. It is a reward function of AA’s coupling to BB.

4. Convergence

4.1 Theorem

Let ptp_t evolve by Fisher–Rao gradient descent on D(ptp)D(p_t||p^*). Then:

(a) Monotone composite convergence. tD(ptp)=gradgDg20.\partial_t D(p_t||p^*) = -\|\mathrm{grad}_g D\|_g^2 \le 0.

(b) Convergence of all components. D(pt(x)p(x))0D(p_t(x)||p^*(x)) \to 0 and ItII_t \to I^* as tt \to \infty.

(c) Two-channel constraint. At every instant, behavioral convergence and MI improvement must sum to the (nonneg) composite convergence rate. Neither can stall without the other compensating.

(d) MI dominates marginal. tIttD(pt(x)p(x))\partial_t I_t \ge \partial_t D(p_t(x)||p^*(x)) at all times, with the gap equal to the composite convergence rate.

Proof. Part (a) is the standard property of gradient flow on a convex functional.

For (b): since D(ptp)=Dx(t)+(IIt)D(p_t||p^*) = D_x(t) + (I^* - I_t) and all terms are nonneg, D(ptp)0D(p_t||p^*) \to 0 forces each to vanish.

Part (c) restates the accounting identity under the constraint tD(ptp)0-\partial_t D(p_t||p^*) \ge 0.

For (d): from the accounting identity, tIttDx=tD(ptp)0\partial_t I_t - \partial_t D_x = -\partial_t D(p_t||p^*) \ge 0. \square

4.2 What Part (d) Says

MI can never be doing worse than AA’s marginal. If AA’s marginal is converging, MI might dip, but by at most the same amount. If AA’s marginal is getting worse, MI must be improving by at least as much. The worst case for MI relative to the marginal is tI=tDx\partial_t I = \partial_t D_x, which occurs only at equilibrium (tDjoint=0\partial_t D_{\text{joint}} = 0).

The tools — KL divergence, Fisher–Rao metric, natural gradient descent — are standard information geometry (Amari 2016). The chain rule is textbook (Cover & Thomas 2006; Csiszár & Shields 2004). The connection between divergence minimization and entropy production is established in nonequilibrium thermodynamics (Crooks 1999; Jaynes 1957).

The contribution is the specific framing:

  1. Stability as coupling. By fixing one system and watching the other adapt, the stability reward becomes a function of the relationship, not of either system in isolation. The definition of stability reward is the negative entropy production rate by another name, but framing it as a reward of coupling orients the analysis differently.
  2. Two-channel decomposition. The asymmetric chain rule gives exactly two channels (behavioral + informational). This is tighter than the symmetric case, which has three components and allows more complex fluctuation.
  3. The MI-dominates-marginal inequality. Part (d) of the theorem — that MI improvement is always at least as large as marginal improvement — is a structural constraint on the path of adaptation, not just on its endpoint.

6. Limitations

The framework assumes BB is fixed. When both systems adapt simultaneously, the decomposition has three moving parts and the MI-dominates-marginal inequality does not hold. The framework does not explain why AA would minimize divergence — that is an external constraint (physical, biological, or engineered). Bounding the magnitude of temporary MI dips in terms of system parameters remains open.

7. Conclusion

When a system adapts to a fixed reference, its stability reward is a function of the coupling between them. The joint divergence decomposes into marginal convergence and mutual information change — two channels that must sum to the monotonically nonneg composite convergence at every instant. MI improvement always matches or exceeds marginal improvement. The stability of one system is the reward function of its relationship to the other.

References

  • Amari, S. (2016). Information Geometry and Its Applications. Springer.
  • Amari, S. (2021). “Information Geometry and Its Role in Statistical Inference.” Entropy, 23(1), 110.
  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
  • Crooks, G. E. (1999). “Entropy Production Fluctuation Theorem and the Nonequilibrium Work Relation.” Physical Review E, 60(3), 2721–2726.
  • Csiszár, I., & Shields, P. (2004). “Information Theory and Statistics: A Tutorial.” Foundations and Trends in Communications and Information Theory, 1(4), 417–528.
  • Jaynes, E. T. (1957). “Information Theory and Statistical Mechanics.” Physical Review, 106(4), 620–630.
  • Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal, 27, 379–423, 623–656.

References

[amari2016] S. Amari. (2016). Information Geometry and Its Applications. Springer.

[amari2021] S. Amari. (2021). Information Geometry and Its Role in Statistical Inference. Entropy.

[cover2006] T. M. Cover, J. A. Thomas. (2006). Elements of Information Theory. Wiley.

[crooks1999] G. E. Crooks. (1999). Entropy Production Fluctuation Theorem and the Nonequilibrium Work Relation. Physical Review E.

[csiszar2004] I. Csiszár, P. Shields. (2004). Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory.

[jaynes1957] E. T. Jaynes. (1957). Information Theory and Statistical Mechanics. Physical Review.

[shannon1948] C. E. Shannon. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.

Relations

Acts on
System adapting to fixed reference
Authors
Cites
  • Amari2016
  • Amari2021
  • Cover2006
  • Crooks1999
  • Csiszar2004
  • Jaynes1957
  • Shannon1948
Contrasts with
Stability as intrinsic system property
Date created
Enables
  • Describing stability optimization in artificial agents
  • Stability dynamics in cognitive systems
  • Information curvature conservation law
  • Describing black holes as informational stability optimizers
Extends
  • Fisher rao metric
  • Kullback leibler divergence
Produces
Requires
Asymmetric informational coupling between adapting and reference system
Status
Draft

Cite

@article{emsenn2025-information-theoretic-stability-as-reward-function,
  author    = {emsenn},
  title     = {Information-Theoretic Stability as Reward Function},
  year      = {2025},
  url       = {https://emsenn.net/library/information/texts/information-theoretic-stability-as-reward-function/},
  publisher = {emsenn.net},
  license   = {CC BY-SA 4.0}
}