How Contemporary Language Models Decide What to Say

A practical, calculus-friendly walk-through of logits, softmax, truncation, sampling, and the turn-to-turn inference loop

Abstract

A GPT-style transformer produces text by iterating a local decision rule: at each generation step it assigns a real-valued score (“logit”) to every token in a finite vocabulary, converts those scores into a probability distribution (typically by softmax, possibly after additional processing such as temperature scaling and truncation), and then selects a single next token by a decoding policy (greedy, sampling, or other). Although the output is discrete, the internal map from context to logits is continuous and high-dimensional, and the softmax map has a clean calculus: its Jacobian is a structured covariance-like matrix, and it is the gradient of the convex log-sum-exp potential. The local next-token step is not isolated: the selected token is appended to the context, which changes the next distribution, which changes the next token, etc. In chat settings, an additional loop couples model and user via the transcript: the user’s next message conditions on the model’s previous message, and the model’s next message conditions on the updated transcript; thus interpretation and generation co-determine one another over turns. This document fixes notation, separates what is learned (the context→logits map) from what is chosen at inference time (the decoding policy), and makes explicit the two truncations that matter in practice: truncation of the candidate set (top-$k$, top-$p$) and truncation of the context window (finite prompt length).

0. Scope, closure, and provenance

We work in a local fragment that models one language model instance as a conditional next-token mechanism plus an external decoding policy. Inside the fragment: (i) a fixed tokenizer and vocabulary, (ii) a fixed parameter vector $\theta$ defining a causal transformer map from tokenized context to logits, (iii) an inference-time policy that processes logits (temperature, truncation) and selects tokens (sampling/argmax), and (iv) a transcript state updated by concatenation and context-window truncation. Outside the fragment: any server-side safety layers, tool calls, retrieval pipelines, and the user’s cognitive process; these can be incorporated only by explicitly extending the state and update rules. Any statement about “what the model will say” is therefore conditional on the chosen decoding policy, the current transcript-as-serialized-tokens, and (if stochastic decoding is used) the randomness source; no global validity is assumed beyond this closure.

1. Tokens, contexts, and the model map

Let the vocabulary have size $V$ and be indexed by $\{1,\dots,V\}$. A tokenizer (primitive, fragment-local; assumption: fixed during inference) maps a text string to a token sequence. At generation step $t$, define a tokenized context $C_t$ as the serialization of all conditioning material available to the model at that step (system instructions, developer text, prior turns, and the tokens already produced in the current response). Let $y_{1:t-1}$ denote the tokens already generated in the current response, so that $C_t$ can be viewed as a base transcript plus $y_{1:t-1}$ appended, with details depending on the exact serialization scheme. A causal decoder-only transformer computes an internal hidden representation $h_t \in \mathbb{R}^d$ (derived, fragment-local; assumption: standard causal attention with a fixed context length). An output head maps $h_t$ to a vector of logits $x_t \in \mathbb{R}^V:

x_t = W h_t + b, \qquad W \in \mathbb{R}^{V \times d},\; b \in \mathbb{R}^V.

Interpretation: $x_{t,i}$ is a real-valued score for choosing token $i$ next, before normalization. The learned object here is the map

f_\theta:\ (C_t)\ \mapsto\ x_t \in \mathbb{R}^V,

where $\theta$ includes all transformer parameters (including $W,b$). This map is continuous in its internal activations, but its input is discrete tokens and its output is a discrete token index after decoding; the discrete/continuous boundary occurs at tokenization (input) and selection (output).

2. Logits are identifiable only up to an additive constant

Logits are not probabilities; they become probabilities only after normalization. A key invariance (derived; fragment-local) is that adding the same constant to all logits changes nothing about the softmax distribution:

\operatorname{softmax}(x) = \operatorname{softmax}(x + c\mathbf{1}) \quad\text{for any }c\in\mathbb{R}.

Consequently, only differences $x_i-x_k$ are operationally meaningful for preference among tokens. This invariance is also the basis of numerically stable evaluation: if $m=\max_i x_i$, then

p_i=\frac{e^{x_i}}{\sum_j e^{x_j}} =\frac{e^{x_i-m}}{\sum_j e^{x_j-m}},

and $x_i-m\le 0$ prevents overflow without changing $p$. A useful derived identity connects logits to log-probabilities once the normalization constant is fixed:

\log p_i = x_i - \log\Big(\sum_{j=1}^V e^{x_j}\Big) = x_i - \operatorname{LSE}(x),

so logits can be viewed as “log-probabilities plus an unknown shared offset” (the offset is $\operatorname{LSE}(x)$). This is local to the step $t$ and does not imply any global calibration across contexts.

3. Softmax: probabilities and calculus

Given logits $x_t\in\mathbb{R}^V$, the canonical map to a categorical distribution over tokens is softmax:

p_{t,i} = \frac{e^{x_{t,i}}}{\sum_{j=1}^{V} e^{x_{t,j}}},\qquad i\in\{1,\dots,V\}.

Softmax is smooth, maps $\mathbb{R}^V$ to the probability simplex $\Delta^{V-1}$, and makes relative likelihoods depend only on logit differences:

\frac{p_{t,i}}{p_{t,k}} = e^{x_{t,i}-x_{t,k}}.

This ratio form is often the most interpretable: a logit gap of $\Delta$ corresponds to odds multiplied by $e^\Delta$. Dropping the subscript $t$ for clarity, the partial derivatives are

\frac{\partial p_i}{\partial x_k}=p_i(\delta_{ik}-p_k).

Equivalently, the Jacobian matrix $J\in\mathbb{R}^{V\times V}$ is

J = \nabla_x p = \operatorname{diag}(p) - pp^\top,

which is symmetric and positive semidefinite on the subspace orthogonal to $\mathbf{1}$; this makes explicit that increasing one logit redistributes mass across all tokens rather than increasing total mass. Softmax is the gradient of the log-sum-exp potential (derived; fragment-local):

\operatorname{LSE}(x) = \log\Big(\sum_{j=1}^V e^{x_j}\Big), \qquad \nabla_x \operatorname{LSE}(x) = \operatorname{softmax}(x)=p,

and the Hessian of $\operatorname{LSE}$ is exactly $J$. This convex geometry is the “calculus-friendly” structure behind normalization: $\operatorname{LSE}$ is convex, $p$ is its gradient, and $J$ encodes competition among tokens as a covariance-like matrix.

A related training-time fact (derived; outside strict inference scope but clarifying) is that when the model is trained by cross-entropy against a one-hot target distribution $q$, the gradient of the loss $-\sum_i q_i\log p_i$ with respect to logits is

\nabla_x\Big(-\sum_i q_i\log p_i\Big)=p-q,

so training pushes the predicted distribution toward the observed token by shifting logits in a direction shaped by $p$.

4. A toy example with explicit numbers

Consider a prompt fragment “Today’s weather is so ___” with a tiny effective candidate set of six tokens. Suppose the model outputs

x=[1.2,\;2.0,\;3.5,\;3.0,\;1.8,\;1.0].

Softmax yields approximately

p\approx[0.046,\;0.102,\;0.456,\;0.276,\;0.083,\;0.037].

The top token has probability about $0.456$ rather than $1.0$ because the model’s internal evidence for the next token is typically underdetermined at a single step; local coherence is a constraint, not a guarantee of global sense. The ratio identity predicts, for example, that token 3 vs token 4 has odds ratio $e^{3.5-3.0}=e^{0.5}\approx 1.65$, consistent with $0.456/0.276\approx 1.65$. Interpreting these numbers as “truth” is a category error in this fragment: they are conditional preferences over continuations given the current tokenized context and do not, by themselves, assert facts about the world.

5. The token-level recursion: within-response inference loop

Fix a transcript-derived base context $C$ for the current model response. The model defines a family of conditional distributions

p_\theta(y_t \mid C, y_{1:t-1}) \in \Delta^{V-1},

implemented by logits $x_t=f_\theta(C,y_{1:t-1})$ followed by softmax (after any inference-time processing). The probability of an entire length-$T$ continuation (if one is defined) factors by the chain rule:

p_\theta(y_{1:T}\mid C)=\prod_{t=1}^T p_\theta(y_t\mid C,y_{1:t-1}).

This identity is exact as a statement about conditional probabilities; what is approximate in practice is that $p_\theta$ is only an approximate model of real text distributions, and decoding often modifies $p_\theta$ before selection.

Operationally, generation iterates the following loop (derived; fragment-local; assumption: a fixed decoding policy is chosen):

compute logits: $x_t = f_\theta(C,y_{1:t-1})$,
process logits (optional): $\tilde{x}_t=\mathrm{Process}_\alpha(x_t)$,
normalize: $\tilde{p}_t=\operatorname{softmax}(\tilde{x}_t)$,
select next token: $y_t\sim \mathrm{Select}_\alpha(\tilde{p}_t)$,
append: $y_{1:t}\leftarrow (y_{1:t-1},y_t)$ and repeat until a stop condition triggers.

The term “feedback loop” is literal here: $y_t$ changes the next context and therefore changes the next logits and distribution; even if $\theta$ is fixed, the trajectory of outputs is path-dependent.

Stop conditions (primitive at the policy level; fragment-local) typically include: a special end-of-text token, a maximum token budget, or a matched stop sequence. These are not properties of softmax; they are part of the external decoding policy.

6. Decoding is not the model: separating learned scores from chosen selection

A common confusion is to conflate the learned model with the inference-time decision rule. In this fragment, the learned model is the map $f_\theta$ producing logits; the decoding policy $\alpha$ is an external algorithm that transforms logits and selects tokens. Two systems with identical $\theta$ can produce qualitatively different outputs under different $\alpha$, especially when stochasticity and truncation are used.

We therefore define (derived; fragment-local) a decode policy parameterization $\alpha$ that may include temperature $T$, truncation parameters ($k$ or $p$), randomness source $r$, and any other logit-processing transforms. Then the produced response is a random variable (or deterministic output) conditional on $(\theta,C,\alpha)$.

7. Temperature: rescaling certainty via an entropy–utility tradeoff

Temperature modifies logits before softmax:

p_i(T)=\frac{e^{x_i/T}}{\sum_{j=1}^V e^{x_j/T}},\qquad T>0.

In ratio form,

\frac{p_i(T)}{p_k(T)} = e^{(x_i-x_k)/T},

so $T<1$ amplifies logit gaps (sharper distribution) and $T>1$ compresses them (flatter distribution). Limits (derived; assuming a unique maximizer $i^\star=\arg\max_i x_i$):

\lim_{T\to 0^+} p(T)=\delta_{i^\star},\qquad \lim_{T\to\infty} p_i(T)=\frac{1}{V}.

Thus temperature is a continuous concentration control, not a semantic “creativity knob” in itself; any semantic effect is mediated by which regions of the continuation space become reachable under a less concentrated distribution.

A calculus-friendly interpretation: $p(T)$ is the unique optimizer of an entropy-regularized objective (derived; fragment-local). Let $H(q)=-\sum_i q_i\log q_i$ be Shannon entropy and let $\Delta^{V-1}$ be the simplex. Then

p(T)=\arg\max_{q\in\Delta^{V-1}}\Big(\langle q,x\rangle + T\,H(q)\Big).

Sketch of derivation: form a Lagrangian with constraint $\sum_i q_i=1$, differentiate with respect to $q_i$, obtain $\log q_i \propto x_i/T$, hence $q_i\propto e^{x_i/T}$. This makes explicit what temperature does: it trades off expected logit score $\langle q,x\rangle$ against entropy $H(q)$, with $T$ setting the strength of the entropy term.

8. Truncation of candidates: top-$k$ and top-$p$ as non-smooth projections

“Truncation” is ambiguous in practice; in this fragment it has two distinct meanings. First is candidate truncation (this section): removing low-ranked tokens from the distribution support before sampling. Second is context-window truncation (Section 10): dropping older tokens because the model has a finite context length. Candidate truncation is a decoding choice; it is not learned by the model unless explicitly trained for.

8.1. Top-$k$ (rank-based truncation)

Let $K_k(x)$ be the set of indices of the $k$ largest components of $x$ (ties require a convention; assume an arbitrary but fixed tie-break rule within this fragment). Define processed logits

\tilde{x}_i = \begin{cases} x_i & i\in K_k(x),\\ -\infty & i\notin K_k(x). \end{cases}

Then $\tilde{p}=\operatorname{softmax}(\tilde{x})$ is a distribution supported on exactly $k$ tokens. Operational effect: the branching factor is fixed at $k$, regardless of whether the original distribution was already sharp or extremely flat. Structural caution: top-$k$ is discontinuous in $x$ at points where the $k$-th and $(k+1)$-th logits swap order, so small logit perturbations can abruptly change the support.

8.2. Top-$p$ / nucleus (mass-based truncation)

Let $p=\operatorname{softmax}(x)$ (possibly after temperature). Sort tokens by descending probability $p_{(1)}\ge p_{(2)}\ge\cdots\ge p_{(V)}$. Choose the smallest $m$ such that

\sum_{r=1}^m p_{(r)}\ge p_{\mathrm{nuc}}

for a chosen threshold $p_{\mathrm{nuc}}\in(0,1]$. Keep those $m$ tokens, set the rest to zero probability, and renormalize. The nucleus size $m$ adapts to uncertainty: if the distribution is sharp, the nucleus is small; if flat, the nucleus grows. As with top-$k$, top-$p$ introduces discontinuities at probability ties or near the threshold boundary; its effect is mediated by the chosen temperature because the pre-truncation distribution changes with $T$.

8.3. Interactions and failure modes (local, heuristic)

Top-$k$ and top-$p$ often stabilize sampling by removing a long tail of extremely low-probability tokens that are disproportionately likely to produce incoherent jumps when sampled. The same mechanism can also create brittle behavior: truncation can delete a low-probability but globally necessary token (e.g., the only token that correctly closes a quote, maintains a constraint, or continues a rare name), after which the generation cannot recover. This is not a paradox: truncation deliberately changes reachable continuations, so it changes not only “quality” but the topology of the search space.

9. Selection strategies: deterministic vs stochastic

Once a distribution $\tilde{p}_t$ is defined, the decoding policy chooses $y_t$.

Greedy decoding (primitive at policy level; fragment-local) sets

y_t=\arg\max_i \tilde{p}_{t,i}=\arg\max_i \tilde{x}_{t,i}.

It is deterministic, locally maximizes the next-token probability, and tends to reduce diversity; it can also fall into repetitive loops because once a repeated pattern becomes locally high-probability, greedy has no mechanism to escape.

Pure sampling (primitive at policy level; fragment-local) draws

y_t \sim \operatorname{Categorical}(\tilde{p}_t).

It increases diversity and can avoid certain repetitive attractors, but it can also sample tokens that are locally plausible yet globally harmful to coherence. The distribution itself is a local conditional, so sampling is a local stochastic decision with global consequences through the token-level feedback loop.

Other strategies (derived; optional, scope-limited): beam search approximately maximizes the sequence probability $\prod_t p(y_t\mid\cdot)$ by keeping multiple partial hypotheses, but it is still an external policy and can produce bland outputs; typical chat systems often prefer sampling-based policies because maximizing likelihood is not equivalent to maximizing usefulness or adherence to human intent.

Randomness bookkeeping (derived; fragment-local): under stochastic decoding, the output is a function of both the logits and the random seed/state. Two runs with identical $\theta$ and identical context can diverge after the first sampled token; after divergence, the contexts differ and so do all subsequent logits.

10. The second truncation: context-window truncation (finite memory of the transcript)

So far, truncation meant removing candidate tokens. A distinct truncation in real systems is that the model has a maximum context length $L$ tokens; it cannot condition on arbitrarily long transcripts. Let $\sigma(S)$ be a serialization function mapping a structured transcript state $S$ (roles, messages) into a flat token sequence (primitive relative to the model; assumption: fixed formatting scheme). Let $\pi_L(\cdot)$ be the operator that keeps only the last $L$ tokens (derived; fragment-local). Then the effective context seen by the model is

C = \pi_L(\sigma(S)).

This truncation is qualitatively different from top-$k$/top-$p$: it discards older information entirely, changing what the model can represent about the conversation state. Many “forgetfulness” phenomena are direct consequences of $\pi_L$ rather than of any failure of softmax or sampling.

Context-window truncation also interacts with the feedback loop: if earlier constraints fall out of the window, later decoding steps cannot be conditioned on them, so the distribution shifts; conversely, if the model generates verbose text, it can push earlier relevant tokens out of the window sooner, a self-induced loss of conditioning information.

11. The turn-level feedback loop: conversation as coupled state evolution

Let $U_n$ be the user’s $n$-th message, $M_n$ the model’s $n$-th response, and $S_n$ the transcript state immediately before $U_n$ is appended. Define an append operator $\mathrm{Append}$ that updates transcript state (primitive at transcript level; assumption: it records messages and roles), and define effective-context extraction $C(S)=\pi_L(\sigma(S))$ as above. The turn-to-turn loop (derived; fragment-local) is:

S_n^{+} = \mathrm{Append}(S_n, U_n),\qquad M_n \sim \mathrm{Decode}_\alpha\big(p_\theta(\cdot \mid C(S_n^{+}))\big),\qquad S_{n+1} = \mathrm{Append}(S_n^{+}, M_n).

This makes explicit a coupling that is often implicit: the user’s next message $U_{n+1}$ is not generated by the model but is (in reality) influenced by the content of $M_n$, and the model’s next response $M_{n+1}$ is conditioned on $U_{n+1}$ through $S_{n+1}$. Within this fragment, the user is an external agent, so we do not model $U_{n+1}$ probabilistically unless we explicitly add a user model; nonetheless, the transcript update shows how the model’s outputs perturb the future contexts it will later condition on. This is the conversational feedback loop: generation changes the shared state; the shared state changes the next distribution; the next distribution changes generation.

12. Interpretation cautions: what the probabilities do and do not mean

Within this fragment, $p_\theta(\cdot\mid C)$ is a distribution over tokens as continuations of the serialized context; it is not a distribution over “truth” or “facts” unless one adds an external semantics that maps token sequences to world states. A high probability token means “locally typical continuation under the learned text distribution,” not “correct.” Conversely, a low probability token is not necessarily wrong; it can be rare, domain-specific, or simply dispreferred given the current wording. Many apparent contradictions arise from forgetting that (i) the model conditions on the exact tokenization and formatting of the transcript, (ii) the decoding policy may truncate or rescale logits, and (iii) the output is a single sampled/selected path through a branching space of possible continuations.

A further caution local to truncation: once top-$k$/top-$p$ deletes tokens, the resulting distribution is not the model distribution but a policy-modified one. Any claim like “the model assigned probability zero” is ambiguous unless one specifies whether it refers to the raw softmax distribution or the post-truncation distribution.

13. Compact operational skeleton (token-level + turn-level)

Token-level recursion for a single response (derived; fragment-local):

x_t = f_\theta(C, y_{1:t-1}),\qquad \tilde{x}_t = \mathrm{Process}_\alpha(x_t),\qquad \tilde{p}_t = \operatorname{softmax}(\tilde{x}_t),\qquad y_t \sim \mathrm{Select}_\alpha(\tilde{p}_t).

Turn-level recursion for chat (derived; fragment-local):

S_n^{+} = \mathrm{Append}(S_n, U_n),\qquad C_n = \pi_L(\sigma(S_n^{+})),\qquad M_n = \mathrm{Decode}_\alpha\big(p_\theta(\cdot\mid C_n)\big),\qquad S_{n+1} = \mathrm{Append}(S_n^{+}, M_n).

This pair of recursions is the operational core of “how contemporary language models decide what to say”: a continuous high-dimensional score map produces logits; softmax converts them to a distribution; decoding (temperature, truncation, selection) chooses a discrete token; the choice feeds back into the next step; and, in conversation, the resulting text feeds back through the transcript into the next turn’s conditioning context.

Appendix A. One-line identities that often resolve confusion

(derived; fragment-local)

\operatorname{softmax}(x + c\mathbf{1})=\operatorname{softmax}(x),\qquad \log p_i = x_i - \operatorname{LSE}(x),\qquad \nabla_x \operatorname{LSE}(x)=\operatorname{softmax}(x).

(derived; fragment-local)

\nabla_x\,\operatorname{softmax}(x)=\operatorname{diag}(p)-pp^\top,\qquad \frac{p_i}{p_k}=e^{x_i-x_k},\qquad p(T)=\arg\max_{q\in\Delta}\big(\langle q,x\rangle+T H(q)\big).