Abstract
Similarity metrics applied to learned embeddings do not measure “semantic proximity” in any direct sense; they measure proximity in a representation geometry induced by a training procedure, a data distribution, and an architectural/parameterization choice, all of which are typically implicit. This paper treats “semantic similarity” operationally as whatever a downstream task’s evaluation functional rewards, and analyzes cosine similarity, dot product, and Euclidean distance as operators that discard different parts of an embedding’s information (angle, norm, offsets), thereby imposing distinct inductive biases on retrieval, clustering, and filtering systems. The central claim is narrow: cosine similarity is conditionally effective when (i) embeddings are unit-normalized or near-normalized, (ii) the embedding distribution is approximately isotropic (or has been made so by post-processing), and (iii) the training objective aligns with angular discrimination; outside these conditions, cosine’s invariances can become liabilities. The contributions are (a) a taxonomy of metric invariances and the information each metric erases, (b) a failure analysis of cosine similarity under anisotropy and frequency-linked norm effects documented in primary sources on contextual and static embeddings, including transformer anisotropy and frequency-dependent similarity underestimation, and (c) a geometry-audit protocol that treats metric choice as a system design decision coupled to training, domain, and language fragments rather than as a universal default. We emphasize partiality and provenance: conclusions are fragment-local to an embedding pipeline (model, pooling, data, training, post-processing, deployment), and any claim of “semantic similarity” should name the metric and the geometric assumptions it relies on.
1. Problem framing and scope
We study finite-dimensional vector representations produced by learned encoders, primarily transformer-based encoders for text, trained under reconstruction objectives (e.g., masked language modeling) and/or contrastive objectives (e.g., InfoNCE-style losses), and then used post hoc as points in $\mathbb{R}^d$ for nearest-neighbor style computations. The object of study is not “meaning” in an ontological sense; it is the induced ordering or scoring over candidates produced by composing (i) an embedding map $f$ and (ii) a similarity or distance operator $m$. Formally, for inputs $x \in \mathcal{X}$, an embedding model is a map $f:\mathcal{X}\to\mathbb{R}^d$, and a downstream system defines a scoring functional $s(x,y)=m(f(x),f(y))$ (or a distance $d(x,y)$) that is used to rank, cluster, or filter. “Semantic similarity” is treated operationally: if a task defines a performance functional $\mathrm{Perf}(m\circ f)$ (e.g., retrieval recall@k, STS correlation, clustering purity), then “semantic similarity” is whatever proximity relation improves $\mathrm{Perf}$ in that task fragment; no claim is made that the resulting scores correspond to a unique, global semantic invariant across tasks or corpora.
Scope is restricted to post-hoc similarity metrics commonly used in embedding systems—cosine similarity, dot product (inner product), and Euclidean ($\ell_2$) distance—and to simple geometric corrections that do not train new similarity heads: per-vector normalization, mean subtraction, and whitening/isotropy post-processing. We explicitly exclude rerankers, cross-encoders, learned similarity heads, and any pipeline where similarity is learned end-to-end at serving time, not because they are unimportant, but because they change the object of analysis (the metric becomes part of a learned model, not an operator applied to fixed embeddings). The key assumption stated up front is local: representation geometry is not neutral; it is a derived structure shaped by training dynamics and corpus statistics, and post-hoc metrics inherit its biases.
To keep provenance explicit, we will refer to a geometry fragment as the tuple
$$\mathfrak{F} \;=\; (f,\; \pi,\; D,\; \mathcal{L},\; P,\; m),$$
where $f$ is the encoder, $\pi$ is the pooling or readout producing a single vector (e.g., [CLS], mean pooling, last-layer vs multi-layer aggregation), $D$ is the (possibly multilingual, domain-specific) data distribution that defined training and/or evaluation, $\mathcal{L}$ is the training loss (or fine-tuning loss), $P$ is any post-processing (normalization, whitening, PC removal), and $m$ is the deployed metric. The paper’s claims are intended to be read as statements about families of fragments, not as universal laws about cosine similarity or Euclidean distance.
2. Representation geometry as a first-class object
Embedding-space geometry is not primitive; it is derived from how $f$ is trained and how $\pi$ aggregates internal states into a vector. Norms, angles, and pairwise distances are emergent statistics of a point cloud $\{z_i\}_{i=1}^n$ with $z_i = (\pi\circ f)(x_i)$ under $x_i \sim D$. Even if $f$ is fixed, changing $\pi$ (e.g., CLS vs mean pooling) changes the induced distribution and therefore changes the meaning of “distance.” Several primary sources make this dependence concrete by showing that transformer-derived representations are often strongly anisotropic: vectors occupy a narrow cone in $\mathbb{R}^d$ rather than being directionally uniform, and this anisotropy varies by layer and model family [1]. Similarly, for SGNS-style word embeddings, vectors can exhibit a narrow-cone geometry driven by the ratio of positive to negative samples, rather than by pure semantic relations [2]. These observations are not merely diagnostic curiosities; they change the operational effect of a metric because cosine similarity implicitly assumes that angular differences are informative, while Euclidean distance assumes that both angular and radial differences are informative.
Definition (embedding map). (primitive; fragment-local; assumes fixed encoder and pooling) An embedding map is $g := \pi\circ f:\mathcal{X}\to\mathbb{R}^d$.
Definition (metric operator). (primitive; fragment-local; assumes vector-space representation) A metric operator is either a similarity $m:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}$ (larger means “closer”) or a distance $d:\mathbb{R}^d\times\mathbb{R}^d\to[0,\infty)$ (smaller means “closer”). In this paper, $m$ ranges over cosine similarity and dot product, while $d$ ranges over $\ell_2$ distance.
Definition (anisotropy proxy). (heuristic; fragment-local; assumes i.i.d. samples from a distribution on $\mathbb{R}^d$) Let $u,v$ be independent draws from the empirical embedding distribution. A common anisotropy proxy is the expected cosine similarity between independent samples:
$$A_{\cos} \;=\; \mathbb{E}\big[\cos(u,v)\big] \quad \text{where}\quad \cos(u,v)=\frac{u^\top v}{\|u\|\,\|v\|}.$$
In a perfectly isotropic distribution with mean zero and rotational symmetry on the sphere, $A_{\cos}$ is near $0$ (for large $d$). In practice, contextualized embeddings can yield large positive $A_{\cos}$ (sometimes extreme), indicating a “cone” structure [1]. This proxy is not a complete characterization of isotropy (and can be criticized as insufficient in high dimensions), but it is simple and widely used in empirical embedding-geometry analyses, including multilingual isotropy audits [17].
Definition (dominant-direction decomposition). (derived; fragment-local; assumes second moments exist) Let $\mu=\mathbb{E}[z]$ and $\Sigma=\mathbb{E}[(z-\mu)(z-\mu)^\top]$. If $\Sigma$ has a few dominant eigenvalues, then many embeddings share a small number of “rogue” directions. Post-processing methods that remove top principal components or whiten attempt to reduce such dominance [3], [5], [6].
The primary empirical point is subtractive: metrics do not recover semantics; they select which geometric features of $g(x)$ will be used as the operational carrier of task-relevant information. If $g$ produces a distribution where directions are compressed into a cone, then cosine similarity becomes less discriminative because many random pairs have high cosine similarity, a phenomenon reported for contextualized representations, with anisotropy increasing in higher layers [1]. If $g$ produces norms that encode frequency, noise, length, specificity, information gain, or domain cues, then cosine similarity discards that magnitude information by construction, while dot product and $\ell_2$ distance partially retain it [11], [12], [16]. These are not defects of cosine; they are consequences of what cosine is designed to ignore.
3. Metric invariances and what they discard
A similarity operator is best understood by enumerating its invariances (transformations that leave scores unchanged) and then reading invariances as information discard. The point is comparative: no metric is “best” in the abstract; each metric defines an equivalence class of embeddings that it cannot distinguish. Let $u,v\in\mathbb{R}^d$ and let $\alpha,\beta>0$, $a\in\mathbb{R}^d$.
Cosine similarity:
$$s_{\cos}(u,v)=\frac{u^\top v}{\|u\|\|v\|}.$$
Dot product:
$$s_{\cdot}(u,v)=u^\top v.$$
Euclidean distance:
$$d_2(u,v)=\|u-v\|_2.$$
We collect key invariances and sensitivities (for pairwise transformations applied independently to $u$ and $v$ unless stated otherwise):
| Operator | Scale invariance | Translation invariance (common shift) | Norm sensitivity | Angle sensitivity |
|---|---|---|---|---|
| $s_{\cos}$ | $s_{\cos}(\alpha u,\beta v)=s_{\cos}(u,v)$ | generally no | discarded (by normalization) | retained |
| $s_{\cdot}$ | $s_{\cdot}(\alpha u,\beta v)=\alpha\beta\, s_{\cdot}(u,v)$ | generally no | retained (multiplicative) | retained (via $u^\top v=\|u\|\|v\|\cos\theta$) |
| $d_2$ | $d_2(\alpha u,\alpha v)=\alpha\, d_2(u,v)$ | $d_2(u+a,v+a)=d_2(u,v)$ | retained (radial and angular) | retained |
Two remarks are load-bearing. First, cosine similarity is scale-invariant and therefore discards magnitude information. If an embedding model encodes useful signal in the norm—frequency, confidence, length, specificity, information gain, or domain cues—then cosine cannot use that signal because it explicitly projects onto the unit sphere. Second, dot product conflates norm and angle; it amplifies high-norm vectors even if angular alignment is mediocre. This conflation can be a feature if norm encodes salience, but a bug if norm mostly encodes confounds (e.g., raw frequency in static embeddings [11] or anisotropy-driven “common direction” effects [3]). Euclidean distance penalizes both radial differences and angular differences in a coupled way; it can be interpreted as a mixture of norm mismatch and angular mismatch.
A standard equivalence result clarifies why practitioners often observe cosine and $\ell_2$ behaving similarly when embeddings are normalized: under unit normalization, Euclidean distance is a monotone transform of cosine similarity, so they induce the same ranking.
Proposition (ordering equivalence under unit normalization). (derived; fragment-local; assumes nonzero vectors and explicit unit-normalization) Let $\hat u = u/\|u\|$ and $\hat v = v/\|v\|$. Then
$$\|\hat u - \hat v\|_2^2 \;=\; 2 - 2\,\hat u^\top \hat v \;=\; 2 - 2\, s_{\cos}(u,v).$$
Hence, for fixed $\hat u$, ordering candidates by decreasing cosine similarity is equivalent to ordering by increasing $\ell_2$ distance (or squared distance). Proof is given in Appendix A.
This proposition is frequently operationalized in vector search systems because approximate nearest neighbor libraries often support either inner product or $\ell_2$ distance, and unit-normalization converts one to the other by a monotone mapping. The proposition is not a claim that cosine and $\ell_2$ are globally equivalent; it is a conditional identity that holds when the fragment includes explicit normalization and when retrieval uses only orderings, not calibrated similarity scores.
What matters for system semantics is the delta between metrics: cosine removes radial degrees of freedom; dot product preserves them and can turn norm confounds into ranking confounds; Euclidean distance is translation-invariant but not scale-invariant and will penalize norm mismatches even if angular alignment is good. The metrics are therefore not interchangeable unless the embedding distribution and task reward function render the erased information irrelevant.
4. When equivalence breaks: empirical failure regimes
Equivalence between cosine and $\ell_2$ breaks immediately when unit-normalization is not performed, but more importantly it breaks when normalization is performed on a geometry where the erased information (norm) is task-relevant or where the retained information (angle) is unreliable because angles are compressed by anisotropy. The failures discussed here are conditional; they identify fragments where cosine’s invariances are misaligned with the downstream objective.
4.1 Anisotropic cones and cosine saturation
Ethayarajh reports that contextualized word representations in BERT, ELMo, and GPT-2 are anisotropic in all layers, occupying a narrow cone rather than being uniformly distributed, with anisotropy increasing in higher layers; in GPT-2’s last layer, the average cosine similarity between two random word embeddings can be “almost perfect,” implying extreme saturation [1]. Mimno and Thompson observe a related phenomenon in SGNS: word vectors occupy a narrow cone and are structured relative to context vectors in a way that depends on the negative sampling ratio, not purely on semantic similarity [2]. These findings are different in mechanism (transformer contextualization vs SGNS training geometry) but similar in consequence: cosine similarity becomes poorly calibrated as a similarity proxy because the baseline similarity between unrelated points is already high. When random pairs have cosine similarity near a large positive constant, the dynamic range available to represent meaningful variation is reduced. In retrieval terms, if $s_{\cos}(q,x)$ concentrates in a narrow interval for most candidates $x$, small perturbations (noise, quantization, batch effects) can reorder neighbors, making rankings brittle.
A fragment-local diagnostic is to measure the distribution of $\cos(z_i,z_j)$ for randomly sampled pairs and compare it to the distribution under a matched isotropic baseline (e.g., random unit vectors). Large positive shifts indicate a cone. Another diagnostic is to compute the fraction of variance explained by top principal components; dominance suggests that most points share a common direction that will inflate cosine scores even after unit normalization (because normalization does not remove shared directionality). The ABTT post-processing method explicitly targets this by subtracting the mean and removing top PCs, motivated by the observation that embeddings share common dominating directions; it reports improvements on both lexical and sentence-level tasks [3]. The precise claim here is not that anisotropy is always bad; it is that cosine similarity assumes usable angular variance, and anisotropy reduces angular expressivity by collapsing directions.
4.2 Frequency-linked pathologies: underestimation and norm/frequency confounds
A distinct failure regime arises when token frequency interacts with representational geometry. Zhou et al. document systematic underestimation: cosine similarity over contextual embeddings underestimates the similarity of frequent words relative to human judgments, even after controlling for polysemy and other factors, and they trace the effect to training data frequency differences [14]. Wannasuphoprasit et al. propose a discounting method to address this “cosine similarity underestimation” for high-frequency words on contextual embedding similarity datasets, reporting improvements by correcting frequency-sensitive components [15]. These papers matter here because they show that cosine’s behavior is not frequency-neutral even though it discards norms; frequency can distort angles themselves, producing frequency-conditioned geometry where cosine is miscalibrated across frequency strata. In other words, discarding norm does not eliminate frequency effects if frequency also shapes direction distributions.
In static embeddings, frequency–norm coupling is more explicit. Wilson and Schakel perform controlled corpus interventions for word2vec CBOW and find that word vector length depends roughly linearly on word frequency and on the level of noise in the word’s co-occurrence distribution [11]. Oyama et al. show that embedding norm can encode information gain, relating magnitude to informational properties rather than purely to semantic content [12]. Together, these results support a general point: norm can be signal or noise depending on task. In lexical semantics tasks where frequency should not dominate, norm sensitivity can be harmful; in tasks where “salience” or “informativeness” is relevant, norm sensitivity can be helpful. Cosine forces a decision: it discards all norm information, including any task-relevant part. Dot product forces the opposite decision: it retains all norm information, including confounds. Euclidean distance retains norm differences but in an additive geometric way rather than multiplicative.
4.3 Long-text and pooling-induced norm effects
Even if a model is trained to produce semantically meaningful angles, pooling can introduce length- or entropy-correlated norms. Adi et al. show that simple sentence embedding constructions can preserve easily predictable surface properties such as sentence length, and length prediction is a standard auxiliary task revealing what embeddings encode [16]. In practical transformer pooling, mean pooling over token embeddings can produce norms that correlate with sequence length, token diversity, or representation smoothness, depending on normalization and layer choices; the fragment-specific direction here is to treat norm as a measurable variable and audit whether it correlates with length or other non-semantic confounds in the target domain. If norms encode length, cosine will remove that signal; whether that is desirable depends on whether the downstream task should treat longer documents as more informative or merely more verbose. The same observation applies to document embeddings: systems that embed long documents often discover that score distributions differ across length buckets, which can cause retrieval bias if a single global threshold is used.
4.4 The “metric is not semantics” critique via model non-identifiability
Steck et al. provide an analytic critique from a different angle: in embeddings derived from certain regularized linear models (e.g., matrix factorization variants), cosine similarities can be arbitrary and even non-unique due to degrees of freedom (e.g., dimension-wise rescalings) that preserve the training objective’s dot products but change cosine after normalization [7]. The relevance to transformer embeddings is not that transformers are linear MF, but that modern deep models use multiple implicit and explicit regularizations (weight decay, layer norm, dropout, temperature scaling, etc.), and the embedding geometry (including norms) may be shaped by these choices in ways not aligned with the intended semantic reading of cosine similarity. The subtractive lesson is methodological: even if cosine “works” on a benchmark, one should avoid treating cosine as intrinsically semantic; rather, cosine is a projection that may or may not be aligned with the training objective’s identifiable structure.
5. Geometry-aware corrections: whitening, norm discounting, angular training
Given that failures are geometric, corrections are naturally geometric. We separate (i) post-hoc corrections applied to a fixed embedding space from (ii) training-time objectives that explicitly shape the geometry. Both are derived operators, not primitives.
5.1 Post-hoc isotropy corrections: ABTT, whitening, and flow-based isotropization
ABTT (All-but-the-Top) removes the mean vector and a small number of dominant principal directions from embeddings, motivated by the empirical observation that embeddings share common dominating directions and that isotropy correlates with better downstream performance [3]. Whitening methods go further by transforming the embedding distribution to have (approximately) identity covariance, often after mean subtraction:
$$z' \;=\; W (z-\mu), \quad W \approx \Sigma^{-1/2},$$
where $\mu$ and $\Sigma$ are estimated on a corpus and $W$ is a whitening transform (possibly regularized). Su et al. propose whitening sentence representations to improve semantic retrieval and speed, presenting it as a lightweight post-processing step with empirical gains on semantic tasks [6]. Huang et al. (WhiteningBERT) study unsupervised sentence embeddings derived from pretrained models and report that a simple whitening-based normalization can consistently improve performance across multiple STS datasets, alongside pooling choices such as mean pooling and layer combination [5]. Li et al. (BERT-flow) frame the problem as mapping an anisotropic sentence embedding distribution into a smooth isotropic Gaussian using normalizing flows trained unsupervised, reporting improvements on semantic similarity tasks [4]. These methods differ in strength and assumptions: ABTT assumes a small number of dominating directions are removable without losing semantic signal; whitening assumes that second-order statistics capture the major distortions; flow-based methods assume that a more flexible invertible mapping is justified and stable.
A key limitation is fragment-dependence. Whitening assumes a globally correctable covariance structure. In cross-domain or multilingual settings, covariance structure can differ by domain or language; a single global whitening matrix can overcorrect some regions and undercorrect others, potentially harming cross-fragment comparability. Empirical work on whitening effects suggests that whitening can remove certain biases (e.g., frequency bias) but can also remove task-relevant structure depending on setting; this motivates treating whitening as an ablation step rather than a universal improvement [28]. The safe claim is procedural: whitening is a knob that changes the geometry; its utility must be locally validated.
A minimal implementation sketch (for reproducibility) is:
# Given embeddings Z: (n, d) array
mu = Z.mean(axis=0)
X = Z - mu
Sigma = (X.T @ X) / (len(Z) - 1)
# Regularized inverse square root via eigendecomposition
eigvals, eigvecs = np.linalg.eigh(Sigma)
eps = 1e-5
W = eigvecs @ np.diag(1.0 / np.sqrt(eigvals + eps)) @ eigvecs.T
Z_white = X @ W.T
# optionally: unit-normalize afterwards depending on metric
Z_white_unit = Z_white / np.linalg.norm(Z_white, axis=1, keepdims=True)
This code is not an algorithmic contribution; it clarifies that “whitening” is an operator with explicit statistical assumptions (stationarity of covariance under the target distribution, adequacy of second moments) that can fail under distribution shift.
5.2 Norm discounting: reintroducing magnitude partially
If cosine discards magnitude entirely, one correction is to reintroduce magnitude in a controlled way rather than switching wholesale to dot product. One can define a discounted cosine family:
$$s_{\delta}(u,v)\;=\;\frac{u^\top v}{(\|u\|\|v\|)^{\delta}}, \quad \delta\in[0,1],$$
where $\delta=1$ recovers cosine similarity and $\delta=0$ recovers dot product. This family makes explicit that “cosine vs dot product” is not binary but a continuum over norm sensitivity. Discounting methods in the literature are often motivated differently: Wannasuphoprasit et al. propose a discounting method targeted at correcting cosine underestimation for high-frequency words in contextual embedding similarity tasks [15]. The precise functional form need not match $s_\delta$; the design principle is that the metric should expose a tunable parameter controlling how much norm information is allowed to influence scores, and that the parameter can be selected by local validation on a calibration set.
A separate but related operator is frequency-aware renormalization, where embeddings are normalized with a frequency-conditioned or component-conditioned scheme. Such methods should be treated as fragment-specific, because they explicitly bake in corpus statistics; cross-domain transfer can break them.
5.3 Training-time angular objectives: making the angle meaningful by construction
Post-hoc corrections attempt to repair geometry after the fact; training-time objectives attempt to prevent distortions by explicitly shaping geometry. Contrastive learning on normalized representations is an instance: Wang and Isola analyze contrastive representation learning on the hypersphere and show that contrastive losses can be decomposed into terms encouraging alignment (pulling positives together) and uniformity (spreading representations on the sphere), making explicit that “uniform on the hypersphere” is a geometric target when features are normalized [10]. SimCSE applies a contrastive objective for sentence embeddings and argues (theoretically and empirically) that contrastive learning regularizes anisotropic pretrained spaces toward uniformity while aligning positives [8]. These sources support a conditional statement: if the training objective explicitly uses normalized features and cosine-like similarity in its logits, then cosine similarity at inference time is closer to being aligned with what the model optimized.
Angular-margin objectives in metric learning provide a different primary example: ArcFace introduces an additive angular margin loss to increase angular separability of classes under normalized features and weights, explicitly tying discrimination to geodesic/angular distances on a sphere [25]. ArcFace is from face recognition, not text embeddings, but it demonstrates a general design: if one intends angles to be meaningful, one can train with losses that directly optimize angular separation rather than relying on incidental geometric properties. The transfer of such objectives to text embedding training must be treated as a hypothesis, not a conclusion: textual similarity is not classification by identity, and the space of positives/negatives is more ambiguous; nevertheless, the family of angular objectives clarifies what it means for “angle” to be an optimized carrier of meaning.
Finally, geometry-aware training is not a free lunch. Steck et al.’s analysis suggests that degrees of freedom and regularization can make cosine similarity opaque or arbitrary in some model families [7]. Training on cosine does not automatically produce a globally interpretable cosine geometry; it produces a fragment-specific equilibrium shaped by optimization and regularization.
6. Multilingual and cross-domain considerations
Multilingual embedding spaces introduce additional distortions because the training distribution is a mixture of language-conditioned subdistributions with different tokenization statistics, corpus sizes, and domain mixes. Multilingual BERT (mBERT) is trained on concatenated monolingual corpora without explicit alignment, yet it exhibits surprising cross-lingual transfer; however, it also exhibits systematic deficiencies and dependence on typological similarity in transfer performance [18]. These facts imply that the embedding space is not a single homogeneous geometry; it is a superposition of partially aligned subspaces whose alignment quality varies across language pairs and data availability. In such a setting, metric choice becomes entangled with both alignment and per-language norm/angle statistics.
An isotropy analysis in the multilingual BERT embedding space reports that isotropy can vary across layers and that cosine similarity between random embeddings can be used as an approximation to diagnose isotropy, with isotropic random embeddings yielding near-zero cosine similarity [17]. If languages occupy different “cones” or exhibit different dominant directions, then cosine similarity can mask cross-language norm asymmetries (because it discards norm) while still being distorted by cross-language directional biases (because angles are computed in a shared coordinate system that may be unevenly aligned). Conversely, dot product may amplify cross-language norm differences if some languages systematically produce higher norms (due to tokenization length, script properties, or corpus-driven calibration), thereby biasing retrieval toward those languages even when semantic alignment is poor.
Sentence-level multilingual embedding models such as LASER and LaBSE are explicitly designed for cross-lingual alignment and retrieval. LASER produces massively multilingual sentence embeddings intended for cross-lingual transfer [20]. LaBSE trains a dual-encoder sentence embedding model for language-agnostic representations and evaluates cross-lingual similarity on multilingual retrieval tasks (e.g., Tatoeba), typically using cosine similarity over normalized embeddings [19], [21]. The design intent here is important: if a model’s evaluation and training are built around normalized embeddings and cosine similarity, then cosine is a more defensible default in that fragment; if a multilingual space is merely incidental (e.g., mBERT used off-the-shelf with a pooling choice), then cosine inherits whatever anisotropy and norm artifacts the incidental geometry contains, and local validation becomes mandatory.
Cross-domain shift complicates further: a whitening transform estimated on one domain can misalign another; norm statistics can change with document style; and anisotropy can change with prompt format and pooling. Therefore, any claim like “cosine works best for multilingual embeddings” is too global. The defensible claim is procedural: multilingual systems should audit per-language distributions of norms and pairwise cosine baselines, and they should test whether normalization/whitening improves cross-lingual retrieval consistency in the target deployment mixture.
7. Evaluation methodology: auditing the metric, not just the model
If “semantic similarity” is a property of $(g,m)$, then evaluation should explicitly separate representation quality from metric behavior. This section proposes a concrete protocol: a runbook for deciding whether cosine similarity is appropriate in a given geometry fragment. The protocol is deliberately mechanical; it is designed to expose failure regimes rather than to optimize a single benchmark score.
7.1 Diagnostics: angular distribution, isotropy, and saturation
Given a sample $Z=\{z_i\}$:
1. Random-pair cosine baseline. Sample pairs $(i,j)$ uniformly with $i\neq j$ and compute $\cos(z_i,z_j)$. Compare the mean and variance to an isotropic baseline (e.g., random unit vectors in $\mathbb{R}^d$). A large positive mean suggests anisotropy and potential cosine saturation [1], [17]. Report the distribution, not only the mean, because multimodality can indicate subspace mixtures (common in multilingual mixtures).
2. Dominant direction diagnostics. Compute PCA on centered embeddings and report the fraction of variance explained by the top $k$ components for small $k$ (e.g., $k\in\{1,2,5,10\}$). Large dominance suggests shared directions that can inflate cosine similarities even after unit normalization, motivating ABTT or whitening [3], [5], [6].
3. Neighborhood stability. For a set of query embeddings, compare nearest-neighbor sets under cosine vs $\ell_2$ vs dot product, with and without unit normalization. Large neighbor-set instability indicates that the metric is materially changing retrieval semantics, not merely rescaling scores.
7.2 Diagnostics: norm distributions and norm–confound correlations
Compute $\|z_i\|$ and audit correlations against measurable confounds:
1. Length correlation. For text, correlate $\|z_i\|$ with token count or character count. If strong, then norm encodes length or pooling artifacts; decide whether the downstream task should be length-invariant. Adi et al. show that sentence embeddings can encode surface properties like length; treat norm–length correlation as an explicit design choice rather than as a surprise [16].
2. Frequency correlation (lexical and contextual). For token embeddings or pooled sentence embeddings, correlate norms (or direction-related statistics) with token frequency. Controlled experiments show static embedding norms depend on frequency and noise [11], and contextual similarity for frequent words can be systematically underestimated by cosine [14]. If the downstream task is sensitive to frequent items (e.g., stopwords, discourse markers), this matters for metric calibration.
3. Information-gain proxy. If available, compute an information-gain or entropy proxy and correlate with $\|z\|$. Oyama et al. suggest norm can encode information gain in word embeddings [12]. If norm aligns with desired salience, discarding it may harm ranking.
7.3 Ablations: normalization, whitening, and discounting as controlled interventions
Treat geometric post-processing steps as interventions and measure deltas:
Baseline: raw embeddings with cosine, dot product, $\ell_2$.
Intervention A: unit-normalize embeddings; measure whether ranking changes and whether performance changes. Under unit normalization, cosine and $\ell_2$ orderings should match; if performance differs, the system is not purely ranking-based or includes score thresholds (calibration issues).
Intervention B: mean subtraction and ABTT (remove top PCs); measure performance changes and isotropy changes [3].
Intervention C: whitening (global) and optional re-normalization; measure improvements or regressions on target tasks [5], [6].
Intervention D: norm discounting family $s_\delta$ (or a literature-based discounting method for frequency underestimation); tune $\delta$ on a held-out calibration set and report sensitivity [15].
The key output of the runbook is not a single number; it is a decision record: “cosine is acceptable in fragment $\mathfrak{F}$ because isotropy is approximately satisfied after $P$ and norm encodes confounds rather than signal,” or “cosine is rejected because norm carries task-relevant salience and angles are saturated.”
7.4 Benchmarks: separating “embedding quality” from “metric fit”
Benchmarks like BEIR (heterogeneous IR evaluation) and MTEB (multi-task embedding benchmark) are useful precisely because they vary tasks, domains, and languages, exposing that a single geometry/metric pairing rarely dominates everywhere [22], [23]. The methodological recommendation is to treat such benchmarks as stress tests for metric stability: if cosine vs dot product swaps ranking performance across tasks, then metric choice is not incidental. In practice, the correct move is to adopt fragment-local defaults: for a given deployment mixture, measure the metric delta on a representative evaluation set rather than relying on generic benchmark leaderboards.
8. Implications for semantic systems and downstream design
In retrieval-augmented generation (RAG), clustering, recommendation, and semantic filtering, the similarity metric is a hidden operator that shapes system behavior. Because nearest-neighbor search is often implemented via approximate methods (e.g., ANN libraries such as FAISS), the chosen metric also constrains which indexing structures are available and which approximations are valid; furthermore, many systems silently conflate “cosine” with “inner product on normalized vectors,” which is a conditional equivalence, not an identity of operators [24]. The systems implication is subtraction: if one does not name the metric and its geometric assumptions, one cannot interpret downstream behavior as “semantic.” For example, a cluster boundary in embedding space is a boundary under a chosen metric; changing from cosine to dot product changes cluster topology if norm varies materially. Similarly, a retrieval threshold used for semantic filtering (e.g., “accept if similarity > t”) is not stable under changes in normalization or whitening because score distributions shift; only orderings are preserved under monotone transforms, and even those only under explicit conditions (Appendix A).
Metric choice should therefore be treated as an architectural parameter with explicit invariances, documented alongside model choice. A system that claims “semantic similarity search” should minimally specify: (i) whether embeddings are normalized, (ii) whether any isotropy correction is applied, (iii) whether the embedding model was trained with angular objectives aligned to cosine, and (iv) whether norm is treated as signal or noise. Steck et al.’s critique reinforces that cosine similarity can be opaque or arbitrary in some embedding constructions; therefore, “cosine similarity” is not a semantics guarantee, even if it is a popular default [7]. The operational claim that survives is narrower: if a model is trained and evaluated with normalized embeddings and cosine similarity (common in contrastive embedding training), and if anisotropy is controlled, then cosine can be a stable operator for ranking, but it remains a property of a fragment, not a universal semantic metric [8], [10].
9. Conclusion and open questions
The conclusion is deliberately narrow. Cosine similarity is conditionally effective: it can be robust when embeddings are normalized, when angular differences are expressive (approximate isotropy or corrected isotropy), and when training objectives and evaluation tasks reward angular discrimination. It is not universally semantic. The failures surveyed—anisotropic cones causing cosine saturation [1], [2], frequency-conditioned underestimation for high-frequency words [14], [15], norm/frequency coupling in static embeddings [11], norm encoding of information gain [12], and analytic non-identifiability concerns for cosine in some embedding families [7]—are not contradictions; they are evidence that “semantic similarity” is a system-level property of $(f,\pi,D,\mathcal{L},P,m)$.
Open questions remain fragment-local but general in form. (i) When should norm be treated as signal versus noise? Existing sources show both possibilities: norms correlate with frequency and noise [11] and can encode information gain [12]. A principled criterion for “norm as salience” vs “norm as confound” is not settled. (ii) How should one design geometry-aware objectives without overfitting to a single benchmark geometry? Contrastive learning encourages uniformity on the hypersphere [10] and can reduce anisotropy [8], but the correct target geometry may differ across domains and languages. (iii) How can task-relative semantic invariants be formalized so that metric choice becomes a derived consequence rather than a heuristic default? Until such invariants are specified, the safest stance is procedural: audit geometry and metric behavior locally, report the metric explicitly, and treat “semantic similarity” as a claim with named assumptions and provenance.
Appendices
Appendix A. Formal equivalence under unit normalization
Let $\hat u = u/\|u\|$ and $\hat v = v/\|v\|$ with $\|u\|,\|v\|>0$. Then:
\[ \begin{align} \|\hat u - \hat v\|_2^2 &= (\hat u - \hat v)^\top(\hat u - \hat v) \\ &= \hat u^\top \hat u + \hat v^\top \hat v - 2 \hat u^\top \hat v \\ &= 1 + 1 - 2 \hat u^\top \hat v \\ &= 2 - 2 \cos(u,v), \end{align} \]
since $\hat u^\top \hat v = \frac{u^\top v}{\|u\|\|v\|} = \cos(u,v)$ and $\|\hat u\|=\|\hat v\|=1$. Therefore, for fixed $\hat u$, minimizing $\|\hat u-\hat v\|_2^2$ over candidates $v$ is equivalent to maximizing $\cos(u,v)$ over candidates $v$, because the map $x \mapsto 2-2x$ is strictly decreasing on $\mathbb{R}$. The same holds for $\|\hat u-\hat v\|_2$ because square root is monotone on $[0,\infty)$. This equivalence concerns orderings; it does not imply that cosine scores and $\ell_2$ distances are interchangeable as calibrated values (e.g., thresholds), because monotone transforms preserve ranking but not scale.
Appendix B. Minimal audit script sketch (diagnostics only)
def audit_geometry(Z, lengths=None, freqs=None, k_pca=10, n_pairs=200000):
# Z: (n,d) numpy array
import numpy as np
n, d = Z.shape
norms = np.linalg.norm(Z, axis=1)
# random-pair cosine baseline
I = np.random.randint(0, n, size=n_pairs)
J = np.random.randint(0, n, size=n_pairs)
mask = (I != J)
I, J = I[mask], J[mask]
Zi, Zj = Z[I], Z[J]
cos = (Zi * Zj).sum(axis=1) / (np.linalg.norm(Zi,axis=1) * np.linalg.norm(Zj,axis=1) + 1e-12)
# PCA dominance (centered)
X = Z - Z.mean(axis=0)
# compute top-k eigenvalues via SVD
U, S, Vt = np.linalg.svd(X, full_matrices=False)
var = (S**2) / (n-1)
frac = var[:k_pca] / var.sum()
out = {
"norm_mean": norms.mean(),
"norm_std": norms.std(),
"cos_mean_random": cos.mean(),
"cos_std_random": cos.std(),
"pca_frac_topk": frac,
}
if lengths is not None:
out["corr_norm_length"] = np.corrcoef(norms, lengths)[0,1]
if freqs is not None:
out["corr_norm_freq"] = np.corrcoef(norms, freqs)[0,1]
return out
This audit does not decide correctness; it produces measurable geometry summaries that can be compared across post-processing steps and metrics.
References (primary sources)
- Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. EMNLP-IJCNLP.
- Mimno, D., & Thompson, L. (2017). The strange geometry of skip-gram with negative sampling. EMNLP.
- Mu, J., Bhat, S., & Viswanath, P. (2018). All-but-the-Top: Simple and Effective Postprocessing for Word Representations. ICLR.
- Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the Sentence Embeddings from Pre-trained Language Models (BERT-flow). arXiv.
- Huang, J., Tang, D., Zhong, W., Lu, S., Shou, L., Gong, M., Jiang, D., & Duan, N. (2021). WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Findings of EMNLP.
- Su, J., et al. (2021). Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv.
- Steck, H., Ekanadham, C., & Kallus, N. (2024). Is Cosine-Similarity of Embeddings Really About Similarity? arXiv.
- Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
- Wang, T., & Isola, P. (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ICML (PMLR).
- Wilson, B. J., & Schakel, A. M. J. (2015). Controlled Experiments for Word Embeddings. arXiv.
- Oyama, M., Yamagiwa, H., & Shimodaira, H. (2023). Norm of Word Embedding Encodes Information Gain. EMNLP.
- Valentini, F., et al. (2023). Investigating the Frequency Distortion of Word Embeddings. arXiv.
- Zhou, K., Ethayarajh, K., Card, D., & Jurafsky, D. (2022). Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words. arXiv.
- Wannasuphoprasit, S., et al. (2023). Solving Cosine Similarity Underestimation between High Frequency Words by Discounting. Findings of ACL.
- Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., & Goldberg, Y. (2017). Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. arXiv.
- Rajaee, S., Pilehvar, M. T., & Cheung, J. C. K. (2022). An Isotropy Analysis in the Multilingual BERT Embedding Space. OpenReview.
- Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? ACL.
- Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. arXiv.
- Artetxe, M., & Schwenk, H. (2019). Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. arXiv.
- Conneau, A., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. (XLM-R). arXiv.
- Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv.
- Muennighoff, N., et al. (2022/2023). MTEB: Massive Text Embedding Benchmark. arXiv.
- Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with GPUs. (FAISS). arXiv.
- Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. CVPR.
- Yamagiwa, H., Oyama, M., & Shimodaira, H. (2024). Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings. arXiv.
- Nastase, V., & Merlo, P. (2025). Testing the assumptions about the geometry of sentence embedding spaces: the cosine measure need not apply. arXiv.
- Sasaki, S., et al. (2023). Examining the effect of whitening on static and contextualized word embeddings. Information Processing & Management.