Glossary

A single alphabetical list of every term defined across the book. Each term is defined in full in its home chapter; this page gathers them for quick reference and search.

A

Term	Meaning
A/B test (online eval)	Serve two models to live user halves and compare a business metric — the real verdict.
Activation $\phi$	Nonlinear function applied to $z$ (sigmoid, tanh, ReLU).
Adjacency matrix $A$	The whole graph’s connections as a square grid of 0/1.
Agentic recommender	An LLM that plans, remembers, and calls tools over multi-turn conversation (§2.4).
Alignment	Positive pairs end up close in space.
Alignment & uniformity	Pull positives together / spread embeddings on the sphere (DirectAU); fights popularity bias.
Alignment loss (RLMRec)	Ties LLM semantic vectors to CF embeddings (contrastive -Con or generative -Gen).
ANN	Approximate nearest-neighbour search — fast, slightly-inexact vector retrieval; the candidate-generation engine.
ANN (approximate nearest neighbour)	Fast similarity search that makes dot-product retrieval scale to millions of items.
ANN (approximate nearest-neighbour)	Fast vector search that returns almost the closest vectors to a query, trading a little accuracy for large speed-ups — how embedding retrieval is served (Implementation Choices).
AP / MAP	Average Precision (precision averaged over hit positions) / its mean over users.
Arm	One choosable option (an item); “pulling” it = showing it. From the slot-machine metaphor.
Attention	$\mathrm{softmax}(QK^\top/\sqrt{d_k})V$: each position takes a similarity-weighted blend of all positions’ value vectors.
Attributed / featureless	Whether nodes carry input feature vectors; recsys nodes are featureless (IDs only).
AUC	Probability a random click outscores a random non-click; grades ordering, not calibration.
AUC / ROC	Area Under the ROC Curve = P(score of a random relevant > random irrelevant).
Autoregressive decoding	Generate one token, append it, run the model again, repeat — the GPT/SASRec generation loop.
Auxiliary task	The extra, label-free objective added alongside the main (BPR) loss.

B

Term	Meaning
Back-propagation	Computing all weight-gradients by the chain rule, backward through the net.
Bandit feedback	You observe the reward only for the action you took, never the alternatives.
Base-rate effect	A rare condition (low prior) makes even an accurate test’s positives mostly false.
Batch / Layer Norm	Re-center/re-scale activations for training stability (LayerNorm powers the Transformer).
Bayes’ rule	posterior $\propto$ likelihood $\times$ prior.
BCE loss	Pointwise binary cross-entropy on positive vs. sampled-negative pairs.
Beam search (decoding)	Keep the $B$ best partial ID prefixes per step; the completed tuples, ranked, are the top-$K$ recommendations (§2.2).
Benchmark framework	RecBole / Cornac / Elliot / Microsoft Recommenders / LensKit — unified datasets + models + protocol.
Benchmaxxing	Training on (or near) benchmark data so a high score reflects recall, not ability — why one leaderboard number is untrustworthy.
Bernoulli	One yes/no with parameter $p$; $-\log$ → BCE.
BERT	Bidirectional Encoder (masked-LM); an encoder for understanding text → embeddings (Devlin 2019).
BERT4Rec	A bidirectional Transformer trained with the cloze (masked-item) objective.
Beta	Distribution over a probability; conjugate prior for Bernoulli.
Beta posterior	$\text{Beta}(1+s,1+f)$ for $s$ clicks, $f$ non-clicks; conjugate to the Bernoulli click.
Bipartite graph	Nodes in two groups; edges only cross between groups (users ↔︎ movies).
Bonferroni / Holm	Multiple-comparison fixes for $m$ tests: Bonferroni compares each $p$ to $\alpha/m$; Holm is a uniformly less conservative stepwise version.
BPR	Bayesian Personalized Ranking — pairwise ranking loss; “Bayesian” = its MAP derivation; default for LightGCN.
BPR loss	Pairwise ranking loss: push observed pairs above sampled negatives.

C

Term	Meaning
Candidate generation → ranking	The production funnel: a cheap recall-stage (MF/$k$NN) then a precise rank-stage.
Categorical	One of $K$ classes; $-\log$ → cross-entropy; paired with softmax.
Causal mask	Lower-triangular mask: position $t$ may attend only to positions $\le t$ (no peeking at the future).
CDF $F(x)$	Cumulative distribution function $F(x)=\Pr(X\le x)$ — the running total of probability; a different function describing the same distribution.
Central Limit Theorem	The average of many noisy pieces is bell-shaped around the truth; basis of the $t$-curve.
Chain rule	Derivative of $f(g(x))$ = $f'(g)\cdot g'$; the basis of back-propagation.
Characteristic equation	$\det(A-\lambda I)=0$; its roots are the eigenvalues.
Chebyshev filter	A numerically stable polynomial filter family (ChebyCF).
Client / server	Devices holding private data / the coordinator that aggregates their updates.
Closed-form / training-free CF	Recommenders that apply a fixed filter with no gradient training (GF-CF, PSGE).
Cloze objective	Mask random items and predict each from both-side context (fill-in-the-blank).
Cold start	No history yet for a new user/item; CF cannot act.
Collaborative filtering (CF)	Recommend using interaction patterns across users (no content).
Collaborative signal	What can be learned from who interacted with what (From Graphs to LightGCN, SSL & Contrastive Learning, The Spectral / Graph-Filter View), independent of item content.
Component / entry	One number inside a vector.
Conditional probability $\Pr(A\mid B)$	Probability of $A$ given $B$ occurred $=\Pr(A\cap B)/\Pr(B)$.
Confidence interval	$\bar d \pm t^{*}s_d/\sqrt n$; the effect size with its uncertainty — significant when it excludes $0$.
Conjugate prior	A prior whose posterior is the same family (Beta↔︎Bernoulli).
Content-based filtering	Recommend items similar (in features) to what the user liked.
Contextual bandit	The reward depends on a context (user/item features); LinUCB models it linearly.
Contrastive learning	SSL pretext: pull a node’s two views together, push other nodes apart; needs negatives; discriminative mechanism.
Convex	Curves up everywhere; one variable $f''\ge0$, many variables Hessian PSD; bowl-shaped; one global minimum.
Cosine similarity	$\dfrac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}\in[-1,1]$; direction agreement, ignoring length.
Cost / objective / criterion	Near-synonyms for loss (cost = total; objective = neutral; criterion = the code object).
Coverage	Fraction of the catalog ever recommended.
Critical / stationary point	Where $f'(x)=0$: peak, valley, or plateau.
Croissant	MLCommons machine-readable dataset-metadata standard (2024); required for NeurIPS dataset tracks.
Cross network (DCN)	Stacked layers that raise interaction degree by one each, explicitly and with linear cost.
Cross-entropy / BCE / log loss	Classification loss; bits to encode truth $p$ with code for prediction $q$; = −log-likelihood of Bernoulli.
Cross-entropy $H(p,q)$	$-\sum p\log q$; surprise of $p$’s outcomes scored by model $q$ — the classification loss.
Cross-view alignment	Pulling a node’s two embeddings (collaborative $e_i$ and semantic $s_i$) together with InfoNCE.
CTR (click-through rate)	$P(\text{click}\mid \text{user, item, context})$; the score the ranking stage predicts.

D

Term	Meaning
Data leakage	Test-period information seeping into training; silently inflates every metric.
DCG / IDCG / NDCG	Discounted Cumulative Gain / its ideal / the normalized ratio in $[0,1]$.
DeepFM	Wide & Deep with the wide arm replaced by an FM, sharing one embedding layer; no hand-crossing.
Degree	How many neighbors a node has.
Degree matrix $D$	Diagonal matrix of degrees; used for normalization.
Degrees of freedom	$n-1$ for the paired test; sets which Student-$t$ curve gives the $p$-value.
Demographic parity	Equal selection / exposure rate across groups.
Derivative $f'(x)$	Instantaneous rate of change = slope of the curve at a point.
Determinant $\det A$	Area/volume-scaling factor ($ad-bc$ for $2\times2$); $\det=0$ ⇒ collapses a dimension / non-invertible.
Difference quotient	$\frac{f(x+h)-f(x)}{h}$ — rise over run before taking $h\to0$.
Differential privacy	Add calibrated noise to updates so no single user’s data can be inferred.
Differentiate	Compute a derivative (limit of differences).
Dimension	How many components a vector has ($\mathbb{R}^d$ = all $d$-vectors).
Directed / undirected	Whether an edge has a direction (A→B $\ne$ B→A) or not.
Directional derivative	Rate of change of $f$ along a unit direction $\hat{\mathbf d}$: $\nabla f\cdot\hat{\mathbf d}$.
Discount $\gamma$	How much a later reward is worth vs. now; $\gamma\to0$ is myopic (a bandit), $\gamma\to1$ far-sighted.
Discriminative vs. generative	Axis 1 — whether a model learns $p(y\mid x)$ (a boundary; can’t generate) or $p(x)$/$p(x,y)$ (the data; can sample). BPR/LightGCN is discriminative; VAEs/LLMs are generative.
Distributional hypothesis	“A word is known by the company it keeps” — similar contexts ⟹ similar vectors (Firth 1957).
Diversity	Average dissimilarity within a recommended list.
Dot / inner / scalar product $\mathbf{u}\cdot\mathbf{v}$	$\sum_i u_iv_i$; one number measuring agreement.
Dot product	Multiply-and-sum two embeddings → a similarity / ranking score.
Dropout	Randomly zero units during training; an ensemble-like regularizer; “edge dropout” on graphs.

E

Term	Meaning
Early stopping	Stop when validation stops improving; implicit regularizer.
Edge	A connection (user watched movie).
Effect size	The magnitude of the difference ($\bar d$); report it alongside $p$ (significance $\ne$ size).
Eigenbasis	A symmetric matrix’s full set of orthogonal eigenvectors.
Eigenvalue $\lambda$	The stretch factor of an eigenvector.
Eigenvector	A direction a matrix only stretches, not rotates: $A\mathbf{v}=\lambda\mathbf{v}$.
Eigenvector $v$	A special direction a matrix only stretches (not rotates): $\tilde{A}v=\lambda v$. The graph’s eigenvectors are its “frequency patterns.”
Elastic net	L1 + L2 combined; “stretchable net” keeping correlated features.
ELBO	Evidence Lower BOund; the VAE (variational autoencoder) objective = reconstruction loss + β·KL.
Embedding	A learned vector representing a user/item as a point in space.
Embedding layer	A learnable lookup table; id → its row (= one-hot × matrix); the input side for discrete data.
Enhancer / reranker	Recsys LLM roles: generate item/user text features / re-order a candidate list — not chat.
Enhancer pipeline	Using an LLM offline to manufacture features for a fast collaborative backbone that serves.
Entropy $H(p)$	$-\sum p\log p$; average surprise / uncertainty / bits to encode (maximal at uniform).
Epoch	One full pass over all observed positives during training.
Epoch / batch	One full pass over the data / the examples used per update.
Equal opportunity	Equal true-positive rate across groups (the truly-relevant are surfaced equally).
ERM	Empirical Risk Minimization — “fit the training data” (usually + a regularizer).
Euclidean distance $\lVert\mathbf{u}-\mathbf{v}\rVert$	Straight-line gap between two arrow-tips; feels magnitude (unlike cosine).
Evidence $p(D)$	Normalizer; data probability averaged over $\theta$ (a.k.a. marginal likelihood).
Expectation $\mathbb{E}[X]$ / mean $\mu$	Probability-weighted average value.
Explicit / implicit feedback	Stated ratings (likes and dislikes) vs. behavioural $0/1$ (positives only — no true negatives).
Explicit vs. implicit feedback	Star ratings vs. clicks/plays (positive-only, unlabeled negatives).
Explore vs. exploit	Try an uncertain option to learn (explore) vs. take the best-known one now (exploit).

F

Term	Meaning
F1@K	Harmonic mean of precision and recall.
Factor vector	A feature’s short learned vector $\mathbf v_i$; the dot product of two is their interaction strength.
Factorization machine (FM)	MF generalized to any features: each feature gets a vector, each feature pair interacts via a dot product (Rendle, 2010). MF is the ids-only special case.
FCF	Federated collaborative filtering: item factors are global; each user’s factor stays on-device.
Feature interaction	A pair of features whose combination matters (user $\times$ genre); FM scores it as $\langle\mathbf v_i,\mathbf v_j\rangle$.
Feature interaction / cross	The conjunction of two categories (`sci-fi ∧ mobile`); where the predictive signal lives.
Feature store	The table of generated `id → (profile, vector)` the serving path reads; the LLM is never on the serving path.
FedAvg	Aggregate by a data-size-weighted average of clients’ local models (McMahan et al., 2017).
Federated learning	Train a shared model across devices that keep their raw data local; only updates are shared.
Feed-forward (FFN)	Per-token MLP $\max(0,xW_1{+}b_1)W_2{+}b_2$ inside each block; transforms each token (most of the parameters).
Feedback loop	Recommend → click → log → retrain on skewed data → more skew; the “rich get richer” spiral that compounds bias.
Filter response $h(\lambda)$	The function deciding how much each frequency is kept; defines the filter.
Focal loss	Cross-entropy times $(1-\hat p)^\gamma$ to down-weight easy examples; for class imbalance.
Forward pass	Running an input through the net to a prediction + loss.
FPMC	Factorizing Personalized Markov Chains: MF term $+$ a learned (factorized) transition term.
Frequency (on a graph)	An eigenvector pattern; low = smooth over neighbors, high = jagged.
Frozen text encoder	Run text once through a fixed pretrained model and keep the output vector — the semantic $s_i$ RecSys aligns/consumes.
Full vs. sampled ranking	Rank the true item against the whole catalog vs. a few sampled negatives (the latter is inconsistent).

G

Term	Meaning
Gaussian (Normal)	Bell curve $(\mu,\sigma^2)$; $-\log$ → squared error (MSE).
GCL	Graph Contrastive Learning — contrastive SSL on a graph recommender.
GCN	The model built by stacking graph-convolution layers (+ loss, prediction head, and — in vanilla GCN — $W$ and $\sigma$). Operation vs. architecture.
Generative learning	SSL pretext: reconstruct the masked/corrupted input; no negatives; generative mechanism.
Generative recommendation	Recommend by generating an item’s Semantic ID token-by-token (§2.2).
Gini coefficient	Concentration of recommendations on few items ($0$ even … $1$ concentrated).
GPT	Generative Pre-trained Transformer (next-token, causal); a decoder for generating text (Radford 2018).
Gradient $\nabla f$	Vector of all partials; points in the direction of steepest increase.
Gradient descent	$\theta\leftarrow\theta-\eta\nabla f$: step against the gradient to minimize.
Graph	Nodes + edges.
Graph convolution	The operation: replace each node’s embedding with the normalized average of its neighbors’ (one matrix multiply by the normalized adjacency).
Graph Fourier Transform (GFT)	Projecting a graph signal onto the eigenvectors of the (normalized) adjacency/Laplacian.
Graph signal	One value per node (e.g. a user’s interaction vector over items).
Graph Signal Processing (GSP)	Treating one-number-per-node data as a “signal” and processing it with graph frequencies.
Graph-CF trio	Gowalla / Yelp2018 / Amazon-Book — the frozen-split benchmark of graph-CF papers.
GraphMAE	A generative (masked-autoencoder) graph SSL method.
ε-greedy	Exploit the best estimate with prob $1-\varepsilon$; pull a uniform-random arm with prob $\varepsilon$.
GRU	Gated Recurrent Unit: a lighter 2-gate LSTM (Cho 2014); GRU4Rec is its session-based recommender.
GRU4Rec	A gated RNN for session-based recommendation; session-parallel batches + a ranking loss.

H

Term	Meaning
Harness / orchestration	The engineering around the LLM call — templates, validation, batching, retries, caching, cost control.
Hessian $H$	Matrix of second partial derivatives (multivariable curvature).
Hidden layer	A layer that is neither input nor output.
Hinge loss	SVM loss; flat once correct-with-margin, then a linear ramp — shaped like a hinge.
Hit Rate (HR@K)	Fraction of users with $\ge 1$ relevant item in the top-$K$.
Hit-Rate@K / NDCG@K	Was the held-out next item in the top $K$? / how high was it? (Evaluation Metrics).
HNSW	A multi-layer navigable-graph ANN index with ${\sim}\log M$ query cost.
HNSW / IVF-PQ / DiskANN / CAGRA	ANN index types: graph (in-RAM default) / inverted-list + compression (memory-thrifty) / on-disk (billion-scale) / GPU graph (batched throughput).
Homogeneous / heterogeneous	One node type, vs. several (users and items).
Hop	One step along an edge (distance in the graph); fixed by the graph, not chosen. $K$ layers ⟹ reach $K$ hops.
Huber loss	Quadratic near 0, linear far out; robust compromise (after P. Huber).
Hybrid	Combine content-based + collaborative (e.g. LLM features + a CF backbone).

I

Term	Meaning
Idempotency	Re-running the pipeline on unchanged input does no new work (hash → cache hit), so you generate each profile once.
Identity $I$	The “do-nothing” matrix; $I\mathbf{x}=\mathbf{x}$.
Impression	One shown item; the unit of a CTR row, labelled $1$ (clicked) or $0$ (not).
Indefinite	A symmetric matrix with both positive and negative eigenvalues (curves up some ways, down others) — the Hessian at a saddle.
Independence	$\Pr(A\cap B)=\Pr(A)\Pr(B)$; one event tells you nothing about the other.
Inference / serving	Using the trained embeddings to return a user’s top-$K$ list.
Inflection point	Where $f''$ changes sign (the curve switches between bending up and down).
InfoNCE	The standard contrastive loss (softmax over similarities, temperature $\tau$).
InfoNCE / NT-Xent	Contrastive loss; Info = mutual-info bound, NCE = Noise-Contrastive Estimation; temperature $\tau$.
Integral $\int$	Area under a curve; the reverse of differentiation; gives probabilities/expectations.
Interaction matrix $R$	Users-×-movies table of 0/1 (who watched what).
Invertible / singular	Invertible = has an inverse $A^{-1}$ ($\det\neq0$, full rank); singular = no inverse ($\det=0$).
item2vec	word2vec applied to user interaction sequences (“item = word, history = sentence”); a collaborative item embedding.

J

Term	Meaning
Jaccard similarity	$\lvert A\cap B\rvert/\lvert A\cup B\rvert$ on liked-sets; ignores rating values.
Jacobian $J$	Matrix of partials of a vector-valued function (rows=outputs, cols=inputs).
Joint / marginal	Joint $p(A,B)=\Pr(A\cap B)$; marginal $p(A)=\sum_B p(A,B)$ (sum the joint over the other variable).

K

Term	Meaning
@K	Evaluated over the top $K$ recommended positions only.
k-core filter	Iteratively drop users/items with $<k$ interactions until none remain; stabilizes the data.
$k$-NN / neighbourhood	The $k$ most-similar users/items used for a prediction.
KL divergence	Asymmetric “distance” between distributions (Kullback–Leibler); a divergence, not a metric.
KL divergence $D_{\mathrm{KL}}(p\Vert q)$	$H(p,q)-H(p)\ge0$; extra surprise from using $q$ not $p$; asymmetric (a divergence, not a distance).
KV-cache	Store past tokens’ keys/values so each decode step processes only the new token: $O(L)$ instead of $O(L^2)$.

L

Term	Meaning
L1 / lasso	$
L2 / ridge / Tikhonov / weight decay	$\sum\theta_k^2$ penalty; shrinks weights smoothly toward 0; = Gaussian prior.
L2 / weight decay	Regularizer $\lambda\lVert E^{(0)}\rVert^2$; LightGCN regularizes only the base embeddings.
$\lambda$ (reg. strength)	How much we weight the regularizer vs. the loss.
LambdaRank / LambdaMART	Listwise loss whose gradient is weighted by each pair’s effect on NDCG.
Laplace	Peaked/heavy-tailed $(\mu,b)$; $-\log$ → absolute error (MAE).
Latent factor	A hidden, learned dimension of taste/content.
Law of Large Numbers	The sample average converges to the expectation as samples accumulate.
Law of total probability	$p(A)=\sum_B p(A\mid B)p(B)$; the engine of Bayes’ evidence term.
Layer	One application of the graph-convolution operation (a computation step in the model); a hyperparameter $K$ you choose.
Layer / width / depth	A bank of neurons / neurons-per-layer / number of layers.
Layer combination	Averaging embeddings from all layers; LightGCN’s fix for over-smoothing.
Learning rate $\eta$	Step size in gradient descent.
Leave-one-last split	Hold out each user’s chronologically last interaction for test (temporal, leak-free).
Likelihood / prior / posterior	$p(\text{data}\mid\theta)$ / $p(\theta)$ / $p(\theta\mid\text{data})$.
Likelihood $L(\theta)$	$p(\text{data}\mid\theta)$ read as a function of $\theta$ (data fixed).
Limit	The value a quantity approaches (here as the run $h\to0$).
LinUCB	Ridge-regression reward estimate $\boldsymbol\theta^{\top}\mathbf x$ plus a feature-space confidence bonus.
LLM	A large pretrained Transformer language model; here, demystified as this note’s lineage at scale.
LLM-as-enhancer	The LLM produces semantic features/profiles offline that feed/align with a classical model (§2.3).
LLM-as-recommender	The LLM directly ranks/selects items from a prompt of the user’s history (§2.1).
LLM-as-reranker	The most-deployed sub-pattern: the LLM (often a cross-encoder) re-scores a short top-$K$ shortlist a cheap retriever already fetched, rather than ranking the whole catalogue (§2.1).
Local / global minimum	Local = lowest within a neighbourhood; global = lowest anywhere.
Log-likelihood	$\log L(\theta)$; products become sums.
Logistic regression	Linear score $\to$ sigmoid $\to$ probability; the CTR baseline and every model’s output head.
Logit (log-odds)	$\ln\!\big(p/(1-p)\big)$; maps $(0,1)\to(-\infty,\infty)$.
LogLoss	Binary cross-entropy; grades how calibrated the predicted probabilities are (lower better).
Long tail	The many low-degree (niche) items; under-trained and under-served.
Long tail / popularity bias	A few blockbusters get most interactions; metrics can be gamed by pushing them.
Loss function	A single number measuring how wrong the model is; training minimizes it.
Low-pass filter	Keep low frequencies (smooth signal), suppress high (noise).
LSTM	Long Short-Term Memory: an RNN with a cell state + forget/input/output gates; the additive cell line stops gradients vanishing.

M

Term	Meaning
MAE / absolute loss	Mean of absolute errors; from Laplace noise; outlier-robust; targets the median.
MAE / MSE / RMSE	Mean Absolute / Mean Squared / Root-Mean-Squared rating error.
MAP	Maximum A Posteriori; peak of the posterior $=$ MLE $+$ prior $=$ loss $+$ regularizer.
MAP estimate	Maximize posterior ∝ likelihood × prior; ⟹ minimize (−log-lik) + (−log-prior) = loss + regularizer.
Markov chain (first-order)	Next item depends only on the last one; transitions estimated by counting.
Matrix	A rectangular grid of numbers; also a linear transformation of vectors.
Matrix Factorization (MF)	$\hat r_{ui}=p_u^\top q_i$; trained with regularized MSE.
Matrix–vector product $A\mathbf{x}$	Dot each row of $A$ with $\mathbf{x}$; “$n$ in, $m$ out.”
Matryoshka (MRL)	Truncatable embeddings — a shorter prefix of the vector still works, so you index at a smaller dim to save memory/latency.
MDP	Markov Decision Process: state, action, reward, transition, policy — the formalism of RL.
Memory-based CF	Predict from similar rows/columns of $R$ at query time ($k$-NN).
Mini-batch	A small group of triples updated together (averaged gradient); also the source of in-batch negatives.
MIPS (Maximum Inner-Product Search)	Finding the item vectors with the largest dot product against a query — what top-$K$ scoring is.
MLE	Maximum Likelihood Estimate; $\theta$ that best explains the data.
MLP	Multi-layer perceptron — a feedforward stack of fully-connected layers.
MNAR data	Missing-Not-At-Random: whether an interaction is observed depends on what the system chose to show — recommender logs are the textbook case.
Model-based CF	Learn a compact model first (e.g. matrix factorization).
MoE (mixture-of-experts)	A model that holds many parameters but activates only a few per token — inference cheaper than total size.
Momentum / Adam	SGD upgrades: a velocity term / per-parameter adaptive step (Adam = the default optimizer).
MRR	Mean Reciprocal Rank — $1/$rank of the first relevant item, averaged.
MSE / squared loss	Mean of squared errors; from Gaussian noise; outlier-sensitive.
MTEB / ann-benchmarks / LMArena / CoNLL	The standard leaderboards for embeddings / vector indexes / chat-LLMs / NER. Priors, not verdicts.
Multi-head	Several attentions in parallel with different learned projections, then concatenated.
Mutual-information maximization	The framing RLMRec uses for aligning the collaborative and semantic views.

N

Term	Meaning
Nabla / del ($\nabla$)	The symbol for the gradient operator.
Negative sampling	Drawing un-interacted items to act as negatives.
Neuron / unit	A weighted sum of inputs + bias, passed through an activation.
Next-item prediction	The core task: given $(i_1,\dots,i_{t-1})$, score every item for being $i_t$.
NLL	Negative log-likelihood $=-\log L$; minimizing it = MLE; this is a loss.
Node	A thing (a user, a movie).
Non-IID data	Each client’s data is small and unrepresentative of the whole — the core difficulty of federation.
Norm / length $\lVert\mathbf{u}\rVert$	The arrow’s length; default L2 $\sqrt{\sum_i u_i^2}$. L1 $=\sum_i\lvert u_i\rvert$ (used in lasso).
Novelty	Non-obviousness, often $-\log_2(\text{popularity})$.
Null hypothesis $H_0$	The skeptical default “no real difference”; a test tries to disprove it.

O

Term	Meaning
Odds	$p/(1-p)$.
Off-policy	Learning from data collected by a different (older) policy than the one being trained.
Offline replay	Unbiased evaluation of a bandit policy on a log of randomly-served actions (Li et al. 2011).
One- vs two-sided test	Two-sided asks “A $\ne$ B” (the default); one-sided asks “A $>$ B” and halves $p$ — valid only if the direction was pre-registered.
One-hot vector	A length-$V$ vector, $1$ in one slot, $0$ elsewhere; encodes identity but no similarity (all pairs orthogonal).
Open weights / open source / open data	Released weights only / weights + permissive code license / + training data too. They are different and often confused.
Orthogonal	Perpendicular; dot product $0$.
Orthonormal basis	A full set of axes that are mutually orthogonal and each unit length; a coordinate is then just a dot product ($U,V$ in §6).
Outer product $\mathbf{u}\mathbf{v}^{\top}$	A column times a row = a whole rank-1 matrix.
Over-smoothing	Too many layers make all embeddings collapse to one blurry point.
Overfitting	Memorizing training noise; great on train, poor on new data.

P

Term	Meaning
$p$-value	Probability of a gap this large if $H_0$ were true; small = significant. NOT $\Pr(H_0\text{ true})$.
Paired $t$-test	Tests whether the mean of per-pair differences $\bar d$ differs from 0: $t=\bar d/(s_d/\sqrt n)$.
Partial derivative $\partial f/\partial x$	Derivative wrt one variable, others held fixed.
Pearson correlation	A similarity measure (centred cosine) for ratings — removes each user’s rating bias.
PMF / PDF	Probability mass (discrete) / density (continuous) function; sums/integrates to 1.
Pointwise / pairwise / listwise	Score each pair / a pair’s order / the whole list.
Policy $\pi(a\mid s)$	The learned rule mapping a state to an action; what RL optimizes.
Polynomial filter	A filter that is a polynomial in $\tilde{A}$ (what LightGCN layers compute).
Popularity bias	Tendency to over-recommend high-degree (blockbuster) items (§13).
Popularity bias (in eval)	Popular items dominate the test set, so pushing them inflates accuracy metrics.
Position bias	Higher-ranked slots draw more clicks regardless of true relevance.
Positional embedding	A learned per-position vector added to item embeddings so attention can see order.
Positional encoding	A per-position signal added to embeddings so order-blind attention can see order (sinusoidal or learned).
Positive / negative pair	Positive = two views of the same node; negative = views of different nodes.
Positive semidefinite (PSD)	A symmetric matrix with $\mathbf x^{\!\top}\!H\mathbf x\ge0$ for all $\mathbf x$ (all eigenvalues $\ge0$); the multivariable “$\ge0$” that makes a Hessian convex.
Positive-semidefinite (PSD)	A symmetric matrix with all eigenvalues $\ge0$ (equivalently $\mathbf{x}^{\top}A\mathbf{x}\ge0$ always); every $M^{\top}M$ is PSD since $\mathbf{x}^{\top}M^{\top}M\mathbf{x}=\lVert M\mathbf{x}\rVert^{2}$.
Posterior $p(\theta\mid D)$	Updated belief after data.
Pre-activation $z$ (logit)	The raw score $\mathbf{w}\cdot\mathbf{x}+b$ before the activation.
Precision@K	Of the $K$ shown, the fraction relevant.
Predictive learning	SSL pretext: predict a property derived from the data (e.g. a masked item/attribute); one correct answer, no negatives; discriminative mechanism.
Pretraining	Train a big Transformer on a huge corpus with a self-supervised objective, then reuse it.
Prior $p(\theta)$	Belief about $\theta$ before data.
Profile	A short LLM-written description of a user’s taste or an item’s character, used as input to an embedder.
Projection $\mathrm{proj}_{\mathbf{v}}\mathbf{u}$	The shadow of $\mathbf{u}$ on $\mathbf{v}$’s line, $\frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{v}\rVert^{2}}\mathbf{v}$; in an orthonormal basis a coordinate is just a dot product.
Propensity / IPS	Propensity = probability an item was shown; IPS re-weights each interaction by $1/$propensity to undo exposure bias.
Provider vs. consumer fairness	Comparable exposure across item creators (provider) vs. comparable quality across user groups (consumer).

Q

Term	Meaning
Quantization	Storing weights/vectors in fewer bits (4/8-bit, int8/binary) to shrink memory and speed inference, at a small quality cost.
Query / Key / Value	Soft database lookup: a query is matched (dot product) against keys to retrieve a blend of values.

R

Term	Meaning
Random split	Hold out a random fraction of interactions — leaks the future for time-ordered data.
Random variable	A quantity whose value is uncertain (coin, die, temperature).
Rank	The matrix’s number of independent directions: independent rows/columns (§2) = nonzero singular values (§6).
Rank / latent dimension $d$	Width of the factor vectors; the dial between under- and over-fitting.
Ranking	The expensive, high-precision second stage that reorders the shortlist with a feature-rich model.
Recall vs QPS	The ANN trade-off: fraction of true neighbours found vs queries-per-second. Always compare at a fixed recall.
Recall@K	Of all relevant items, the fraction in the top-$K$.
Regret	$\sum_t(\mu^{*}-\mu_{a_t})$: reward lost versus always pulling the best arm.
Regularization	Making an ill-posed solution “regular” (well-behaved); from Hadamard/Tikhonov.
Regularizer	A term/procedure that discourages complex solutions to improve generalization.
Relevant item	A ground-truth item the user actually likes.
ReLU	$\max(0,z)$ — Rectified Linear Unit; the default activation.
Representation learning	The network learning useful features by itself.
Residual connection	A sub-layer computes $x+f(x)$; the $+1$ in its derivative is a gradient highway for deep stacks.
Retrieval / candidate generation	The cheap, high-recall first stage that cuts millions of items to a few hundred.
Return $G$	Cumulative discounted reward $\sum_t \gamma^t r_t$ — the long-term objective.
Risk / empirical risk	Expected loss over the true distribution / its average on the training sample.
RLMRec	An enhancer that aligns an LLM-written semantic embedding with a CF embedding via contrastive/generative alignment (§3).
RLMRec-Con / -Gen	RLMRec’s contrastive / generative variants for aligning LLM semantics with GNN embeddings.
RMSE / Recall@K / NDCG@K	Rating-error / top-$K$ ranking-quality metrics.
RNN	Recurrent Neural Network: one shared cell that updates a hidden-state “memory” $h_t=\tanh(Wh_{t-1}+Ux_t+b)$ along a sequence.
RNN / hidden state $\mathbf h_t$	A net that folds each item into a running fixed-length summary of the session.
RQ-VAE	The residual-quantized variational autoencoder that learns the Semantic-ID codebooks by reconstructing the content embedding (§2.2).

S

Term	Meaning
Saddle point	Critical point that is a min in some directions, a max in others (indefinite Hessian).
Sample space / event	The set of possible outcomes / a subset of them.
Sampled softmax (SSM)	Listwise cross-entropy over a positive + many sampled negatives; structurally = InfoNCE.
SASRec	Self-Attentive Sequential Recommendation: a causal Transformer over the item sequence.
SASRec / BERT4Rec	Sequential recommenders: a causal (left-to-right) / bidirectional (masked) Transformer over the item history.
Second derivative $f''$	Rate of change of the slope = curvature.
Secure aggregation	The server sees only the sum of client updates, never any individual one.
Selection / exposure bias	Only items the system showed can get feedback; un-shown items look “disliked” by silence.
Self-attention	Attention of a sequence to itself (Q, K, V all from the same tokens).
Self-loop	An artificial edge from a node to itself; present in GCN, absent in LightGCN.
Semantic embedding	A fixed-length vector encoding the meaning of the profile text (from a frozen sentence encoder), distinct from the CF latent space.
Semantic ID	A short sequence of content-derived codewords identifying an item (quantized text embedding).
Sequential recommendation	Predict the next item from a user’s ordered history.
Serendipity	Relevant and surprising.
Session-based recommendation	Same, but from a short, often anonymous current session (no user id).
SGD	Stochastic Gradient Descent — gradient steps on random mini-batches.
SGL / SimGCL / XSimGCL / LightGCL	GCL methods differing in how the two views are built (edge-drop / noise / cross-layer noise / SVD).
Shape $m\times n$	$m$ rows by $n$ columns.
$\sigma$ (nonlinearity)	Activation (e.g. ReLU) in GCN/NGCF; removed in LightGCN.
Sigmoid $\sigma(z)$	$1/(1+e^{-z})$; inverse of the logit; score → probability.
Significance (paired $t$-test)	Is a gain bigger than run-to-run seed noise? Report mean $\pm$ std and a $p$-value.
Significance level $\alpha$	The pre-set bar (usually $0.05$); reject $H_0$ if $p<\alpha$.
Singular value $\sigma$	The strength of an SVD pattern; $\sigma\ge0$, biggest first.
Slate	A whole page/list of recommended items — the (combinatorial) action in slate RL.
SlateQ	Decomposes a slate’s value into per-item $Q$-values, making value-based slate RL tractable.
SLM	Small language model ($\approx 0.5$–9B) — cheap, high-throughput; the recsys feature-generation workhorse.
Smooth	Has a derivative everywhere — no kinks or jumps; looks straight up close.
Softmax	Multi-class sigmoid; scores → a probability distribution.
Softmax output	Multi-class output head: $K$ scores → a probability distribution, paired with cross-entropy.
Sparse / one-hot feature	A category encoded as all-zeros with a single $1$; CTR rows stack millions of these.
Sparsity	Most of $R$ is unknown (real matrices ${\sim}99.5\%$ empty); the core difficulty.
SSL (self-supervised learning)	Training on an auxiliary task that needs no labels; supervision is manufactured from the data’s own structure.
Standard error	$s_d/\sqrt n$ — how much an estimate wobbles across repeats; shrinks with more data.
State $s$	A summary of the current situation; the ingredient bandits lack (actions change it).
Static vs contextual	word2vec gives one fixed vector per word; a Transformer gives a different vector per occurrence (context-dependent).
Stochastic gradient descent (SGD)	Gradient descent on a noisy gradient estimated from a random minibatch, not the full dataset.
Structured output	Forcing an LLM to emit schema-valid JSON via constrained decoding — valid shape, not guaranteed-correct values.
Structured output / JSON mode	Forcing the LLM to emit schema-valid JSON, so the result is reliably parseable.
Subgradient	A stand-in slope at a kink where the true derivative is undefined (e.g. any value in $[0,1]$ for ReLU at $0$).
Subword tokenization (BPE)	Split text into reusable sub-word pieces by greedily merging the most frequent adjacent pair; fixed vocabulary, no `<UNK>` (WordPiece/SentencePiece are cousins).
Surprise / self-information	$-\log p(x)$; how surprising an outcome is ($0$ if certain, $\to\infty$ if rare).
SVD ($M=U\Sigma V^\top$)	Factorizes a (rectangular) matrix into patterns $U,V$ and strengths $\Sigma$; keeping the top-$q$ = an ideal low-pass (PSGE, LightGCL).
SVD $M=U\Sigma V^{\top}$	Factor any matrix into orthonormal patterns ($U,V$) scaled by singular values ($\Sigma$).
Symmetric	$A=A^{\top}$; clean orthogonal eigenvectors.

T

Term	Meaning
Taylor / linear approximation	$f(x+\delta)\approx f(x)+f'(x)\delta$; a curve looks straight up close.
Temperature $\tau$	Scales the contrast sharpness in InfoNCE; a sensitive hyperparameter.
Temporal / leave-one-last split	Hold out each user’s chronologically last interaction; the honest, leak-free default.
TF-IDF	A way to turn text into a feature vector (term frequency × inverse document frequency).
The recsys leak	Sending an update for item $i$ reveals the user touched $i$; FedRec masks this with decoy items.
Thompson sampling	Keep a posterior per arm; sample one value from each and pull the largest sample.
Top-$K$	The $K$ highest-scoring unseen items returned to the user.
Trace	Sum of a square matrix’s diagonal entries; equals the sum of its eigenvalues.
Train/test split	How held-out test data is chosen: random, leave-one-out (LOO), or temporal (by timestamp).
Transformer	A stack of (self-attention + add&norm + feed-forward + add&norm) blocks; no recurrence, fully parallel (Vaswani 2017).
Transition probability	$P(\text{next}=j\mid\text{last}=i)$ — count of $i\!\to\!j$ over count of $i$-as-from.
Transpose $A^{\top}$	The matrix mirrored across its diagonal (rows ↔︎ columns).
Triple $(u,i^+,i^-)$	A training example for pairwise ranking: user $u$ prefers positive $i^+$ over sampled negative $i^-$.
Truncated SVD	Keep the top-$q$ singular components = best low-rank approximation (least-squares / Frobenius sense).
Two-stage funnel	Retrieval → ranking: fast-and-forgiving then slow-and-sharp, so a rich model can run on only a few hundred candidates.
Two-tower LLM-embedding retrieval	A hybrid: embed users and items with an LLM-grade encoder offline, then retrieve by ANN over those vectors at serve time — enhancer-class cost, LLM semantics on the retrieval path (§4).
Type I / II error	False positive (reject true $H_0$) / false negative (miss a real effect).

U

Term	Meaning
UCB	Pull $\arg\max_i\,[\hat\mu_i + \sqrt{2\ln t / n_i}\,]$: highest upper bound — optimism under uncertainty.
Uniform / popularity-weighted / in-batch / hard negatives	Four sampling distributions for $i^-$: equal-probability; $\propto\text{pop}^{0.75}$; reuse the batch’s other positives; pick high-scoring (confusing) items.
Uniformity	Embeddings spread evenly over the sphere; fights popularity bias / collapse.
Unit vector	A vector of length $1$; “normalizing” = dividing by your own length.
Universal approximation	A wide enough net can approximate any continuous function.
User–item matrix $R$	Rows = users, cols = items, entries = ratings/clicks (mostly unknown).

V

Term	Meaning
Value / $Q$-value	Expected return from a state ($V$) or a state–action pair ($Q$).
Vanishing gradient	Back-prop through many steps multiplies many small factors → gradient decays → long-range dependencies unlearnable.
Variance $\sigma^2$	Expected squared distance from the mean; spread.
Vector	An ordered list of numbers; equivalently, an arrow from the origin to a point.
View	One augmented version of a node’s representation; contrastive learning needs two per node.

W

Term	Meaning
$W$ (weight matrix)	Learnable feature transform in GCN/NGCF; removed in LightGCN.
Weight / bias	Learned multiplier per input / learned constant offset.
Weighted / unweighted	Whether edges carry a number (rating/count) or are just 0/1.
Wide & Deep	Joint linear-with-hand-crosses (memorize) $+$ MLP-over-embeddings (generalize) model.
Wilcoxon signed-rank	Non-parametric paired test (ranks, not values); use when differences aren’t normal.
word2vec	Predict-based word embeddings trained so a word predicts its context (skip-gram / CBOW), Mikolov 2013.
WRMF / ALS	Weighted Regularized MF / Alternating Least Squares (implicit-feedback, squared error).

X

Term	Meaning
xDeepFM / AutoInt / DIN	Vector-wise crosses (CIN) / attention-learned interactions / per-candidate behavior attention.

Z

Term	Meaning
Zero-shot NER	Extracting arbitrary entity types with no task-specific training (GLiNER, NuNER).