Glossary

A single alphabetical list of every term defined across the book. Each term is defined in full in its home chapter; this page gathers them for quick reference and search.

A

Term Meaning
A/B test (online eval) Serve two models to live user halves and compare a business metric — the real verdict.
Activation \(\phi\) Nonlinear function applied to \(z\) (sigmoid, tanh, ReLU).
Adjacency matrix \(A\) The whole graph’s connections as a square grid of 0/1.
Agentic recommender An LLM that plans, remembers, and calls tools over multi-turn conversation (§2.4).
Alignment Positive pairs end up close in space.
Alignment & uniformity Pull positives together / spread embeddings on the sphere (DirectAU); fights popularity bias.
Alignment loss (RLMRec) Ties LLM semantic vectors to CF embeddings (contrastive -Con or generative -Gen).
ANN Approximate nearest-neighbour search — fast, slightly-inexact vector retrieval; the candidate-generation engine.
ANN (approximate nearest neighbour) Fast similarity search that makes dot-product retrieval scale to millions of items.
ANN (approximate nearest-neighbour) Fast vector search that returns almost the closest vectors to a query, trading a little accuracy for large speed-ups — how embedding retrieval is served (Implementation Choices).
AP / MAP Average Precision (precision averaged over hit positions) / its mean over users.
Arm One choosable option (an item); “pulling” it = showing it. From the slot-machine metaphor.
Attention \(\mathrm{softmax}(QK^\top/\sqrt{d_k})V\): each position takes a similarity-weighted blend of all positions’ value vectors.
Attributed / featureless Whether nodes carry input feature vectors; recsys nodes are featureless (IDs only).
AUC Probability a random click outscores a random non-click; grades ordering, not calibration.
AUC / ROC Area Under the ROC Curve = P(score of a random relevant > random irrelevant).
Autoregressive decoding Generate one token, append it, run the model again, repeat — the GPT/SASRec generation loop.
Auxiliary task The extra, label-free objective added alongside the main (BPR) loss.

B

Term Meaning
Back-propagation Computing all weight-gradients by the chain rule, backward through the net.
Bandit feedback You observe the reward only for the action you took, never the alternatives.
Base-rate effect A rare condition (low prior) makes even an accurate test’s positives mostly false.
Batch / Layer Norm Re-center/re-scale activations for training stability (LayerNorm powers the Transformer).
Bayes’ rule posterior \(\propto\) likelihood \(\times\) prior.
BCE loss Pointwise binary cross-entropy on positive vs. sampled-negative pairs.
Beam search (decoding) Keep the \(B\) best partial ID prefixes per step; the completed tuples, ranked, are the top-\(K\) recommendations (§2.2).
Benchmark framework RecBole / Cornac / Elliot / Microsoft Recommenders / LensKit — unified datasets + models + protocol.
Benchmaxxing Training on (or near) benchmark data so a high score reflects recall, not ability — why one leaderboard number is untrustworthy.
Bernoulli One yes/no with parameter \(p\); \(-\log\) → BCE.
BERT Bidirectional Encoder (masked-LM); an encoder for understanding text → embeddings (Devlin 2019).
BERT4Rec A bidirectional Transformer trained with the cloze (masked-item) objective.
Beta Distribution over a probability; conjugate prior for Bernoulli.
Beta posterior \(\text{Beta}(1+s,1+f)\) for \(s\) clicks, \(f\) non-clicks; conjugate to the Bernoulli click.
Bipartite graph Nodes in two groups; edges only cross between groups (users ↔︎ movies).
Bonferroni / Holm Multiple-comparison fixes for \(m\) tests: Bonferroni compares each \(p\) to \(\alpha/m\); Holm is a uniformly less conservative stepwise version.
BPR Bayesian Personalized Ranking — pairwise ranking loss; “Bayesian” = its MAP derivation; default for LightGCN.
BPR loss Pairwise ranking loss: push observed pairs above sampled negatives.

C

Term Meaning
Candidate generation → ranking The production funnel: a cheap recall-stage (MF/\(k\)NN) then a precise rank-stage.
Categorical One of \(K\) classes; \(-\log\) → cross-entropy; paired with softmax.
Causal mask Lower-triangular mask: position \(t\) may attend only to positions \(\le t\) (no peeking at the future).
CDF \(F(x)\) Cumulative distribution function \(F(x)=\Pr(X\le x)\) — the running total of probability; a different function describing the same distribution.
Central Limit Theorem The average of many noisy pieces is bell-shaped around the truth; basis of the \(t\)-curve.
Chain rule Derivative of \(f(g(x))\) = \(f'(g)\cdot g'\); the basis of back-propagation.
Characteristic equation \(\det(A-\lambda I)=0\); its roots are the eigenvalues.
Chebyshev filter A numerically stable polynomial filter family (ChebyCF).
Client / server Devices holding private data / the coordinator that aggregates their updates.
Closed-form / training-free CF Recommenders that apply a fixed filter with no gradient training (GF-CF, PSGE).
Cloze objective Mask random items and predict each from both-side context (fill-in-the-blank).
Cold start No history yet for a new user/item; CF cannot act.
Collaborative filtering (CF) Recommend using interaction patterns across users (no content).
Collaborative signal What can be learned from who interacted with what (From Graphs to LightGCN, SSL & Contrastive Learning, The Spectral / Graph-Filter View), independent of item content.
Component / entry One number inside a vector.
Conditional probability \(\Pr(A\mid B)\) Probability of \(A\) given \(B\) occurred \(=\Pr(A\cap B)/\Pr(B)\).
Confidence interval \(\bar d \pm t^{*}s_d/\sqrt n\); the effect size with its uncertainty — significant when it excludes \(0\).
Conjugate prior A prior whose posterior is the same family (Beta↔︎Bernoulli).
Content-based filtering Recommend items similar (in features) to what the user liked.
Contextual bandit The reward depends on a context (user/item features); LinUCB models it linearly.
Contrastive learning SSL pretext: pull a node’s two views together, push other nodes apart; needs negatives; discriminative mechanism.
Convex Curves up everywhere; one variable \(f''\ge0\), many variables Hessian PSD; bowl-shaped; one global minimum.
Cosine similarity \(\dfrac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}\in[-1,1]\); direction agreement, ignoring length.
Cost / objective / criterion Near-synonyms for loss (cost = total; objective = neutral; criterion = the code object).
Coverage Fraction of the catalog ever recommended.
Critical / stationary point Where \(f'(x)=0\): peak, valley, or plateau.
Croissant MLCommons machine-readable dataset-metadata standard (2024); required for NeurIPS dataset tracks.
Cross network (DCN) Stacked layers that raise interaction degree by one each, explicitly and with linear cost.
Cross-entropy / BCE / log loss Classification loss; bits to encode truth \(p\) with code for prediction \(q\); = −log-likelihood of Bernoulli.
Cross-entropy \(H(p,q)\) \(-\sum p\log q\); surprise of \(p\)’s outcomes scored by model \(q\) — the classification loss.
Cross-view alignment Pulling a node’s two embeddings (collaborative \(e_i\) and semantic \(s_i\)) together with InfoNCE.
CTR (click-through rate) \(P(\text{click}\mid \text{user, item, context})\); the score the ranking stage predicts.

D

Term Meaning
Data leakage Test-period information seeping into training; silently inflates every metric.
DCG / IDCG / NDCG Discounted Cumulative Gain / its ideal / the normalized ratio in \([0,1]\).
DeepFM Wide & Deep with the wide arm replaced by an FM, sharing one embedding layer; no hand-crossing.
Degree How many neighbors a node has.
Degree matrix \(D\) Diagonal matrix of degrees; used for normalization.
Degrees of freedom \(n-1\) for the paired test; sets which Student-\(t\) curve gives the \(p\)-value.
Demographic parity Equal selection / exposure rate across groups.
Derivative \(f'(x)\) Instantaneous rate of change = slope of the curve at a point.
Determinant \(\det A\) Area/volume-scaling factor (\(ad-bc\) for \(2\times2\)); \(\det=0\) ⇒ collapses a dimension / non-invertible.
Difference quotient \(\frac{f(x+h)-f(x)}{h}\) — rise over run before taking \(h\to0\).
Differential privacy Add calibrated noise to updates so no single user’s data can be inferred.
Differentiate Compute a derivative (limit of differences).
Dimension How many components a vector has (\(\mathbb{R}^d\) = all \(d\)-vectors).
Directed / undirected Whether an edge has a direction (A→B \(\ne\) B→A) or not.
Directional derivative Rate of change of \(f\) along a unit direction \(\hat{\mathbf d}\): \(\nabla f\cdot\hat{\mathbf d}\).
Discount \(\gamma\) How much a later reward is worth vs. now; \(\gamma\to0\) is myopic (a bandit), \(\gamma\to1\) far-sighted.
Discriminative vs. generative Axis 1 — whether a model learns \(p(y\mid x)\) (a boundary; can’t generate) or \(p(x)\)/\(p(x,y)\) (the data; can sample). BPR/LightGCN is discriminative; VAEs/LLMs are generative.
Distributional hypothesis “A word is known by the company it keeps” — similar contexts ⟹ similar vectors (Firth 1957).
Diversity Average dissimilarity within a recommended list.
Dot / inner / scalar product \(\mathbf{u}\cdot\mathbf{v}\) \(\sum_i u_iv_i\); one number measuring agreement.
Dot product Multiply-and-sum two embeddings → a similarity / ranking score.
Dropout Randomly zero units during training; an ensemble-like regularizer; “edge dropout” on graphs.

E

Term Meaning
Early stopping Stop when validation stops improving; implicit regularizer.
Edge A connection (user watched movie).
Effect size The magnitude of the difference (\(\bar d\)); report it alongside \(p\) (significance \(\ne\) size).
Eigenbasis A symmetric matrix’s full set of orthogonal eigenvectors.
Eigenvalue \(\lambda\) The stretch factor of an eigenvector.
Eigenvector A direction a matrix only stretches, not rotates: \(A\mathbf{v}=\lambda\mathbf{v}\).
Eigenvector \(v\) A special direction a matrix only stretches (not rotates): \(\tilde{A}v=\lambda v\). The graph’s eigenvectors are its “frequency patterns.”
Elastic net L1 + L2 combined; “stretchable net” keeping correlated features.
ELBO Evidence Lower BOund; the VAE (variational autoencoder) objective = reconstruction loss + β·KL.
Embedding A learned vector representing a user/item as a point in space.
Embedding layer A learnable lookup table; id → its row (= one-hot × matrix); the input side for discrete data.
Enhancer / reranker Recsys LLM roles: generate item/user text features / re-order a candidate list — not chat.
Enhancer pipeline Using an LLM offline to manufacture features for a fast collaborative backbone that serves.
Entropy \(H(p)\) \(-\sum p\log p\); average surprise / uncertainty / bits to encode (maximal at uniform).
Epoch One full pass over all observed positives during training.
Epoch / batch One full pass over the data / the examples used per update.
Equal opportunity Equal true-positive rate across groups (the truly-relevant are surfaced equally).
ERM Empirical Risk Minimization — “fit the training data” (usually + a regularizer).
Euclidean distance \(\lVert\mathbf{u}-\mathbf{v}\rVert\) Straight-line gap between two arrow-tips; feels magnitude (unlike cosine).
Evidence \(p(D)\) Normalizer; data probability averaged over \(\theta\) (a.k.a. marginal likelihood).
Expectation \(\mathbb{E}[X]\) / mean \(\mu\) Probability-weighted average value.
Explicit / implicit feedback Stated ratings (likes and dislikes) vs. behavioural \(0/1\) (positives only — no true negatives).
Explicit vs. implicit feedback Star ratings vs. clicks/plays (positive-only, unlabeled negatives).
Explore vs. exploit Try an uncertain option to learn (explore) vs. take the best-known one now (exploit).

F

Term Meaning
F1@K Harmonic mean of precision and recall.
Factor vector A feature’s short learned vector \(\mathbf v_i\); the dot product of two is their interaction strength.
Factorization machine (FM) MF generalized to any features: each feature gets a vector, each feature pair interacts via a dot product (Rendle, 2010). MF is the ids-only special case.
FCF Federated collaborative filtering: item factors are global; each user’s factor stays on-device.
Feature interaction A pair of features whose combination matters (user \(\times\) genre); FM scores it as \(\langle\mathbf v_i,\mathbf v_j\rangle\).
Feature interaction / cross The conjunction of two categories (sci-fi ∧ mobile); where the predictive signal lives.
Feature store The table of generated id → (profile, vector) the serving path reads; the LLM is never on the serving path.
FedAvg Aggregate by a data-size-weighted average of clients’ local models (McMahan et al., 2017).
Federated learning Train a shared model across devices that keep their raw data local; only updates are shared.
Feed-forward (FFN) Per-token MLP \(\max(0,xW_1{+}b_1)W_2{+}b_2\) inside each block; transforms each token (most of the parameters).
Feedback loop Recommend → click → log → retrain on skewed data → more skew; the “rich get richer” spiral that compounds bias.
Filter response \(h(\lambda)\) The function deciding how much each frequency is kept; defines the filter.
Focal loss Cross-entropy times \((1-\hat p)^\gamma\) to down-weight easy examples; for class imbalance.
Forward pass Running an input through the net to a prediction + loss.
FPMC Factorizing Personalized Markov Chains: MF term \(+\) a learned (factorized) transition term.
Frequency (on a graph) An eigenvector pattern; low = smooth over neighbors, high = jagged.
Frozen text encoder Run text once through a fixed pretrained model and keep the output vector — the semantic \(s_i\) RecSys aligns/consumes.
Full vs. sampled ranking Rank the true item against the whole catalog vs. a few sampled negatives (the latter is inconsistent).

G

Term Meaning
Gaussian (Normal) Bell curve \((\mu,\sigma^2)\); \(-\log\) → squared error (MSE).
GCL Graph Contrastive Learning — contrastive SSL on a graph recommender.
GCN The model built by stacking graph-convolution layers (+ loss, prediction head, and — in vanilla GCN — \(W\) and \(\sigma\)). Operation vs. architecture.
Generative learning SSL pretext: reconstruct the masked/corrupted input; no negatives; generative mechanism.
Generative recommendation Recommend by generating an item’s Semantic ID token-by-token (§2.2).
Gini coefficient Concentration of recommendations on few items (\(0\) even … \(1\) concentrated).
GPT Generative Pre-trained Transformer (next-token, causal); a decoder for generating text (Radford 2018).
Gradient \(\nabla f\) Vector of all partials; points in the direction of steepest increase.
Gradient descent \(\theta\leftarrow\theta-\eta\nabla f\): step against the gradient to minimize.
Graph Nodes + edges.
Graph convolution The operation: replace each node’s embedding with the normalized average of its neighbors’ (one matrix multiply by the normalized adjacency).
Graph Fourier Transform (GFT) Projecting a graph signal onto the eigenvectors of the (normalized) adjacency/Laplacian.
Graph signal One value per node (e.g. a user’s interaction vector over items).
Graph Signal Processing (GSP) Treating one-number-per-node data as a “signal” and processing it with graph frequencies.
Graph-CF trio Gowalla / Yelp2018 / Amazon-Book — the frozen-split benchmark of graph-CF papers.
GraphMAE A generative (masked-autoencoder) graph SSL method.
ε-greedy Exploit the best estimate with prob \(1-\varepsilon\); pull a uniform-random arm with prob \(\varepsilon\).
GRU Gated Recurrent Unit: a lighter 2-gate LSTM (Cho 2014); GRU4Rec is its session-based recommender.
GRU4Rec A gated RNN for session-based recommendation; session-parallel batches + a ranking loss.

H

Term Meaning
Harness / orchestration The engineering around the LLM call — templates, validation, batching, retries, caching, cost control.
Hessian \(H\) Matrix of second partial derivatives (multivariable curvature).
Hidden layer A layer that is neither input nor output.
Hinge loss SVM loss; flat once correct-with-margin, then a linear ramp — shaped like a hinge.
Hit Rate (HR@K) Fraction of users with \(\ge 1\) relevant item in the top-\(K\).
Hit-Rate@K / NDCG@K Was the held-out next item in the top \(K\)? / how high was it? (Evaluation Metrics).
HNSW A multi-layer navigable-graph ANN index with \({\sim}\log M\) query cost.
HNSW / IVF-PQ / DiskANN / CAGRA ANN index types: graph (in-RAM default) / inverted-list + compression (memory-thrifty) / on-disk (billion-scale) / GPU graph (batched throughput).
Homogeneous / heterogeneous One node type, vs. several (users and items).
Hop One step along an edge (distance in the graph); fixed by the graph, not chosen. \(K\) layers ⟹ reach \(K\) hops.
Huber loss Quadratic near 0, linear far out; robust compromise (after P. Huber).
Hybrid Combine content-based + collaborative (e.g. LLM features + a CF backbone).

I

Term Meaning
Idempotency Re-running the pipeline on unchanged input does no new work (hash → cache hit), so you generate each profile once.
Identity \(I\) The “do-nothing” matrix; \(I\mathbf{x}=\mathbf{x}\).
Impression One shown item; the unit of a CTR row, labelled \(1\) (clicked) or \(0\) (not).
Indefinite A symmetric matrix with both positive and negative eigenvalues (curves up some ways, down others) — the Hessian at a saddle.
Independence \(\Pr(A\cap B)=\Pr(A)\Pr(B)\); one event tells you nothing about the other.
Inference / serving Using the trained embeddings to return a user’s top-\(K\) list.
Inflection point Where \(f''\) changes sign (the curve switches between bending up and down).
InfoNCE The standard contrastive loss (softmax over similarities, temperature \(\tau\)).
InfoNCE / NT-Xent Contrastive loss; Info = mutual-info bound, NCE = Noise-Contrastive Estimation; temperature \(\tau\).
Integral \(\int\) Area under a curve; the reverse of differentiation; gives probabilities/expectations.
Interaction matrix \(R\) Users-×-movies table of 0/1 (who watched what).
Invertible / singular Invertible = has an inverse \(A^{-1}\) (\(\det\neq0\), full rank); singular = no inverse (\(\det=0\)).
item2vec word2vec applied to user interaction sequences (“item = word, history = sentence”); a collaborative item embedding.

J

Term Meaning
Jaccard similarity \(\lvert A\cap B\rvert/\lvert A\cup B\rvert\) on liked-sets; ignores rating values.
Jacobian \(J\) Matrix of partials of a vector-valued function (rows=outputs, cols=inputs).
Joint / marginal Joint \(p(A,B)=\Pr(A\cap B)\); marginal \(p(A)=\sum_B p(A,B)\) (sum the joint over the other variable).

K

Term Meaning
@K Evaluated over the top \(K\) recommended positions only.
k-core filter Iteratively drop users/items with \(<k\) interactions until none remain; stabilizes the data.
\(k\)-NN / neighbourhood The \(k\) most-similar users/items used for a prediction.
KL divergence Asymmetric “distance” between distributions (Kullback–Leibler); a divergence, not a metric.
KL divergence \(D_{\mathrm{KL}}(p\Vert q)\) \(H(p,q)-H(p)\ge0\); extra surprise from using \(q\) not \(p\); asymmetric (a divergence, not a distance).
KV-cache Store past tokens’ keys/values so each decode step processes only the new token: \(O(L)\) instead of \(O(L^2)\).

L

Term Meaning
L1 / lasso $
L2 / ridge / Tikhonov / weight decay \(\sum\theta_k^2\) penalty; shrinks weights smoothly toward 0; = Gaussian prior.
L2 / weight decay Regularizer \(\lambda\lVert E^{(0)}\rVert^2\); LightGCN regularizes only the base embeddings.
\(\lambda\) (reg. strength) How much we weight the regularizer vs. the loss.
LambdaRank / LambdaMART Listwise loss whose gradient is weighted by each pair’s effect on NDCG.
Laplace Peaked/heavy-tailed \((\mu,b)\); \(-\log\) → absolute error (MAE).
Latent factor A hidden, learned dimension of taste/content.
Law of Large Numbers The sample average converges to the expectation as samples accumulate.
Law of total probability \(p(A)=\sum_B p(A\mid B)p(B)\); the engine of Bayes’ evidence term.
Layer One application of the graph-convolution operation (a computation step in the model); a hyperparameter \(K\) you choose.
Layer / width / depth A bank of neurons / neurons-per-layer / number of layers.
Layer combination Averaging embeddings from all layers; LightGCN’s fix for over-smoothing.
Learning rate \(\eta\) Step size in gradient descent.
Leave-one-last split Hold out each user’s chronologically last interaction for test (temporal, leak-free).
Likelihood / prior / posterior \(p(\text{data}\mid\theta)\) / \(p(\theta)\) / \(p(\theta\mid\text{data})\).
Likelihood \(L(\theta)\) \(p(\text{data}\mid\theta)\) read as a function of \(\theta\) (data fixed).
Limit The value a quantity approaches (here as the run \(h\to0\)).
LinUCB Ridge-regression reward estimate \(\boldsymbol\theta^{\top}\mathbf x\) plus a feature-space confidence bonus.
LLM A large pretrained Transformer language model; here, demystified as this note’s lineage at scale.
LLM-as-enhancer The LLM produces semantic features/profiles offline that feed/align with a classical model (§2.3).
LLM-as-recommender The LLM directly ranks/selects items from a prompt of the user’s history (§2.1).
LLM-as-reranker The most-deployed sub-pattern: the LLM (often a cross-encoder) re-scores a short top-\(K\) shortlist a cheap retriever already fetched, rather than ranking the whole catalogue (§2.1).
Local / global minimum Local = lowest within a neighbourhood; global = lowest anywhere.
Log-likelihood \(\log L(\theta)\); products become sums.
Logistic regression Linear score \(\to\) sigmoid \(\to\) probability; the CTR baseline and every model’s output head.
Logit (log-odds) \(\ln\!\big(p/(1-p)\big)\); maps \((0,1)\to(-\infty,\infty)\).
LogLoss Binary cross-entropy; grades how calibrated the predicted probabilities are (lower better).
Long tail The many low-degree (niche) items; under-trained and under-served.
Long tail / popularity bias A few blockbusters get most interactions; metrics can be gamed by pushing them.
Loss function A single number measuring how wrong the model is; training minimizes it.
Low-pass filter Keep low frequencies (smooth signal), suppress high (noise).
LSTM Long Short-Term Memory: an RNN with a cell state + forget/input/output gates; the additive cell line stops gradients vanishing.

M

Term Meaning
MAE / absolute loss Mean of absolute errors; from Laplace noise; outlier-robust; targets the median.
MAE / MSE / RMSE Mean Absolute / Mean Squared / Root-Mean-Squared rating error.
MAP Maximum A Posteriori; peak of the posterior \(=\) MLE \(+\) prior \(=\) loss \(+\) regularizer.
MAP estimate Maximize posterior ∝ likelihood × prior; ⟹ minimize (−log-lik) + (−log-prior) = loss + regularizer.
Markov chain (first-order) Next item depends only on the last one; transitions estimated by counting.
Matrix A rectangular grid of numbers; also a linear transformation of vectors.
Matrix Factorization (MF) \(\hat r_{ui}=p_u^\top q_i\); trained with regularized MSE.
Matrix–vector product \(A\mathbf{x}\) Dot each row of \(A\) with \(\mathbf{x}\); “\(n\) in, \(m\) out.”
Matryoshka (MRL) Truncatable embeddings — a shorter prefix of the vector still works, so you index at a smaller dim to save memory/latency.
MDP Markov Decision Process: state, action, reward, transition, policy — the formalism of RL.
Memory-based CF Predict from similar rows/columns of \(R\) at query time (\(k\)-NN).
Mini-batch A small group of triples updated together (averaged gradient); also the source of in-batch negatives.
MIPS (Maximum Inner-Product Search) Finding the item vectors with the largest dot product against a query — what top-\(K\) scoring is.
MLE Maximum Likelihood Estimate; \(\theta\) that best explains the data.
MLP Multi-layer perceptron — a feedforward stack of fully-connected layers.
MNAR data Missing-Not-At-Random: whether an interaction is observed depends on what the system chose to show — recommender logs are the textbook case.
Model-based CF Learn a compact model first (e.g. matrix factorization).
MoE (mixture-of-experts) A model that holds many parameters but activates only a few per token — inference cheaper than total size.
Momentum / Adam SGD upgrades: a velocity term / per-parameter adaptive step (Adam = the default optimizer).
MRR Mean Reciprocal Rank — \(1/\)rank of the first relevant item, averaged.
MSE / squared loss Mean of squared errors; from Gaussian noise; outlier-sensitive.
MTEB / ann-benchmarks / LMArena / CoNLL The standard leaderboards for embeddings / vector indexes / chat-LLMs / NER. Priors, not verdicts.
Multi-head Several attentions in parallel with different learned projections, then concatenated.
Mutual-information maximization The framing RLMRec uses for aligning the collaborative and semantic views.

N

Term Meaning
Nabla / del (\(\nabla\)) The symbol for the gradient operator.
Negative sampling Drawing un-interacted items to act as negatives.
Neuron / unit A weighted sum of inputs + bias, passed through an activation.
Next-item prediction The core task: given \((i_1,\dots,i_{t-1})\), score every item for being \(i_t\).
NLL Negative log-likelihood \(=-\log L\); minimizing it = MLE; this is a loss.
Node A thing (a user, a movie).
Non-IID data Each client’s data is small and unrepresentative of the whole — the core difficulty of federation.
Norm / length \(\lVert\mathbf{u}\rVert\) The arrow’s length; default L2 \(\sqrt{\sum_i u_i^2}\). L1 \(=\sum_i\lvert u_i\rvert\) (used in lasso).
Novelty Non-obviousness, often \(-\log_2(\text{popularity})\).
Null hypothesis \(H_0\) The skeptical default “no real difference”; a test tries to disprove it.

O

Term Meaning
Odds \(p/(1-p)\).
Off-policy Learning from data collected by a different (older) policy than the one being trained.
Offline replay Unbiased evaluation of a bandit policy on a log of randomly-served actions (Li et al. 2011).
One- vs two-sided test Two-sided asks “A \(\ne\) B” (the default); one-sided asks “A \(>\) B” and halves \(p\) — valid only if the direction was pre-registered.
One-hot vector A length-\(V\) vector, \(1\) in one slot, \(0\) elsewhere; encodes identity but no similarity (all pairs orthogonal).
Open weights / open source / open data Released weights only / weights + permissive code license / + training data too. They are different and often confused.
Orthogonal Perpendicular; dot product \(0\).
Orthonormal basis A full set of axes that are mutually orthogonal and each unit length; a coordinate is then just a dot product (\(U,V\) in §6).
Outer product \(\mathbf{u}\mathbf{v}^{\top}\) A column times a row = a whole rank-1 matrix.
Over-smoothing Too many layers make all embeddings collapse to one blurry point.
Overfitting Memorizing training noise; great on train, poor on new data.

P

Term Meaning
\(p\)-value Probability of a gap this large if \(H_0\) were true; small = significant. NOT \(\Pr(H_0\text{ true})\).
Paired \(t\)-test Tests whether the mean of per-pair differences \(\bar d\) differs from 0: \(t=\bar d/(s_d/\sqrt n)\).
Partial derivative \(\partial f/\partial x\) Derivative wrt one variable, others held fixed.
Pearson correlation A similarity measure (centred cosine) for ratings — removes each user’s rating bias.
PMF / PDF Probability mass (discrete) / density (continuous) function; sums/integrates to 1.
Pointwise / pairwise / listwise Score each pair / a pair’s order / the whole list.
Policy \(\pi(a\mid s)\) The learned rule mapping a state to an action; what RL optimizes.
Polynomial filter A filter that is a polynomial in \(\tilde{A}\) (what LightGCN layers compute).
Popularity bias Tendency to over-recommend high-degree (blockbuster) items (§13).
Popularity bias (in eval) Popular items dominate the test set, so pushing them inflates accuracy metrics.
Position bias Higher-ranked slots draw more clicks regardless of true relevance.
Positional embedding A learned per-position vector added to item embeddings so attention can see order.
Positional encoding A per-position signal added to embeddings so order-blind attention can see order (sinusoidal or learned).
Positive / negative pair Positive = two views of the same node; negative = views of different nodes.
Positive semidefinite (PSD) A symmetric matrix with \(\mathbf x^{\!\top}\!H\mathbf x\ge0\) for all \(\mathbf x\) (all eigenvalues \(\ge0\)); the multivariable “\(\ge0\)” that makes a Hessian convex.
Positive-semidefinite (PSD) A symmetric matrix with all eigenvalues \(\ge0\) (equivalently \(\mathbf{x}^{\top}A\mathbf{x}\ge0\) always); every \(M^{\top}M\) is PSD since \(\mathbf{x}^{\top}M^{\top}M\mathbf{x}=\lVert M\mathbf{x}\rVert^{2}\).
Posterior \(p(\theta\mid D)\) Updated belief after data.
Pre-activation \(z\) (logit) The raw score \(\mathbf{w}\cdot\mathbf{x}+b\) before the activation.
Precision@K Of the \(K\) shown, the fraction relevant.
Predictive learning SSL pretext: predict a property derived from the data (e.g. a masked item/attribute); one correct answer, no negatives; discriminative mechanism.
Pretraining Train a big Transformer on a huge corpus with a self-supervised objective, then reuse it.
Prior \(p(\theta)\) Belief about \(\theta\) before data.
Profile A short LLM-written description of a user’s taste or an item’s character, used as input to an embedder.
Projection \(\mathrm{proj}_{\mathbf{v}}\mathbf{u}\) The shadow of \(\mathbf{u}\) on \(\mathbf{v}\)’s line, \(\frac{\mathbf{u}\cdot\mathbf{v}}{\lVert\mathbf{v}\rVert^{2}}\mathbf{v}\); in an orthonormal basis a coordinate is just a dot product.
Propensity / IPS Propensity = probability an item was shown; IPS re-weights each interaction by \(1/\)propensity to undo exposure bias.
Provider vs. consumer fairness Comparable exposure across item creators (provider) vs. comparable quality across user groups (consumer).

Q

Term Meaning
Quantization Storing weights/vectors in fewer bits (4/8-bit, int8/binary) to shrink memory and speed inference, at a small quality cost.
Query / Key / Value Soft database lookup: a query is matched (dot product) against keys to retrieve a blend of values.

R

Term Meaning
Random split Hold out a random fraction of interactions — leaks the future for time-ordered data.
Random variable A quantity whose value is uncertain (coin, die, temperature).
Rank The matrix’s number of independent directions: independent rows/columns (§2) = nonzero singular values (§6).
Rank / latent dimension \(d\) Width of the factor vectors; the dial between under- and over-fitting.
Ranking The expensive, high-precision second stage that reorders the shortlist with a feature-rich model.
Recall vs QPS The ANN trade-off: fraction of true neighbours found vs queries-per-second. Always compare at a fixed recall.
Recall@K Of all relevant items, the fraction in the top-\(K\).
Regret \(\sum_t(\mu^{*}-\mu_{a_t})\): reward lost versus always pulling the best arm.
Regularization Making an ill-posed solution “regular” (well-behaved); from Hadamard/Tikhonov.
Regularizer A term/procedure that discourages complex solutions to improve generalization.
Relevant item A ground-truth item the user actually likes.
ReLU \(\max(0,z)\) — Rectified Linear Unit; the default activation.
Representation learning The network learning useful features by itself.
Residual connection A sub-layer computes \(x+f(x)\); the \(+1\) in its derivative is a gradient highway for deep stacks.
Retrieval / candidate generation The cheap, high-recall first stage that cuts millions of items to a few hundred.
Return \(G\) Cumulative discounted reward \(\sum_t \gamma^t r_t\) — the long-term objective.
Risk / empirical risk Expected loss over the true distribution / its average on the training sample.
RLMRec An enhancer that aligns an LLM-written semantic embedding with a CF embedding via contrastive/generative alignment (§3).
RLMRec-Con / -Gen RLMRec’s contrastive / generative variants for aligning LLM semantics with GNN embeddings.
RMSE / Recall@K / NDCG@K Rating-error / top-\(K\) ranking-quality metrics.
RNN Recurrent Neural Network: one shared cell that updates a hidden-state “memory” \(h_t=\tanh(Wh_{t-1}+Ux_t+b)\) along a sequence.
RNN / hidden state \(\mathbf h_t\) A net that folds each item into a running fixed-length summary of the session.
RQ-VAE The residual-quantized variational autoencoder that learns the Semantic-ID codebooks by reconstructing the content embedding (§2.2).

S

Term Meaning
Saddle point Critical point that is a min in some directions, a max in others (indefinite Hessian).
Sample space / event The set of possible outcomes / a subset of them.
Sampled softmax (SSM) Listwise cross-entropy over a positive + many sampled negatives; structurally = InfoNCE.
SASRec Self-Attentive Sequential Recommendation: a causal Transformer over the item sequence.
SASRec / BERT4Rec Sequential recommenders: a causal (left-to-right) / bidirectional (masked) Transformer over the item history.
Second derivative \(f''\) Rate of change of the slope = curvature.
Secure aggregation The server sees only the sum of client updates, never any individual one.
Selection / exposure bias Only items the system showed can get feedback; un-shown items look “disliked” by silence.
Self-attention Attention of a sequence to itself (Q, K, V all from the same tokens).
Self-loop An artificial edge from a node to itself; present in GCN, absent in LightGCN.
Semantic embedding A fixed-length vector encoding the meaning of the profile text (from a frozen sentence encoder), distinct from the CF latent space.
Semantic ID A short sequence of content-derived codewords identifying an item (quantized text embedding).
Sequential recommendation Predict the next item from a user’s ordered history.
Serendipity Relevant and surprising.
Session-based recommendation Same, but from a short, often anonymous current session (no user id).
SGD Stochastic Gradient Descent — gradient steps on random mini-batches.
SGL / SimGCL / XSimGCL / LightGCL GCL methods differing in how the two views are built (edge-drop / noise / cross-layer noise / SVD).
Shape \(m\times n\) \(m\) rows by \(n\) columns.
\(\sigma\) (nonlinearity) Activation (e.g. ReLU) in GCN/NGCF; removed in LightGCN.
Sigmoid \(\sigma(z)\) \(1/(1+e^{-z})\); inverse of the logit; score → probability.
Significance (paired \(t\)-test) Is a gain bigger than run-to-run seed noise? Report mean \(\pm\) std and a \(p\)-value.
Significance level \(\alpha\) The pre-set bar (usually \(0.05\)); reject \(H_0\) if \(p<\alpha\).
Singular value \(\sigma\) The strength of an SVD pattern; \(\sigma\ge0\), biggest first.
Slate A whole page/list of recommended items — the (combinatorial) action in slate RL.
SlateQ Decomposes a slate’s value into per-item \(Q\)-values, making value-based slate RL tractable.
SLM Small language model (\(\approx 0.5\)–9B) — cheap, high-throughput; the recsys feature-generation workhorse.
Smooth Has a derivative everywhere — no kinks or jumps; looks straight up close.
Softmax Multi-class sigmoid; scores → a probability distribution.
Softmax output Multi-class output head: \(K\) scores → a probability distribution, paired with cross-entropy.
Sparse / one-hot feature A category encoded as all-zeros with a single \(1\); CTR rows stack millions of these.
Sparsity Most of \(R\) is unknown (real matrices \({\sim}99.5\%\) empty); the core difficulty.
SSL (self-supervised learning) Training on an auxiliary task that needs no labels; supervision is manufactured from the data’s own structure.
Standard error \(s_d/\sqrt n\) — how much an estimate wobbles across repeats; shrinks with more data.
State \(s\) A summary of the current situation; the ingredient bandits lack (actions change it).
Static vs contextual word2vec gives one fixed vector per word; a Transformer gives a different vector per occurrence (context-dependent).
Stochastic gradient descent (SGD) Gradient descent on a noisy gradient estimated from a random minibatch, not the full dataset.
Structured output Forcing an LLM to emit schema-valid JSON via constrained decoding — valid shape, not guaranteed-correct values.
Structured output / JSON mode Forcing the LLM to emit schema-valid JSON, so the result is reliably parseable.
Subgradient A stand-in slope at a kink where the true derivative is undefined (e.g. any value in \([0,1]\) for ReLU at \(0\)).
Subword tokenization (BPE) Split text into reusable sub-word pieces by greedily merging the most frequent adjacent pair; fixed vocabulary, no <UNK> (WordPiece/SentencePiece are cousins).
Surprise / self-information \(-\log p(x)\); how surprising an outcome is (\(0\) if certain, \(\to\infty\) if rare).
SVD (\(M=U\Sigma V^\top\)) Factorizes a (rectangular) matrix into patterns \(U,V\) and strengths \(\Sigma\); keeping the top-\(q\) = an ideal low-pass (PSGE, LightGCL).
SVD \(M=U\Sigma V^{\top}\) Factor any matrix into orthonormal patterns (\(U,V\)) scaled by singular values (\(\Sigma\)).
Symmetric \(A=A^{\top}\); clean orthogonal eigenvectors.

T

Term Meaning
Taylor / linear approximation \(f(x+\delta)\approx f(x)+f'(x)\delta\); a curve looks straight up close.
Temperature \(\tau\) Scales the contrast sharpness in InfoNCE; a sensitive hyperparameter.
Temporal / leave-one-last split Hold out each user’s chronologically last interaction; the honest, leak-free default.
TF-IDF A way to turn text into a feature vector (term frequency × inverse document frequency).
The recsys leak Sending an update for item \(i\) reveals the user touched \(i\); FedRec masks this with decoy items.
Thompson sampling Keep a posterior per arm; sample one value from each and pull the largest sample.
Top-\(K\) The \(K\) highest-scoring unseen items returned to the user.
Trace Sum of a square matrix’s diagonal entries; equals the sum of its eigenvalues.
Train/test split How held-out test data is chosen: random, leave-one-out (LOO), or temporal (by timestamp).
Transformer A stack of (self-attention + add&norm + feed-forward + add&norm) blocks; no recurrence, fully parallel (Vaswani 2017).
Transition probability \(P(\text{next}=j\mid\text{last}=i)\) — count of \(i\!\to\!j\) over count of \(i\)-as-from.
Transpose \(A^{\top}\) The matrix mirrored across its diagonal (rows ↔︎ columns).
Triple \((u,i^+,i^-)\) A training example for pairwise ranking: user \(u\) prefers positive \(i^+\) over sampled negative \(i^-\).
Truncated SVD Keep the top-\(q\) singular components = best low-rank approximation (least-squares / Frobenius sense).
Two-stage funnel Retrieval → ranking: fast-and-forgiving then slow-and-sharp, so a rich model can run on only a few hundred candidates.
Two-tower LLM-embedding retrieval A hybrid: embed users and items with an LLM-grade encoder offline, then retrieve by ANN over those vectors at serve time — enhancer-class cost, LLM semantics on the retrieval path (§4).
Type I / II error False positive (reject true \(H_0\)) / false negative (miss a real effect).

U

Term Meaning
UCB Pull \(\arg\max_i\,[\hat\mu_i + \sqrt{2\ln t / n_i}\,]\): highest upper bound — optimism under uncertainty.
Uniform / popularity-weighted / in-batch / hard negatives Four sampling distributions for \(i^-\): equal-probability; \(\propto\text{pop}^{0.75}\); reuse the batch’s other positives; pick high-scoring (confusing) items.
Uniformity Embeddings spread evenly over the sphere; fights popularity bias / collapse.
Unit vector A vector of length \(1\); “normalizing” = dividing by your own length.
Universal approximation A wide enough net can approximate any continuous function.
User–item matrix \(R\) Rows = users, cols = items, entries = ratings/clicks (mostly unknown).

V

Term Meaning
Value / \(Q\)-value Expected return from a state (\(V\)) or a state–action pair (\(Q\)).
Vanishing gradient Back-prop through many steps multiplies many small factors → gradient decays → long-range dependencies unlearnable.
Variance \(\sigma^2\) Expected squared distance from the mean; spread.
Vector An ordered list of numbers; equivalently, an arrow from the origin to a point.
View One augmented version of a node’s representation; contrastive learning needs two per node.

W

Term Meaning
\(W\) (weight matrix) Learnable feature transform in GCN/NGCF; removed in LightGCN.
Weight / bias Learned multiplier per input / learned constant offset.
Weighted / unweighted Whether edges carry a number (rating/count) or are just 0/1.
Wide & Deep Joint linear-with-hand-crosses (memorize) \(+\) MLP-over-embeddings (generalize) model.
Wilcoxon signed-rank Non-parametric paired test (ranks, not values); use when differences aren’t normal.
word2vec Predict-based word embeddings trained so a word predicts its context (skip-gram / CBOW), Mikolov 2013.
WRMF / ALS Weighted Regularized MF / Alternating Least Squares (implicit-feedback, squared error).

X

Term Meaning
xDeepFM / AutoInt / DIN Vector-wise crosses (CIN) / attention-learned interactions / per-candidate behavior attention.

Z

Term Meaning
Zero-shot NER Extracting arbitrary entity types with no task-specific training (GLiNER, NuNER).