Appendix B — Datasets & Benchmarks

1. The datasets you will actually meet

Academic recommendation runs on a surprisingly small set of public logs. It helps to group them by family, because the family fixes the loss and the metric more than the size does:

  • Rating prediction (explicit stars) — MovieLens, Netflix, Book-Crossing.
  • Implicit top-\(N\) / graph CF (clicks, check-ins, buys) — Gowalla, Yelp, Amazon, Steam, LastFM.
  • CTR / news / sequential (a binary click, often with rich features or a time axis) — Criteo, MIND, Taobao.

For each, the four properties that actually change your modelling choice are: feedback (explicit ratings vs. implicit 0/1), timestamps? (needed for a temporal split, §3), side-features? (enables content / FM / cold-start), and license.

A working subset of the public datasets in academic recommender-systems research, grouped by the family that fixes the loss and metric. The Status tag reads Active (downloadable today), Restricted (gated — academic-only / revocable / by signed agreement), or Withdrawn (officially pulled — keep it for history, do not build on it). The other columns that drive a modelling choice are the feedback type, whether there are timestamps (for a temporal split), and whether there are side-features (for content / FM / cold-start).
Dataset Domain Scale (approx.) Feedback Time Side-features Status & license
MovieLens (100K–32M) movies 100K–32M ratings explicit 1–5 yes genres, tags, tag-genome; demographics (100K/1M only) Active — GroupLens; research, non-commercial
Netflix Prize movies 100M ratings explicit 1–5 yes title/year only Withdrawn (de-anonymization); copies circulate — cite as history (§7)
Amazon Reviews e-commerce 233M (’18) → 571M (’23) explicit 1–5 + text yes metadata, images; per category Active — research; on HuggingFace
Yelp Open Dataset local business ~7M reviews explicit 1–5 + text yes attributes, social graph, check-ins Active (rolling) — academic, revocable
Gowalla check-ins ~1M (10-core) implicit yes geo, friendship graph Active — SNAP, public
Steam games ~7M reviews implicit + reviews yes genres, price, playtime Active — research
LastFM (LFM-1b / 2b) music 1B–2B plays implicit yes demographics, genre Withdrawn — LFM-2b pulled (license); LFM-1b gated
MIND news 1M users, >15M impr. implicit click yes title, abstract, body, entities Active — Microsoft research license
Criteo display ads ~45M rows implicit click 13 numeric + 26 categorical (hashed) Active — public (Kaggle)
Book-Crossing books 1.1M ratings explicit 1–10 + implicit no age, location, metadata Active — public (GroupLens)

Inside the families: which variant carries what. Several entries above are not one dataset but a family of releases, and a beginner can lose hours to the wrong one (MovieLens alone has seven). The matrices below show, per family, what data each variant actually gives you.

The MovieLens family. Every numbered release is a frozen benchmark; the two latest sets are development data that change over time and must not be used for reported results. Pick by the data you need: user demographics (age/gender/occupation) exist only in 100K/1M; the tag genome (a movie \(\times\) tag relevance matrix — 1,128 tags in 20M, 1,129 in 25M) only in 20M/25M; 10M first added free-text tags. Ratings are whole-star 1–5 in 100K/1M, half-star 0.5–5 from 10M on. (User \(\times\) item counts grow with size: 943 \(\times\) 1.7K at 100K up to 162K \(\times\) 62K at 25M.) GroupLens now also ships a 32M frozen release (2023). Source: GroupLens; Harper & Konstan (2015).
MovieLens variant Ratings Scale Tags Tag-genome Demographics Best for
100K (1998) 100K 1–5 yes smallest frozen set; demographics
1M (2003) 1M 1–5 yes the classic mid-scale CF benchmark
10M (2009) 10M 0.5–5 yes first release with free-text tags
20M (2015) 20M 0.5–5 yes yes the classic tag-genome benchmark
25M (2019) 25M 0.5–5 yes yes larger frozen genome set
latest-small ~100K 0.5–5 yes quick dev / teaching — not for results
latest-full ~33M 0.5–5 yes yes dev only; changes over time
The Amazon Reviews family grows by release year, not a size-suffix; each version re-crawls all categories. Papers use a single category — most often Books, Beauty, Sports, or Toys. The graph-CF “Amazon-Book” benchmark is a 10-core subset of the 2014 Books category (about 52.6K users, 91.6K items, 3.0M interactions) frozen in the NGCF/LightGCN repos — a benchmark artifact, not the live category (which is far larger). Use 2023 for new work. Sources: Ni et al. (2019); Hou et al. (2024).
Amazon Reviews version Curators Reviews Categories Adds over prior Status
2014 McAuley et al. ~143M ~24 ratings, text, also-bought graph superseded
2018 Ni, Li & McAuley 233M ~30 \(+\) price, brand, image features Active
2023 Hou et al. (McAuley Lab) 571M 33 \(+\) richer metadata; HuggingFace-hosted Active — current
The remaining well-known families and their variants / snapshots. Two rows are frozen evaluation artifacts, not live datasets — Yelp2018 and the graph-CF Amazon-Book — that live only in the benchmark repos, so do not try to re-download them from the vendor (§2). MIND-small is a 50K-user sample for quick iteration (no test split). Book-Crossing’s implicit feedback is coded 0 — an interaction without a grade, not a low score. Sources: official dataset pages; Ziegler et al. (2004); Wan & McAuley (2018).
Other family · variant Scale Carries Status
Yelp Open Dataset (rolling) ~7M reviews · 150K businesses · ~1.9M users text + stars, business attributes, friend graph, check-ins, tips Active — academic, revocable
Yelp2018 (snapshot) 31.7K users · 38K items · 1.56M (10-core) interactions only Frozen — in NGCF/LightGCN repos
MIND-large (news) 1M users · ~160K news · >15M impressions title, abstract, body, category, entities Active — MS research license
MIND-small 50K sampled users (train+dev) a subset of MIND-large Active
Book-Crossing 1.15M ratings · 279K users · 271K books explicit 1–10 + implicit (0); age/location Active — GroupLens
Goodreads (UCSD) ~228M interactions · 876K users · 2.36M books reviews, shelves, spoiler tags Active — academic
Amazon-Books = Amazon Reviews “Books” category (see the Amazon table) Active

2. The standard graph-CF trio: Gowalla, Yelp2018, Amazon-Book

The graph chapters (From Graphs to LightGCN, SSL & Contrastive Learning) — and nearly every graph-CF paper — report on the same three datasets: Gowalla, Yelp2018, and Amazon-Book. This is not a coincidence, and knowing why saves a beginner real confusion:

  1. They are implicit / one-class. Graph CF models the user–item interaction graph; explicit star ratings are discarded and every observed pair is a binary edge. All three are pure interaction logs, not rating-prediction tasks.
  2. They span a sparsity range (~\(0.08\%\)\(0.13\%\) density) and three domains (check-ins, local-business, books), so a method cannot win by overfitting one density regime.
  3. The preprocessing is frozen. Everyone reuses the same 10-core-filtered, fixed train/test split released with NGCF (the paper LightGCN built on). That shared split — not any statistical superiority — is the real reason the trio dominates: it lets papers quote each other’s numbers as directly comparable.

Why this matters (and its catch). Comparability is the whole point — but the frozen split is itself a known hazard (§3): if it was made by a random rather than a temporal cut, every model trained on it inherits the same future-leak. Comparable is not the same as correct.


3. Splitting without leaking the future (the keystone)

A model is graded by holding out some interactions as a test set. How you choose them decides whether the score is honest. Two protocols dominate, and the difference is the single most important idea in this appendix.

  • Random split — hold out a random fraction of all interactions. Simple, and wrong for any data with time: it can put a user’s later interactions in train and an earlier one in test, so the model trains on the future to predict the past. The score inflates.
  • Temporal / leave-one-last split — for each user, sort by time and hold out their chronologically last interaction (and often the second-last for validation). This grades the model the way it is used: predict the next thing from everything before it. It is the honest default, and mandatory for sequential models (Sequential & Session-Based Recommendation §6).

Almost always, a k-core filter runs first: repeatedly drop users and items with fewer than \(k\) interactions until none remain (papers use \(k=10\); we use \(k=2\) to keep it tiny). This removes the noisiest cold cases so the comparison is stable.

Worked example — k-core then leave-one-last, by hand. Four users over five items, with timestamps \(t\):

User Interactions (item@time)
U1 A@1, B@2, C@3
U2 A@1, B@2
U3 A@1, C@2, D@3
U4 E@1

Step 1 — 2-core filter. Count interactions: users U1=3, U2=2, U3=3, U4=1; items A=3, B=2, C=2, D=1, E=1. Drop everything below \(2\): out goes U4 (only one interaction) and items D and E. Re-check what remains — U1=3, U2=2, U3=2; A=3, B=2, C=2 — all \(\ge 2\), so we stop (the filter is iterative; here one pass suffices).

Step 2 — leave-one-last. For each surviving user, hold out the last by time:

\[ \text{TEST}=\{\text{U1}\!:\!\text{C@3},\ \text{U2}\!:\!\text{B@2},\ \text{U3}\!:\!\text{C@2}\},\qquad \text{TRAIN}=\{\text{U1}\!:\!\text{A,B},\ \text{U2}\!:\!\text{A},\ \text{U3}\!:\!\text{A}\}. \]

Note the subtlety that trips people up: U3’s chronologically last interaction was D@3 — but D was removed by the k-core filter, so U3’s held-out test item is the next-latest survivor, C@2. Filtering happens before splitting; the order matters.

Figure B.1: k-core then leave-one-last, on the worked example. U4 and items D, E are removed by the 2-core filter (U4 and D@3 shown only by their absence). For each surviving user the last interaction by time (orange) is the test target; everything earlier (teal) is training. U3’s true last item, D@3, was filtered out before the split, so its test item is the next-latest survivor, C@2 — filtering precedes splitting.

Why not just split randomly? Because recommendation is a forecast. A random split that trains on U1’s C@3 and tests on U1’s A@1 is grading the model on a question it has already been shown the answer to. Every honest evaluation in this book uses a temporal hold-out.


4. The benchmark map: which model, which dataset, which protocol

The sections so far gave you the datasets (§1), the trio everyone shares (§2), and how to split without leaking the future (§3). This section assembles them into the map the rest of the literature actually lives on: for each research community, the datasets it runs on, the models most often cited on them, and the one evaluation protocol they all share. Read alongside the LLM × RecSys: the Landscape chapter — which surveys the models in depth — it is, in one place, the answer to the question a practitioner or reviewer asks first: which model, on which dataset, under which protocol?

One idea organizes the whole map, and it is the opposite of what a beginner expects: the protocol is fixed by the research community, not by the dataset. MovieLens-1M alone is evaluated three incompatible ways — as RMSE rating-prediction (the Netflix-Prize lineage), as leave-one-out ranking against 99 sampled negatives (the NCF lineage), and as leave-one-out next-item prediction against 100 sampled negatives (the SASRec / BERT4Rec lineage). A number is therefore comparable to another only when the regime, the split, and the ranking scope all match — the concrete form of the reproducibility warning §5 sharpens. (The metric definitions — Recall@\(K\), NDCG@\(K\), MRR, AUC, RMSE — and the sampled-vs-full argument live in Evaluation Metrics; here they are only labels.)

The field’s benchmark map — each research community (regime) pairs a set of public datasets with its most-cited models and one shared evaluation protocol (split \(+\) ranking scope \(+\) metrics). The deciding fact: the protocol is set by the community, not the dataset — MovieLens-1M appears under three regimes with three protocols. Two numbers are comparable only when regime, split, and scope all match. Model definitions live in the chapters that introduce each family (Traditional Recommender Systems, Sequential & Session-Based Recommendation, From Graphs to LightGCN, LLM × RecSys); metric definitions live in Evaluation Metrics.
Regime (community) Typical datasets Most-cited models (year) Split Ranking scope Headline metrics
Rating prediction Netflix Prize; MovieLens 100K/1M/10M MF / SVD++ (’08), timeSVD++ (’09), RBM-CF (’07), BellKor (’09) random or \(k\)-fold over ratings predict the star value RMSE (MAE)
Implicit top-\(N\), sampled MovieLens-1M; Pinterest NeuMF / NCF (’17) leave-one-out (last by time) vs. 99 sampled negatives HR@10, NDCG@10
Sequential, sampled MovieLens-1M; Amazon Beauty/Games; Steam (BERT4Rec adds ML-20M) GRU4Rec (’16), Caser (’18), SASRec (’18), BERT4Rec (’19) leave-one-out vs. 100 sampled negatives HR@10, NDCG@10, MRR
Sequential, full + Amazon Sports/Toys; Yelp S3-Rec (’20), CL4SRec (’22), DuoRec (’22), FEARec (’23) leave-one-out full ranking (all items) HR / NDCG @10,20, MRR
Graph CF (the trio) Gowalla, Yelp2018, Amazon-Book (SGL adds iFashion; KGAT adds LastFM) NGCF (’19), LightGCN (’20), DGCF (’20), SGL (’21), SimGCL (’22), LightGCL (’23) random per-user 80/20, 10-core full ranking (all items) Recall@20, NDCG@20
CTR prediction Criteo, Avazu Wide&Deep (’16), DeepFM (’17), DCN (’17), xDeepFM (’18), AutoInt (’19), DIN (’18) random 8:1:1 per-impression binary click AUC, LogLoss
News MIND DKN (’18), NPA (’19), NAML (’19), LSTUR (’19), NRMS (’19) impression / time-ordered within each impression AUC, MRR, nDCG@5/10
LLM-augmented CF (this book) ML-20M (genome subset); Amazon-Books; ML-1M RLMRec R1/R1-plus/R2/R3, LightGCN + contrastive backbones, SASRec global temporal (by date), 10-core full ranking (all items) NDCG / Recall @10,20,50, MRR; 5-seed paired-\(t\)

A few precise points the table compresses, each worth knowing before you trust a leaderboard:

  • The sequential shift (the two sequential rows). Early next-item models ranked the held-out item against a small sample of 100 negatives (BERT4Rec’s drawn by item popularity; SASRec’s paper says only “randomly sampled”) and reported HR@10 / NDCG@10. After Krichene & Rendle (2020) showed sampled metrics can reverse which model looks better (§5), the field moved to full ranking over the whole catalogue; S3-Rec, CL4SRec, DuoRec and later work report unsampled metrics, and modern re-benchmarks re-score even SASRec and BERT4Rec that way. One model, two very different headline numbers — always check which.
  • NCF vs. sequential negatives. The NCF protocol pairs the test item with 99 sampled negatives (a 100-item list); the SASRec / BERT4Rec protocol uses 100 (a 101-item list). The one-item difference is real, and it is exactly the kind of detail that makes two “leave-one-out HR@10” numbers quietly non-comparable.
  • The trio’s split is random, not temporal (§2, §3). NGCF released a fixed per-user 80/20 split (10% of train held out for validation, 10-core filtered); LightGCN and every contrastive successor reuse it verbatim — which is precisely why their numbers are mutually quotable. But because the cut is random, it does not test temporal generalization: here “comparable” still is not “leak-free.”
  • Dataset lineage, kept straight. Alibaba-iFashion enters this list with SGL (2021), not the original trio; LastFM / Last-FM is the third member of KGAT’s knowledge-graph trio, not of the NGCF general-CF trio.
  • This book’s own studies (last row). The LLM × RecSys chapter’s experiments (LLM-MovieLens) sit in the graph-CF / full-ranking regime but deliberately adopt the leak-free protocol §3 argues for: a global temporal split by calendar date (not the trio’s random cut), full ranking over all items, NDCG / Recall @{10, 20, 50} and MRR averaged over five seeds with paired \(t\)-tests. It is a worked instance of doing it the honest way on a modern LLM-augmented setup.

The protocol anchors — the papers whose numbers later work quotes itself against — are, by regime: rating prediction, the Netflix Prize and Koren’s matrix-factorization line; the sampled top-\(N\) regime, NCF (He et al., 2017); early sequential, SASRec (2018) and BERT4Rec (2019); the shift to full ranking, Krichene & Rendle (2020); graph CF, NGCF (2019, which released the split) and LightGCN (2020); CTR, Wide & Deep (2016) onward, standardized by the FuxiCTR open benchmark (2021); news, the MIND dataset paper (Wu et al., 2020). Full citations are in §9.

The same map indexed by dataset — the lookup for “I have this dataset (or I’m reading a paper that uses it): who runs on it, and how?” One dataset can host several regimes (MovieLens-1M is the clearest case, hence three protocols). Yelp2018 and Amazon-Book are the frozen NGCF-split versions, not the raw Yelp Open Dataset / Amazon Reviews dumps of §1. “LOO” \(=\) leave-one-out (hold out each user’s last interaction by time, §3); “sampled \(\to\) full” marks the regimes that began with sampled negatives and moved to full ranking after Krichene & Rendle (2020).
Dataset Feedback Used by (regime) Representative models Protocol (split · scope) Headline metric
MovieLens-1M explicit \(\to\) implicit rating · NCF · sequential MF, NeuMF, SASRec, BERT4Rec random ratings or LOO · sampled RMSE or HR@10 / NDCG@10
MovieLens-20M explicit \(\to\) implicit graph CF · this book · BERT4Rec LightGCN, RLMRec, BERT4Rec temporal · full NDCG@10 / Recall@20
Netflix Prize explicit rating prediction (history) MF, SVD++, BellKor fixed Probe/Quiz/Test · — RMSE
Amazon-Book implicit graph CF (trio) NGCF, LightGCN, SGL random 80/20, 10-core · full Recall@20 / NDCG@20
Amazon Beauty / Sports / Toys implicit sequential SASRec, BERT4Rec, S3-Rec LOO · sampled \(\to\) full HR@10 / NDCG@10
Yelp2018 implicit graph CF (trio) NGCF, LightGCN, SimGCL random 80/20, 10-core · full Recall@20 / NDCG@20
Gowalla implicit graph CF (trio); POI LightGCN, SGL, LightGCL random 80/20, 10-core · full Recall@20 / NDCG@20
Steam implicit sequential SASRec, BERT4Rec LOO · sampled \(\to\) full HR@10 / NDCG@10
LastFM / Last-FM implicit knowledge-graph CF KGAT random 80/20, 10-core · full Recall@20 / NDCG@20
MIND implicit click news NRMS, NAML, LSTUR impression / time · within-impression AUC / MRR / nDCG@5,10
Criteo / Avazu implicit click CTR DeepFM, DCN, xDeepFM random 8:1:1 · per-impression AUC / LogLoss

How to use the two tables. Go down the benchmark map when you are choosing how to run an experiment: pick the regime that matches your task and inherit its split, scope, and metrics wholesale, so your numbers join an existing conversation instead of starting a private one. Go into the dataset index when you are reading a paper or holding a dataset: look up what its community does, and you will know whether a headline number is full-ranking or sampled, temporal or random, before you decide how much to trust it.

Figure B.2: Choosing — and reading — an evaluation protocol. Explicit-rating tasks split at random and report RMSE; implicit-feedback tasks should split by time (§3) and then rank over the full catalogue (Recall/NDCG@\(K\), the honest modern default) rather than against a handful of sampled negatives (HR/NDCG@10), which Krichene & Rendle (2020) showed can reverse model rankings. Each ranking regime in the map above is a leaf of this tree; the CTR and news regimes split differently (random or impression-based) and report AUC-family metrics instead.

5. Two evaluation traps the data side creates

Beyond the split, two well-documented pitfalls live on the data side; both are reasons to distrust a headline number.

  • Popularity bias in evaluation. A few blockbusters collect most test interactions (the long tail of From Graphs to LightGCN §13), so a recommender that simply pushes popular items scores well on Recall/NDCG without personalizing at all. Report a popularity-aware view (coverage, tail-Recall) alongside the headline metric.
  • The reproducibility critique. Ferrari Dacrema, Cremonesi & Jannach (“Are we really making much progress?”, RecSys 2019) re-evaluated a wave of neural recommenders and found most could not be reproduced, and the reproducible ones were often beaten by a well-tuned classical baseline (item-kNN, matrix factorization). The lesson for a from-zero reader: a strong, tuned baseline is mandatory, and a gain is meaningless without identical splits, full-ranking metrics (the sampled-vs-full caution of Evaluation Metrics), and released code.

6. How the field gets data today

You rarely parse a raw dump by hand any more — the work has moved into toolkits and metadata standards:

  • Curated catalogs. Julian McAuley’s Recommender Systems Datasets page (UCSD) is the community’s de-facto index; several GitHub “awesome-datasets” lists mirror it.
  • Benchmark frameworks ship datasets in a unified format and run dozens of models on them under one protocol — the modern way to “present” a dataset: RecBole (~40 datasets, 90+ models, the .inter/.user/.item atomic format), Cornac (multimodal/side-info focus), Elliot (reproducibility-first: one config drives models + HPO + metrics + significance tests), Microsoft Recommenders, and LensKit.
  • Hosting & metadata. HuggingFace Datasets is now a primary host (Amazon-Reviews-2023 lives there with standard splits); Croissant (MLCommons, 2024) is the emerging machine-readable metadata standard — a record now required for NeurIPS dataset submissions.

In practice. Don’t hand-roll loading and splitting. Pick a framework (RecBole or Elliot for research comparability; Microsoft Recommenders for an end-to-end pipeline), use its built-in temporal split, and report the framework + version so others can reproduce you. If you must publish a dataset, attach a Croissant record.


7. Active, deprecated, and a licensing caution

Famous datasets split into two groups, and a from-zero reader should know both: the active ones you actually build on, and the deprecated / withdrawn ones that shaped the field but are no longer safe — or even possible — to use. The withdrawn ones are still worth knowing by name (you will meet them in older papers); they are never worth training on.

  • Netflix Prize (2006) — the dataset that launched modern collaborative-filtering research, withdrawn after researchers de-anonymized its users and a privacy lawsuit (Doe v. Netflix) followed. There is no official download; unofficial copies circulate, but cite it as history, do not train on it.
  • LastFM LFM-2b / LFM-1b — the large listening-history sets are no longer officially distributed (“license issues”; LFM-1b was historically the more available of the two). Treat as restricted and check current terms.
  • Yahoo! Webscope (e.g. the R3 music-ratings set) — still available, but only by signed agreement (academic use). R3 remains the classic missing-not-at-random benchmark — it pairs users’ organic ratings with a randomly-surveyed set — so it recurs in debiasing work despite the access gate.

Most active datasets — MovieLens, Gowalla, Yelp, Amazon, MIND, Steam, Book-Crossing, Goodreads — are free for research, non-commercial use with attribution (Yelp’s license is notably revocable). Always read the specific license before any commercial use.


8. Glossary

Term Plain meaning
Explicit / implicit feedback Stated ratings (1–5) vs. behavioural \(0/1\) (click/buy); the latter is positives-only.
k-core filter Iteratively drop users/items with \(<k\) interactions until none remain; stabilizes the data.
Random split Hold out a random fraction of interactions — leaks the future for time-ordered data.
Temporal / leave-one-last split Hold out each user’s chronologically last interaction; the honest, leak-free default.
Graph-CF trio Gowalla / Yelp2018 / Amazon-Book — the frozen-split benchmark of graph-CF papers.
Popularity bias (in eval) Popular items dominate the test set, so pushing them inflates accuracy metrics.
Benchmark framework RecBole / Cornac / Elliot / Microsoft Recommenders / LensKit — unified datasets + models + protocol.
Croissant MLCommons machine-readable dataset-metadata standard (2024); required for NeurIPS dataset tracks.

9. References

Ordered alphabetically. The benchmark-map entries (§4) are grouped here with the dataset and reproducibility references; each carries a short note on what it anchors.

  • Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). Wide & Deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS @ RecSys). arXiv:1606.07792 (the CTR wide-and-deep anchor)

  • Ferrari Dacrema, M., Cremonesi, P., & Jannach, D. (2019). Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys). https://doi.org/10.1145/3298689.3347058

  • Harper, F. M., & Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4). https://doi.org/10.1145/2827872

  • He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering (NCF / NeuMF). In Proceedings of WWW. arXiv:1708.05031 (the leave-one-out + 99-sampled-negatives top-\(N\) protocol)

  • He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M. (2020). LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of SIGIR. arXiv:2002.02126 (reuses NGCF’s split; the de-facto graph-CF backbone)

  • Kang, W.-C., & McAuley, J. (2018). Self-attentive sequential recommendation (SASRec). In Proceedings of ICDM. arXiv:1808.09781 (leave-one-out + 100 sampled negatives)

  • Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37. https://doi.org/10.1109/MC.2009.263 (the rating-prediction MF line behind the Netflix Prize)

  • Krichene, W., & Rendle, S. (2020). On sampled metrics for item recommendation. In Proceedings of KDD. https://doi.org/10.1145/3394486.3403226 (the case against sampled metrics — why the field moved to full ranking)

  • Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In IEEE Symposium on Security and Privacy (the Netflix Prize de-anonymization). https://doi.org/10.1109/SP.2008.33

  • Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, P. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from Transformer. In Proceedings of CIKM. arXiv:1904.06690 (leave-one-out + 100 popularity-sampled negatives)

  • Wang, X., He, X., Cao, Y., Liu, M., & Chua, T.-S. (2019). KGAT: Knowledge graph attention network for recommendation. In Proceedings of KDD. arXiv:1905.07854 (the knowledge-graph trio: Amazon-Book / LastFM / Yelp2018)

  • Wang, X., He, X., Wang, M., Feng, F., & Chua, T.-S. (2019). Neural graph collaborative filtering (NGCF). In Proceedings of SIGIR. arXiv:1905.08108 (released the random per-user 80/20 split the graph-CF trio reuses verbatim)

  • Wu, F., Qiao, Y., Chen, J.-H., et al. (2020). MIND: A large-scale dataset for news recommendation. In Proceedings of ACL. https://aclanthology.org/2020.acl-main.331 (the news-recommendation impression-based protocol)

  • Wu, J., Wang, X., Feng, F., He, X., Chen, L., Lian, J., & Xie, X. (2021). Self-supervised graph learning for recommendation (SGL). In Proceedings of SIGIR. arXiv:2010.10783 (adds Alibaba-iFashion to the graph-CF trio)

  • Zhao, W. X., et al. (2021). RecBole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of CIKM. arXiv:2011.01731

  • Zhu, J., Liu, J., Yang, S., Zhang, Q., & He, X. (2021). Open benchmarking for click-through rate prediction (FuxiCTR). In Proceedings of CIKM. arXiv:2009.05794 (standardized the Criteo / Avazu 8:1:1 split + AUC / LogLoss)

Online sources verified June 2026.