LLM Primer — The Modern AI Reference 2025

From math to models — primers and practice.

About

LLM Primer is a living, reference for modern AI. We focus on clarity, citations, and reproducible examples — from core mathematics to transformers, diffusion, RL, and frontier topics.

Transformer & Foundation Timeline (Consolidated to Sept 2025)

The Foundations: Pre-Transformer Era

1997 — “Long Short-Term Memory” (LSTM)
Contribution: Introduced the LSTM recurrent neural network, critical for sequential data processing and learning long-term dependencies, addressing the vanishing gradient problem in RNNs.
2013 — “Efficient Estimation of Word Representations in Vector Space” (Word2Vec)
Contribution: Presented Word2Vec, a method to learn dense vector representations (embeddings) of words that capture semantic relationships.
2014 — “GloVe: Global Vectors for Word Representation” (GloVe)
Contribution: Learned word vectors by aggregating global word-word co-occurrence statistics from a corpus; an alternative to Word2Vec.
2014 — “Sequence to Sequence Learning with Neural Networks” / “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation” (Seq2Seq)
Contribution: Introduced the encoder–decoder framework for sequence-to-sequence tasks (often LSTM/GRU-based).
2015 — “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau Attention)
Contribution: Added attention to seq2seq, letting the decoder focus on relevant input tokens—crucial for long sequences.

The Transformer Era: Birth of Modern LLMs

2017 — “Attention Is All You Need” (Transformer)
Contribution: Replaced recurrence/convolutions with self-attention, enabling massive parallelism and scalability.
2018 — “Improving Language Understanding by Generative Pre-Training” (GPT-1)
Contribution: Introduced the GPT paradigm and the pretrain→finetune recipe for generation.
2018 — “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (BERT)
Contribution: Bidirectional pretraining (MLM + NSP variants) that reset SOTA across many NLP benchmarks.
2019 — “Language Models are Unsupervised Multitask Learners” (GPT-2)
Contribution: Showed strong zero-shot abilities from large autoregressive LMs; sparked capability and safety debates.
2019 — “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” (T5)
Contribution: Unified tasks in a text-to-text format, simplifying transfer across problems.
2020 — “Language Models are Few-Shot Learners” (GPT-3)
Contribution: Popularized in-context learning at 175B parameters.
2020 — “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (ViT)
Contribution: Brought transformers to vision; competitive with CNNs.
2020 — “Scaling Laws for Neural Language Models”
Contribution: Empirical laws relating model/data/compute to performance, guiding scale-up strategy.

2021 — Efficiency, Multimodality, and PEFT

“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (Switch Transformer, MoE)
Contribution: Sparse Mixture-of-Experts for efficient trillion-parameter scale.
“Learning Transferable Visual Models From Natural Language Supervision” (CLIP)
Contribution: Text–image alignment for zero-shot classification and multimodal understanding.
“LoRA: Low-Rank Adaptation of Large Language Models” (LoRA)
Contribution: Parameter-efficient finetuning via low-rank adapters; big cost/memory savings.

2022 — Instruction Tuning & Reasoning

“Training language models to follow instructions with human feedback” (InstructGPT → ChatGPT)
Contribution: RLHF to align models with user intent; foundation for ChatGPT’s helpfulness/safety.
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (CoT Prompting)
Contribution: Step-by-step prompting improves complex reasoning.
“PaLM: Scaling Language Modeling with Pathways” (PaLM)
Contribution: Demonstrated extreme scale (540B) and Pathways for efficient training.
“Flamingo: a Visual Language Model for Few-Shot Learning” (Flamingo)
Contribution: Interleaved image–text modeling for multimodal few-shot tasks.

2023 — Open-Weight Momentum & Practical Efficiency

Feb — “LLaMA: Open and Efficient Foundation Language Models” (LLaMA)
Contribution: Open-weight family on public data; catalyzed the open ecosystem.
Mar — “GPT-4 Technical Report” (GPT-4)
Contribution: Multimodal flagship; near human-level performance across many evaluations.
May — “QLoRA: Efficient Finetuning of Quantized LLMs” (QLoRA)
Contribution: Combined 4-bit quantization with LoRA for consumer-grade finetuning.
Jul — “Llama 2: Open Foundation and Fine-Tuned Chat Models” (Llama 2)
Contribution: Improved open-weight family for research/commercial use; safety/alignment write-ups.
Dec — “Constitutional AI: Harmlessness from AI Feedback” (Anthropic)
Contribution: Alignment via AI-feedback guided by an explicit constitution; alternative to RLHF.
“GENIE: Generative Information Extraction” (Diffusion Language Model)
Contribution: Early large-scale Diffusion Language Model (DLM) showing diffusion’s viability for text generation.

2024 — Proliferation of Open-Weights & Multimodality

Rise of Open-Weight Models (distinguish from true open-source):
Llama 3 (Meta) — 8B–70B; strong reasoning/coding.
Mixtral 8×22B (Mistral AI) — Sparse MoE; Apache-2.0.
Grok-1 (xAI) — ~314B; Apache-2.0. Grok-1.5V adds vision.
DeepSeek-V2 (DeepSeek) — Efficient 236B MoE.
Qwen1.5 series (Alibaba Cloud) — 7B–110B range.
Phi-3 Mini (Microsoft) — 3.8B small model; MIT license.
Gemma (Google) — lightweight family inspired by Gemini.
Falcon 2 (TII) — 11B; Apache-2.0.
May — “GPT-4o” (OpenAI)
Contribution: “Omnimodal” realtime model for seamless text–audio–vision interaction.
“LLaDA: An 8B Pre-trained Diffusion Language Model” (LLaDA)
Contribution: 8B DLM competitive with similar-size autoregressive models—further validation of diffusion for language.
Late 2024 — OpenAI o-series (o1-preview, o3)
Contribution: First reasoning-specialized models targeting math/science/coding; shift toward high-logic systems.

2025 (to Sept) — Specialized Reasoning, Agents & New Paradigms

Early 2025 — DeepSeek R1
Contribution: Strong complex math/code performance via reinforcement learning on verifiable tasks.
Diffusion Language Models (DLMs) gain traction
Mercury (Inception Labs) — “First commercial-grade” DLM; fast non-autoregressive text generation.
Dream-7B — Instruction-tuned DLM; iterative refinement helps reasoning.
Advanced Reasoning Focus
How reasoning models work: internal CoT/ToT style decomposition; trained with reinforcement learning, self-reflection; tool use (calculators/coders) for reliability.
EURUS: Open-source suite (from Mistral-7B / CodeLlama-70B bases) specialized for mathematical/logical/code reasoning.
Autonomous AI Agents
Contribution: Systems capable of multi-step planning and execution beyond chat; rapid growth in agent frameworks and benchmarks.
Data-Centric AI & Compression
Contribution: Shifting efficiency focus from purely model-centric scaling to data curation/pipeline optimization (e.g., Shifting AI Efficiency From Model-Centric to Data-Centric Compression).
Ongoing Open-Source Momentum
Qwen3 Technical Report (Alibaba Cloud): Continued progress in open-weight LLMs and tooling across languages and modalities.