Skip to content

LLM Primer — The Modern AI Reference 2025

From math to models — primers and practice.

LLM Primer is a living, reference for modern AI. We focus on clarity, citations, and reproducible examples — from core mathematics to transformers, diffusion, RL, and frontier topics.


Transformer & Foundation Timeline (Consolidated to Sept 2025)

Section titled “Transformer & Foundation Timeline (Consolidated to Sept 2025)”
  • 1997 — “Long Short-Term Memory” (LSTM)
    Contribution: Introduced the LSTM recurrent neural network, critical for sequential data processing and learning long-term dependencies, addressing the vanishing gradient problem in RNNs.

  • 2013 — “Efficient Estimation of Word Representations in Vector Space” (Word2Vec)
    Contribution: Presented Word2Vec, a method to learn dense vector representations (embeddings) of words that capture semantic relationships.

  • 2014 — “GloVe: Global Vectors for Word Representation” (GloVe)
    Contribution: Learned word vectors by aggregating global word-word co-occurrence statistics from a corpus; an alternative to Word2Vec.

  • 2014 — “Sequence to Sequence Learning with Neural Networks” / “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation” (Seq2Seq)
    Contribution: Introduced the encoder–decoder framework for sequence-to-sequence tasks (often LSTM/GRU-based).

  • 2015 — “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau Attention)
    Contribution: Added attention to seq2seq, letting the decoder focus on relevant input tokens—crucial for long sequences.


  • 2017 — “Attention Is All You Need” (Transformer)
    Contribution: Replaced recurrence/convolutions with self-attention, enabling massive parallelism and scalability.

  • 2018 — “Improving Language Understanding by Generative Pre-Training” (GPT-1)
    Contribution: Introduced the GPT paradigm and the pretrain→finetune recipe for generation.

  • 2018 — “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (BERT)
    Contribution: Bidirectional pretraining (MLM + NSP variants) that reset SOTA across many NLP benchmarks.

  • 2019 — “Language Models are Unsupervised Multitask Learners” (GPT-2)
    Contribution: Showed strong zero-shot abilities from large autoregressive LMs; sparked capability and safety debates.

  • 2019 — “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” (T5)
    Contribution: Unified tasks in a text-to-text format, simplifying transfer across problems.

  • 2020 — “Language Models are Few-Shot Learners” (GPT-3)
    Contribution: Popularized in-context learning at 175B parameters.

  • 2020 — “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” (ViT)
    Contribution: Brought transformers to vision; competitive with CNNs.

  • 2020 — “Scaling Laws for Neural Language Models”
    Contribution: Empirical laws relating model/data/compute to performance, guiding scale-up strategy.


2021 — Efficiency, Multimodality, and PEFT

Section titled “2021 — Efficiency, Multimodality, and PEFT”
  • “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (Switch Transformer, MoE)
    Contribution: Sparse Mixture-of-Experts for efficient trillion-parameter scale.

  • “Learning Transferable Visual Models From Natural Language Supervision” (CLIP)
    Contribution: Text–image alignment for zero-shot classification and multimodal understanding.

  • “LoRA: Low-Rank Adaptation of Large Language Models” (LoRA)
    Contribution: Parameter-efficient finetuning via low-rank adapters; big cost/memory savings.


  • “Training language models to follow instructions with human feedback” (InstructGPT → ChatGPT)
    Contribution: RLHF to align models with user intent; foundation for ChatGPT’s helpfulness/safety.

  • “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (CoT Prompting)
    Contribution: Step-by-step prompting improves complex reasoning.

  • “PaLM: Scaling Language Modeling with Pathways” (PaLM)
    Contribution: Demonstrated extreme scale (540B) and Pathways for efficient training.

  • “Flamingo: a Visual Language Model for Few-Shot Learning” (Flamingo)
    Contribution: Interleaved image–text modeling for multimodal few-shot tasks.


2023 — Open-Weight Momentum & Practical Efficiency

Section titled “2023 — Open-Weight Momentum & Practical Efficiency”
  • Feb — “LLaMA: Open and Efficient Foundation Language Models” (LLaMA)
    Contribution: Open-weight family on public data; catalyzed the open ecosystem.

  • Mar — “GPT-4 Technical Report” (GPT-4)
    Contribution: Multimodal flagship; near human-level performance across many evaluations.

  • May — “QLoRA: Efficient Finetuning of Quantized LLMs” (QLoRA)
    Contribution: Combined 4-bit quantization with LoRA for consumer-grade finetuning.

  • Jul — “Llama 2: Open Foundation and Fine-Tuned Chat Models” (Llama 2)
    Contribution: Improved open-weight family for research/commercial use; safety/alignment write-ups.

  • Dec — “Constitutional AI: Harmlessness from AI Feedback” (Anthropic)
    Contribution: Alignment via AI-feedback guided by an explicit constitution; alternative to RLHF.

  • “GENIE: Generative Information Extraction” (Diffusion Language Model)
    Contribution: Early large-scale Diffusion Language Model (DLM) showing diffusion’s viability for text generation.


2024 — Proliferation of Open-Weights & Multimodality

Section titled “2024 — Proliferation of Open-Weights & Multimodality”
  • Rise of Open-Weight Models (distinguish from true open-source):
    Llama 3 (Meta) — 8B–70B; strong reasoning/coding.
    Mixtral 8×22B (Mistral AI) — Sparse MoE; Apache-2.0.
    Grok-1 (xAI) — ~314B; Apache-2.0. Grok-1.5V adds vision.
    DeepSeek-V2 (DeepSeek) — Efficient 236B MoE.
    Qwen1.5 series (Alibaba Cloud) — 7B–110B range.
    Phi-3 Mini (Microsoft) — 3.8B small model; MIT license.
    Gemma (Google) — lightweight family inspired by Gemini.
    Falcon 2 (TII) — 11B; Apache-2.0.

  • May — “GPT-4o” (OpenAI)
    Contribution: “Omnimodal” realtime model for seamless text–audio–vision interaction.

  • “LLaDA: An 8B Pre-trained Diffusion Language Model” (LLaDA)
    Contribution: 8B DLM competitive with similar-size autoregressive models—further validation of diffusion for language.

  • Late 2024 — OpenAI o-series (o1-preview, o3)
    Contribution: First reasoning-specialized models targeting math/science/coding; shift toward high-logic systems.


2025 (to Sept) — Specialized Reasoning, Agents & New Paradigms

Section titled “2025 (to Sept) — Specialized Reasoning, Agents & New Paradigms”
  • Early 2025 — DeepSeek R1
    Contribution: Strong complex math/code performance via reinforcement learning on verifiable tasks.

  • Diffusion Language Models (DLMs) gain traction
    Mercury (Inception Labs) — “First commercial-grade” DLM; fast non-autoregressive text generation.
    Dream-7B — Instruction-tuned DLM; iterative refinement helps reasoning.

  • Advanced Reasoning Focus
    How reasoning models work: internal CoT/ToT style decomposition; trained with reinforcement learning, self-reflection; tool use (calculators/coders) for reliability.
    EURUS: Open-source suite (from Mistral-7B / CodeLlama-70B bases) specialized for mathematical/logical/code reasoning.

  • Autonomous AI Agents
    Contribution: Systems capable of multi-step planning and execution beyond chat; rapid growth in agent frameworks and benchmarks.

  • Data-Centric AI & Compression
    Contribution: Shifting efficiency focus from purely model-centric scaling to data curation/pipeline optimization (e.g., Shifting AI Efficiency From Model-Centric to Data-Centric Compression).

  • Ongoing Open-Source Momentum
    Qwen3 Technical Report (Alibaba Cloud): Continued progress in open-weight LLMs and tooling across languages and modalities.