33  Further Reading

Resources to continue your deep learning journey.

33.1 Papers

33.1.1 Foundational

Backpropagation

  • Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors.” Nature.
    • The paper that made neural networks practical

Automatic Differentiation

  • Baydin et al. (2018). “Automatic Differentiation in Machine Learning: a Survey.” JMLR.
    • Comprehensive overview of autodiff techniques

33.1.2 Optimization

Adam Optimizer

  • Kingma & Ba (2014). “Adam: A Method for Stochastic Optimization.” ICLR.
    • The most widely used optimizer

AdamW (Weight Decay)

  • Loshchilov & Hutter (2017). “Decoupled Weight Decay Regularization.” ICLR.
    • Fixes weight decay in Adam

Learning Rate Schedules

  • Smith (2017). “Cyclical Learning Rates for Training Neural Networks.” WACV.
    • Learning rate cycling techniques

33.1.3 Normalization

Batch Normalization

  • Ioffe & Szegedy (2015). “Batch Normalization: Accelerating Deep Network Training.” ICML.
    • Breakthrough for training deep networks

Layer Normalization

  • Ba, Kiros, Hinton (2016). “Layer Normalization.” arXiv.
    • Essential for Transformers

RMSNorm

  • Zhang & Sennrich (2019). “Root Mean Square Layer Normalization.” NeurIPS.
    • Simplified normalization for modern LLMs

33.1.4 Transformers

Original Transformer

  • Vaswani et al. (2017). “Attention Is All You Need.” NeurIPS.
    • The architecture that changed everything

GPT

  • Radford et al. (2018). “Improving Language Understanding by Generative Pre-Training.”
    • First GPT paper

GPT-2

  • Radford et al. (2019). “Language Models are Unsupervised Multitask Learners.”
    • Scaling up GPT

GPT-3

  • Brown et al. (2020). “Language Models are Few-Shot Learners.” NeurIPS.
    • In-context learning discovery

33.1.5 Activation Functions

ReLU

  • Nair & Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML.
    • Simple but effective

GELU

  • Hendrycks & Gimpel (2016). “Gaussian Error Linear Units (GELUs).” arXiv.
    • Smooth activation for Transformers

SiLU/Swish

  • Ramachandran et al. (2017). “Searching for Activation Functions.” arXiv.
    • Self-gated activation

33.2 Books

33.2.1 Deep Learning Theory

Deep Learning by Goodfellow, Bengio, Courville (2016)

Neural Networks and Deep Learning by Michael Nielsen

33.2.2 Implementation

Dive into Deep Learning by Zhang et al.

  • Code-first approach with multiple frameworks
  • Free online: d2l.ai

Programming PyTorch for Deep Learning by Ian Pointer (2019)

  • Practical PyTorch guide

33.2.3 Mathematics

Mathematics for Machine Learning by Deisenroth, Faisal, Ong (2020)

Linear Algebra Done Right by Sheldon Axler

  • Elegant linear algebra treatment

33.3 Online Courses

33.3.1 Foundations

CS231n: CNNs for Visual Recognition (Stanford)

CS224n: NLP with Deep Learning (Stanford)

33.3.2 Practical

Fast.ai by Jeremy Howard

  • Top-down practical approach
  • fast.ai

Deep Learning Specialization (Coursera) by Andrew Ng

33.4 Code Repositories

33.4.1 Educational Frameworks

micrograd by Andrej Karpathy

tinygrad by George Hotz

nanoGPT by Andrej Karpathy

33.4.2 Production Frameworks

PyTorch

JAX

MLX (Apple Silicon)

33.5 ONNX Resources

ONNX Specification

ONNX Runtime

ONNX Model Zoo

33.6 GPU Computing

33.6.1 cuNumeric / Legate

cuNumeric

Legate

33.6.2 CUDA Programming

CUDA C Programming Guide

Programming Massively Parallel Processors by Kirk & Hwu

  • GPU computing textbook

33.7 Blogs and Tutorials

33.7.1 Technical Deep Dives

The Illustrated Transformer by Jay Alammar

Andrej Karpathy’s Blog

Lil’Log by Lilian Weng

33.7.2 Newsletters

The Batch by deeplearning.ai

Import AI by Jack Clark

33.8 Community

33.8.1 Forums

r/MachineLearning (Reddit)

Hugging Face Forums

33.8.2 Discord/Slack

EleutherAI

MLOps Community

33.9 What to Learn Next

After this book, consider exploring:

  1. Convolutional Networks: Image processing
  2. Recurrent Networks: Sequential data (historical interest)
  3. Diffusion Models: Image generation
  4. Reinforcement Learning: Decision making
  5. Distributed Training: Scaling to multiple GPUs/nodes
  6. Quantization: Model compression
  7. Flash Attention: Efficient attention implementations

The field moves fast. Stay curious, keep building.