33 Further Reading

Resources to continue your deep learning journey.

33.1 Papers

33.1.1 Foundational

Backpropagation

Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors.” Nature.
- The paper that made neural networks practical

Automatic Differentiation

Baydin et al. (2018). “Automatic Differentiation in Machine Learning: a Survey.” JMLR.
- Comprehensive overview of autodiff techniques

33.1.2 Optimization

Adam Optimizer

Kingma & Ba (2014). “Adam: A Method for Stochastic Optimization.” ICLR.
- The most widely used optimizer

AdamW (Weight Decay)

Loshchilov & Hutter (2017). “Decoupled Weight Decay Regularization.” ICLR.
- Fixes weight decay in Adam

Learning Rate Schedules

Smith (2017). “Cyclical Learning Rates for Training Neural Networks.” WACV.
- Learning rate cycling techniques

33.1.3 Normalization

Batch Normalization

Ioffe & Szegedy (2015). “Batch Normalization: Accelerating Deep Network Training.” ICML.
- Breakthrough for training deep networks

Layer Normalization

Ba, Kiros, Hinton (2016). “Layer Normalization.” arXiv.
- Essential for Transformers

RMSNorm

Zhang & Sennrich (2019). “Root Mean Square Layer Normalization.” NeurIPS.
- Simplified normalization for modern LLMs

33.1.4 Transformers

Original Transformer

Vaswani et al. (2017). “Attention Is All You Need.” NeurIPS.
- The architecture that changed everything

GPT

Radford et al. (2018). “Improving Language Understanding by Generative Pre-Training.”
- First GPT paper

GPT-2

Radford et al. (2019). “Language Models are Unsupervised Multitask Learners.”
- Scaling up GPT

GPT-3

Brown et al. (2020). “Language Models are Few-Shot Learners.” NeurIPS.
- In-context learning discovery

33.1.5 Activation Functions

ReLU

Nair & Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML.
- Simple but effective

GELU

Hendrycks & Gimpel (2016). “Gaussian Error Linear Units (GELUs).” arXiv.
- Smooth activation for Transformers

SiLU/Swish

Ramachandran et al. (2017). “Searching for Activation Functions.” arXiv.
- Self-gated activation

33.2 Books

33.2.1 Deep Learning Theory

Deep Learning by Goodfellow, Bengio, Courville (2016)

The comprehensive reference
Free online: deeplearningbook.org

Neural Networks and Deep Learning by Michael Nielsen

Interactive introduction
Free online: neuralnetworksanddeeplearning.com

33.2.2 Implementation

Dive into Deep Learning by Zhang et al.

Code-first approach with multiple frameworks
Free online: d2l.ai

Programming PyTorch for Deep Learning by Ian Pointer (2019)

Practical PyTorch guide

33.2.3 Mathematics

Mathematics for Machine Learning by Deisenroth, Faisal, Ong (2020)

Mathematical foundations
Free online: mml-book.github.io

Linear Algebra Done Right by Sheldon Axler

Elegant linear algebra treatment

33.3 Online Courses

33.3.1 Foundations

CS231n: CNNs for Visual Recognition (Stanford)

Classic neural network course
cs231n.stanford.edu

CS224n: NLP with Deep Learning (Stanford)

Transformers and language models
web.stanford.edu/class/cs224n/

33.3.2 Practical

Fast.ai by Jeremy Howard

Top-down practical approach
fast.ai

Deep Learning Specialization (Coursera) by Andrew Ng

Comprehensive introduction
coursera.org/specializations/deep-learning

33.4 Code Repositories

33.4.1 Educational Frameworks

micrograd by Andrej Karpathy

Tiny autograd engine (~100 lines)
github.com/karpathy/micrograd

tinygrad by George Hotz

Small but complete framework
github.com/tinygrad/tinygrad

nanoGPT by Andrej Karpathy

Minimal GPT implementation
github.com/karpathy/nanoGPT

33.4.2 Production Frameworks

PyTorch

JAX

github.com/google/jax
Functional approach to autodiff

MLX (Apple Silicon)

github.com/ml-explore/mlx
Designed for Apple hardware

33.5 ONNX Resources

ONNX Specification

ONNX Runtime

onnxruntime.ai
High-performance inference

ONNX Model Zoo

Pre-trained models in ONNX format
github.com/onnx/models

33.6 GPU Computing

33.6.1 cuNumeric / Legate

cuNumeric

NumPy on GPUs
github.com/nv-legate/cunumeric

Legate

Distributed computing foundation
legate.stanford.edu

33.6.2 CUDA Programming

CUDA C Programming Guide

docs.nvidia.com/cuda/cuda-c-programming-guide/

Programming Massively Parallel Processors by Kirk & Hwu

GPU computing textbook

33.7 Blogs and Tutorials

33.7.1 Technical Deep Dives

The Illustrated Transformer by Jay Alammar

Visual explanation of Transformers
jalammar.github.io/illustrated-transformer/

Andrej Karpathy’s Blog

Excellent technical writing
karpathy.github.io

Lil’Log by Lilian Weng

In-depth ML explanations
lilianweng.github.io

33.7.2 Newsletters

The Batch by deeplearning.ai

Weekly AI news
deeplearning.ai/the-batch/

Import AI by Jack Clark

Research updates
importai.substack.com

33.8 Community

33.8.1 Forums

r/MachineLearning (Reddit)

Research discussions
reddit.com/r/MachineLearning

Hugging Face Forums

Transformers and NLP
discuss.huggingface.co

33.8.2 Discord/Slack

EleutherAI

Open source LLM research
eleuther.ai

MLOps Community

Production ML
mlops.community

33.9 What to Learn Next

After this book, consider exploring:

Convolutional Networks: Image processing
Recurrent Networks: Sequential data (historical interest)
Diffusion Models: Image generation
Reinforcement Learning: Decision making
Distributed Training: Scaling to multiple GPUs/nodes
Quantization: Model compression
Flash Attention: Efficient attention implementations

The field moves fast. Stay curious, keep building.