33 Further Reading
Resources to continue your deep learning journey.
33.1 Papers
33.1.1 Foundational
Backpropagation
- Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors.” Nature.
- The paper that made neural networks practical
Automatic Differentiation
- Baydin et al. (2018). “Automatic Differentiation in Machine Learning: a Survey.” JMLR.
- Comprehensive overview of autodiff techniques
33.1.2 Optimization
Adam Optimizer
- Kingma & Ba (2014). “Adam: A Method for Stochastic Optimization.” ICLR.
- The most widely used optimizer
AdamW (Weight Decay)
- Loshchilov & Hutter (2017). “Decoupled Weight Decay Regularization.” ICLR.
- Fixes weight decay in Adam
Learning Rate Schedules
- Smith (2017). “Cyclical Learning Rates for Training Neural Networks.” WACV.
- Learning rate cycling techniques
33.1.3 Normalization
Batch Normalization
- Ioffe & Szegedy (2015). “Batch Normalization: Accelerating Deep Network Training.” ICML.
- Breakthrough for training deep networks
Layer Normalization
- Ba, Kiros, Hinton (2016). “Layer Normalization.” arXiv.
- Essential for Transformers
RMSNorm
- Zhang & Sennrich (2019). “Root Mean Square Layer Normalization.” NeurIPS.
- Simplified normalization for modern LLMs
33.1.4 Transformers
Original Transformer
- Vaswani et al. (2017). “Attention Is All You Need.” NeurIPS.
- The architecture that changed everything
GPT
- Radford et al. (2018). “Improving Language Understanding by Generative Pre-Training.”
- First GPT paper
GPT-2
- Radford et al. (2019). “Language Models are Unsupervised Multitask Learners.”
- Scaling up GPT
GPT-3
- Brown et al. (2020). “Language Models are Few-Shot Learners.” NeurIPS.
- In-context learning discovery
33.1.5 Activation Functions
ReLU
- Nair & Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML.
- Simple but effective
GELU
- Hendrycks & Gimpel (2016). “Gaussian Error Linear Units (GELUs).” arXiv.
- Smooth activation for Transformers
SiLU/Swish
- Ramachandran et al. (2017). “Searching for Activation Functions.” arXiv.
- Self-gated activation
33.2 Books
33.2.1 Deep Learning Theory
Deep Learning by Goodfellow, Bengio, Courville (2016)
- The comprehensive reference
- Free online: deeplearningbook.org
Neural Networks and Deep Learning by Michael Nielsen
- Interactive introduction
- Free online: neuralnetworksanddeeplearning.com
33.2.2 Implementation
Dive into Deep Learning by Zhang et al.
- Code-first approach with multiple frameworks
- Free online: d2l.ai
Programming PyTorch for Deep Learning by Ian Pointer (2019)
- Practical PyTorch guide
33.2.3 Mathematics
Mathematics for Machine Learning by Deisenroth, Faisal, Ong (2020)
- Mathematical foundations
- Free online: mml-book.github.io
Linear Algebra Done Right by Sheldon Axler
- Elegant linear algebra treatment
33.3 Online Courses
33.3.1 Foundations
CS231n: CNNs for Visual Recognition (Stanford)
- Classic neural network course
- cs231n.stanford.edu
CS224n: NLP with Deep Learning (Stanford)
- Transformers and language models
- web.stanford.edu/class/cs224n/
33.3.2 Practical
Fast.ai by Jeremy Howard
- Top-down practical approach
- fast.ai
Deep Learning Specialization (Coursera) by Andrew Ng
- Comprehensive introduction
- coursera.org/specializations/deep-learning
33.4 Code Repositories
33.4.1 Educational Frameworks
micrograd by Andrej Karpathy
- Tiny autograd engine (~100 lines)
- github.com/karpathy/micrograd
tinygrad by George Hotz
- Small but complete framework
- github.com/tinygrad/tinygrad
nanoGPT by Andrej Karpathy
- Minimal GPT implementation
- github.com/karpathy/nanoGPT
33.4.2 Production Frameworks
PyTorch
JAX
- github.com/google/jax
- Functional approach to autodiff
MLX (Apple Silicon)
- github.com/ml-explore/mlx
- Designed for Apple hardware
33.5 ONNX Resources
ONNX Specification
ONNX Runtime
- onnxruntime.ai
- High-performance inference
ONNX Model Zoo
- Pre-trained models in ONNX format
- github.com/onnx/models
33.6 GPU Computing
33.6.1 cuNumeric / Legate
cuNumeric
- NumPy on GPUs
- github.com/nv-legate/cunumeric
Legate
- Distributed computing foundation
- legate.stanford.edu
33.6.2 CUDA Programming
CUDA C Programming Guide
Programming Massively Parallel Processors by Kirk & Hwu
- GPU computing textbook
33.7 Blogs and Tutorials
33.7.1 Technical Deep Dives
The Illustrated Transformer by Jay Alammar
- Visual explanation of Transformers
- jalammar.github.io/illustrated-transformer/
Andrej Karpathy’s Blog
- Excellent technical writing
- karpathy.github.io
Lil’Log by Lilian Weng
- In-depth ML explanations
- lilianweng.github.io
33.8 Community
33.8.1 Forums
r/MachineLearning (Reddit)
- Research discussions
- reddit.com/r/MachineLearning
Hugging Face Forums
- Transformers and NLP
- discuss.huggingface.co
33.8.2 Discord/Slack
EleutherAI
- Open source LLM research
- eleuther.ai
MLOps Community
- Production ML
- mlops.community
33.9 What to Learn Next
After this book, consider exploring:
- Convolutional Networks: Image processing
- Recurrent Networks: Sequential data (historical interest)
- Diffusion Models: Image generation
- Reinforcement Learning: Decision making
- Distributed Training: Scaling to multiple GPUs/nodes
- Quantization: Model compression
- Flash Attention: Efficient attention implementations
The field moves fast. Stay curious, keep building.