Automatic Differentiation Mechanism

In the early days of deep learning, compute the gradient of the loss function with respect to the parameters of the model is a very tedious and error-prone task. This usually done by model developers themselves, by hand by hard coding. If you have the previligy to study the oldest version of Andrew Ng's Machine Learning course which is taught in 2011 on Coursera, you will see that the gradient of the loss function with respect to the parameters of the model is computed by hand by hard coding by using Octave (poor man's Matlab). In those days, the machine learning is really a high-end software engineering job.

Fortunately, with the development of automatic differentiation mechanism, the gradient of the loss function with respect to the parameters of the model can be computed automatically by the computer. It is called automatic differentiation mechanism.

To be honest, the automatic differentiation mechanism is not a new mechanism and not difficult to understand. It is just a simple algorithm that can be easily implemented by using the chain rule of calculus.

In case you have forgotten the chain rule of calculus, here is a quick reminder. For a composite function $f(g(x))$ , the derivative of $f$ with respect to $x$ is $f'(g(x)) \cdot g'(x)$ . As we mentioned in the previous section, networks are DAGs, it means the compute is passed from the input layer to the output layer by calling different operator instances. those operator instances equal to functions in calculus. That means networks are composite functions from the view of calculus. We can apply the technique (chain rule) in calculus to compute the gradient of the loss function with respect to the parameters of the model.