In the article about Logistic Regression, we mentioned that logistic regression and neural networks are fit by minimising a loss function. In order to do this, we need to calculate the gradient of the loss function with respect to the parameters. This tells us how we can adjust the parameters to reduce the loss.

Since the functions we want to optimise in machine learning problems can usually be expressed as a function of a function. To differentiate this, we use the *chain rule*

To illustrate this, let's see how we can use it to differentiate the cross-entropy loss

with respect to the weights \(\mathbf{W}\) of a logistic regression model. First we differentiate the loss with respect to the probability of the correct class.

Then we need to differentiate the probability with respect to each of the logits \(q_{i}\)

where \(\delta_{ic}\) is the *Kroneker delta function*, which is 1 if \(i=c\) and 0 otherwise.
(As an aside, functions whose derivative can be expressed in terms of their output are commonly used in machine learning, because they make differentiation easier. Such functions are often derived from the exponential function in some way).

Then, we need to differentiate the logits with respect to the weights

Finally, we can combine these derivatives using the chain rule

where \(\otimes\) denotes the outer product.

For a deeper neural network, we use the fact that each layer \(n\) of the network can be treated as a function

and apply the chain rule recursively to calculate the gradient of the loss with respect to each layer's weights and biases. This recursive application of the chain rule is known as *backpropogation*, and is the basis of most neural network optimisation algorithms.

Of course, very few data scientists ever need to do this themselves on a day-to-day basis, because automatic differentiation and backpropogation are provided by machine learning software libraries, but it's still useful to understand how it works.