In several articles, we have mentioned Loss Functions. These are the functions we aim to minimise when fitting the model. These are functions
where \(y_{t}\) is the true value of the target variable and \(y_{p}\) is the predicted output. They should have the following properties.
- They should have a global minimum when \(y_{t} = y_{p}\) and be strictly increasing elsewhere.
- They should be differentiable, so that we can apply backpropogation and gradient descent
They similar to, and to some extent, overlap with Similarity and Distance Metrics.
For regression problems, an obvious choice of loss function is the Mean Squared Error
This is easily differentiable
Linear Regression models usually use some variation of this, often with a regularisation penalty to prevent overfitting. However, since the loss is quadratic in the deviation of the prediction from the true value, this function is sensitive to outliers. The Mean Absolute Error
would avoid this problem, but is not differentialble at \(y_{t} = y_{p}\). This problem can be addressed by using the Huber Loss
This behaves like Mean Squared Error for \(|y_{t} - y_{p}| < \delta\), and like Mean Absolute Error for \(|y_{t} - y_{p}| > \delta\), where \(\delta\) is a scale above which we wish to reduce the influence of outliers. The derivative is given by
These loss functions assume that the errors are normally distributed with constant variance. If we expect a different distribution, we can use a Negative Log Likelihood Loss based on the expected distribution. For example, the if predicted values are expected to be drawn from the Poisson Distribution
we can, by setting \(\lambda = y_{p}\) and \(k = y_{t}\) use the Poisson Loss
The derivative of this is
If the variance of the predictions is not constant, but is itself predicted by the model, we may use the Gaussian Negative Log Likelihood Loss
where \(\sigma^{2}\) is the variance
The derivatives are
For classification problems, we usually use a Cross Entropy Loss. This assumes that the prediction is a probability distribution over the possible classes, and that the true value is represented by either a binary flag or a one-hot encoded value. In the case of binary classification, we use the Binary Cross Entropy Loss
for which the derivative is
Where there are more than two classes, we use the Categorical Cross Entropy Loss. For this, \(\vec{y}_{p}\) represents a probability distribution over classes \(i\) such that
\(\vec{y}_{t}\) is a one hot encoded representeation of the correct class \(j\), such that
where \(\delta_{ij}\) is the Kroneker delta function
We then have
and
Two variations of this are worth noting. In Sparse Categorical Cross Entropy, the true values are given as integer indexes rather than one-hot encoded. This is useful when there are a large number of categories, such as in Natural Language Processing. Secondly, since the predictions are usually obtained from a softmax function, it is possible to supply them as logits rather than probabilities.
If the values of \(y_{t}\) are themselves probabilities, an appropriate loss function is the Kullback-Liebler Divergence Loss
The derivative is
Cross-entropy and Kullback-Liebler divergence are both derived from Information Theory
Both PyTorch and Keras provide a wide variety of loss functions.
While the serve similar purposes, it is important not to conflate loss functions with evaluation metrics. Just as the data used to test the performance of the model should be independent of that used to train it, so should the metrics, as far as possible, to ensure a rigorous evaluation.
Previous | |
---|---|
Gradient Descent |