One vitally important task in any data science project is to assess how well the model performs. Various metrics are available for doing this, and each has its own advantages and disadvantages. This is a large topic, so we will seperate it into metrics suitable for classifiers (this article) and those suitable for regression (next article).

A detailed description of the performance of a classifier model is given by the *Confusion Matrix* \(\mathbf{C}\), where \(C_{ij}\) is the number of instances of class \(i\) that are predicted to belong to class \(j\). This is useful for visualising the peerformance of the classifier, and the metrics discussed below can be calculated from it

Consider a binary classification problem. We may classify the results in our test dataset as True Positives, True Negatives, False Positives and False Negatives. The number of each of these is denoted \(\mathrm{TP} = C_{1,1}\), \(\mathrm{TN} = C_{0,0}\), \(\mathrm{FP} = C_{0,1}\) and \(\mathrm{FN} = C_{1,0}\) respectively.

The *Precision* of the classifier is the probability that an item predicted to be true is actually true. This is given by

In Bayesian terms, if the predicted class is \(p\) and the actual class is \(a\),

The *Recall* of the classifier is the probability that a true item is predicted to te true. This is given by

Which of these is more informative depends on the application. In The Grammar of Truth and Lies, my initial approach gave 100% Recall. However, since I had designated *True* to indicate a reliable article and *False* to indicate fake news, Precision was a more important measure of the model's ability to discriminate fact from fiction.

The F1 score is a metric that seeks to balance Precision and Recall. It is defined as the harmonic mean of them.

This measures similarity between the set of items predicted to be true and those that actually are true, but is not easy to interpret in terms of a Bayesian probability.

The *Accuracy* of the model is the probability that it predicts the correct class.

This is intuitive to interpret and, unlike the metrics discussed above, takes the true negatives into account. However, it becomes uninformative if classes are strongly imbalanced. For example, if we wish to predict whether or not a user will click on a given advertisement, we can achieve at least 99% accuracy by predicting *No* all the time. We therefore need metrics that correct for class imbalance.

*Cohen's Kappa* is a measure of how much better a classifier is than guesswork. If we guessed the class of an item without information, our best strategy would be to pick the maxumum-likelihood class every time, and this would give us a success rate of \(P_{\mathrm{max}}\). We can then define

*Matthew's Correlation Coefficient* is the Pearson Correlation Coefficient between the actual and predicted classes. It is calculated as

Accuracy and Cohen's Kappa can be extended to the multiclass case in the obvious way. It is not trivial to do this for Precision and Recall. However, we can define them on a per-class basis.

Evidently AI suggests three methods for calculating overall precision and recall scores for calculating overall precision and recall scores in a multiclass problem. *Macro averaging* simple calculates the mean of precision and recall across all classes.

where \(N\) is the number of classes.

*Micro averaging* gives an average of precision and recall across all instances.

These are equivalent, as a false negative for one class is a false positive for another, so while finer grained in one way, micro averaging loses information in another.

The third possibility is *weighted averaging*. While macro averaging gives all classes equal weight, wieghted averaging considers their overall prevalence in the data.

To gereralise Matthew's Correlation Coefficient to multiple classes, we first define the following terms

is the number of times class \(k\) occurs

is the number of times class \(k\) is predicted

is the number of correct predictions

is the total number of samples

We then obtain

Once you have the numbers, of course, it's important to dig deeper and understand what the factors influencing your model's performance are.

Previous | Next |
---|---|

PageRank | Evaluation Metrics for Regression |