A simple classification algorithm
Over the past few weeks, we have been looking at algorithms related to Bayes' Theorem. This week, we are starting on a different tack, but it's still in the realm of relating probabilities to observations.
We start with the logistic function
, where \(q\) is a quantity we call a logit. This has the property that as \(q \rightarrow \infty\), \(p \rightarrow 1\) and as \(q \rightarrow -\infty\), \(p \rightarrow 0\), so it can be used to model a probability. If we wish to calculate the probabilities of more than one class, we can generalise this with the softmax function
where \(p_{i}\) and \(q_{i}\) represent the probabilities and logits for each class \(i\) respectively.
But what are the logits? In the basic implementation of logistic regression, they are a linear function of some observations. Given a vector \(\vec{x}\) of observations, we may model the logits as
for the binary case and
in the multiclass case. where \(\vec{w}\) and \(\mathbf{W}\) are weights and \(b\) and \(\vec{b}\) are biases. In terms of Bayes' Theorem.
and
We fit the weights and biases by minimising the cross-entropy loss
where \(c\) is the correct class for the example \(j\) in the training dataset.
This works well as a simple classifier under two conditions
- The classes are fairly evenly balanced
- The classes are linearly seperable
If there is a strong imbalance between the classes, the bias will tend to dominate over the weights, and the rarer classes will never be predicted. To mitigate this, is is possible to undersample the more common classes or oversample the rarer ones before training.
If the classes are not linearly seperable, it's necessary to transform the data into a space where they are. This may be done by applying
where \(f\) is some non-linear function and \(\mathbf{M}\) is a matrix of weights. We may in fact apply several layers of similar transformations, each with its own set of weight parameters. That is the basis of neural networks.
Previous | Next |
---|---|
Markov Chain Monte Carlo | The Chain Rule and Backpropogation |