Logistic Regression

A simple classification algorithm

Over the past few weeks, we have been looking at algorithms related to Bayes' Theorem. This week, we are starting on a different tack, but it's still in the realm of relating probabilities to observations.

We start with the logistic function

$$p = \frac{1}{1+e^{-q}}$$

, where $q$ is a quantity we call a logit. This has the property that as $q \rightarrow \infty$, $p \rightarrow 1$ and as $q \rightarrow -\infty$, $p \rightarrow 0$, so it can be used to model a probability. If we wish to calculate the probabilities of more than one class, we can generalise this with the softmax function

$$p_{i} = \frac{e^{q_{i}}}{\sum_{i} e^{q_{i}}}$$

where $p_{i}$ and $q_{i}$ represent the probabilities and logits for each class $i$ respectively.

But what are the logits? In the basic implementation of logistic regression, they are a linear function of some observations. Given a vector $\vec{x}$ of observations, we may model the logits as

$$q = \vec{w} \cdot \vec{x} + b$$

for the binary case and

$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec {b}$$

in the multiclass case. where $\vec{w}$ and $\mathbf{W}$ are weights and $b$ and $\vec{b}$ are biases. In terms of Bayes' Theorem.

$$\vec{b} = \ln P(H)$$

and

$$\mathbf{W} \cdot \vec{x} = \ln P(\vec{x} \mid H)$$

We fit the weights and biases by minimising the cross-entropy loss

$$L = -\sum_{j} \ln p_{j,c}$$

where $c$ is the correct class for the example $j$ in the training dataset.

This works well as a simple classifier under two conditions

The classes are fairly evenly balanced
The classes are linearly seperable

If there is a strong imbalance between the classes, the bias will tend to dominate over the weights, and the rarer classes will never be predicted. To mitigate this, is is possible to undersample the more common classes or oversample the rarer ones before training.

If the classes are not linearly seperable, it's necessary to transform the data into a space where they are. This may be done by applying

$$\vec{x^{\prime}} = f(\mathbf{M} \cdot \vec{x})$$

where $f$ is some non-linear function and $\mathbf{M}$ is a matrix of weights. We may in fact apply several layers of similar transformations, each with its own set of weight parameters. That is the basis of neural networks.

Previous	Next
Markov Chain Monte Carlo	The Chain Rule and Backpropogation