Playful Technology Limited - Key Algorithmshttps://PlayfulTechnology.co.uk/2024-07-18T00:00:00+01:00Expectation Maximisation2024-07-18T00:00:00+01:002024-07-18T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-07-18:/expectation-maximisation.html<p>Modelling data with latent variables</p><p>Consider a dataset where each datapoint <span class="math">\(\mathbf{x}\)</span> is associated with an unobserved <em>latent variable</em> <span class="math">\(z\)</span>. We wish to fit a model that accounts for these latent varibles. Obviously, if we knew the values of <span class="math">\(z\)</span> we would be able to fit a supervised classifier model to predict them, but since they are unknown values that must be estimated by the model, we must estimate the latent variables and fit then model parameters at the same time.</p>
<p>The method we use to do this is called <em>Expectation Maximsation</em>. It iterates over two steps</p>
<dl>
<dt>Expectation step (E-step)</dt>
<dd>Estimate the latent varibles given the current model parameters</dd>
<dt>Maximisation step (M-step)</dt>
<dd>Update the model parameters to maximise the likelihood of the data given the current estimates of the latent variables</dd>
</dl>
<p>until the model converges (the improvement in the likelihood is less than a given threshold).</p>
<p><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-means clustering</a> is a simple form of expectation maximisation, where the latent variables are the cluster labels, the model parameters are the centroids, the expectation step consists of assigning data points to the cluster with the nearest centroid, and the maximisation step consists of recalculating the centroids. It assumes that all clusters have the same variance, and makes a hard assignment of cluster labels, which means that the uncertainty in the assignments is not qunatified.</p>
<p>These assumptions can be relaxed by a <em>Gaussian Mixture Models</em>. This assumes that each datapoint is drawn from one of several <span class="math">\(k\)</span>-dimensional normal distributions </p>
<div class="math">$$P(\vec{x} \mid z) = \frac{e^{-(\vec{x}-\vec{\mu}_{z}) \cdot \mathbf{\Sigma}^{-1}_{z} \cdot (\vec{x}-\vec{\mu}_{z})}}{\sqrt{(2 \pi)^{k} |\mathbf{\Sigma}_{z}|}}$$</div>
<p>where <span class="math">\(\vec{\mu}_{z}\)</span> and <span class="math">\(\mathbf{\Sigma}_{z}\)</span> are the mean and covariance of distribution <span class="math">\(z\)</span> respectively. The prior probability of the clusters is <span class="math">\(P(z)\)</span></p>
<p>In the expectation step, we calculate cluster membership probabilites according to <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' theorem</a></p>
<div class="math">$$P(z \mid \vec{x}) = \frac{P(z) P(\vec{x} \mid z)}{\sum_{z} P(z) P (\vec{x} \mid z)}$$</div>
<p>We can the use these probabilites in the maximisation step to update the model parameters using weighted means and covariances</p>
<div class="math">$$\vec{\mu}_{z} = \frac{\sum_{i} \vec{x}_{i} P(z \mid \vec{x}_{i})}{\sum_{i} P(z \mid \vec{x}_{i})}$$</div>
<div class="math">$$\mathbf{\Sigma}_{z} = \frac{\sum_{i} (\vec{x}_{i} - \vec{\mu}_z) \otimes (\vec{x}_{i} - \vec{\mu}_{z}) P(z \mid \vec{x}_{i})}{\sum{i} P(z \mid \vec{x}_{i})}$$</div>
<div class="math">$$P(z) = \frac{\sum_{i} P(z \mid \vec{x}_{i})}{N}$$</div>
<p> where <span class="math">\(N\)</span> is the number of datapoints.</p>
<p>Gaussian mixture models can be seen as a more rigourous version of K-Means, however, in high dimensions there is a risk of the covariance matrices becoming singular (all the datapoints for a particular cluster lie in a lower-dimensional subspace), so it is a good idea to apply <a href="https://PlayfulTechnology.co.uk/data-reduction.html">data reduction</a> first.</p>
<p>Mixture models of other distributions are possible, but may require <a href="https://PlayfulTechnology.co.uk/gradient-descent.html">gradient descent</a> or <a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a> in the maximisation step.</p>
<p>The <a href="https://PlayfulTechnology.co.uk/tokenizers.html">SentencePiece</a> tokenizer uses expectaion maximisation to calcuate the token frequency distribution. The expectation step calculates the token probabilites given the current maximum likelihood segmentation of the input, while the maximisation step uses the <a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">Viterbi algorithm</a> to calculate a new maximum likelihood segmentation.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/tokenizers.html">Tokenizers</a></td>
<td></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Tokenizers2024-07-11T00:00:00+01:002024-07-11T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-07-11:/tokenizers.html<p>Segmenting text into meaningful units</p><p>As mentioned in the article on <a href="https://PlayfulTechnology.co.uk/transformers.html">transformers</a>, most NLP models start by dividing the text into <em>tokens</em>. The simplest way of doing this is simply to split the text on whitespace or non-alphabetic characters. <a href="https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize">Gensim's tokenize function</a> does this, and it is adequate for bag-of-words models like <a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a> or <a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a>. However, it has a number of disadvanteges.</p>
<ol>
<li>Not all languages divide words on spaces - Mandarin and Japanese do not.</li>
<li>Words related to each other by prefixes and suffixes are treated as entirely different words - for example, "surprise", "surprised", "surprising", "surprisingly" and "unsurprisingly" would not be treated as related words. This is particularly a problem for languages whose morphology carries more information than English's does. Stemming or lemmatisation can partially address this, but discards the information in the affixes.</li>
<li>Assigning numerical indexes to rare words becomes a problem. A rare word that is seen during training will get an index, but may not be encountered in subsequent use, whereas a word that has not been seen during training cannot be assigned indexes in subsequent use.</li>
</ol>
<p>While some models, such as those based on <a href="https://PlayfulTechnology.co.uk/recurrent-neural-networks.html">Recurrent Neural Networks</a> simply use characters as input, this of course means that the meaning of words must be inferred from scratch at every invocation of the model. </p>
<p><em>Subword tokenizers</em> attemrpt to identify meaningful sequences within the input, which may correspond to words, word stems, affixes or punctuation. This we would expect "unsurprisingly" to be tokenized as "un-surpris-ing-ly". This is useful for transformer models, where the <a href="https://PlayfulTechnology.co.uk/the-attention-mechanism.html">Attention Mechnanism</a> can then use each subword to modify the meaning of the others.</p>
<p>The simplest method for this is <em>Byte Pair Encoding</em>. This starts by treating each byte of the training corpus as a token. It then calculates the frequencies of each pair of tokens occurring in the corpus, and replaces the most frequenct pair with a new token representing that pair, and learns a <em>merge rule</em> for that pair of tokens This process repeats until a vocabulary of a given size has been obtained. To tokenize a text, it applies the merge rules to the text.</p>
<p>A variation of this is the <em>WordPiece</em> tokenizer. Rather than picking tokens to merge by their overall frequency, this uses a score
</p>
<div class="math">$$S_{ij} = \frac{F_{ij}}{F_{i} F_{j}}$$</div>
<p> where <span class="math">\(F_{i}\)</span>, <span class="math">\(F_{j}\)</span> and <span class="math">\(F_{ij}\)</span> are the frequencies of tokens <span class="math">\(i\)</span>, <span class="math">\(j\)</span> and the sequence <span class="math">\(ij\)</span> respectively. Rather than using merge rules, it tokenises texts by finding the longest substring at the beginning of the text that is in its vocabulary, splitting it off as a token, and repeating until the entire text has been divided into tokens.</p>
<p>These methods build their vocabulary from the bottom up. In contrast, the <em>Unigram</em> tokenizer builds its vocabulary from the top down. It starts by calculating the frequencies of all the words and their substrings in the training corpus, and discards infrequent tokens until the required vocabulary size is reached. To tokenize a text, it uses the <a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">Viterbi algorithm</a> to find the maximum likelihood sequence of tokens corresponding to the text.</p>
<p><em>SentencePiece</em> is an algorithm that aims to account for multiple possible segmentations in a more robust way than either byte pair encoders or unigram models. It is trained by the following procedure.</p>
<ol>
<li>Train a byte pair encoder or unigram tokenizer with a larger vocabulary size than we ultimately need.</li>
<li>Use the Viterbi algorithm to calculate the maximum likelihood segmentation of the training corpus according to this model.</li>
<li>Recalculate the token frequencies according to the maximum-likelihood segmentation.</li>
<li>Repeat steps 2 and 3 until the token frequencies converge</li>
<li>Prune the model down to the required vocabulary size.</li>
</ol>
<p>To tokenise a text, SentencePiece maximises the score
</p>
<div class="math">$$S = \frac{\log P(\mathbf{y} \mid \mathbf{x})}{|\mathbf{y}|^{\lambda}}$$</div>
<p>
Where <span class="math">\(\mathbf{y}\)</span> is the tokenized text, <span class="math">\(\mathbf{x}\)</span> is the untokenized text, <span class="math">\(|\mathbf{y}|\)</span> is the length of the tokenized text, and <span class="math">\(\lambda\)</span> is a constant. </p>
<p>The <a href="https://huggingface.co/docs/tokenizers/index">HuggingFace Tokenizers Library</a> contains implementations of byte pair encoding, WordPiece and unigram tokenizers, and pretrained tokenizers are available for models obtained from HuggingFace, and <a href="https://pypi.org/project/sentencepiece/">SentencePiece</a> can be found on PyPI.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/loss-functions.html">Loss Functions</a></td>
<td><a href="https://PlayfulTechnology.co.uk/expectation-maximisation.html">Expectation Maximisation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Loss Functions2024-07-04T00:00:00+01:002024-07-04T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-07-04:/loss-functions.html<p>What we minimise when we fit a model</p><p>In several articles, we have mentioned <em>Loss Functions</em>. These are the functions we aim to minimise when fitting the model. These are functions </p>
<div class="math">$$\mathcal{L}(y_{t},y_{p})$$</div>
<p> where <span class="math">\(y_{t}\)</span> is the true value of the target variable and <span class="math">\(y_{p}\)</span> is the predicted output. They should have the following properties.</p>
<ol>
<li>They should have a global minimum when <span class="math">\(y_{t} = y_{p}\)</span> and be strictly increasing elsewhere.</li>
<li>They should be differentiable, so that we can apply <a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">backpropogation</a> and <a href="https://PlayfulTechnology.co.uk/gradient-descent.html">gradient descent</a></li>
</ol>
<p>They similar to, and to some extent, overlap with <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a>.</p>
<p>For regression problems, an obvious choice of loss function is the <em>Mean Squared Error</em></p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = (y_{t} - y_{p})^{2}$$</div>
<p>This is easily differentiable</p>
<div class="math">$$\frac{d \mathcal{L}}{d y_{p}} = 2(y_{p} - y_{t})$$</div>
<p><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a> models usually use some variation of this, often with a regularisation penalty to prevent overfitting. However, since the loss is quadratic in the deviation of the prediction from the true value, this function is sensitive to outliers. The <em>Mean Absolute Error</em>
</p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = |y_{t} - y_{p}|$$</div>
<p> would avoid this problem, but is not differentialble at <span class="math">\(y_{t} = y_{p}\)</span>. This problem can be addressed by using the <em>Huber Loss</em></p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = \min(\frac{(y_{t} - y_{p})^2)}{2}, \delta (|y_{t} - y_{p}| - \frac{\delta}{2}))$$</div>
<p>This behaves like Mean Squared Error for <span class="math">\(|y_{t} - y_{p}| < \delta\)</span>, and like Mean Absolute Error for <span class="math">\(|y_{t} - y_{p}| > \delta\)</span>, where <span class="math">\(\delta\)</span> is a scale above which we wish to reduce the influence of outliers. The derivative is given by</p>
<div class="math">$$\frac{d \mathcal{L}}{d y_{p}} = \max(-\delta,\min(\delta,\delta (y_{p} - y_{t})))$$</div>
<p>These loss functions assume that the errors are normally distributed with constant variance. If we expect a different distribution, we can use a <em>Negative Log Likelihood Loss</em> based on the expected distribution. For example, the if predicted values are expected to be drawn from the Poisson Distribution
</p>
<div class="math">$$P(k \mid \lambda) = \frac{\lambda^{k} e^{-\lambda}}{k!}$$</div>
<p>we can, by setting <span class="math">\(\lambda = y_{p}\)</span> and <span class="math">\(k = y_{t}\)</span> use the <em>Poisson Loss</em></p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = -ln(P(y_{t} \mid y_{p}) = \ln(y_{t}!) + y_{p} - y_{t} \ln y_{p}$$</div>
<p>The derivative of this is </p>
<div class="math">$$\frac{d \mathcal{L}}{d y_{p}} = 1 - y_{t}/y_{p}$$</div>
<p>If the variance of the predictions is not constant, but is itself predicted by the model, we may use the <em>Gaussian Negative Log Likelihood Loss</em></p>
<div class="math">$$\mathcal{L}(y_{t},y_{p},\sigma) = \frac{1}{2}\left( 2 \ln(\sigma) +\left(\frac{y_{t} - y_{p}}{\sigma} \right)^{2} \right)$$</div>
<p>
where <span class="math">\(\sigma^{2}\)</span> is the variance</p>
<p>The derivatives are
</p>
<div class="math">$$\frac{\partial \mathcal{L}}{\partial y_{p}} = \frac{y_{p} - y_{t}}{\sigma^{2}}$$</div>
<div class="math">$$\frac{\partial \mathcal{L}}{\partial \sigma} = \frac{1}{\sigma}\left(1 - \left(\frac{y_{t} - y_{p}}{\sigma} \right)^{2} \right)$$</div>
<p>For classification problems, we usually use a <em>Cross Entropy Loss</em>. This assumes that the prediction is a probability distribution over the possible classes, and that the true value is represented by either a binary flag or a one-hot encoded value. In the case of binary classification, we use the <em>Binary Cross Entropy Loss</em>
</p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = y_{t} \ln(y_{p}) + (1 - y_{t}) \ln(1 - y_{p})$$</div>
<p>
for which the derivative is
</p>
<div class="math">$$\frac{d \mathcal{L}}{d y_{p}} = \frac{y_{t}}{y_{p}} - \frac{1-y_{t}}{1-y_{p}}$$</div>
<p>Where there are more than two classes, we use the <em>Categorical Cross Entropy Loss</em>. For this, <span class="math">\(\vec{y}_{p}\)</span> represents a probability distribution over classes <span class="math">\(i\)</span> such that
</p>
<div class="math">$$y_{pi} = P(i)$$</div>
<p>
<span class="math">\(\vec{y}_{t}\)</span> is a one hot encoded representeation of the correct class <span class="math">\(j\)</span>, such that
</p>
<div class="math">$$y_{ti} = \delta_{ij}$$</div>
<p> where <span class="math">\(\delta_{ij}\)</span> is the Kroneker delta function</p>
<p>We then have
</p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = -\ln(\vec{y}_{t} \cdot \vec{y}_{p}) \\
= -\ln(\sum_{i} \delta_{ij} P(i) \\
= - \ln(P(j))$$</div>
<p>and </p>
<div class="math">$$\frac{d \mathcal{L}}{d \vec{y}_{p}} = -\frac{1}{P(j)}$$</div>
<p>Two variations of this are worth noting. In <em>Sparse Categorical Cross Entropy</em>, the true values are given as integer indexes rather than one-hot encoded. This is useful when there are a large number of categories, such as in Natural Language Processing. Secondly, since the predictions are usually obtained from a <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">softmax function</a>, it is possible to supply them as logits rather than probabilities.</p>
<p>If the values of <span class="math">\(y_{t}\)</span> are themselves probabilities, an appropriate loss function is the <em>Kullback-Liebler Divergence Loss</em>
</p>
<div class="math">$$\mathcal{L}(y_{t},y_{p}) = y_{t}(\ln(y_{t} - \ln(y_{p})$$</div>
<p>The derivative is
</p>
<div class="math">$$\frac{d \mathcal{L}}{d y_{p}} = -\frac{y_{t}}{y_{p}}$$</div>
<p>Cross-entropy and Kullback-Liebler divergence are both derived from <a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></p>
<p>Both <a href="https://pytorch.org/docs/stable/nn.html#loss-functions">PyTorch</a> and <a href="https://keras.io/api/losses/">Keras</a> provide a wide variety of loss functions.</p>
<p>While the serve similar purposes, it is important not to conflate loss functions with <a href="https://PlayfulTechnology.co.uk/tag/evaluation.html">evaluation metrics</a>. Just as the data used to test the performance of the model should be independent of that used to train it, so should the metrics, as far as possible, to ensure a rigorous evaluation.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/gradient-descent.html">Gradient Descent</a></td>
<td><a href="https://PlayfulTechnology.co.uk/tokenizers.html">Tokenizers</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Gradient Descent2024-06-27T00:00:00+01:002024-06-27T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-06-27:/gradient-descent.html<p>The basis of model training</p><p>Throughout this series, we have frequently mentioned the need to train models. For many models, including <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a>, <a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a>, and most forms of neural network, this is done by some form of <em>Gradient Descent</em>. Given a model
</p>
<div class="math">$$\mathbf{Y} = f(\mathbf{X},\mathbf{W})$$</div>
<p>
where <span class="math">\(\mathbf{Y}\)</span> are the desired outputs, <span class="math">\(\mathbf{X}\)</span> are the inputs, and <span class="math">\(\mathbf{W}\)</span> are the model's weights, we first calculate a loss function
</p>
<div class="math">$$\mathcal{L}(\mathbf{Y},f(\mathbf{X}, \mathbf{W}))$$</div>
<p>
which measures the deviation of the predicted results from the actual values in the training data. We then calculate the gradient of this with respect to the weights.
</p>
<div class="math">$$\mathbf{G} = \frac{\partial \mathcal{L}}{\partial \mathbf{W}}$$</div>
<p>
This is generally calculated using the <a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">Chain Rule</a>. We then update the weights as
</p>
<div class="math">$$\mathbf{W} \rightarrow \mathbf{W} - \eta \mathbf{G}$$</div>
<p>
where <span class="math">\(\eta\)</span> is a small constant known as the <em>learning rate</em>. This process is then iterated over a number of <em>epochs</em>, or until the loss has converged to a minimum and no further improvement can be found.</p>
<p>Gradient descent methods are analogous to a physical system in which the loss function represents the potential energy of a particle whose position is given by the model's parameters.</p>
<p>Calculating the loss and gradient over the whole dataset as described (<em>Batch Gradient Descent</em>) above is only really tractable for simple models and small datasets. For larger datasets and more complex datasets, memory requirements would be prohibitive, so it is easier to iterate over the dataset at each epoch, calculating the update for each sample individually. However, since the gradient is in general a non-linear function of the input, wieghts and outputs, the update calculated for each datapoint will depend on the updates performed on previous datapoints. If we were to iterate over the dataset in a fixed order, this would lead to a risk of creating systematic errors in the fit, which could lead to the model converging to a local minimum, where it is not optimal, but cannot be further improved by gradient descent.</p>
<p>To avoid this, <em>Stochastic Gradient Descent</em> shuffles the dataset between each epoch. As the samples are now visited in a different random order each time, any systematic errors that may arise from the order of iteration are smoothed out.</p>
<p>However, since Stochastic Gradient Descent involves calculating an update for each datapoint in the sample, it is more computationally expensive than simple gradient descent, and while shuffling the data mitigates the instability caused by iterating over individual data points, it does not eliminate it completely. These problems can be addressed by <em>Mini-Batch Gradient Descent</em>, where, after shuffling, the datapoints are grouped into batches, and the update calculated for each batch. Small batches such as 32 samples are found to give a good trade-off between stability and memory use.</p>
<p>The learning rate is an important parameter in gradient descent algorithms. Too low and the algorithms will take a long time to converge, too high and they will tend to overshoot the minimum. As previously mentioned, <a href="https://PlayfulTechnology.co.uk/transfer-learning.html">Transfer Learning</a> uses small learning rates to avoid catastrophic forgetting. There are various methods of adapting the leaarning rate as training progresses. The simplest is to use a <em>Learning Rate Scheduler</em>, which starts with a higher learning rate to ensure the parameter space is adequately explored, and gradually adjusts it downwards as training progresses. other methods, such as <a href="https://optimization.cbe.cornell.edu/index.php?title=AdaGrad">AdaGrad and RMSProp</a>, which use independent learning rates for each parameter, which are continuously updated based on the gradients previously encountered.</p>
<p>A further adaptation to gradient descent algorithms is to introduce the concept of <em>momentum</em>. Whereas in the physical analogy, the methods previously described treat the gradient as the velocity of the particle, momentum based methods such as <a href="https://optimization.cbe.cornell.edu/index.php?title=Adam">Adam</a> treat it as an acceleration. Given a momentum <span class="math">\(\mathbf{M}\)</span> which is the same shape as <span class="math">\(\mathbf{W}\)</span>, and two learning rates <span class="math">\(\alpha\)</span> and <span class="math">\(\beta\)</span>, (<span class="math">\(0 < \beta < 1\)</span>) the weights and momentum are updated at each timestep by
</p>
<div class="math">$$\mathbf{W} \rightarrow \mathbf{W} - \alpha \mathbf{M}$$</div>
<div class="math">$$\mathbf{M} \rightarrow \beta \mathbf{M} + (1 - \beta) \frac{\partial \mathcal{L}}{\partial \mathbf{W}}$$</div>
<p>By retaining information about previous gradients between timesteps, these models are able to converge more smoothly and quickly. <span class="math">\(\beta\)</span> damps the momentum to prevent divergence.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/transformers.html">Transformers</a></td>
<td>[Loss Functions]({filename}</td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Transformers2024-06-20T00:00:00+01:002024-06-20T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-06-20:/transformers.html<p>The architecture of Large Language Models</p><p>In the article on the <a href="https://PlayfulTechnology.co.uk/the-attention-mechanism.html">Attention Mechanism</a> mentioned the <em>Transformer Architecture</em>, which is the basis of Large Languge Models. In this article, we examine transformers in detail.</p>
<p>A transformer model, like most NLP models, begins with a <em>tokenizer</em>, which converts a string input into a sequence of integers, each representing a particular word, subword or symbol stored in the model's vocabulary. If the sequence is shorter than the model's context window, it may be padded to the required length.</p>
<p>An <em>embedding</em> is then used to supply an initial vector for each token. The embedding can be seen a <span class="math">\(w \times m\)</span> matrix <span class="math">\(\mathbf{E}\)</span>, where <span class="math">\(w\)</span> is the vocabulary size and <span class="math">\(m\)</span> the vector width. If the output of the tokenizer is represented by an vector <span class="math">\(\vec{T}\)</span> of integers of size <span class="math">\(n\)</span>, the embedding produces an <span class="math">\(n \times m\)</span> matrix <span class="math">\(\mathbf{X}\)</span>, where
</p>
<div class="math">$$\vec{X}_{i} = \vec{E}_{T_{i}}$$</div>
<p>It is desirable that the model should be able to take the position of words into account. For this reason a <em>Positional Encoding</em> is added to the initial vectors. This is an <span class="math">\(n \times m\)</span> matrix <span class="math">\(\mathbf{P}\)</span> where
</p>
<div class="math">$$P_{i} = f(i)$$</div>
<p>
Typically used is a sinusoidal positional encoding, where for each <span class="math">\(k\)</span> in the range <span class="math">\(0 \le k < m/2\)</span>
</p>
<div class="math">$$P_{i,2k} = \sin\left(\frac{k}{n^{2 i /m}}\right)$$</div>
<p> and
</p>
<div class="math">$$P_{i,2k+1} = \cos\left(\frac{k}{n^{2 i / m}}\right)$$</div>
<p>However, more recent transformer models often use a <em>Rotary Positional Encoding</em>. In this, we have an <span class="math">\(m \times m\)</span> matrix <span class="math">\(\mathbf{R}\)</span> which defines a rotation in <span class="math">\(m\)</span> dimensional space, and each element of <span class="math">\(\mathbf{P}\)</span> is calculated as
</p>
<div class="math">$$\vec{P}_{i} = \mathbf{R} \vec{P}_{i-1}$$</div>
<p>This allows both absolute and relative positions to be taken into account.</p>
<p>After this, the vectors are fed to a series of <em>Transformer Blocks</em>. The first elememnt of each transformer block is an attention layer. The output of the attention layer is then added to its input, so as to avoid vanishing gradients. This then undergoes <em>Layer Normalisation</em></p>
<div class="math">$$\mathbf{X}^{\prime} = \gamma \frac{\mathbf{X}-\mu}{\sigma} + \beta $$</div>
<p>
Where
</p>
<div class="math">$$\mu = \frac{\sum_{i} \sum_{j} X_{ij}}{n \times m}$$</div>
<div class="math">$$\sigma = \sqrt{\frac{\sum_{i} \sum_{j} (X_{ij} - \mu)^{2}}{n \times m}}$$</div>
<p>
and <span class="math">\(\gamma\)</span> and <span class="math">\(\beta\)</span> are learnable parameters</p>
<p>The normalised outputs are then fed to a feedforward layer. Early transformer models generally used a ReLU <a href="https://PlayfulTechnology.co.uk/activation-functions.html">activation function</a>, but more recent models tend to use one of the non-monotonic functions, such as the swish function.</p>
<p>The output of the feedforward layer is added to its input, and layer normalisation applied again. The output of the second layer normalisation is the output of the transformer block.</p>
<p>Several transformer blocks are stacked together, each processing the output of the previous one and adding more context to the representation of each token. Finally, a <em>head</em> layer is applied - this is typically a <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">logistic regression</a> layer that makes the necessary predictions.</p>
<p>There are three types of transformer models. <em>Encoder</em> models use bidirectional attention, where the full context window is taken into account for every token. These are typically trained on a <em>masked word prediction</em> task, where some words of the input are masked and the model is trained to predict them. <a href="https://huggingface.co/docs/transformers/model_doc/roberta">RoBERTa</a> is an example of an encoder model.</p>
<p><em>Decoder</em> models use <em>masked attention</em>, where each token's context depends only on itself and previous tokens. These are used for <em>causal language modelling</em>, and are typically trained on <em>next word prediction</em>. <em>Generative</em> or <em>autoregressive</em> models synthesize text by iteratively adding each predicted word back into their inputs. Several of the best known LLMs currently, sucn as the GPT models, <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3">Mistral</a> and <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B">LLaMa</a> are decoder models.</p>
<p><em>Encoder-decoder</em> models are used for sequence-to-sequence modelling tasks such as machine translation. They consist of an encoder model and a decoder model coupled by <em>Cross Attention</em>, whereby the output of each transformer block in the encoder model is prepended to the input of the corresponding transformer block in the decoder model. <a href="https://huggingface.co/google/t5-v1_1-xxl">T5</a> is an example of such a model.</p>
<p>Due to their large size and complexity, transformer models are usually specialised to particular tasks by <a href="https://PlayfulTechnology.co.uk/transfer-learning.html">Transfer Learning</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-attention-mechanism.html">The Attention Mechanism</a></td>
<td><a href="https://PlayfulTechnology.co.uk/gradient-descent.html">Gradient Descent</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Attention Mechanism2024-06-13T00:00:00+01:002024-06-13T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-06-13:/the-attention-mechanism.html<p>Adding context to the meanings of words</p><p>Natural Language Processing systems need to take account the fact that the meaning of a word is dependent on the context in which it occurs. Earlier generations of NLP systems addressed this is various ways. The simplest was to group words into bigrams and trigrams before performing <a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a>. Another approach was to map the words to an unambiguous ontology using Word Sense Disambiguation, as I did in my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> using the <a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">Viterbi Algorithm</a>. More recently <a href="https://PlayfulTechnology.co.uk/recurrent-neural-networks.html">Recurrent Neural Networks</a> have been used.</p>
<p>However, the current generation of NLP models are mainly based on <em>The Attention Mechanism</em>. Given an <span class="math">\(n \times m\)</span> matrix <span class="math">\(\mathbf{X}\)</span> which represents a <em>context window</em> containing <span class="math">\(n\)</span> tokens, each represented by an <span class="math">\(m\)</span> dimensional vector, this calculates the context-dependent word vectors as
</p>
<div class="math">$$\mathbf{Y} = \mathbf{A} \cdot \mathbf{V}$$</div>
<p>Where <span class="math">\(\mathbf{A}\)</span> is an <em>attention matrix</em>, where <span class="math">\(A_{ij}\)</span> represents the significance of the <span class="math">\(j\)</span>th token to the meaning of the <span class="math">\(i\)</span>th, and <span class="math">\(\mathbf{V}\)</span> is a linear projection of the token vectors
</p>
<div class="math">$$\mathbf{V} = \mathbf{X} \cdot \mathbf{W}_{V}$$</div>
<p>, known as the <em>value</em>.</p>
<p>The attention mechanism takes advantage of the fact that matrix multiplications can be executed efficiently in hardware, thus making it possible to calculate contextual word vectors over a large context window with fewer calculations than recurrent neural networks would require.</p>
<p>To calculate the attention matrix, we first calculate a <em>Score Function</em>
</p>
<div class="math">$$\mathbf{S} = f(\mathbf{Q},\mathbf{K})$$</div>
<p>
where <span class="math">\(\mathbf{Q}\)</span> and <span class="math">\(\mathbf{K}\)</span> are <span class="math">\(n \times d\)</span> linear projections of the context window
</p>
<div class="math">$$\mathbf{Q} = \mathbf{X} \cdot \mathbf{W}_{Q}$$</div>
<div class="math">$$\mathbf{K} = \mathbf{X} \cdot \mathbf{W}_{K}$$</div>
<p>
known as the <em>query</em> and the <em>key</em> repsectively. The names <em>query</em>, <em>key</em> and <em>value</em> are based on an analogy with databases, but aren't really significant. The attention matrix is then calculated by applying the <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">softmax function</a> to the score matrix</p>
<div class="math">$$A_{ij} = \frac{e^{S_{ij}}}{\sum_{j} e^{S_{ij}}}$$</div>
<p>The most commonly-used scoring function is the <em>Scaled Dot Product Attention</em>
</p>
<div class="math">$$\mathbf{S} = \frac{\mathbf{Q} \cdot \mathbf{K}^{T}}{\sqrt{d}}$$</div>
<p>The scaling is done to prevent the softmax function from saturating and selecting a single token as the sole contributor to the meaning.</p>
<p>Other scoring functions include <em>Additive Attention</em>
</p>
<div class="math">$$S_{ij} = \vec{v} \cdot \tanh( \mathbf{W}_{1} \cdot Q_{i} + \mathbf{W}_{2} K_{j})$$</div>
<p>
where <span class="math">\(\vec{v}\)</span>, <span class="math">\(\mathbf{W}_{1}\)</span> and <span class="math">\(W_{2}\)</span> are learnable parameters, and <em>General Attention</em>
</p>
<div class="math">$$\mathbf{S} = \mathbf{Q} \cdot \mathbf{W}_{a} \mathbf{K}^{T}$$</div>
<p> in which the scaling factor used in scaled dot product attention is replaced by a learnable <span class="math">\(d \times \d\)</span> weight matrix <span class="math">\(\mathbf{W}_{a}\)</span>$.</p>
<p>For causal language models, where the task is to predict the next word in a sequence, the contextual word vector for each word should depend only on itself and preceding words. We then have the constraint that <span class="math">\(A_{ij} = 0\)</span> when <span class="math">\(j>i\)</span>.</p>
<p>Large language models usually implement <em>Multi-Head Attention</em>. In this, several attention layers are applied in parallel to the same input, their outputs concatenated, and the concatenated output fed to a final linear projection to reduce it to the required dimension. This allows each head to detect different relationships between tokens.</p>
<p>Attention layers were originally used to allign inputs with outputs in LSTM-based machine translation systems, in order to account for differing word orders between the source and target languages. However, the paper <a href="https://arxiv.org/abs/1706.03762">Attention is All You Need</a> introduced the <em>Transformer Architecture</em>, in which the model consists entirely of alternating attention and feed-forward layers. Most Large Language Models are based on this.</p>
<p>Attention can also be applied globally to pool token vectors into a single vector representing a document. In <a href="https://PlayfulTechnology.co.uk/tag/qarac.html">QARAC</a> I use a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/layers/GlobalAttentionPoolingHead.py">Global Attention Pooling Head</a> to do this. In this, the attention for each token is the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">cosine similarity</a> between a <em>local projection</em> of its input token vector and a <em>global projection</em> of the sum of the input vectors. I used cosine similarity here because my aim of mapping logical reasoning to vector arithmetic requires the model to be able to understand negation.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/recurrent-neural-networks.html">Recurrent Neural Networks</a></td>
<td><a href="https://PlayfulTechnology.co.uk/transformers.html">Transformers</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Recurrent Neural Networks2024-06-06T00:00:00+01:002024-06-06T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-06-06:/recurrent-neural-networks.html<p>Neural networks that analyse sequences</p><p><em>Recurrent Neural Networks</em> (RNN) are to <a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">neural networks</a> what <a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a> are to <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayesian Models</a>. That is, they are a form of the model that is designed to analyse sequences which evolve over time. While 1-dimensional <a href="https://PlayfulTechnology.co.uk/convolutional-networks.html">Convolutional Networks</a> can be used to analyse sequences, they are sensitive mainly to short range relationships between samples, whereas recurrent networks aim to capture longer range relationships.</p>
<p>In general, given the input <span class="math">\(\vec{x}_{t}\)</span> at timestep <span class="math">\(t\)</span>, and the previous output <span class="math">\(y_{t-1}\)</span>
the output <span class="math">\(\vec{y}_{t}\)</span> is calculated as
</p>
<div class="math">$$\vec{y}_{t} = g(\vec{x}_{t}, \vec{c}_{t}, \vec{y}_{t-1})$$</div>
<p>
where <span class="math">\(\vec{c}_{t}\)</span> is a <em>cell state</em> vector, which is updated by
</p>
<div class="math">$$\vec{c}_{t} = f(\vec{x}_{t}, \vec{c}_{t-1}, \vec{y}_{t-1})$$</div>
<p>
<span class="math">\(f\)</span> and <span class="math">\(g\)</span> are learnable functions. The simplest version is the <em>Ellman RNN</em>
</p>
<div class="math">$$\vec{y}_{t} = \tanh(\mathbf{W}_{i} \cdot \vec{x}_{t} + \mathbf{W}_{h} \cdot \vec{y}_{t-1})$$</div>
<p>
where in general <span class="math">\(\mathbf{W}\)</span> and <span class="math">\(\vec{b}\)</span> are weight matrices and bias terms respectively. Elsewhere you will see these equations written with two different bias terms, but since they are additive, it is mathematically equivalent (and simpler) to express them as one.</p>
<p>When training a recurrent network, gradients must be <a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">backpropogated</a> over a large number of timesteps, which in simple recurrent models leads to the vanishing gradient problem. Currently used recurrent network architectures are designed to mitigate this problem.</p>
<p>The most commonly used RNN architecture is <em>Long Short Term Memory</em> (LSTM). An <em>LSTM cell</em> consists of a number of <em>gate</em> functions, which control how information flows between the input, cell state, and outputs. The <em>Input Gate</em> controls how much information flows from the input and the previous output to the cell state. It is
</p>
<div class="math">$$\vec{i}_{t} = \sigma(\mathbf{W}_{ii} \cdot \vec{x}_{t} + \mathbf{W}_{hi} \cdot \vec{y}_{t-1} + \vec{b}_{i})$$</div>
<p>
where <span class="math">\(\sigma\)</span> is the <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">logistic function</a>.</p>
<p>The <em>Forget Gate</em> controls how much information the cell state retains from one timestep to the next. It is given by </p>
<div class="math">$$\vec{f}_{t} = \sigma(\mathbf{W}_{if} \cdot \vec{x}_{t} + \mathbf{W}_{hf} \cdot \vec{y}_{t-1} + \vec{b}_{f})$$</div>
<p>The <em>Cell Gate</em> combines information from the input and the previous output. It is given by</p>
<div class="math">$$\vec{g}_{t} = \tanh(\mathbf{W}_{ig} \cdot \vec{x}_{t} + \mathbf{W}_{hg} \cdot \vec{y}_{t-1} + \vec{b}_{g})$$</div>
<p>The <em>Output Gate</em> controls how information flows from the cell state to the output. It is given by
</p>
<div class="math">$$\vec{o}_{t} = \sigma(\mathbf{W}_{io} \cdot \vec{x}_{t} + \mathbf{W}_{ho} \cdot \vec{y}_{t-1} + \vec{b}_{o})$$</div>
<p>At each timestep, the cell state is updated by
</p>
<div class="math">$$\vec{c}_{t} = \vec{f}_{t} \odot \vec{c}_{t-1} + \vec{i}_{t} \odot \vec{g}_{t}$$</div>
<p>
and the output is calculated as
</p>
<div class="math">$$\vec{y}_{t} = \vec{o}_{t} \odot \tanh(\vec{c}_{t})$$</div>
<p>A variation on this is a <em>Peephole LSTM</em>, in which the gate functions also include a term in the previous cell state. </p>
<p>A simpler alternative to LSTMs is <em>Gated Recurrent Units</em> (GRU). This dispenses with the cell state, and uses a simpler set of gates. The <em>Update Gate</em> controls whether the output will be primarily based on the previous output or the new input
</p>
<div class="math">$$\vec{z}_{t} = \sigma(\mathbf{W}_{iz} \cdot \vec{x}_{t} + \mathbf{W}_{hz} \cdot \vec{y}_{t-1} + \vec{b}_{z})$$</div>
<p>
The <em>Reset Gate</em> controls how much influence the previous output will have on the calculation of a new candidate output.
</p>
<div class="math">$$\vec{r}_{t} = \sigma(\mathbf{W}_{ir} \cdot \vec{x}_{t} + \mathbf{W}_{hr} \cdot \vec{y}_{t-1} + \vec{b}_{r})$$</div>
<p>
A <em>Candidate Ouptut</em> is calculated as
</p>
<div class="math">$$\vec{h}_{t} = \tanh(\mathbf{W}_{ih} \cdot \vec{x}_{t} + \mathbf{W}_{hh} \cdot (\vec{r}_{t} \odot y_{t-1}) + \vec{b}_{h})$$</div>
<p>
The output is then calculated as
</p>
<div class="math">$$y_{t} = (\vec{1} - \vec{z}_{t}) \odot \vec{y}_{t-1} + \vec{z} \odot \vec{h}_{t}$$</div>
<p>This can be simplified even further as a <em>Minimal Gated Unit</em> (MGU), in which the update and reset gates are combined into a single forget gate.</p>
<p>When the whole sequence to be analysed is known in advance, a <em>Bidirectional RNN</em> can be used. This combines the outputs of two RNNs, one of which is working on a reversed copy of the inputs and has its outputs reversed in turn. This means that the output vector produced at each step of the sequence will take the context of the entire sequence into account.</p>
<p>I used LSTMs in my work at <a href="https://PlayfulTechnology.co.uk/formisimo.html">Formisimo</a> to predict whether users were likely to complete or abandon web forms. Recurrent networks are also used in OCR applications and in NLP - an interesting example is <a href="https://huggingface.co/flair">Flair</a>, which runs a bidirectional LSTM over the characters of its input sequence, and samples the vectors corresponding to the last character of each word in the forward direction and the first character of the word in the backward direction to produce a contextual word vector for each word.</p>
<p>Recurrent networks can be implemented in <a href="https://keras.io/api/layers/recurrent_layers/">Keras</a> and <a href="https://pytorch.org/docs/stable/nn.html#recurrent-layers">PyTorch</a>.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/convolutional-networks.html">Convolutional Networks</a></td>
<td><a href="https://PlayfulTechnology.co.uk/the-attention-mechanism.html">The Attention Mechanism</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Convolutional Networks2024-05-30T00:00:00+01:002024-05-30T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-05-30:/convolutional-networks.html<p>Neural networks that learn hierarchies of localised features</p><p><a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">Multi Layer Perceptron</a> networks would not be suitable for computer vision. Consider a fully-connected layer that takes a <span class="math">\(1080 \times 1920\)</span> RGB image (this is the resolution of a HDTV image) and outputs <span class="math">\(N\)</span> features. This would require <span class="math">\((1080 \times 1920 \times 3 + 1) N = 6220801 N\)</span> parameters for the first hidden layer alone. Worse still, it would be unable to generalise - simply shifting the inputs by 1 pixel in any direction would produce a completely different output.</p>
<p>To address this problem, we need to look at how computer vision systems typically worked before deep learning became commonplace. They would start by detecting simple, localised features such as edges, and then build them up into hierachies of more complex features. Algorithms based on this approach can be found in the <a href="https://opencv.org/">OpenCV</a> library.</p>
<p><em>Convolutional Networks</em> allow us to make these features learnable. A convolutional layer has a <em>kernel</em>, which applies to a <span class="math">\(k \times k\)</span> region of its inputs. For computer vision applications If the previous layer produces <span class="math">\(N_{in}\)</span> features, and this layer produces <span class="math">\(N_{out}\)</span> features, the total number of parameters needed for the kernel is <span class="math">\((k \times k \times N_{in} + 1) N_{out}\)</span>, a vast saving of complexity - for a <span class="math">\(3 \times 3\)</span> kernel applies to an RGB input image, we would require only <span class="math">\(28 N_{out}\)</span> parameters. The kernel is scanned accross the inputs, creating <span class="math">\(N_{out}\)</span> features at each position in the image (padding is usually applies at the edges of the image to prevent loss of information). This is the convolution that the algorithm is named after, and it makes the features detected invariant under translation.</p>
<p>Between convolutional layers there is often a <em>Pooling Layer</em> which combines the features of adjacent pixels, so as to reduce the size of the layer. These divide their inputs into windows (typically <span class="math">\(2 \times 2\)</span>) and use an aggregation function to chose one value for each feature to represent the window as a whole. Typically, either the maximum or the average is used. Pooling layers help to regularise the model, and increase invariance against small local distortions of the image. After a number of alternating layers of convolution and pooling, they will reduce the image size to the point where fully connected layers can tractably be used for the final task.</p>
<p>Convolutional layers are not restricted to computer vision problems. While this discussion has mainly addressed 2-dimensional convolutional networks. There are also 1-dimensional convolutional networks, which find application in natural language processing, speech recognition, and time series analysis.</p>
<p>Convolution and pooling layers are implemented in <a href="https://keras.io/2.16/api/layers/convolution_layers/">Keras</a> and <a href="https://pytorch.org/docs/stable/nn.html#convolution-layers">PyTorch</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/transfer-learning.html">Transfer Learning</a></td>
<td><a href="https://PlayfulTechnology.co.uk/recurrent-neural-networks.html">Recurrent Neural Networks</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Transfer Learning2024-05-23T00:00:00+01:002024-05-23T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-05-23:/transfer-learning.html<p>Adapting pre-trained models to more specialised functions</p><p>As mentioned in the article about <a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">Multi Layer Perceptron</a>, I mentioned that neural networks need a lot of data and computing time to train. This is especially true for the models currently used in computer vision and natural language processing - large language models have billions of weights to train, and are typically trained on resources like <a href="https://commoncrawl.org/">the Common Crawl</a> or <a href="https://pile.eleuther.ai/">the Pile</a>, which contain petabytes of data. </p>
<p>Clearly, very few organisations have the computing resources to train such a model from scratch. So, if we need a large model for a specific task of our own, what do we do? The usual approach is known as <em>Transfer Learning</em>, and aims to transfer information previously learned by a model (such as a statistical model of a language) to a new task. This is achieved by the following procedure.</p>
<ol>
<li>Start with a pre-trained <em>Base Model</em>. In naturla language processing, this has often been trained on a task such as masked word prediction or next word prediction.</li>
<li>Remove the output layer, or <em>head</em>.</li>
<li>Replace it with an output layer of your own.</li>
<li><em>Fine tune</em> the model, by training it with a dataset specific to your task.</li>
</ol>
<p>One problem that can occur during fine tuning is <em>Catastrophic Forgetting</em>. This is when, rather than adapting the model's existing capabilities to the new task, the new training data simply replaces what has previously been learnt, thus losing the capabilities we wanted to apply to the new task. To avoid this, we typically use a small <em>learning rate</em> (the constant by which the gradient is multiplied when adjusting the model weights), which is typically around <span class="math">\(10^{-5}\)</span> for NLP models. It is also common to use a <em>learning rate scheduler</em> to gradually decrease the learning rate over the course of training.</p>
<p>Transfer learning not only makes training large models more tractable, it also mitigates the energy costs and carbon footprint of doing so.</p>
<p>For <a href="https://PlayfulTechnology.co.uk/tag/qarac.html">QARAC</a> I am planning to use <a href="https://huggingface.co/roberta-base">RoBERTa</a> base models, and fine tune them on four different training tasks. The training for these tasks will be done in parallel, since training sequentially would increase the risk of catastrophic forgetting.</p>
<p><a href="https://huggingface.co">HuggingFace</a> hosts a wide selection of both base models and fine tuned models suitable for a variety of AI tasks.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/activation-functions.html">Activation Functions</a></td>
<td><a href="https://PlayfulTechnology.co.uk/convolutional-networks.html">Convolutional Networks</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Activation Functions2024-05-16T00:00:00+01:002024-05-16T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-05-16:/activation-functions.html<p>How neural networks learn nonlinear functions.</p><p>In the prevvious article, discussing <a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">Multi Layer Perceptron</a> networks, we mentioned activation functions, which allow neural networks to learn non-linear functions. In this article, we'll look at these functions in more detail.</p>
<p>As the name implies, Multi Layer Perceptron networks are based on the <em>perceptron</em> model, which fits a function </p>
<div class="math">$$y = \mathrm{sgn}( \vec{w} \cdot \vec{x} + b)$$</div>
<p>to the data. The activation function for this model is the <em>Heaviside Step Function</em>, but since it is not differentiable, we cannot apply the <a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">chain rule</a> to it. Therefore, early neural networks used activation function that replaced the sudden step with a smooth transition between positive and negative values. One such function is the <em>softsign function</em></p>
<div class="math">$$f(x) = \frac{x}{|x| + 1}$$</div>
<p>Whose derivative is given by </p>
<div class="math">$$\frac{df}{dx} = \frac{1}{(|x|+1)^{2}}$$</div>
<p>but more commonly used is the <em>hyperbolic tangent</em></p>
<div class="math">$$\tanh{x} = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$</div>
<p>for which</p>
<div class="math">$$\frac{df}{dx} = 1-f(x)^{2}$$</div>
<p>This is related to the <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">logistic function</a> or <em>sigmoid</em></p>
<div class="math">$$\sigma(x) = \frac{1}{1+e^{-x}}$$</div>
<p> by the relationship
</p>
<div class="math">$$\tanh{x/2} = 2 \sigma(x) -1$$</div>
<p>Both these functions are 0 at <span class="math">\(x=0\)</span> and have the property
</p>
<div class="math">$$\lim_{x\to\pm\infty} y = \pm 1$$</div>
<p>
and so are known as <em>saturating functions</em> as a result. One disadvantage of saturating functions is the <em>vanishing gradient problem</em>. Since the gradient of a saturating function tends to zero for large positive or negative <span class="math">\(x\)</span>, these functions will propogate little information to their inputs during training if they are close to saturation. This makes them unsuitable for use in deep networks. As a result, a variety of <em>non-saturating functions</em> are now used.</p>
<p>One of the most common of these is the <em>Rectified Linear Unit</em> (ReLU).
</p>
<div class="math">$$f(x) = \max(0,x)$$</div>
<p>
The output vector space for a layer with ReLU activation is a piecewise linear function. It has the advantage of computational simplicity, but can suffer from <em>dead neurons</em>. Since it can produce arbirtrarity large outputs, but the gradient is zero when the input is less than zero, a neuron with large negative weights can get locked into producing zero outputs. To address this problem a number of variations of ReLu are used, which produce non-zero outputs for negative inputs. <em>Leaky ReLU</em> is a piecewise linear function
</p>
<div class="math">$$f(x) = \max(\alpha x, x)$$</div>
<p>
where <span class="math">\(\alpha\)</span> is a constant in the range <span class="math">\(0 < \alpha <1\)</span>. If we treat <span class="math">\(\alpha\)</span> as a trainable parameter, we get <em>Parametric ReLU</em> (PReLU).</p>
<p><em>Exponential Linear Unit</em> (ELU) uses an offset exponetial function for negative inputs.
</p>
<div class="math">$$f(x) = \max(x,e^{x}-1)$$</div>
<p>. This means that the activation will saturate for large negative values but not for positive values.</p>
<p>The exponential function
</p>
<div class="math">$$y = e^{x}$$</div>
<p> may also be used as an activation function, but has the disadvantage that its gradients can be arbitrarily large, so this can lead to an <em>exploding gradient problem</em>, where gradients increase without limit during training, leading to instability and the risk of numerical overflows. A more stable alternative is the <em>softsum function</em>
</p>
<div class="math">$$f(x) = \ln(e^{x}+1)$$</div>
<p>
This can be seen as a fully continuous alternative to ReLU. The gradient is given by
</p>
<div class="math">$$\frac{df}{dx} = \frac{e^{x}}{e^{x}+1} \\
= \frac{1}{1+e^{-x}} \\
= \sigma(x) $$</div>
<p>These functions are all monotonic - their gradients are always non-negative with respect to their inputs. More recent research has introduced a number of <em>non-monotonic</em> activation functions, which have a minimum for small negative inputs and tend to zero for large negative inputs. These are the <em>Gausian Error Linear Unit</em>, the <em>Sigmoid Linear Unit</em> (SiLU), or <em>Swish function</em>, and the <em>Mish function</em> (apparently names after Diganta Misra, who devised it).
The GELU activation function is defined as
</p>
<div class="math">$$f(x) = x \frac{1+\mathrm{erf}(x/\sqrt{2 \pi})}{2}$$</div>
<p>
where </p>
<div class="math">$$\mathrm{erf(x)} = \frac{2}{\sqrt{\pi}} \int_{-infty}^{x} e^{-x^{2}} dx$$</div>
<p>. For this
</p>
<div class="math">$$\frac{df}{dx} = \frac{1+\mathrm{erf}(x/\sqrt{2 \pi})}{2} +\frac{e^{-x^{2}/2}}{\sqrt{2 \pi}}$$</div>
<p>The swish function is given by
</p>
<div class="math">$$f(x) = x \sigma(x) = \frac{x}{1+e^{-x}}$$</div>
<p> and its derivative is
</p>
<div class="math">$$\frac{df}{dx} = \sigma{x} + x \sigma(x)(1-\sigma(x))$$</div>
<p>.</p>
<p>The Mish function is given by
</p>
<div class="math">$$f(x) = x \tanh(\mathrm{softsum}(x)) \\
= x \frac{e^{x} + 1 - \frac{1}{e^{x} + 1}}{e^{x} + 1 + \frac{1}{e^{x} + 1}} \\
= x \frac{(e^{x} +1)^{2} -1 }{(e^{x} +1)^{2} + 1}$$</div>
<p>Its derivative is
</p>
<div class="math">$$\frac{df}{dx} = 4\frac{e^{x}(e^{x}+1)}{((e^{x}+1)^{2} + 1)^{2}}$$</div>
<p>These are quite similar functions. The fact that they are not monotonic gives them the propery of being self-regularising - weights and inputs giving rise to large values of <span class="math">\(x\)</span> will tend to be weakened rather than strengthened during training, thus reducing the the tendency to overfit.</p>
<p>In some circumstances, we may wish to use an activation function that treats both large positive and large negative input values. For this purpose there is a family of activation functions known as <em>Shrink functions</em>. The <em>Hard Shrink</em> function</p>
<div class="math">$$f(x) = \left\{ \begin{array}{c 1} 0 & \quad \mathrm{if} |x| < 1 \\
x & \quad \mathrm{otherwise} \end{array} \right. $$</div>
<p> is discontinuous, which may lead to unstable behaviour during training. The <em>Soft Shrink</em> function
</p>
<div class="math">$$f(x) = \max(x-1,\min(x+1,0))$$</div>
<p> avoids this problem. However, if we wish to use a smooth function, there is the <em>Tanh Shrink</em> function.</p>
<div class="math">$$f(x) = x - \tanh(x)$$</div>
<p>The activation functions discussed so far apply mainly to hidden layers. For output layers, the activation functions used may be chosen according to the requirements of the problem. The sigmoid function and its generalisation the <em>Softmax function</em></p>
<div class="math">$$p_{i} = \frac{e^{x_{i}}}{\sum_{i} e^{x_{i}}}$$</div>
<p> may be used, while a simple linear output may be suitable for regression.</p>
<p>Both <a href="https://www.tensorflow.org/api_docs/python/tf/keras/activations">TensorFlow</a> and <a href="https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity">PyTorch</a> provide a wide selection of activation functions.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">Multi Layer Perceptron</a></td>
<td><a href="https://PlayfulTechnology.co.uk/transfer-learning.html">Transfer Learning</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Multi Layer Perceptron2024-05-09T00:00:00+01:002024-05-09T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-05-09:/multi-layer-perceptron.html<p>The most basic neural network architecture</p><p>One of my aims for this series of articles is to cover the things every data scientist should know. Since neural networks are so ubiquitous in data science, they certainly fit this description.</p>
<p>Neural networks in general learn a function that maps their inputs to the desired outputs. To make learning tractable, that function is built from a number of smaller units. This is meant to mimic the way that processing in the brain in is accomplished by signals being passed between neurons, but compared to the way real neurons work, it's a grossly over-simplfied model. There are many different architectures for combining smaller units to make the overall one, but the most basic is known as a <em>Multi Layer Perceptron</em> network.</p>
<p>This consists of an <em>Input Layer</em> <span class="math">\(\vec{x}\)</span>, representing the observations we wish to classify,
<span class="math">\(N\)</span> <em>Hidden Layers</em> <span class="math">\(h_{i}\)</span>, where</p>
<div class="math">$$\vec{h_{0}} = f(\mathbf{W}_{0} \cdot \vec{x} + \vec{b_{0}})$$</div>
<div class="math">$$\vec{h_{i+1}} = f(\mathbf{W}_{i} \cdot \vec{h_{i}} + \vec{b_{i}})$$</div>
<p>and an <em>Output Layer</em></p>
<div class="math">$$\vec{y} = g(\mathbf{W}_{N} \cdot \vec{h_{N-1}} + \vec{b_{N}})$$</div>
<p>repressenting the model's prediction of the target variable, where <span class="math">\(\mathbf{W}_{i}\)</span> are weights, <span class="math">\(\vec{b_{i}}\)</span> are biases, and <span class="math">\(f\)</span> and <span class="math">\(g\)</span> are <em>activation functions</em>. In general, the activation functions are non-linear, to allow the model to learn a non-linear function, and differentiable, to allow weights and biases to be learned by <a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">Backpropogation</a>. While <span class="math">\(f\)</span> is usually the same for all hidden layers, <span class="math">\(g\)</span> is chosen according to the outputs required - for example, a <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">softmax function</a> may be used for classification problems.</p>
<p>Multi Layer Perceptron networks are powerful algorithms, but they do have some disadvantages. Generally, using more layers allows them to make better predictions - indeed, it when the computational power to train and run networks with many layers (<em>deep learning</em>) became available that neural networks first became viable for mainstream use. However, the number of hidden layers and the width of each layer are hyperparameters that may need some experimentation to choose correctly. The complexity of the models means that a lot of data and computing time is needed to train them, they may be prone to overfitting and it is hard to explain their results.</p>
<p>Having said that, if the problem to be solved is sufficiently complex, neural networks might well give better results than anything else. I consider it best practice to thoroughly investigate the data and see whether a simpler model will give good results first, and turn to neural networks only when necessary.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/imputation.html">Imputation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/activation-functions.html">Activation Functions</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Imputation2024-05-02T00:00:00+01:002024-05-02T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-05-02:/imputation.html<p>Filling in missing data</p><p>Real world data is often messy. Values may often be missing, or erroneous, which may be flagged by <a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a>. If only a small number of values are missing, it may be possible to simply omit them from the dataset, but if missing values are frequent, this might lead to us losing too much of the dataset. We need a way to estimate the missing values. Such methods are known as <em>Imputation</em>.</p>
<p>Imputation methods can be divided into <em>Single Imputation</em>, which estimate missing values for a single variable, and <em>Multiple Imputation</em>, which estimate values for several missing values.
Simgle Imputation methods usually substitute all the missing values with a single value, which may be the mean, median or mode of the varible's known values. This runs the risk of introducing bias into the dataset, although the median and mode are less likely to intoduce bias than the mean. An alternative is <em>Random Imputation</em>, where the missing values are filled with random samples from the variable's probability distribution. This reduces the potential for bias but introduces noise.</p>
<p><em>Multiple Imputation</em> methods infer the unknown values from known values of other variables. One method is to fit a regression model (such as <a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a> or a <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> regression model) to the known values of the variable in terms of the other variables and interpolate the missing values. One version of this, <em>MICE</em> (Multivariate Inputation by Chained Equations), starts by using Single Imputation to provide initial estimates for the missing values. Then for each variable in turn, the imputed values are updated by a regression model. Over a number of iterations, the imputed values are expected to converge to realistic estimates, as each update of one variable improves the regression models used to estimate the others. A fuller description can be found in <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/">Multiple imputation by chained equations: what is it and how does it work?</a></p>
<p>Another method to predict the missing values is <em>K Nearest Neighbours</em>. In this we use an approproiate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a> to find the <span class="math">\(K\)</span> most similar datapoints with known values for the missing variable, and use the mean of their values for that variable as the estimate. This avoids the overhead of fitting regression models, but may overfit.</p>
<p>If all the values in the dataset are positive or zero, <em>Non-Negative Matrix Factorisation</em> can be used. If the dataset is represented by an <span class="math">\(N \times M\)</span> matrix <span class="math">\(\mathbf{X}\)</span>, we decompose it into an <span class="math">\(N \times m\)</span> matrix <span class="math">\(\mathbf{U}\)</span> and an <span class="math">\(M \times m\)</span> matrix <span class="math">\(\mathbf{V}\)</span> such that</p>
<div class="math">$$\mathbf{X} \approx \mathbf{U} \cdot \mathbf{V}^{T}$$</div>
<p>(note the similarity to <a href="https://PlayfulTechnology.co.uk/data-reduction.html">Singular Value Decomposition</a>). When fitting <span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> we simply ignore missing values. However, when we reconstruct <span class="math">\(\mathbf{X}\)</span> from the factor matrices, it will contain predictions for the missing values, based on correlations found in the dataset. This technique can also be used as a recommendation algorithm, where <span class="math">\(\mathbf{X}\)</span> represents users' ratings for various items, and the missing values are ratings we wish to predict.</p>
<p><a href="https://scikit-learn.org/stable/modules/impute.html">Scikit-learn's <code>impute</code> module</a> contains implementations for some of these techniques.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/graph-search-algorithms.html">Graph Search Algorithms</a></td>
<td><a href="https://PlayfulTechnology.co.uk/multi-layer-perceptron.html">Multi Layer Perceptron</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Graph Search Algorithms2024-04-25T00:00:00+01:002024-04-25T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-25:/graph-search-algorithms.html<p>Navigating networks</p><p>There are many applications in which we might wish to navigate our way through a graph of nodes and edges - route finding for satnav systems, controlling character movement in video games, routing traffic on the Internet, crawling the web for a search engine, or finding information in a vector database, for example. </p>
<p>Algorithms used for this task make use of a <em>frontier</em>, which initally contains a node representing the starting point of our search. At each stage of the search, we retrieve an node from the frontier, and examine the nodes it links to. Any that have not previously been explored are added to the frontier. This continues until a stopping criterion is fulfilled. For a navigation system, this would be that the desired destination has been retrieved from the frontier. For a web crawler, it may be that a certain number of pages have been indexed or a certain amount of time has elapsed. For a vector database, it may be that is is not possible to get closer to the target.</p>
<p>Different forms of search algorithm are distinguished by the type of data structure they use to represent the frontier. If the frontier is a LIFO stack, we have a <em>Depth First Search</em>. In this, the search goes as deep as it can into the network before backtracking and exploring other paths. This can be easily implemented as a recursive algorithm, with the call stack acting as the frontier, and is memory efficient. However, it is probably only suitable for fairly small networks.</p>
<p>If the frontier is a FIFO queue, we have a <em>Breadth First Search</em>. This explores all the nodes at a given depth from the starting point before moving on to greater depths. It is suitable for web crawling.</p>
<p>If we can associate a weight with the nodes or edges of the graph, we can use a <a href="https://PlayfulTechnology.co.uk/priority-queues.html">priority queue</a> for the frontier. This gives us a <em>Greedy Best First Search</em>. This can be the most efficient way of finding the best route to a goal, since the most promissing candidates will be explored first, meaning that the target node is likely to be found sooner than by either of the other approaches.</p>
<p>For <a href="https://PlayfulTechnology.co.uk/tag/qarac.html">QARAC</a>, I have written a <a href="https://github.com/PeteBleackley/QARAC/blob/main/Crawler.py">web crawler</a> based on Greedy Best First Search, to harvest the system's knowledge base. This uses the reliability of the sites that link to a given site as its scoring function. If a given site's inbound links predominately come from unreliable sites, it will be ignored, and the search will terminate when there are no more sites linked from reliable sources to explore.</p>
<p>There are two important variations on Greedy Best First Search that need to be discussed in greater detail. In <em>Dijkstra's Algorithm</em>, each edge on the graph is associated with the distance between the two nodes it connects. The weigh of a node on the frontier is the shortest distance so far found to it from the starting node (that is the sum or the distances over the links taken to get to it from the starting node), and will be updated if a shorter path is found. Dijkstra's Algorithm terminates when all nodes have been visited, and returns a <em>shortest path tree</em>, representing the best route from the starting point to each of the other nodes in the network.</p>
<p><em>A* Search</em> is used to find the best route between the starting point and a desired end point. Here the weight used is the sum of two terms. The first is the length for the path from the start node to the candidate node, and the second is an estimate of the distance from the candidate node to the target. This distance is estimated by an <em>acceptable heuristic</em>, which can be guaranteed to be equal or less than the true distance. For navigation, the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Euclidean distance</a> is an acceptable heuristic.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/priority-queues.html">Priority Queues</a></td>
<td><a href="https://PlayfulTechnology.co.uk/imputation.html">Imputation</a></td>
</tr>
</tbody>
</table>Priority Queues2024-04-18T00:00:00+01:002024-04-18T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-18:/priority-queues.html<p>Efficiently iterating over items in order</p><p>While researching last week's article on <a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a>, I found that two methods for constructing ball trees and the algorithm for querying ANNOY both involved <em>Priority Queues</em>. Since these are an important component of a number of different algorithms, it is worth examining them in detail.</p>
<p>Suppose we want to iterate over a set of items in a particular order. The naive way of doing this is to sort the list of items and then iterate over them. However, sorting is an expensive operation for large datasets, and we may want to add further items to the list while still iterating, which would necessitate re-sorting the list each time. We therefore need a more efficient way of tackling this.</p>
<p>Priority Queues address this by storing the data in a partially ordered data structure whose elements can be reordered efficiently when items are added or removed. Most implementations use a <em>heap</em>, which is a list of items with the following properties.</p>
<ol>
<li>The item at index <span class="math">\(i\)</span> is the parent of the items at <span class="math">\(2i+1\)</span> and <span class="math">\(2(i+1)\)</span></li>
<li>The parent is less than or equal to each of its children.</li>
</ol>
<p>These properties can be efficiently maintained by the following operations.</p>
<dl>
<dt><em>Sift Up</em></dt>
<dd>While an item is less than its parent (or not at the start of the list), swap it with its parent and check to see if its less than its new parent</dd>
<dt><em>Sift Down</em></dt>
<dd>While an item is greater than the smaller of its two children (or not at the end of the list), swap it with that child and check to see if it is greater than either of its new children.</dd>
</dl>
<p>(<em>Note</em>: What I'm describing here is a <em>Min Heap</em>, which is used when we want to iterate over our items in ascending order. Most Python implementations of priority queues use this. There are also <em>Max Heaps</em>, which are used to iterate over items in descending order).</p>
<p>To add an item to the heap, we place it at the end, and then Sift Up until it reaches its proper place. When we remove the first item from the heap during iteration, we more the last item from the heap to the first position, and then Sift Down until it reaches its proper place.</p>
<p>There are several implementations of priority queues in Python <a href="https://docs.python.org/3/library/heapq.html">heapq</a> in the standard library, <a href="https://pypi.org/project/HeapDict/">heapdict</a> which implements a dictionary interface and allows the priority of items to be altered, and <a href="https://docs.python.org/3/library/queue.html#queue.PriorityQueue">PriorityQueue</a> in the standard Queue library, which is useful for sheduling data items to be processed by workers in a multithreaded application.</p>
<p>Prioritising tasks is an important part of many algorithms, so this is a useful tool to be aware of when designing an algorithm.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
<td><a href="https://PlayfulTechnology.co.uk/graph-search-algorithms.html">Graph Search Algorithms</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Vector Search Trees2024-04-11T00:00:00+01:002024-04-11T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-11:/vector-search-trees.html<p>Finding nearest neigbours quickly</p><p>There are many applications where we need to search a dataset for the nearest neghbours of a given point. For a large dataset, comparing the data point to the entire dataset will be too slow, especially if we need to do it frequently. If we store the dataset to be searched in a tree structure, we can improve the efficiency of queries from <span class="math">\(\mathcal{O} (N)\)</span> to <span class="math">\(\mathcal{O} (\log N)\)</span>.</p>
<p>A simple method to construct the search tree is <em>KD Trees</em>. This method iterates over the dimension of the dataset, partitioning it into hyperrectanular blocks. Each of these blocks is partitioned at the median of the datapoints contained in it along the dimension under consideration. Using the median ensures that the number of points in each partition will be balanced. This allows for rapid construction of the search tree, and and rapid searching if the dimensionality of the data is low, but its performance degrades when the number of dimensions in the dataset is large. The documentation for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree">SciPy implementation of KD Tree</a> notes that <em>20 is already too large</em>. Adding new data to the tree after initial construction also runs a high risk of the tree becoming unbalanced.</p>
<p>An alternative, that improves performance at higher dimensionalities, is <em>Ball Trees</em>. In this, each node represents a ball of centroid <span class="math">\(\vec{C}\)</span> and radius <span class="math">\(r\)</span>. Data is assigned to the nodes in such a way as to minimise the hypervolume of the balls. Several methods for doing this are available, as detailed by Stephen M. Omohundro in <a href="https://ftp.icsi.berkeley.edu/ftp/pub/techreports/1989/tr-89-063.pdf">Five Balltree Construction Algorithms</a>. The one used in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree">Scikit-Learn implementation of Ball Trees</a> is a variation on the KD Tree construction algorithm, where instead of iterating through the dimensions in a fixed order, each node is partitioned along the dimension in which the spread of its datapoints is greatest. Another method is an <em>online insertion algorithm</em>, which is suitable for when we want to continually add new data to the search tree. Given a tree, each new node is added to the tree in the position that minimises the increase in volume of the nodes that contain it. It is also possible to build a Ball Tree bottom up, with a method based on <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierachical Clustering</a>.</p>
<p>Another method for constructing search trees is <em>ANNOY</em> (Approximate Nearest Neighbours Oh Yeah), which was developed by Erik Bernhardsson at Spotify, who needed to be able to search large datasets of high dimensional datasets as quickly as possible for music recommendations. In this method, the dataset is recursively partitioned by picking two datapoints at random from each existing partition and splitting the partition midway between them. The random assignment of the partitions means that it is possible for the nearest neighbour of a point to fall into a different partition. Therefore, an ensemble of trees, similar to a <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> is constructed. We can then find a candidate nearest neighbour from each tree and select the best. The randomness of the algorithm makes the matches approximate, rather than exact, but for many applications this doesn't matter
Here is <a href="https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html">Eric Berharsson's own description of ANNOY</a>. There's a <a href="https://pypi.org/project/annoy/1.0.3/">Python implementation of ANNOY</a> on PyPI and it can be used to search word vectors or document vectors in <a href="https://radimrehurek.com/gensim/similarities/annoy.html">Gensim</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/priority-queues.html">Priority Queues</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Cross Validation2024-04-04T00:00:00+01:002024-04-04T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-04:/cross-validation.html<p>Ensuring unbiassed selection of hyperparameters</p><p>When training a model, standard practice is to hold back part of the dataset for testing. This ensures that we have tested the model's ability to generalise to unseen data.</p>
<p>However, many models have <em>hyperparameters</em>, such as the regularisation penalties used in <a href="https://PlayfulTechnology.co.uk/linear-regression.html">regularised linear models</a>. In order to select the best values for these hyperparameters, it is necessary to try fitting the model with different values of hyperparameters and select the version that gives the best results. However, if we use the same test dataset for hyperparameter selection as we do for overall model testing, there is a risk that the hyperparameters will themselves be overfit to the test dataset.</p>
<p>One solution to this is to further subdivide the dataset into training, validation and test datasets. We use the valuation dataset to assess which hyperparameters give the best performance, and then use the test dataset to evaluate how well this model performs on unseen data. Many publicly available datasets come partitioned in this way. However, if we have a limited amount of data to work with, we may find that this approach reduces the training dataset too much.</p>
<p>An alternative to this is <em>Cross Validation</em>. The basic procedure is to make several different partitions of the data into training and validation sets, and calculates the average of the <a href="https://PlayfulTechnology.co.uk/tag/evaluation.html">evaluation metrics</a> across the different partitions. This, while more computationally expensive than using a single validation partition, gives more robust results, since the choice of hyperparameters will not depend on the results from a single validation partition. Once hyperparemeters have been chosen, the data used for validation can then be folded back into the training dataset to train the final model.</p>
<p>Several strategies may be used for making the split. The simplest is the <em>Leave One Out</em> strategy. For a training dataset of size <span class="math">\(N\)</span>, this makes <span class="math">\(N\)</span> partitions into <span class="math">\(N-1\)</span> training examples and 1 validation example. A variation of this is <em>Leave P Out</em>, which makes <span class="math">\(\binom{N}{P}\)</span> partitions of <span class="math">\(N-P\)</span> training examples and <span class="math">\(P\)</span> validation examples. These methods are computationally espensive have the disadvantage that there is considerable overlap between the partitions, so their results are not independent.</p>
<p>A more commonly used strategy is <em>K-Fold Cross Validation</em>. This divides the data into <span class="math">\(K\)</span> <em>folds</em> of <span class="math">\(\frac{N}{K}\)</span> examples. Each of these in turn is used as the validation partition, with the remaining folds combined to form the training partition. Usually 5 or 10 folds are used. This is more efficient than Leave One Out, and provides greater independence between tests, as each training dataset overlaps by only <span class="math">\(\frac{K-2}{K-1}\)</span> with the others, as opposed to almost complete overlap in Leave One Out. For further statistical rigour (at the expense of greater compute time) <em>Repeated K-Fold Cross Validation</em> performs this several times, with different assignments of examples to folds. </p>
<p>If the classes to be predicted are highly unbalanced, there is a risk that some folds may not contain any examples of a particular class, thus skewing the results. <em>Stratified K-Fold Cross Validation</em> addresses this problem by grouping the examples by target class, and then dividing each class equally between the folds. If there are know statistical dependencies in the training examples, <em>Group K Fold</em> divides the dataset into groups according to some feature which is expected to have impotant staticstical correlations with other variables, and assigns the data to folds group by group, so that the same group is never present in both the training and validation dataset. This ensures that the model will generalise across groups. Group K-Fold relaxes the requirement that folds be of equal size. These two strategies can be combined as <em>Stratified Group K-Fold Cross Validation</em>.</p>
<p>Related to Group K-Fold is the <em>Leave One Group Out</em> strategy, which in effect treats each group as a fold, and the <em>Leave P Groups Out</em>, strategy, which, given <span class="math">\(G\)</span> groups, forms <span class="math">\(\binom{G}{P}\)</span> partitions, each containing <span class="math">\(G-P\)</span> groups in the training dataset and <span class="math">\(P\)</span> groups in the test dataset.</p>
<p>Another possible strategy for <em>Shuffle Split Cross Validation</em>. In this, the dataset is repeatedly shuffled, and after each shuffle split into a training and validation dataset. Whereas with K-Fold cross validation and its variants the size of the validation dataset is dependent on the number of iterations, in Shuffle Split Cross Validation they may be selected independently of each other. Stratification and Grouping may be applied to Shuffle Split as they are to K-Fold.</p>
<p>In my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a> I had to evaluate a large number of candidate models. K-Fold Cross Validation played an essential role in this</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Linear Regression2024-03-28T00:00:00+00:002024-03-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-28:/linear-regression.html<p>Fitting linear models</p><p>After the discussion of <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a> in the last article, it makes sense to discuss regression models themselves. For many problems, we wish to fit a function of the form</p>
<div class="math">$$y = m x + c$$</div>
<p>or, for multivariate problems</p>
<div class="math">$$\vec{y} = \mathbf{M} \vec{x} + \vec{c}$$</div>
<p>The simplest method for this is <em>Ordinary Least Squares</em>, which choses the parameters so as to mimimise the mean squared error of the model. This has a closed form solution, but there are disadvantages to using it with multivariate data. Firstly, there is a danger of overfitting, with variables of little importance adding to the complexity of the model, and secondly there is the possibility of dependencies existing between the input variables and thus introducing redundancy into the model. These issues may be addressed by applying <a href="https://PlayfulTechnology.co.uk/data-reduction.html">principal component analysis</a> to the input data, but this has the disadvantage of making the model less explainable.</p>
<p>There are a number of methods of reducing the complexity of multivariate linear regression models. One of these is <em>Least Angle Regression</em> (LARS). This is a method of fitting the model that minimises the number of components used to predict the outputs. Rather than following the gradient of the loss function, it adjust the weights corresponding with the input variable that has the strongest correlation with the residuals at each step of the optimisation. When more than one variable have equally strong correlations with the target residuals, they are increased together in the joint least squares direction. While LARS identifies the most important variables contributing to the prediction, it does not solve the problem of colinearity between variables and is sensitive to noise.</p>
<p>Other methods for preventing overfitting involve adding a <em>regularisation penalty</em> to the loss function in the optimisation. For <em>Lasso regression</em>, this penalty is the sum of the absolute values of the weights, so the loss function to be optimised is</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} |M_{jk}|$$</div>
<p>
where <span class="math">\(N\)</span> is the number of samples and <span class="math">\(\alpha\)</span> is a hyperparameter.</p>
<p>For <em>Ridge regression</em>, the penalty term is the sum of the squares of the model weights, hence the loss function is </p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} M_{jk}^{2}$$</div>
<p>Lasso regression favours sparse models (that is, those with fewer non-zero weights), whereas ridge regression favours generally small weights.</p>
<p>These methods can be combined. <em>Lasso LARS</em> applies the Lasso regularisation penalty to LARS, which reduces LARS vulnerability to collinearity and noise. In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteins in Breast Cancer Patients</a> I used this method to fit numerical variables related to the progress of cancer to measures of activity in clusters of proteins. This method was chosen because I wished to assess which protein clusters were strong predictors.</p>
<p><em>ElasticNet</em> combines the Lasso and Ridge regression methods, optimising the loss function</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \left( \rho \sum_{j} \sum_{k} |M_{jk}| + (1 - \rho) \sum_{j} \sum_{k} M_{jk}^{2} \right)$$</div>
<p>where <span class="math">\(\rho\)</span> is another hyperparameter, ranging from 0 to 1, which determines the relative importance of the two regularisation penalties.</p>
<p>These algorithms and a number of related ones, are implemented in <a href="https://scikit-learn.org/stable/modules/linear_model.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Regression2024-03-21T00:00:00+00:002024-03-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-21:/evaluation-metrics-for-regression.html<p>How good is your regression model?</p><p>In the previous article, we looked at <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a> which are applicable when we are predicting discrete categories. This time, we'll look at how to evaluate models that predict continuous variables.</p>
<p>Suppose, in our test dataset, we have <span class="math">\(N\)</span> data points. We'll designate the predicted values as <span class="math">\(f_{i}\)</span> and the actual values as <span class="math">\(y_{i}\)</span>. One of the most obvious metrics to use is the <em>mean squared error</em></p>
<div class="math">$$\mathrm{MSE} = \frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}$$</div>
<p>This is essentially the variance of the errors. Since the mean squared error is often used as the loss function when fitting a regression model, we can easily compare this metric to the fitting loss to give an indication of how well the model has generalised. However, it can be difficult to interpret, since the scale of the metric is not the same as the original data. We may therefore wish to use the <em>root mean squared error</em></p>
<div class="math">$$\mathrm{RMSE} = \sqrt{\frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}}$$</div>
<p>which is the standard deviation of the errors. However, both these metrics can be sensitive to outliers, because of the squaring of the errors, which effectively gives larger errors higher weight. A metric that is less sensitive to this is the <em>mean absolute error</em></p>
<div class="math">$$\mathrm{MAE} = \frac{\sum_{i} |y_{i} - f_{i}|}{N}$$</div>
<p>This gives the same weight to small errors as to large ones. If we were to chose a constant <span class="math">\(f\)</span> that minimises the mean absolute error, it would correspond to the median of <span class="math">\(y\)</span>.</p>
<p>If we wish to use a metric that is independent of the scale of the data, we can use the <em>mean absolute percentage error</em></p>
<div class="math">$$\mathrm{MAPE} = \frac{1}{N}\sum_{i} \left| \frac{y_{i} - f_{i}}{y_{i}} \right|$$</div>
<p>While this is intuitively easy to understand, it has two disadvangtages. One is that it gives lower errors when the predicted valuea are too high than when they are two low, and the other is that it can diverge if any of the values of <span class="math">\(y_{i}\)</span> are close to zero. There are a number of approaches to mitigating these disadvantages. The <em>weighted mean absolute percetage error</em></p>
<div class="math">$$\mathrm{wMAPE} = \frac{\sum_{i}|y_{i} - f_{i}|}{\sum_{i}|y_{i}|}$$</div>
<p>
is robust against divergence, because it scales the errors by the mean absolute value of the true values, rather than the individual true values.</p>
<p>The <em>symmetrical mean average percentage error</em>
</p>
<div class="math">$$\mathrm{sMAPE} = \frac{100}{N} \frac{|y_{i} - f_{i}|}{|y_{i}| + |f_{i}|}$$</div>
<p>
is bounded between 0% and 100%. When <span class="math">\(y_{i}\)</span> and <span class="math">\(f_{i}\)</span> are both 0, the datapoint's percentage error is taken to be 0.</p>
<p>The <em>mean absolute scaled error</em>
</p>
<div class="math">$$\mathrm{MASE} = \frac{\sum{i}|y_{i} - f_{i}|}{\sum_{i} |y_{i} - \bar{y}|}$$</div>
<p>where </p>
<div class="math">$$\bar{y} = \frac{\sum_{i} y_{i}}{N}$$</div>
<p> is the mean of the true values</p>
<p>is similar to the weighted mean absolute percentage error, but scaled by the sum of the absolute deviations rather than the sum of the absolute values. It gives equal weight to positive and negative errors.</p>
<p>The <em>mean absolute log error</em></p>
<div class="math">$$\mathrm{MALE} = \frac{\sum_{i}|\ln y_{i} - ln f_{i}|}{N}$$</div>
<p>
gives equal weight to positive and negative errors, but requires the forecasted and actual values to be strictly positive, or it will diverge.</p>
<p>Another important metric is the <em>coefficient of determination</em>, or <em>explained variance</em></p>
<div class="math">$$R^{2} = 1 - \frac{\sum_{i}(y_{i} - f_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^2}$$</div>
<p>This can be seen as bearing a similar relationship to the mean squared error as the mean absolute scaled error has to the mean absolute error. It is a measure of how successful a model is at predicting the variability of the data. It is less sensitive to outliers that MSE, because an outlier will increase the denominator as well as the numerator. It is equivalent to the square of <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson correlation coefficient</a> between the actual and predicted values.</p>
<p>All these metrics test primarily for random errors. If we wish to test for systematic errors we can use the <em>mean signed difference</em></p>
<div class="math">$$\mathrm{MSD} = \frac{\sum_{i} y_{i} - f_{i}}{N}$$</div>
<p>which indicates the magnitude and direction of any likely bias in the model.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Classifiers2024-03-14T00:00:00+00:002024-03-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-14:/evaluation-metrics-for-classifiers.html<p>How good is your classifier model?</p><p>One vitally important task in any data science project is to assess how well the model performs. Various metrics are available for doing this, and each has its own advantages and disadvantages.
This is a large topic, so we will seperate it into metrics suitable for classifiers (this article) and those suitable for regression (next article).</p>
<p>A detailed description of the performance of a classifier model is given by the <em>Confusion Matrix</em> <span class="math">\(\mathbf{C}\)</span>, where <span class="math">\(C_{ij}\)</span> is the number of instances of class <span class="math">\(i\)</span> that are predicted to belong to class <span class="math">\(j\)</span>. This is useful for visualising the peerformance of the classifier, and the metrics discussed below can be calculated from it</p>
<p>Consider a binary classification problem. We may classify the results in our test dataset as True Positives, True Negatives, False Positives and False Negatives. The number of each of these is denoted <span class="math">\(\mathrm{TP} = C_{1,1}\)</span>, <span class="math">\(\mathrm{TN} = C_{0,0}\)</span>, <span class="math">\(\mathrm{FP} = C_{0,1}\)</span> and <span class="math">\(\mathrm{FN} = C_{1,0}\)</span> respectively.</p>
<p>The <em>Precision</em> of the classifier is the probability that an item predicted to be true is actually true. This is given by
</p>
<div class="math">$$ \mathrm{Pr} = \frac{\mathrm{TP}}{\mathrm{TP} +\mathrm{FP}}$$</div>
<p>
In Bayesian terms, if the predicted class is <span class="math">\(p\)</span> and the actual class is <span class="math">\(a\)</span>,
</p>
<div class="math">$$\mathrm{Pr} = P(a=\mathsf{True} \mid p=\mathsf{True})$$</div>
<p>The <em>Recall</em> of the classifier is the probability that a true item is predicted to te true. This is given by
</p>
<div class="math">$$\mathbf{R} = P(p=\mathsf{True} \mid a=\mathsf{True}) = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $$</div>
<p>Which of these is more informative depends on the application. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a>, my initial approach gave 100% Recall. However, since I had designated <em>True</em> to indicate a reliable article and <em>False</em> to indicate fake news, Precision was a more important measure of the model's ability to discriminate fact from fiction.</p>
<p>The F1 score is a metric that seeks to balance Precision and Recall. It is defined as the harmonic mean of them.</p>
<div class="math">$$F_{1} = \frac{2}{1/\mathrm{Pr} + 1/\mathrm{Re}} = \frac{2 \mathrm{Pr} \mathrm{R}}{\mathrm{Pr} + \mathrm{R}} = \frac{2 \mathrm{TP}}{2 \mathrm{TP} + \mathrm{FP} + \mathrm{FN}}$$</div>
<p>This measures similarity between the set of items predicted to be true and those that actually are true, but is not easy to interpret in terms of a Bayesian probability.</p>
<p>The <em>Accuracy</em> of the model is the probability that it predicts the correct class.
</p>
<div class="math">$$A = P(p=a) = \frac{\mathrm{TP} +\mathrm{TN}}{\mathrm{TP} +\mathrm{TN} + \mathrm{FP} +\mathrm{FN}}$$</div>
<p>
This is intuitive to interpret and, unlike the metrics discussed above, takes the true negatives into account. However, it becomes uninformative if classes are strongly imbalanced. For example, if we wish to predict whether or not a user will click on a given advertisement, we can achieve at least 99% accuracy by predicting <em>No</em> all the time. We therefore need metrics that correct for class imbalance.</p>
<p><em>Cohen's Kappa</em> is a measure of how much better a classifier is than guesswork. If we guessed the class of an item without information, our best strategy would be to pick the maxumum-likelihood class every time, and this would give us a success rate of <span class="math">\(P_{\mathrm{max}}\)</span>. We can then define
</p>
<div class="math">$$\kappa = 1 - \frac{1 - A}{1 - P_{\mathrm{max}}}$$</div>
<p><em>Matthew's Correlation Coefficient</em> is the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson Correlation Coefficient</a> between the actual and predicted classes. It is calculated as</p>
<div class="math">$$\phi = \frac{\mathrm{TP} \mathrm{TN} - \mathrm{FP} \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP})(\mathrm{TP} + \mathrm{FN})(\mathrm{TN} + \mathrm{FP})(\mathrm{TN} + \mathrm{FN})}}
$$</div>
<p>Accuracy and Cohen's Kappa can be extended to the multiclass case in the obvious way. It is not trivial to do this for Precision and Recall. However, we can define them on a per-class basis.
</p>
<div class="math">$$\mathrm{Pr}_{i} = \frac{C_{ii}}{\sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R}_{i} = \frac{C_{ii}}{\sum_{j} C_{ij}}$$</div>
<p><a href="https://www.evidentlyai.com/classification-metrics/multi-class-metrics">Evidently AI</a> suggests three methods for calculating overall precision and recall scores for calculating overall precision and recall scores in a multiclass problem. <em>Macro averaging</em> simple calculates the mean of precision and recall across all classes.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \mathrm{Pr}_{i}}{N}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \mathrm{R}_{i}}{N}$$</div>
<p>where <span class="math">\(N\)</span> is the number of classes.</p>
<p><em>Micro averaging</em> gives an average of precision and recall across all instances.</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} C_{ii}}{\sum_{i} \sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{j} C_{ii}}{\sum_{i} \sum_{j} C_{ij}}$$</div>
<p>These are equivalent, as a false negative for one class is a false positive for another, so while finer grained in one way, micro averaging loses information in another.</p>
<p>The third possibility is <em>weighted averaging</em>. While macro averaging gives all classes equal weight, wieghted averaging considers their overall prevalence in the data.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \left( C_{ii} \sum_{j} C_{ij} \right)}{\sum_{i} \left( \sum_{j} C_{ji} \sum_{k} C_{ik} \right)}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \left(C_{ii} \sum_{j} C_{ij} \right)}{\left(\sum_{j} C_{ij} \right)^{2}}$$</div>
<p>To gereralise Matthew's Correlation Coefficient to multiple classes, we first define the following terms
</p>
<div class="math">$$t_{k} = \sum_{j} C_{kj}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> occurs
</p>
<div class="math">$$p_{k} = \sum_{j} C_{jk}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> is predicted
</p>
<div class="math">$$c = \sum_{k} C_{kk}$$</div>
<p> is the number of correct predictions
</p>
<div class="math">$$s =\sum_{i} \sum_{j} C_{ij}$$</div>
<p> is the total number of samples</p>
<p>We then obtain
</p>
<div class="math">$$\phi = \frac{c s - \vec{t} \cdot \vec{p}}{\sqrt{s^{2} - |p|^{2}}\sqrt{s^{2} - |t|^{2}}}$$</div>
<p>Once you have the numbers, of course, it's important to dig deeper and understand what the factors influencing your model's performance are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>PageRank2024-03-07T00:00:00+00:002024-03-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-07:/pagerank.html<p>Using the connectivity of networks to rank items</p><p>Early web search engines, such as AltaVista, relied on hand-curated indexes of content. This was, of course, difficult to scale. What was needed was an automatic way of ranking web pages. Larry <em>Page</em>, developed an algorithm for <em>ranking</em> the importance of nodes in a network (such as web <em>pages</em>) in terms of their connections during his PhD at Stanford University, and then went on to found Google to exploit his research.</p>
<p>The <em>PageRank</em> algorithm is based on three assumptions
1. The more valuable a web page is, the more likely other web pages are to link to it.
2. Links originating from more valuable pages confer more value on the pages they link to.
3. Pages that link indiscriminately to many other pages confer less value on those pages than those which link more selectively.</p>
<p>Based on these assumptions, it then models a <em>random walk</em> taken through the Internet by a user clicking web links at random. If the user is viewing a web page <span class="math">\(i\)</span> that has <span class="math">\(N_{i}\)</span> outgoing links, they have a probability <span class="math">\(d\)</span> (known as the <em>damping factor</em>, and typically chosen as 0.85) of clicking a link to another page. This link is assumed to be chosen with uniform probability from the page's outgoing links. The PageRank <span class="math">\(P_{i}\)</span> for the page is a measure of how likely the page is to be found by this method.</p>
<p>If <span class="math">\(L_{i}\)</span> is the set of pages that link to <span class="math">\(i\)</span>, the PageRank satisfies the equation</p>
<div class="math">$$P_{i} = d \sum_{j \in L_{i}} \frac{P_{j}}{N_{j}} + (1-d)$$</div>
<p>This is solved iteratively.</p>
<p>If we define a <em>connection matrix</em> <span class="math">\(\mathbf{C}\)</span> such that <span class="math">\(C_{ij}\)</span> is <span class="math">\(1/N_{j}\)</span> if <span class="math">\(j\)</span> connects to <span class="math">\(i\)</span> and 0 otherwise, we can express this as a matrix equation</p>
<div class="math">$$\vec{P} = d \mathbf{C} \cdot \vec{P} + (1-d)$$</div>
<p>We then see that the PageRank is a modified form of the first eigenvector of the connection matrix.</p>
<p>Like <a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a>, PageRank is an example of a <em>collective intelligence</em> algorithm, in that it uses data from the actions of a large number of people to infer its scores.</p>
<p>PageRank is one of the most commercially successful algorithms ever devised, however its uses are not limited to ranking web pages. It can be used to analyse any data that can be modelled as a graph, such as citations in academic papers, patterns of gene activation in cells, or connection in the nervous system. A survey of these uses can be found in <a href="https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf">PageRank Beyond the Web</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Collaborative Fitering2024-02-29T00:00:00+00:002024-02-29T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-29:/collaborative-fitering.html<p>A basic recommendation algorithm</p><p>A problem of interest to a lot of businesses is <em>recommedation</em> - how to predict what their customers are likely to want. One of the simplest approaches to this is <em>Collaborative Fitering</em>, which works by identifying users with similar tastes.</p>
<p>Suppose each user <span class="math">\(i\)</span> has rated a set of items <span class="math">\(R_{i}\)</span>, giving each item <span class="math">\(n\)</span> a score <span class="math">\(S_{i,n}\)</span>. For a second user <span class="math">\(j\)</span>, we can obtain the intersection of their rated items <span class="math">\(R_{i} \cap R_{j}\)</span> and from these compute a weight <span class="math">\(w_{ij}\)</span> using a suitable <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">similarity metric</a> on the scores each user has given to their common items. The most common metrics to use would be the cosine similarity or Pearson correlation. If all scores are positive or zero, cosine similarity will give weights in the range <span class="math">\(0 \le w_{ij} \le 1\)</span>, whereas the Pearson correlation will give weights in the range <span class="math">\(-1 \le w_{ij} \le 1\)</span>, which is potentially more sensitive to polarisation in people's tastes. If two users have no items in common, <span class="math">\(w_{ij} = 0\)</span>.</p>
<p>For an item <span class="math">\(n\)</span> which user <span class="math">\(i\)</span> has not rated, we may then calculate a predicted rating
</p>
<div class="math">$$S^{\prime}_{i,n} = \frac{\sum_{j \mid n \in R_{j}} S_{j,n} w_{ij}}{\sum_{j \mid n \in R_{j}} w_{ij}}$$</div>
<p>This is the weighted mean of other user's weightings for the item, weighted according to the similarity of the users' ratings on other items. Items with a high predicted score for a given user can then be recommended to that user. The name <em>Collaborative Filtering</em> refers to the users collaborating through the algorithm to filter the items according to each other's preferences.</p>
<p>So far we have assumed that users have rated the items with a numerical score. However, in many applications, we only have a binary choice - for example, whether users have purchased an item, or shared a link. In this case, we can use the Tanimoto metric
</p>
<div class="math">$$w_{ij} = \frac{|R_{i} \cap R_{j}|}{|R{i} \cup R_{j}|}$$</div>
<p> as the weighting between users. The Predicted rating for <span class="math">\(n\)</span> then becomes
</p>
<div class="math">$$S_{i,n} = \frac{\sum_{j \mid n \in R_{j}} w_{ij}}{|\left\{j \mid n \in R_{j}\right\}|}$$</div>
<p>
that is, the average similarity to the user of users who have chosen the item.</p>
<p>Collaborative filtering and other recommendation algorithms suffer from the <em>bootstrap problem</em>, in that they require a lot of user data to work effectively, but when starting something up, that data is not available. Until a user has rated a significant number of items, it will not be possible to predict accurately what they will like, and until a significant number of people have rated an item, it will not be possible to predict accurately who will like it. As a result, recommendation systems cannot function effectively as a product in their own right, but work best as a feature of a larger product.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a>]</td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>K-Means Clustering2024-02-22T00:00:00+00:002024-02-22T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-22:/k-means-clustering.html<p>Finding clusters by their centroids</p><p>In the previous article, we discussed <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a>. Another method commonly used method is the <em>K-Means</em> algorithm, which attempts to find <span class="math">\(K\)</span> clusters such that this variance within the clusters is minimised. It does this by the following method</p>
<ol>
<li>Given an appropriately scaled dataset, choose <span class="math">\(K\)</span> points in the range of the data</li>
<li>Assign each point in the dataset to a cluster associated with the nearest of these points, accorting to the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">euclidean distance</a></li>
<li>Recalcuate the points as the means of the datapoints assigned to their clusters</li>
<li>Repeat from step 2 until the assignments converge</li>
</ol>
<p>Several different methods may be used to assign the initial centroids. The <em>Random Partition</em> method initially assigns each datapoint to a random cluster and takes the means of those clusters as the starting points. This tends to produce initial centroids close to the centre of the dataset. <em>Fogy's method</em> choses <span class="math">\(K\)</span> datapoints randomly from the dataset as the initial centroids. This tends to give more widely spaced centroids. A variation of this, the <em>kmeans++</em> method, weights the probability of chosing each datapoint as a centroid by the minimum squared distance of that point from the centroids already chosen. This is the default in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">Scikit-Learn's implementation of K-means</a>, since it is considered more robust. K-means cannot be guaranteed to converge to the optimal solution and is senitive to its initial conditions, it is common practice to rerun the clustering several times with several different sets of initial centroids, and chose the solution with the lowest variance.</p>
<p>Another issue with K-means is how many clusters to chose. This may be done by visialising the data in advance, or with the <em>silhouette score</em>. This is a measure of how much closer a datapoint is to other datapoints in its own cluster than it is to datapoints in other clusters. For a datapoint <span class="math">\(i\)</span> which is a member of cluster <span class="math">\(C_{k}\)</span>, which has <span class="math">\(N_{k}\)</span> datapoints assigned to it, we first calculate the mean distance of <span class="math">\(i\)</span> from the other members of <span class="math">\(C_{k}\)</span></p>
<div class="math">$$a_{i} = \frac{\sum_{j \in C_{k},j \neq i} d(i,j)}{N_{k}-1}$$</div>
<p>where <span class="math">\(d(i,j)\)</span> is the distance betwen datapoints <span class="math">\(i\)</span> and <span class="math">\(j\)</span></p>
<p>We then find the mean distance betwen <span class="math">\(i\)</span> and the datapoints in the closest cluster to it other than the one to which it is assigned.</p>
<div class="math">$$b_{i} = \min_{l \neq k} \frac{\sum_{j \in C_{l}} d(i,j)}{N_{l}}$$</div>
<p>The silhouette score for an individual point is then calculated as </p>
<div class="math">$$s_{i} = \frac{b_{i} - a_{i}}{\max(b_{i},a_{i})}$$</div>
<p>This has a range of -1 to 1, where a high value would indicate that a datapoint is central to its cluster and a low value that it is peripheral. We may then calculate the mean of <span class="math">\(s_{i}\)</span> over the dataset. The optimum number of clusters is the one that maximises this score.</p>
<p><em>X-means</em> is a variant of K-means that aims to select the optimum number of clusters automatically. This follows the following procedure.
1. Perform K-means on the dataset with <span class="math">\(K=2\)</span>.
2. For each cluster, perform K-means again with <span class="math">\(K=2\)</span> for the members of that cluster.
3. Use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to determine whether this improves the model. Keep subdividing clusters until it does not.
4. When no further subdivision are necessary, use the centroids of the clusters thus obtained as the starting point for a final round of K-means clustering on the full dataset.</p>
<p>K-means clustering requires the clusters to be linearly seperable. If this is not the case, it is necessary to perform <a href="https://PlayfulTechnology.co.uk/data-reduction.html">Kernel PCA</a> to map the dataset into a space where they are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hierarchical Clustering2024-02-15T00:00:00+00:002024-02-15T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-15:/hierarchical-clustering.html<p>Clustering data into trees of related items</p><p>When exploring a dataset, it is often useful to identify what groups or <em>clusters</em> of items may exist within the data. This is know as <em>unsupervised</em> learning, since it attempts to learn what classes exist within the data without prior knowledge of what they are, as opposed to <em>supervised learning</em> (classification), which trains a model to identify known classes in the dataset.</p>
<p>A simple method for this is <em>Hierarchical Clustering</em>. This arranges the datapoints in a tree structure by the following method.</p>
<ol>
<li>Assign each data point to a <em>leaf node</em></li>
<li>Calculate the distances between the nodes using an appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a></li>
<li>Create the <em>parent node</em> of the two nodes that are closest to each other, and replace its two <em>daughter nodes</em> with it.</li>
<li>Calculate the distances of the new node to each of the remaining nodes in the dataset</li>
<li>Repeat from step 3 until all all the nodes have been merged into a single tree.</li>
</ol>
<p>At stage 4, there are a number of different <em>linkage methods</em> for calculating the new distances. The main ones are</p>
<dl>
<dt>Single linkage</dt>
<dd>use the minimum distance between two points in each cluster</dd>
<dt>Complete (or maximum) linkage</dt>
<dd>use the maximum distance between two points in each cluster</dd>
<dt>Average linkage</dt>
<dd>use the average distance between points in the two clusters. With Euclidean distances, this can be simplified to the distance between the centroids of the clusters</dd>
<dt>Ward linkage</dt>
<dd>calculate distances between clusters recursively with the formula
<div class="math">$$d(u,v) = \sqrt{ \frac{\left(n_{v} + n_{s}\right) d(s,v)^{2} + \left(n_{v} + n_{t}\right) d(t,v)^{2} - n_{v} d(s,t)^{2}}{n_{s} + n_{t} + n_{v}}}$$</div>
where <span class="math">\(u\)</span> is the cluster formed by merging <span class="math">\(s\)</span> and <span class="math">\(t\)</span>, <span class="math">\(v\)</span> is another cluster, and <span class="math">\(n_{c}\)</span> is the number of datapoints in cluster <span class="math">\(c\)</span>. The distance between two leaf nodes is euclidean. This has the property of minimising the variance of the new cluster.</dd>
</dl>
<p>Ward linkage is the technique most likely to give even cluster sizes, while single linkage is the one most useful when cluster shapes are likely to be irregular.</p>
<p>One problem with Hierarchical Clustering is that, as described above, it does not produce discrete clusters. One way to address this is to visualise the data and chose a number of clusters, and then terminate the clustering early when that number of clusters is reached. The number of clusters may be chosen by visualising the data, either by using <a href="https://PlayfulTechnology.co.uk/data-reduction.html">t-SNE</a> or by performing an initial clustering and plotting the tree structure on a dendrogram. Another method is to chose a distance threshold, and not merge clusters further apart than this. It would be necessary to know the statistical distribution of distances between clusters to choose an appropriate threshold. While I have not seen this implemented, it would be theoretically possible to use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to decide when to separate clusters - this approach would be most useful when Ward linkage was used.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteins in Breast Cancer Patients</a> I used Hierarchical Clustering to identify groups of proteins whose activity was related in patients.</p>
<p>Implementations of Heirarchical Clustering can be found in <a href="https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html">Scipy</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Outlier Detection2024-02-08T00:00:00+00:002024-02-08T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-08:/outlier-detection.html<p>Finding the Odd One Out</p><p>Many datasets contain <em>outliers</em>, datapoints which do not fit the general pattern of the observations. This may be due to errors in the data collection, in which case removing these datapoints will make models fitted to the data more robust and reduce the risk of overfitting. In other cases, the outliers themselves are the signal we want to detect.</p>
<p>One method for doing this is <em>Isolation Forests</em>. As the name implies, it is related to the <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> algorithm discussed in the previous article. It fits a forest of (usually around 100) random decision trees to the dataset by the following method.</p>
<ol>
<li>Pick a feature at random</li>
<li>Pick a random threshold in the range of that feature</li>
<li>Partition the data at that threshold</li>
<li>Repeat the process for each partition</li>
</ol>
<p>We can then calculate an anomaly score for each datapoint. This is the depth in the decision tree at which a datapoint becomes isolated from the rest of the dataset. The mean of this score over all the trees gives a robust estimator of how easily a datapoint can be separated from the rest. The advantages of this method are that it makes no assumptions about the underlying distribution of the data, and that is explainable, in that the features which are most likely to contribute to a datapoint being isolated can be identified from the decision trees.</p>
<p>I used Isolation Forests in my work at <a href="https://PlayfulTechnology.co.uk/amey-strategic-consulting.html">Amey Strategic Consulting</a> to identify faulty traffic flow sensors in the Strategic Road Network.</p>
<p>Another method that makes no assumptions about the underlying distribution is <em>Local Outlier Factors</em>. This calculates how different data points are from their local neighbourhood. First we calculated the distances <span class="math">\(S_{i,j}\)</span> between datapoints in the sample using some appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a> (this requires all variables to be approriately scaled) and identify the <span class="math">\(N\)</span> nearest neighbours of each datapoint (usually 20). We then calculate the <em>local density</em> <span class="math">\(D_{i}\)</span> for each datapoint. This is the inverse of the mean of the distances between the point and each of its neighbours.
</p>
<div class="math">$$D_{i} = \frac{N}{\sum_{k} S_{i,k}}$$</div>
<p> where <span class="math">\(k\)</span> are the indices of the points neighbours. We can then calculate the <em>Local Outlier Factor</em> <span class="math">\(\mathrm{LOF}\)</span> for each datapoint. This is the mean of the ratio between the datapoints local density and that of each of its neighbours, <em>ie</em>
</p>
<div class="math">$$\mathrm{LOF} = \frac{\sum_{k} \frac{D_{i}}{D_{k}}}{N}$$</div>
<p>Samples whose Local Outlier Factor is below a given threshold (<em>ie</em> those whose local density is lower than that of their neighbours) can be identified as outliers.</p>
<p>If we can assume that the data are drawn from a multivariate Gaussian distribution, we can use an <em>Eliptic Envelope</em> method. For a sample of size <span class="math">\(N\)</span> with <span class="math">\(d\)</span> dimensions, we chose a sample size <span class="math">\(h\)</span> such that
</p>
<div class="math">$$\frac{\lfloor N+d+1 \rfloor }{2} < h < N$$</div>
<p>
We then select a large number of subsamples of size <span class="math">\(h\)</span> from the dataset, and calculate the mean and covariance of each. The one where the covariance has the smallest determinant is the one least likely to contain outliers. Datapoints with a large Mahalanobis distance from the mean of this sample are therefore likely to be outliers.</p>
<p>Of these methods, I'd expect Isolation Forests to be the one most likely to be useful in the widest variety of circumstances.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Random Forests2024-02-01T00:00:00+00:002024-02-01T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-01:/random-forests.html<p>Classification and regression with ensembles of decision trees.</p><p>We have mentioned clasification problems in a number of previous articles, and shown how they can be approached with <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>, <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a> and by extension, neural networks. This week we'll examine a different method, based on <em>Decision Trees</em>.</p>
<p>A decision tree can be thought of a set of nested if/else statements. It can be fit by the following procedure.</p>
<ol>
<li>Find the variable that correlates most strongly with the target variable.</li>
<li>Find the set of thresholds against that variable that comes closest to splitting the data into nodes that correspond to the target classes.</li>
<li>Repeat this for each of the nodes you have split the data into, until each <em>leaf node</em> contains a single class.</li>
</ol>
<p>However, this is prone to <em>overfitting</em>, whereby the model fits every detail of the training data but does not generalise well when classifying new data. In effect, it fits the noise as well as the signal.</p>
<p><em>Random Forests</em> is an algorithm that addresses this problem. As the word forest implies, it fits a large number (typically 100) of decision trees to the training data. Each, however, is trained only on a subset of the training data and with a subset of the variables. These subsets are chosen randomly for each tree in the forest.</p>
<p>While each individual tree in the forest will tend to overfit, the fact that they were all fit against different subsets of the data and variables will mean that the errors they make on new data will not be correlated. Therefore a majority vote of the trees provides a much more robust classifier than any individual tree would. It is also possible to take account of uncertainty in the classification by reporting the number of individual trees that voted for each class - in Bayesian terms, this coresponds to <span class="math">\(P(H \mid O)\)</span>. Algorithms that combine the results of multiple classifiers in this way are known as <em>ensemble methods</em>.</p>
<p>If the target variable is continuous, Random Forests can also be used for regression. In this case, the fitting of the decision trees terminates when the variance of the samples in each leaf node fall below a certain threshold. The prediction is then the mean of the predictions from the individual trees.</p>
<p>Random Forests tend to give better results than Logistic Regression when the target classes are unbalanced, and the algorithm is noted for having a high success rate in <a href="https://kaggle.com">Kaggle</a> competitions. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a> I found it gave good results in using grammatical features to classify Fake News.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Information Theory2024-01-25T00:00:00+00:002024-01-25T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-25:/information-theory.html<p>How much information does your data contain?</p><p>Data science can be described as turning data into infomation. However, we need to know how much information there is to find and where to find it. There are various methods we can use to measure this, which derive from the field of <em>Information Theory</em>.</p>
<p>The most basic of these measurements is <em>entropy</em>, which was intoduced by Claude Shannon. If a varible has a probability distribution <span class="math">\(p_{i}\)</span>, the entropy of that variable is given by
</p>
<div class="math">$$H = -\sum_{i} p_{i} \log_{2}p_{i}$$</div>
<p>
This is the expected number of binary decisions needed to identify a value of the variable, or, if we were to generate a stream of symbols from that distribution, the average number of bits per symbol that would be needed to encode that stream in an optimal lossless compression.
This is useful for identifying which variables are most important. Entropy has its maximum value of <span class="math">\(\log_{2} N\)</span>, where <span class="math">\(N\)</span> is the number of possible values, when the values are evenly distributed, and its minimum value of 0 when one value is a certainty.</p>
<p>We also need to quantify how much information is contained in the relationship between two variables. Suppose that two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> have individual probability distributions <span class="math">\(p_{i}\)</span> and <span class="math">\(p_{j}\)</span>, and a joint probability distribution <span class="math">\(p_{ij}\)</span>. If the variables are statistically independent, these distributions would satisfy the relationship <span class="math">\(p_{ij} = p_{i} p_{j}\)</span>. <em>Mutual information</em> characterises the deviation from this as
</p>
<div class="math">$$\mathrm{MI}(A,B) = \sum_{i} \sum_{j} p_{ij} \log_{2} \frac{p_{ij}}{p_{i} p_{j}}$$</div>
<p>
This is the amount of information that knowing the value of one variable will tell you about the other. This can be used for feature selection. Consider two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> and a target variable <span class="math">\(T\)</span>. If <span class="math">\(\textrm{MI}(A,T) > \textrm{MI}(B,T)\)</span> and <span class="math">\(\textrm{MI}(A,B) > \textrm{MI}(B,T)\)</span>, it is likely that any relationship between <span class="math">\(B\)</span> and <span class="math">\(T\)</span> is entirely a consequence of their mutual relationship with <span class="math">\(A\)</span>. Therefore, <span class="math">\(B\)</span> can safely be discarded.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It A Mushroom or Is It A Toadstool</a> I used mutual information to infer hidden variables when building a <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayesian Belief Network</a>.</p>
<p>There are number of information-theory based methods for selecing models. The best known of these, which are closely related to each other are the <em>Bayesian Information Criterion</em></p>
<div class="math">$$\mathrm{BIC} = k \ln n - 2 \ln \hat{L}$$</div>
<p> and the <em>Akaike Information Criterion</em></p>
<div class="math">$$\mathrm{AIC} = 2 ( k- \ln \hat{L} )$$</div>
<p>where <span class="math">\(k\)</span> is the number of free parameters in the model, <span class="math">\(n\)</span> is the number of data points to which the model is fitted, and <span class="math">\(\hat{L}\)</span> is the likelihood of the data under the optimally fitted model. In both cases, a lower value indicates a better model, favouring models that give a high likelihood of the data and penalising more complex models. The main difference between them is that the Bayesian Information Criterion penalises compelexity more heavily, especially for larger datasets.</p>
<p>There are many other uses for information theory in data science, but I'd like finish with one relevant to natural language processing. Marcello Montemurro and Damian Zanette published a paper entitled <a href="https://arxiv.org/abs/0907.1558">Towards the quantification of semantic information in written language</a> in which they intoduced a technique for using the entropy of word frequency distributions across different parts of a document to identify the most significant words, according to the role they play in its structure. I illustrate this in <a href="https://PlayfulTechnology.co.uk/the-entropy-of-alice-in-wonderland.html">The Entropy of Alice in Wonderland</a>.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Latent Semantic Indexing2024-01-18T00:00:00+00:002024-01-18T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-18:/latent-semantic-indexing.html<p>Reducing the dimensionality of language data</p><p>In the article on <a href="https://PlayfulTechnology.co.uk/data-reduction.html">data reduction</a>, we mentioned the <em>curse of dimensionality</em>, whereby large numbers of features make data increasingly difficult to analyse meaningfully. If we take another look at <a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a>, we see that this will generate a feature for each unique word in the corpus that it is trained on, which may be in the tens of thousands. It therefore makes sense to apply a data reduction method and obtain a more compact representation.</p>
<p>TF-IDF, as previously discussed, makes use of the fact that words that occur in some documents but not others are the most useful for distinguishing between the documents. This means that its feature vectors will generally be quite sparse. Therefore, the most appropriate data reduction method to use will be Singular Value Decomposition.</p>
<div class="math">$$\mathbf{TFIDF} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>Tyypically around 200 components are retained. The left singular vectors <span class="math">\(\mathbf{U}\)</span> then represent documents in the lower-dimensional space, while the right sigular vectors <span class="math">\(\mathbf{V}\)</span> represent words in the same space. Words that tend to appear in the same documents will tend to have similar vector representations, and according to the <em>distributional hypothesis</em>, this gives an implicit representation of their meaninng. This implicit representation of meaning gives the technique the name <em>Latent Semantic Analysis</em>.</p>
<p>Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span>, we can calculate a query vector
</p>
<div class="math">$$\vec{q} = \sum_{i}\mathbf{V}_{w_{i}}$$</div>
<p>
We can then search our corpus for the most relevant documents to match the query by calculating a score
</p>
<div class="math">$$S = \mathbf{U} \cdot \vec{q}$$</div>
<p> and selecting the documents with the greatest score. Since it can be used to search the corpus in this way, Latent Semantic Analysis is also known as <em>Latent Semantic Indexing</em>.</p>
<p>An implementation of Latent Semanic Indexing (LSI) can be found in the <a href="https://radimrehurek.com/gensim/models/lsimodel.html">Gensim</a> library, along with several other <em>topic models</em>, which similarly attempt to use the distributional hypothesis to characterise documents.</p>
<p>While LSI can account for different words having similar meanings, it is still a bag of words model and cannot account for the same word having different meanings dependent on context. In my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> I attempted to address this issue by building an NLP pipeline that enriched the documents with Named Entity Recognition and Word Sense Disambiguation before applying LSI, but modern transformer models address it by calculating contextual word vectors. It can, however, be seen as the distant ancestor of these models.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Data Reduction2024-01-11T00:00:00+00:002024-01-11T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-11:/data-reduction.html<p>Mapping data to lower dimensions</p><p>Datasets that involve a large number of features suffer from <em>The Curse of Dimensionality</em>, where, as the number of features increases, it becomes harder and harder to use them to define a meaningful <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">measure of distance</a> between the samples. It becomes necessary to map the data into a smaller number of dimensions. To do this, we need to find mathematical relatioships between the features that can be used to form a more economical representation of the data.</p>
<p>The most common way of doing this is <em>Principal Component Analysis</em> (PCA), which captures linear relationships between features. This starts by calculating the covariance of the features</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i} (\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p>where <span class="math">\(\vec{x_{i}}\)</span> is a sample, <span class="math">\(\bar{\vec{x}}\)</span> is the mean of the samples, and <span class="math">\(N\)</span> is the number of samples. We then calculate the eigenvalues and eigenvectors of this matrix. Each eigenvalue quantifies how much of the variance of the data the associated eigenvector explains. Hopefully, the larger eigenvectors will encode the useful signals in the data and the smaller ones mainly contain noise, which we can filter out. Therefore, if we take the eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues (out of the original <span class="math">\(M\)</span> features), we can use them to form an <span class="math">\(M \times m\)</span> projection matrix <span class="math">\(\mathbf{P}\)</span>. We can then project the data into a lower dimension by calculating
</p>
<div class="math">$$\vec{x_{i}}^{\prime} = (\vec{x_{i}} - \bar{\vec{x}}) \cdot \mathbf{P}$$</div>
<p>We may chose <span class="math">\(m\)</span> by examining a line chart of the eigenvalues in increasing order and looking for an <em>elbow</em> where the slope suddenly increases, or by maximising the amount of variance explained while minimising the number of components retained, as described in <a href="https://PlayfulTechnology.co.uk/how-many-components.html">How Many Components?</a>.</p>
<p>A related technique, <em>Independent component analysis</em>, seeks to maximise the statistical independence between the projected components rather than the explained variance. This is often used in signal processing.</p>
<p>This works well when the data is dense, and when the classes we want to find in the data are linearly seperable. When the data is sparse, we instead use a technique called <em>Singular Value Decomposition</em>. Given an <span class="math">\(N \times M\)</span> matrix <span class="math">\(\mathbf{X}\)</span>, we decompose it into an <span class="math">\(N \times m\)</span> matrix <span class="math">\(\mathbf{U}\)</span>, an <span class="math">\(m \times m\)</span> matrix <span class="math">\(\mathbf{\Sigma}\)</span> and an <span class="math">\(M \times m\)</span> matrix <span class="math">\(\mathbf{V}\)</span> such that</p>
<div class="math">$$\mathbf{X} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>These have the additional properties that <span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are <em>unitary matrices</em>, that is </p>
<div class="math">$$\mathbf{U} \cdot \mathbf{U}^{T} = \mathbf{I}$$</div>
<p> and </p>
<div class="math">$$\mathbf{V} \cdot \mathbf{V}^{T} = \mathbf{I}$$</div>
<p>
The matrix <span class="math">\(\mathbf{\Sigma}\)</span> is zero everywhere except along its leading diagonal. The values along the leading diagonal are known as <em>singular values</em>, and act like the eigenvalues in principal component analysis. For a full singular value decomposition, <span class="math">\(m=M\)</span> and the product of the matrices is exactly equal to <span class="math">\(\mathbf{X}\)</span>, but for data reduction we use truncated singular value decomposition, using only the largest <span class="math">\(m\)</span> singular values.</p>
<p><span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are the left and right singular vectors, and represent the mapping of the datapoints and the features into the lower-dimensional space respectively.</p>
<p>In the case where the classes are not linearly seperable, we need to capture non-linear relatiionships between the features. The simplest way doing this is <em>kernel PCA</em>. This relies on the fact that there is normally a way to project the data into a higher dimensional space so that it becomes linearly seperable. To illustate this, considera set of concentric circles in a plane. If we add the distance from the centre as a third dimension, the circles appear as seperate layers.</p>
<p>But wait. Why are we projecting into a higher-dimensional space when we want to reduce the number of dimensions? Well, we don't actually do this. Instead, we define a <em>kernel function</em> <span class="math">\(f(\vec{x},\vec{y})\)</span>, which corresponds to the distance between two points <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> in the higher-dimensional space. We then obtain the <span class="math">\(N \times N\)</span> matrix</p>
<div class="math">$$\mathbf{F}_{i,j} = f(\vec{x_{i}},\vec{x_{j}})$$</div>
<p>We then obtain the eigenvalues and eigenvectors of this matrix. The eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues form an <span class="math">\(N \times m\)</span> matrix whose rows correspond to vectors we would obtain if we carried out PCA in the higher-dimensional space. Unfortunately, for a large dataset, this is more computationally intensive that standard PCA.</p>
<p>There are a number of other techniques for using non-linear relationships in data reduction, collectively known as <em>manifold learning</em>, but this article would get a bit too long if we tried to cover them all. However, one that is of particular interest is <em>t-distributed Stochastic Neighbour Embedding</em> (t-SNE). This tries to map datapoints to a lower dimension so that the statistical distibution of distances between points in the lower dimension is similar to that in the higher dimension. It is sensitive to the local structure of the data, and so useful for exploratory visualisations.</p>
<p>I used several of these techniques in my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a>. Implementations can be found in Scikit-Learn's <a href="https://scikit-learn.org/stable/modules/decomposition.html">decomposition</a> and <a href="https://scikit-learn.org/stable/modules/manifold.html">manifold</a> modules.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>TF-IDF2024-01-04T00:00:00+00:002024-01-04T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-04:/tf-idf.html<p>Characterising documents by their most important words</p><p>In the post on <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a>, we mentioned that Levenshtein distance is only suitable for comparing short strings. One reason for this, as previously discussed, is computational complexity, but another is that by comparing <em>characters</em>, it says nothing about the <em>meaning</em> of what it compares.</p>
<p>So, what can we do if we want to compare large documents in a meaningful way? One thing we could do is compare word frequencies. Of course, we need to take the overall length of the document into account, so we define the <em>Term Frequency</em></p>
<div class="math">$$\mathrm{TF}_{w} = \frac{n_{w}}{\sum_{i} n_{i}}$$</div>
<p> where <span class="math">\(n_{w}\)</span> is the number of times word <span class="math">\(w\)</span> occurs in the document. Using this, we could compute a Euclidean distance or cosine similarity between two documents. </p>
<p>However, not all words are equally important. If we are talking about <em>an algorithm</em>, we can easily see that the content word <em>algorithm</em> is more important that the function word <em>an</em>. Given a corpus of <span class="math">\(D\)</span> documents, of which <span class="math">\(D_{w}\)</span> contain word <span class="math">\(w\)</span>, we then define the <em>Inverse Document Frequency</em></p>
<div class="math">$$\mathrm{IDF}_{w} = \log \frac{D}{D_{w}+1}$$</div>
<p>Adding 1 to the denominator ensures we never divide by zero. You may wonder why we have to do this, since there will be no words in the corpus that do not occur in any documents. However, if we are continually adding documents to our corpus, it would be a major expense to have to recalculate all the previous documents when one was added that contained new vocabulary. To avoid that, we might want to use a fixed dictionary that is provided in advance. However, if our corpus is fixed, and we know that all words will occur in at least one document, we can use <span class="math">\(D_{w}\)</span> as the denominator.</p>
<p>This measures the ability of a word to discriminate between documents in the corpus. For a document <span class="math">\(d\)</span> and a word <span class="math">\(w\)</span> we can then combine these two measures to define <em>TF-IDF</em> as</p>
<div class="math">$$\mathrm{TFIDF}_{w,d} = \mathrm{TF}_{w,d} \mathrm{IDF}_{w} = \frac{n_{w,d}}{\sum_{i} n_{i,d}} \log \frac{D}{D_{w}+1}$$</div>
<p>
which measures the importance of the word in the document weighted by its importance in the corpus. A word that occurs frequently in a few documents but is absent in many will be important for identifying those documents.</p>
<p>One way we can use TF-IDF is to search a corpus of documents. Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span> we can calculate a score for a document <span class="math">\(d\)</span></p>
<div class="math">$$S_{d} = \sum_{i} \mathrm{TFIDF}_{w_{i},d}$$</div>
<p> and retrieve the documents with the highest scores.</p>
<p>TF-IDF is an example of a <em>bag of words</em> model - one based entirely on word frequencies that takes no account of grammar or context. An implementation (to which I have contributed a bug fix) can be found in the <a href="https://radimrehurek.com/gensim/">Gensim</a> topic modelling library.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Similarity and Distance Metrics2023-12-28T00:00:00+00:002023-12-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-28:/similarity-and-distance-metrics.html<p>Methods for comparing data</p><p>Data scientists often need to compare data points. This is necessary for indexing data, for finding clusters in datasets, for detecting outliers and anomalies, for comparing user behaviour in recommendations systems, and for measuring quality of fit when predicting continuous variables. There are various metrics that can be used for this purpose.</p>
<p>One of the most frequently used metrics is <em>Euclidean distance</em>. For two vectors <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span>, this is given by
</p>
<div class="math">$$S = |\vec{x} - \vec{y}| \\
= \sqrt{\sum_{i} (x_{i} - y_{i})^2}$$</div>
<p>
This is analogous to distances in physical space. It is useful when the overall scale of the data is important, and has the property that <em>smaller is better</em>.</p>
<p>When we wish to take the overall scale of the data out of consideration, it is common to use <em>cosine similarity</em> </p>
<div class="math">$$C = \frac{\vec{x} \cdot \vec{y}}{|\vec{x}||\vec{y}|}$$</div>
<p>
This represents the cosine of the angle between the two vectors, measured from the origin. It has a range of -1 to +1 and <em>bigger is better</em>. (If all the components of the vectors are positive, the range is from 0 to 1.) A variation on this is the <em>Pearson correlation</em>
</p>
<div class="math">$$P = \frac{(\vec{x} - \bar{x}) \cdot (\vec{y} - \bar{y})}{|\vec{x} - \bar{x}||\vec{y} - \bar{y}|}$$</div>
<p> where <span class="math">\(\bar{x}\)</span> and <span class="math">\(\bar{y}\)</span> are the means of the components of <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> repsectively. This measures the degree to which the components of the two vectors are linearly correlated with each other.</p>
<p>These metrics are all most useful when the ranges of all the components are similar. Otherwise, the effects of the components with the largest ranges will tend to dominate over those with smaller ranges. The usual remedy for this is to scale the components as </p>
<div class="math">$$\vec{x^{\prime}} = \frac{\vec{x} - \bar{\vec{x}}}{\vec{\sigma}}$$</div>
<p> where <span class="math">\(\bar{\vec{x}}\)</span> and <span class="math">\(\vec{\sigma}\)</span> are the mean and covariance of the sample respectively. Another possibility is to use the <em>Mahalanobis distance</em>
</p>
<div class="math">$$M = \sqrt{(\vec{x} - \vec{y}) \cdot \mathbf{\Sigma}^{-1} \cdot (\vec{x} -\vec{y})}$$</div>
<p>
where <span class="math">\(\mathbf{\Sigma}\)</span> is the <em>covariance matrix</em>
</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i}(\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p> where <span class="math">\(N\)</span> is the number of samples. This not only scales the variables appropriately, but accounts for dependencies between them. It is, however, more computationally expensive, especially for high-dimensional data.</p>
<p>Sometimes we wish to compare data that is not readily described as vectors. Suppose that we wish to compare two users of a social network in terms of which links they have shared. We might consider the links shared by each user as a set of unique items. To compare these sets, we can use the <em>Tanimoto metric</em>
</p>
<div class="math">$$T = \frac{|A \cap B|}{|A \cup B|}$$</div>
<p>, that is the fraction of the links shared by either user that have been shared by both users. This has a range from 0 to 1 and <em>bigger is better</em>.</p>
<p>If we wish to compare two short strings (as for example, in a spellchecking application), the ususal method is the <em>Leveshtein distance</em> . This is the number of insertions, deletions or substitutiions needed to transform one string into another. If we consider the strings <span class="math">\(X\)</span> and <span class="math">\(Y\)</span> as sequences of characters <span class="math">\(x_{1}x_{2}\ldots x_{m}\)</span> and <span class="math">\(y_{1}y_{2}\ldots y_{n}\)</span> respectively, we can define an <span class="math">\(m \times n\)</span> matrix <span class="math">\(\mathbf{L}\)</span> as
</p>
<div class="math">$$L_{i,0} = i$$</div>
<p> for <span class="math">\(i\)</span> from 0 to m
</p>
<div class="math">$$L_{0,j} = j$$</div>
<p> for <span class="math">\(j\)</span> from 0 to n
</p>
<div class="math">$$L_{i,j} = \min \left(L_{i,j-1},L{i-1,j},L{i-1,j-1}+\left\{\begin{array}{c 1} 0 & \quad \mathrm{if } x_{i} = y_{j} \\
1 & \quad \mathrm{if } x_{i} \neq y_{j} \end{array} \right.\right)$$</div>
<p>The Levenshtein distance is then <span class="math">\(L_{m,n}\)</span>. While simple to implement and intuitive to understand, this is only really suitable for comparing short strings, as the complexity is <span class="math">\(\mathcal{O}(m \times n)\)</span>.</p>
<p>A wide variety of distance metrics are implemented in <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">Scipy</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">The Chain Rule and Backpropogation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Chain Rule and Backpropogation2023-12-21T00:00:00+00:002023-12-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-21:/the-chain-rule-and-backpropogation.html<p>Calculating the gradients of complex functions</p><p>In the article about <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a>, we mentioned that logistic regression and neural networks are fit by minimising a loss function. In order to do this, we need to calculate the gradient of the loss function with respect to the parameters. This tells us how we can adjust the parameters to reduce the loss. </p>
<p>Since the functions we want to optimise in machine learning problems can usually be expressed as a function of a function. To differentiate this, we use the <em>chain rule</em>
</p>
<div class="math">$$\frac{df(g(x))}{dx} = \frac{df}{dg}\frac{dg}{dx}$$</div>
<p>To illustrate this, let's see how we can use it to differentiate the cross-entropy loss
</p>
<div class="math">$$L = -\ln p_{c}$$</div>
<p> with respect to the weights <span class="math">\(\mathbf{W}\)</span> of a logistic regression model. First we differentiate the loss with respect to the probability of the correct class.
</p>
<div class="math">$$\frac{dL}{dp_{c}} = -\frac{1}{p_{c}}$$</div>
<p>Then we need to differentiate the probability with respect to each of the logits <span class="math">\(q_{i}\)</span>
</p>
<div class="math">$$p_{c} = \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \\
\frac{\partial p_{c}}{\partial q_{i}} = \frac{\delta_{ic} e^{q_{c}} \sum_{i} e^{q_{i}} - e^{q_{c}} e^{q_{i}}}{\left( \sum_{i} e^{q_{i}} \right)^{2}} \\
= \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \frac{\delta_{ic} \sum_{i} e^{q_{i}} - e^{q_{i}}}{\sum_{i} e^{q_{i}}} \\
= p_{c}(\delta_{ic} - p_{i}) $$</div>
<p>
where <span class="math">\(\delta_{ic}\)</span> is the <em>Kroneker delta function</em>, which is 1 if <span class="math">\(i=c\)</span> and 0 otherwise.
(As an aside, functions whose derivative can be expressed in terms of their output are commonly used in machine learning, because they make differentiation easier. Such functions are often derived from the exponential function in some way).</p>
<p>Then, we need to differentiate the logits with respect to the weights
</p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec{b} \\
\frac{d \vec{q}}{d \mathbf{W}} = \vec{x}$$</div>
<p>Finally, we can combine these derivatives using the chain rule
</p>
<div class="math">$$\frac{dL}{d\mathbf{W}} = \frac{dL}{dp_{c}}\frac{dp_{c}}{d\vec{q}}\frac{d\vec{q}}{d\mathbf{W}} \\
=-\frac{1}{p_{c}}p_{c}(\delta_{ic}-\vec{p}) \otimes \vec{x} \\
=(\vec{p} - \delta_{ic}) \otimes \vec{x}$$</div>
<p> where <span class="math">\(\otimes\)</span> denotes the outer product.</p>
<p>For a deeper neural network, we use the fact that each layer <span class="math">\(n\)</span> of the network can be treated as a function </p>
<div class="math">$$\vec{x}_{n+1} = f_{n}(\mathbf{W}_{n} \cdot \vec{x}_{n} + \vec{b}_{n})$$</div>
<p> and apply the chain rule recursively to calculate the gradient of the loss with respect to each layer's weights and biases. This recursive application of the chain rule is known as <em>backpropogation</em>, and is the basis of most neural network optimisation algorithms.</p>
<p>Of course, very few data scientists ever need to do this themselves on a day-to-day basis, because automatic differentiation and backpropogation are provided by machine learning software libraries, but it's still useful to understand how it works.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Logistic Regression2023-12-14T00:00:00+00:002023-12-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-14:/logistic-regression.html<p>A simple classification algorithm</p><h2>A simple classification algorithm</h2>
<p>Over the past few weeks, we have been looking at algorithms related to <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>. This week, we are starting on a different tack, but it's still in the realm of relating probabilities to observations. </p>
<p>We start with the <em>logistic function</em>
</p>
<div class="math">$$p = \frac{1}{1+e^{-q}}$$</div>
<p>, where <span class="math">\(q\)</span> is a quantity we call a <em>logit</em>. This has the property that as <span class="math">\(q \rightarrow \infty\)</span>, <span class="math">\(p \rightarrow 1\)</span> and as <span class="math">\(q \rightarrow -\infty\)</span>, <span class="math">\(p \rightarrow 0\)</span>, so it can be used to model a probability. If we wish to calculate the probabilities of more than one class, we can generalise this with the <em>softmax function</em>
</p>
<div class="math">$$p_{i} = \frac{e^{q_{i}}}{\sum_{i} e^{q_{i}}}$$</div>
<p> where <span class="math">\(p_{i}\)</span> and <span class="math">\(q_{i}\)</span> represent the probabilities and logits for each class <span class="math">\(i\)</span> respectively.</p>
<p>But what are the logits? In the basic implementation of logistic regression, they are a linear function of some observations. Given a vector <span class="math">\(\vec{x}\)</span> of observations, we may model the logits as </p>
<div class="math">$$q = \vec{w} \cdot \vec{x} + b$$</div>
<p> for the binary case and </p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec {b}$$</div>
<p> in the multiclass case. where <span class="math">\(\vec{w}\)</span> and <span class="math">\(\mathbf{W}\)</span> are <em>weights</em> and <span class="math">\(b\)</span> and <span class="math">\(\vec{b}\)</span> are biases. In terms of Bayes' Theorem.
</p>
<div class="math">$$\vec{b} = \ln P(H)$$</div>
<p> and </p>
<div class="math">$$\mathbf{W} \cdot \vec{x} = \ln P(\vec{x} \mid H)$$</div>
<p>We fit the weights and biases by minimising the <em>cross-entropy loss</em>
</p>
<div class="math">$$L = -\sum_{j} \ln p_{j,c}$$</div>
<p> where <span class="math">\(c\)</span> is the correct class for the example <span class="math">\(j\)</span> in the training dataset. </p>
<p>This works well as a simple classifier under two conditions</p>
<ol>
<li>The classes are fairly evenly balanced</li>
<li>The classes are linearly seperable</li>
</ol>
<p>If there is a strong imbalance between the classes, the bias will tend to dominate over the weights, and the rarer classes will never be predicted. To mitigate this, is is possible to undersample the more common classes or oversample the rarer ones before training.</p>
<p>If the classes are not linearly seperable, it's necessary to transform the data into a space where they are. This may be done by applying </p>
<div class="math">$$\vec{x^{\prime}} = f(\mathbf{M} \cdot \vec{x})$$</div>
<p> where <span class="math">\(f\)</span> is some non-linear function and <span class="math">\(\mathbf{M}\)</span> is a matrix of weights. We may in fact apply several layers of similar transformations, each with its own set of weight parameters. That is the basis of neural networks.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
<td><a href="[filename}chain-rule.md">The Chain Rule and Backpropogation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Markov Chain Monte Carlo2023-12-07T00:00:00+00:002023-12-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-07:/markov-chain-monte-carlo.html<p>Estimating posterior distributions of continuous variables</p><h2>Estimating the posterior distributions of continuous variables</h2>
<p>In our previous discussions of <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' theorem</a> we have assumed that the probability distributions involved are of discrete variables. However, in many cases we wish to deal with continuous variables. In this case, Bayes' Theorem becomes</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{\int P(H) P(O \mid H) dH}$$</div>
<p>Unfortunately, for many distributions we may be interested in (including the ubiquitous normal distribution), the integral involved is intractible. The problem only gets worse in complex models, especially where we distributions may have multiple parameters. Some distribuitions have a <em>conjugate prior</em>, where the posterior distribution is of the same form as the prior distribution and may be obtained by an appropriate adjustment of parameters, but this is not always the case, and we need a numerical method that is more generally applicable.</p>
<p>The method we use is called <em>Markov Chain Monte Carlo</em> because it uses random samples Markov chains to explore the parameter space of the distribution. There are a number of variations of this, so for the sake of illustration, we will select a particular variant, the <em>Metropolis-Hastings algorithm</em>, as the basis of further discussion.</p>
<p>We start with a Markov chain <span class="math">\(P(H^{\prime} \mod H)\)</span> that, given a sample hypothesis <span class="math">\(H\)</span> generates a nearby hypothesis <span class="math">\(H^{\prime}\)</span>. At timestep <span class="math">\(t=0\)</span>, we generate a set of samples <span class="math">\(H_{i,0}\)</span> from the prior distribution. Then at each timestep <span class="math">\(t\)</span>, we generate a set of alternative hypotheses <span class="math">\(H^{\prime}_{i,t}\)</span> from the Markov chain given <span class="math">\(H_{i,t}\)</span>. For each pair of hypotheses, we then calculate an acceptance probability</p>
<div class="math">$$ A(H^{\prime}_{i,t},H_{i,t}) = \min \left( 1, \frac{P(O \mid H^{\prime}_{i,t}) P(H^{\prime}_{i,t}) P(H_{i,t} \mid H^{\prime}_{i,t})}{P(O \mid H_{i,t}) P(H_{i,t}) P(H^{\prime}_{i,t}) \mid H_{i,t}} \right) $$</div>
<p>We then generate a set of samples <span class="math">\(S_{i}\)</span> from a uniform distribution between 0 and 1, and update the samples as</p>
<div class="math">$$H_{i,t+1} = \left\{ \begin{array}{c 1} H^{\prime}_{i,t} & \quad \textrm{if } S_{i} \leq A(H^{\prime}_{i,t},H_{i,t}) \\ H_{i,t} & \quad \textrm{otherwise} \end{array} \right.$$</div>
<p>Provided that the model and the choice of priors is suitable for the data being modelled, over sufficient steps, the distiribution of <span class="math">\(H_{i,t}\)</span> will converge to <span class="math">\(P(H|O)\)</span>. We can envision this as each sample exploring the nearby regions of the distribution and preferring to move towards regions of higher likelihood.</p>
<p>Markov Chain Monte Carlo is implemented in the <a href="https://www.pymc.io/">PyMC</a> library, which provides a comprehensive toolkit for probabilistic modelling. </p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Viterbi Algorithm2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-30:/the-viterbi-algorithm.html<p>Finding the hidden states that generated a sequence</p><h2>Finding the Hidden States that generated a sequence</h2>
<p>If we have a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span> generated by a <a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Model</a>. One thing we may wish to do is infer the maximum likelihood sequence of hidden states <span class="math">\(S_{0},S_{1}...S_{t}\)</span> that gave rise for this. A useful technique for this is the <em>Viterbi Algorithm</em>.</p>
<p>The Viterbi algorithm represents the possible paths through the sequence of hidden states as a graphical model called a <em>trellis</em>. The possible hidden states at each time step are represented by nodes, with edges representing the transitiions between them.</p>
<p>At each time step <span class="math">\(t\)</span>, we start by caclulating the probabilities of the hidden states given the observation at that time step, <span class="math">\(P(S_{i,t} \mid X_{t})\)</span> and place corresponding nodes on the trellis. We then find the maximum likelihood predecessor for each node</p>
<div class="math">$$\texttt{argmax} \left( P(S_{j,t-1}) P(S_{j,t-1} \mid S_{i,t}) \right)$$</div>
<p>and connect an edge from it to its successor. Any nodes at <span class="math">\(t-1\)</span> that have no outgoing edges are then deleted, along with their incoming edge, and this is repeated at each previous time step until no more nodes can be deleted. Then, working forwards through the trellis from the first step at which nodes were deleted, we recalculate the probabilities at each timeslice as </p>
<div class="math">$$P^{\prime}(S_{i,t}) = \frac{P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}{\sum_{i} P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}$$</div>
<p>where <span class="math">\(S_{i,t}\)</span> are the remaining states at time <span class="math">\(t\)</span> and <span class="math">\(S_{j,t-1}\)</span> is the maximum likelihood predecessor of each state. </p>
<p>At the end of the sequence, we may select the maximum likelihood final state <span class="math">\(\texttt{argmax} P(S_{i,t})\)</span>. The path leading to it is then the maximum likelihood sequence of states given the observations. The Viterbi Algorithm is particularly suitable for real-time applications, as any time step where the number of possible states falls to 1 may be output immediately and removed from the trellis, which in turn reduces memory requirements and computation time.</p>
<p>I first encountered the Viterbi algorithm in the context of error-correcting codes for digital television. The sequence of bits to be transmitted in a digital TV signal can be protected against errors by interspersing it with extra bits derived from a <em>convolutional code</em> - this is a binary function of a number of previous bits. This converts the transmitted sequence from an apparently random sequence (due to data compression) to a Markov process. At the receiving side, we treat the received bitstream (which inevitably contains errors) as the observations and the transmitted bitstream as the hidden states, using the Viterbi algorithm to recover it.</p>
<p>I later used the Viterbi Algorithm for <a href="https://PlayfulTechnology.co.uk/true-212.html">Word Sense Disambiguation</a>. In this application, the observations were words and the hidden states were <a href="https://wordnet.princeton.edu/">WordNet</a> word senses. There were a few complications to take into account - function words, out-of-vocabulary words, multi-word expressions, proper names - but it achieved 70% accuracy, which was described to me as "state of the art".</p>
<p>It's this flexibility and applicability to a range of different problems that makes the Viterbi Algorithm one of my favourite algorithms.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hidden Markov Models2023-11-23T00:00:00+00:002023-11-23T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-23:/hidden-markov-models.html<p>Using Bayes' Theorem to analyse sequences</p><h2>Using Bayes'Theorem to analyse sequences</h2>
<p>Suppose we wish to analyse a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span>. This can be modelled using <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a> as a <em>Markov process</em> <span class="math">\(P(X_{t} \mid X_{t-1})\)</span>, <em>ie</em> the probability of each event depends on the previous event in the sequence.</p>
<p>If there are <span class="math">\(N\)</span> possible values that <span class="math">\(X\)</span> can take, the number of transition probabilities between them is <span class="math">\(N^{2}\)</span>. Such a model would quickly become very large and not very informative. We need a way to make the models more tractable.</p>
<p>To do this, we assume that the probability of each event can be described in terms of a hidden state, <span class="math">\(S\)</span>, as <span class="math">\(P(X_{t} \mid S_{t})\)</span>. The states can then be modelled by a Markov process, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>. This is known as a <em>Hidden Markov Model</em>, since it models a sequence of hidden states with a Markov process. The number of hidden states can be considerable smaller than the number of possible events, and the states can group events into meaningful categories. The model consists of three distributions, the inital state distribution, <span class="math">\(P(S_{0})\)</span>, the transition probability distribution, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>, and the conditional distribution of the events <span class="math">\(P(X \mid S)\)</span>. </p>
<p>Starting from the initial state distribution <span class="math">\(P(S_{0})\)</span>, we can caclulate the posterior distributions of the hidden states at each step <span class="math">\(t\)</span> of a sequence by the following method.</p>
<ol>
<li>Calculate the posterior distribution of the hidden state given the observed event <span class="math">\(X_{t}\)</span> using Bayes' Theorem
<div class="math">$$P(S_{t} \mid X_{t}) = \frac{P(S_{t}) P(X_{t} \mid S_{t})}{P(X_{t})}$$</div>
</li>
<li>Calculate the prior probability of the next state
<div class="math">$$P(S_{t+1}) = P(S_{t+!} \mid S{t}) P(S_{t} \mid {X_t})$$</div>
</li>
</ol>
<p>A concrete example is <a href="https://PlayfulTechnology.co.uk/video-part-of-speech-tagging.html">Part of Speech Tagging</a>. In this application, the observed events are words and the hidden states are the parts of speech (noun, verb, adjective etc.). This approach is particularly useful when you want the probability of each part of speech for a given word, rather than a single tag. I used this approach in my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, using my own open source <a href="https://PlayfulTechnology.co.uk/a-hidden-markov-model-library.html">Hidden Markov Model library</a>, which I had created as a lerning exercise when I first learnt about HMMs. I was pleased to discover that a colleague on that project had also used the library, but I no longer maintain it, as I've learnt a lot since then and if I did any more work on it I'd prefer to restart it from scratch.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a></td>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayes' Theorem2023-11-16T00:00:00+00:002023-11-16T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-16:/bayes-theorem.html<p>The probability of a hypothesis given observations.</p><h2>Estimating the probability of a hypothesis given observations.</h2>
<p>This is the beginning of what will hopefully be a regular series of articles explaining Key Algorithms in data science.</p>
<p>If you look at <a href="https://www.linkedin.com/in/peterjbleackley">my LinkedIn profile</a>, you'll see that the banner shows the formula </p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{P(O)}$$</div>
<p>This is a foundational rule for calculating conditional probabilities, known a <em>Bayes' Theorem</em>, after the Reverend Thomas Bayes, who first proposed it. It may be read as <em>the probability of a hypothesis given some observations is equal to the prior probability of the hypothesis multiplies by the probability of the observations given that hypothesis, and divided by the probability of the observations</em>. </p>
<p>To illustrate this, consider a family where the father has rhesus-positive blood and the mother has rhesus-negative blood. Rhesus-positive is a dominant trait - the father might have one or two copies of the Rh+ gene, whereas rhesus-negative is recessive - the mother must have two copies of the Rh- gene.</p>
<p>Let <span class="math">\(H\)</span> be the probability that the father has 2 copies of the RH+ gene. Without further information, 1/2 is the best estimate for this. If the family's first child is rhesus-positive, the probability of this is <span class="math">\(P(O \mid H) = 1\)</span> if the father has two copies of the Rh+ gene and <span class="math">\(P(O \mid ¬H) = \frac{1}{2}\)</span> if he has 1 copy. In general the overall probability of the observations give a set of hypotheses <span class="math">\(H_{i}\)</span> is given by
</p>
<div class="math">$$P(O) = \sum_{i} H_{i} P(O \mid H_{i})$$</div>
<p>, since the posterior probabilities of all hypotheses must sum to 1. Therefore, we can update the probability of the father having two copies of the Rh+ gene as
</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{1}{2} \times 1}{\frac{1}{2} \times 1 + \frac{1}{2} \times \frac{1}{2}} = \frac{2}{3}$$</div>
<p>If the family's second child is also rhesus-positive, we can further update our estimate with the new information</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{2}{3} \times 1}{\frac{2}{3} \times 1 + \frac{1}{3} \times \frac{1}{2}} = \frac{4}{5}$$</div>
<p>It is easy to see that if we had known both children's blood groups from the outset, and used <span class="math">\(P(O \mid ¬H) = \frac{1}{4}\)</span> we could have got the same result.</p>
<p>In data science, we often have to estimate the probability of a hypothesis given some evidence, so Bayes' theorem is a useful thing to have in our toolkit. </p>
<p>If we need to take observations of several different variables into account, there are two ways we can do it, the first, the <em>Naive Bayes</em> approach, treats all the variables as statistically independent, as we did in the above example. While this has the advantage of simplicity, it is only really viable when the problem is sufficiently simple.</p>
<p>For more complex problems, we need to model the dependencies between variables. We do this with a graphical method called a <em>Bayesian Belief Net</em>, where each node on a graph represents a variable, and the links represent dependencies between them. Each node then calculates the probability of the variable it represents in terms of the variables it is dependent on. A simple example can be seen in the Data Science Notebook <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It a Mushroom or Is It a Toadstool?</a>.</p>
<p>For my first AI project, I was asked to chose the best system to implement an automatic diagnostic system. I chose a Bayesian Belief Network on the grounds that it was important for the system to be explainable. Since each node of the Bayesian Belief Newtork represents a meaningful variable, its results are more explainable that those of a neural network, whose nodes are simply steps in a calculation. More recently I used Bayesian models in a project to predict the optimum settings for machine tools, so Bayes' Theorem has followed me throughout my data science career.</p>
<table>
<thead>
<tr>
<th></th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>