Playful Technology Limited - Key Algorithmshttps://PlayfulTechnology.co.uk/2024-04-18T00:00:00+01:00Priority Queues2024-04-18T00:00:00+01:002024-04-18T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-18:/priority-queues.html<p>Efficiently iterating over items in order</p><p>While researching last week's article on <a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a>, I found that two methods for constructing ball trees and the algorithm for querying ANNOY both involved <em>Priority Queues</em>. Since these are an important component of a number of different algorithms, it is worth examining them in detail.</p>
<p>Suppose we want to iterate over a set of items in a particular order. The naive way of doing this is to sort the list of items and then iterate over them. However, sorting is an expensive operation for large datasets, and we may want to add further items to the list while still iterating, which would necessitate re-sorting the list each time. We therefore need a more efficient way of tackling this.</p>
<p>Priority Queues address this by storing the data in a partially ordered data structure whose elements can be reordered efficiently when items are added or removed. Most implementations use a <em>heap</em>, which is a list of items with the following properties.</p>
<ol>
<li>The item at index <span class="math">\(i\)</span> is the parent of the items at <span class="math">\(2i+1\)</span> and <span class="math">\(2(i+1)\)</span></li>
<li>The parent is less than or equal to each of its children.</li>
</ol>
<p>These properties can be efficiently maintained by the following operations.</p>
<dl>
<dt><em>Shift Up</em></dt>
<dd>While an item is less than its parent (or not at the start of the list), swap it with its parent and check to see if its less than its new parent</dd>
<dt><em>Shift Down</em></dt>
<dd>While an item is greater than the smaller of its two children (or not at the end of the list), swap it with that child and check to see if it is smaller than either of its new children.</dd>
</dl>
<p>(<em>Note</em>: What I'm describing here is a <em>Min Heap</em>, which is used when we want to iterate over our items in ascending order. Most Python implementations of priority queues use this. There are also <em>Max Heaps</em>, which are used to iterate over items in descending order).</p>
<p>To add an item to the heap, we place it at the end, and then Shift Up until it reaches its proper place. When we remove the first item from the heap during iteration, we more the last item from the heap to the first position, and then Shift Down until it reaches its proper place.</p>
<p>There are several implementations of priority queues in Python <a href="https://docs.python.org/3/library/heapq.html">heapq</a> in the standard library, <a href="https://pypi.org/project/HeapDict/">heapdict</a> which implements a dictionary interface and allows the priority of items to be altered, and <a href="https://docs.python.org/3/library/queue.html#queue.PriorityQueue">PriorityQueue</a> in the standard Queue library, which is useful for sheduling data items to be processed by workers in a multithreaded application.</p>
<p>Prioritising tasks is an important part of many algorithms, so this is a useful tool to be aware of when designing an algorithm.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
<td></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Vector Search Trees2024-04-11T00:00:00+01:002024-04-11T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-11:/vector-search-trees.html<p>Finding nearest neigbours quickly</p><p>There are many applications where we need to search a dataset for the nearest neghbours of a given point. For a large dataset, comparing the data point to the entire dataset will be too slow, especially if we need to do it frequently. If we store the dataset to be searched in a tree structure, we can improve the efficiency of queries from <span class="math">\(\mathcal{O} (N)\)</span> to <span class="math">\(\mathcal{O} (\log N)\)</span>.</p>
<p>A simple method to construct the search tree is <em>KD Trees</em>. This method iterates over the dimension of the dataset, partitioning it into hyperrectanular blocks. Each of these blocks is partitioned at the median of the datapoints contained in it along the dimension under consideration. Using the median ensures that the number of points in each partition will be balanced. This allows for rapid construction of the search tree, and and rapid searching if the dimensionality of the data is low, but its performance degrades when the number of dimensions in the dataset is large. The documentation for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree">SciPy implementation of KD Tree</a> notes that <em>20 is already too large</em>. Adding new data to the tree after initial construction also runs a high risk of the tree becoming unbalanced.</p>
<p>An alternative, that improves performance at higher dimensionalities, is <em>Ball Trees</em>. In this, each node represents a ball of centroid <span class="math">\(\vec{C}\)</span> and radius <span class="math">\(r\)</span>. Data is assigned to the nodes in such a way as to minimise the hypervolume of the balls. Several methods for doing this are available, as detailed by Stephen M. Omohundro in <a href="https://ftp.icsi.berkeley.edu/ftp/pub/techreports/1989/tr-89-063.pdf">Five Balltree Construction Algorithms</a>. The one used in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree">Scikit-Learn implementation of Ball Trees</a> is a variation on the KD Tree construction algorithm, where instead of iterating through the dimensions in a fixed order, each node is partitioned along the dimension in which the spread of its datapoints is greatest. Another method is an <em>online insertion algorithm</em>, which is suitable for when we want to continually add new data to the search tree. Given a tree, each new node is added to the tree in the position that minimises the increase in volume of the nodes that contain it. It is also possible to build a Ball Tree bottom up, with a method based on <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierachical Clustering</a>.</p>
<p>Another method for constructing search trees is <em>ANNOY</em> (Approximate Nearest Neighbours Oh Yeah), which was developed by Erik Bernhardsson at Spotify, who needed to be able to search large datasets of high dimensional datasets as quickly as possible for music recommendations. In this method, the dataset is recursively partitioned by picking two datapoints at random from each existing partition and splitting the partition midway between them. The random assignment of the partitions means that it is possible for the nearest neighbour of a point to fall into a different partition. Therefore, an ensemble of trees, similar to a <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> is constructed. We can then find a candidate nearest neighbour from each tree and select the best. The randomness of the algorithm makes the matches approximate, rather than exact, but for many applications this doesn't matter
Here is <a href="https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html">Eric Berharsson's own description of ANNOY</a>. There's a <a href="https://pypi.org/project/annoy/1.0.3/">Python implementation of ANNOY</a> on PyPI and it can be used to search word vectors or document vectors in <a href="https://radimrehurek.com/gensim/similarities/annoy.html">Gensim</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/priority-queues.html">Priority Queues</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Cross Validation2024-04-04T00:00:00+01:002024-04-04T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-04:/cross-validation.html<p>Ensuring unbiassed selection of hyperparameters</p><p>When training a model, standard practice is to hold back part of the dataset for testing. This ensures that we have tested the model's ability to generalise to unseen data.</p>
<p>However, many models have <em>hyperparameters</em>, such as the regularisation penalties used in <a href="https://PlayfulTechnology.co.uk/linear-regression.html">regularised linear models</a>. In order to select the best values for these hyperparameters, it is necessary to try fitting the model with different values of hyperparameters and select the version that gives the best results. However, if we use the same test dataset for hyperparameter selection as we do for overall model testing, there is a risk that the hyperparameters will themselves be overfit to the test dataset.</p>
<p>One solution to this is to further subdivide the dataset into training, validation and test datasets. We use the valuation dataset to assess which hyperparameters give the best performance, and then use the test dataset to evaluate how well this model performs on unseen data. Many publicly available datasets come partitioned in this way. However, if we have a limited amount of data to work with, we may find that this approach reduces the training dataset too much.</p>
<p>An alternative to this is <em>Cross Validation</em>. The basic procedure is to make several different partitions of the data into training and validation sets, and calculates the average of the <a href="https://PlayfulTechnology.co.uk/tag/evaluation.html">evaluation metrics</a> across the different partitions. This, while more computationally expensive than using a single validation partition, gives more robust results, since the choice of hyperparameters will not depend on the results from a single validation partition. Once hyperparemeters have been chosen, the data used for validation can then be folded back into the training dataset to train the final model.</p>
<p>Several strategies may be used for making the split. The simplest is the <em>Leave One Out</em> strategy. For a training dataset of size <span class="math">\(N\)</span>, this makes <span class="math">\(N\)</span> partitions into <span class="math">\(N-1\)</span> training examples and 1 validation example. A variation of this is <em>Leave P Out</em>, which makes <span class="math">\(\binom{N}{P}\)</span> partitions of <span class="math">\(N-P\)</span> training examples and <span class="math">\(P\)</span> validation examples. These methods are computationally espensive have the disadvantage that there is considerable overlap between the partitions, so their results are not independent.</p>
<p>A more commonly used strategy is <em>K-Fold Cross Validation</em>. This divides the data into <span class="math">\(K\)</span> <em>folds</em> of <span class="math">\(\frac{N}{K}\)</span> examples. Each of these in turn is used as the validation partition, with the remaining folds combined to form the training partition. Usually 5 or 10 folds are used. This is more efficient than Leave One Out, and provides greater independence between tests, as each training dataset overlaps by only <span class="math">\(\frac{K-2}{K-1}\)</span> with the others, as opposed to almost complete overlap in Leave One Out. For further statistical rigour (at the expense of greater compute time) <em>Repeated K-Fold Cross Validation</em> performs this several times, with different assignments of examples to folds. </p>
<p>If the classes to be predicted are highly unbalanced, there is a risk that some folds may not contain any examples of a particular class, thus skewing the results. <em>Stratified K-Fold Cross Validation</em> addresses this problem by grouping the examples by target class, and then dividing each class equally between the folds. If there are know statistical dependencies in the training examples, <em>Group K Fold</em> divides the dataset into groups according to some feature which is expected to have impotant staticstical correlations with other variables, and assigns the data to folds group by group, so that the same group is never present in both the training and validation dataset. This ensures that the model will generalise across groups. Group K-Fold relaxes the requirement that folds be of equal size. These two strategies can be combined as <em>Stratified Group K-Fold Cross Validation</em>.</p>
<p>Related to Group K-Fold is the <em>Leave One Group Out</em> strategy, which in effect treats each group as a fold, and the <em>Leave P Groups Out</em>, strategy, which, given <span class="math">\(G\)</span> groups, forms <span class="math">\(\binom{G}{P}\)</span> partitions, each containing <span class="math">\(G-P\)</span> groups in the training dataset and <span class="math">\(P\)</span> groups in the test dataset.</p>
<p>Another possible strategy for <em>Shuffle Split Cross Validation</em>. In this, the dataset is repeatedly shuffled, and after each shuffle split into a training and validation dataset. Whereas with K-Fold cross validation and its variants the size of the validation dataset is dependent on the number of iterations, in Shuffle Split Cross Validation they may be selected independently of each other. Stratification and Grouping may be applied to Shuffle Split as they are to K-Fold.</p>
<p>In my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a> I had to evaluate a large number of candidate models. K-Fold Cross Validation played an essential role in this</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Linear Regression2024-03-28T00:00:00+00:002024-03-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-28:/linear-regression.html<p>Fitting linear models</p><p>After the discussion of <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a> in the last article, it makes sense to discuss regression models themselves. For many problems, we wish to fit a function of the form</p>
<div class="math">$$y = m x + c$$</div>
<p>or, for multivariate problems</p>
<div class="math">$$\vec{y} = \mathbf{M} \vec{x} + \vec{c}$$</div>
<p>The simplest method for this is <em>Ordinary Least Squares</em>, which choses the parameters so as to mimimise the mean squared error of the model. This has a closed form solution, but there are disadvantages to using it with multivariate data. Firstly, there is a danger of overfitting, with variables of little importance adding to the complexity of the model, and secondly there is the possibility of dependencies existing between the input variables and thus introducing redundancy into the model. These issues may be addressed by applying <a href="https://PlayfulTechnology.co.uk/data-reduction.html">principal component analysis</a> to the input data, but this has the disadvantage of making the model less explainable.</p>
<p>There are a number of methods of reducing the complexity of multivariate linear regression models. One of these is <em>Least Angle Regression</em> (LARS). This is a method of fitting the model that minimises the number of components used to predict the outputs. Rather than following the gradient of the loss function, it adjust the weights corresponding with the input variable that has the strongest correlation with the residuals at each step of the optimisation. When more than one variable have equally strong correlations with the target residuals, they are increased together in the joint least squares direction. While LARS identifies the most important variables contributing to the prediction, it does not solve the problem of colinearity between variables and is sensitive to noise.</p>
<p>Other methods for preventing overfitting involve adding a <em>regularisation penalty</em> to the loss function in the optimisation. For <em>Lasso regression</em>, this penalty is the sum of the absolute values of the weights, so the loss function to be optimised is</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} |M_{jk}|$$</div>
<p>
where <span class="math">\(N\)</span> is the number of samples and <span class="math">\(\alpha\)</span> is a hyperparameter.</p>
<p>For <em>Ridge regression</em>, the penalty term is the sum of the squares of the model weights, hence the loss function is </p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} M_{jk}^{2}$$</div>
<p>Lasso regression favours sparse models (that is, those with fewer non-zero weights), whereas ridge regression favours generally small weights.</p>
<p>These methods can be combined. <em>Lasso LARS</em> applies the Lasso regularisation penalty to LARS, which reduces LARS vulnerability to collinearity and noise. In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteins in Breast Cancer Patients</a> I used this method to fit numerical variables related to the progress of cancer to measures of activity in clusters of proteins. This method was chosen because I wished to assess which protein clusters were strong predictors.</p>
<p><em>ElasticNet</em> combines the Lasso and Ridge regression methods, optimising the loss function</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \left( \rho \sum_{j} \sum_{k} |M_{jk}| + (1 - \rho) \sum_{j} \sum_{k} M_{jk}^{2} \right)$$</div>
<p>where <span class="math">\(\rho\)</span> is another hyperparameter, ranging from 0 to 1, which determines the relative importance of the two regularisation penalties.</p>
<p>These algorithms and a number of related ones, are implemented in <a href="https://scikit-learn.org/stable/modules/linear_model.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Regression2024-03-21T00:00:00+00:002024-03-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-21:/evaluation-metrics-for-regression.html<p>How good is your regression model?</p><p>In the previous article, we looked at <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a> which are applicable when we are predicting discrete categories. This time, we'll look at how to evaluate models that predict continuous variables.</p>
<p>Suppose, in our test dataset, we have <span class="math">\(N\)</span> data points. We'll designate the predicted values as <span class="math">\(f_{i}\)</span> and the actual values as <span class="math">\(y_{i}\)</span>. One of the most obvious metrics to use is the <em>mean squared error</em></p>
<div class="math">$$\mathrm{MSE} = \frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}$$</div>
<p>This is essentially the variance of the errors. Since the mean squared error is often used as the loss function when fitting a regression model, we can easily compare this metric to the fitting loss to give an indication of how well the model has generalised. However, it can be difficult to interpret, since the scale of the metric is not the same as the original data. We may therefore wish to use the <em>root mean squared error</em></p>
<div class="math">$$\mathrm{RMSE} = \sqrt{\frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}}$$</div>
<p>which is the standard deviation of the errors. However, both these metrics can be sensitive to outliers, because of the squaring of the errors, which effectively gives larger errors higher weight. A metric that is less sensitive to this is the <em>mean absolute error</em></p>
<div class="math">$$\mathrm{MAE} = \frac{\sum_{i} |y_{i} - f_{i}|}{N}$$</div>
<p>This gives the same weight to small errors as to large ones. If we were to chose a constant <span class="math">\(f\)</span> that minimises the mean absolute error, it would correspond to the median of <span class="math">\(y\)</span>.</p>
<p>If we wish to use a metric that is independent of the scale of the data, we can use the <em>mean absolute percentage error</em></p>
<div class="math">$$\mathrm{MAPE} = \frac{1}{N}\sum_{i} \left| \frac{y_{i} - f_{i}}{y_{i}} \right|$$</div>
<p>While this is intuitively easy to understand, it has two disadvangtages. One is that it gives lower errors when the predicted valuea are too high than when they are two low, and the other is that it can diverge if any of the values of <span class="math">\(y_{i}\)</span> are close to zero. There are a number of approaches to mitigating these disadvantages. The <em>weighted mean absolute percetage error</em></p>
<div class="math">$$\mathrm{wMAPE} = \frac{\sum_{i}|y_{i} - f_{i}|}{\sum_{i}|y_{i}|}$$</div>
<p>
is robust against divergence, because it scales the errors by the mean absolute value of the true values, rather than the individual true values.</p>
<p>The <em>symmetrical mean average percentage error</em>
</p>
<div class="math">$$\mathrm{sMAPE} = \frac{100}{N} \frac{|y_{i} - f_{i}|}{|y_{i}| + |f_{i}|}$$</div>
<p>
is bounded between 0% and 100%. When <span class="math">\(y_{i}\)</span> and <span class="math">\(f_{i}\)</span> are both 0, the datapoint's percentage error is taken to be 0.</p>
<p>The <em>mean absolute scaled error</em>
</p>
<div class="math">$$\mathrm{MASE} = \frac{\sum{i}|y_{i} - f_{i}|}{\sum_{i} |y_{i} - \bar{y}|}$$</div>
<p>where </p>
<div class="math">$$\bar{y} = \frac{\sum_{i} y_{i}}{N}$$</div>
<p> is the mean of the true values</p>
<p>is similar to the weighted mean absolute percentage error, but scaled by the sum of the absolute deviations rather than the sum of the absolute values. It gives equal weight to positive and negative errors.</p>
<p>The <em>mean absolute log error</em></p>
<div class="math">$$\mathrm{MALE} = \frac{\sum_{i}|\ln y_{i} - ln f_{i}|}{N}$$</div>
<p>
gives equal weight to positive and negative errors, but requires the forecasted and actual values to be strictly positive, or it will diverge.</p>
<p>Another important metric is the <em>coefficient of determination</em>, or <em>explained variance</em></p>
<div class="math">$$R^{2} = 1 - \frac{\sum_{i}(y_{i} - f_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^2}$$</div>
<p>This can be seen as bearing a similar relationship to the mean squared error as the mean absolute scaled error has to the mean absolute error. It is a measure of how successful a model is at predicting the variability of the data. It is less sensitive to outliers that MSE, because an outlier will increase the denominator as well as the numerator. It is equivalent to the square of <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson correlation coefficient</a> between the actual and predicted values.</p>
<p>All these metrics test primarily for random errors. If we wish to test for systematic errors we can use the <em>mean signed difference</em></p>
<div class="math">$$\mathrm{MSD} = \frac{\sum_{i} y_{i} - f_{i}}{N}$$</div>
<p>which indicates the magnitude and direction of any likely bias in the model.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Classifiers2024-03-14T00:00:00+00:002024-03-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-14:/evaluation-metrics-for-classifiers.html<p>How good is your classifier model?</p><p>One vitally important task in any data science project is to assess how well the model performs. Various metrics are available for doing this, and each has its own advantages and disadvantages.
This is a large topic, so we will seperate it into metrics suitable for classifiers (this article) and those suitable for regression (next article).</p>
<p>A detailed description of the performance of a classifier model is given by the <em>Confusion Matrix</em> <span class="math">\(\mathbf{C}\)</span>, where <span class="math">\(C_{ij}\)</span> is the number of instances of class <span class="math">\(i\)</span> that are predicted to belong to class <span class="math">\(j\)</span>. This is useful for visualising the peerformance of the classifier, and the metrics discussed below can be calculated from it</p>
<p>Consider a binary classification problem. We may classify the results in our test dataset as True Positives, True Negatives, False Positives and False Negatives. The number of each of these is denoted <span class="math">\(\mathrm{TP} = C_{1,1}\)</span>, <span class="math">\(\mathrm{TN} = C_{0,0}\)</span>, <span class="math">\(\mathrm{FP} = C_{0,1}\)</span> and <span class="math">\(\mathrm{FN} = C_{1,0}\)</span> respectively.</p>
<p>The <em>Precision</em> of the classifier is the probability that an item predicted to be true is actually true. This is given by
</p>
<div class="math">$$ \mathrm{Pr} = \frac{\mathrm{TP}}{\mathrm{TP} +\mathrm{FP}}$$</div>
<p>
In Bayesian terms, if the predicted class is <span class="math">\(p\)</span> and the actual class is <span class="math">\(a\)</span>,
</p>
<div class="math">$$\mathrm{Pr} = P(a=\mathsf{True} \mid p=\mathsf{True})$$</div>
<p>The <em>Recall</em> of the classifier is the probability that a true item is predicted to te true. This is given by
</p>
<div class="math">$$\mathbf{R} = P(p=\mathsf{True} \mid a=\mathsf{True}) = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $$</div>
<p>Which of these is more informative depends on the application. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a>, my initial approach gave 100% Recall. However, since I had designated <em>True</em> to indicate a reliable article and <em>False</em> to indicate fake news, Precision was a more important measure of the model's ability to discriminate fact from fiction.</p>
<p>The F1 score is a metric that seeks to balance Precision and Recall. It is defined as the harmonic mean of them.</p>
<div class="math">$$F_{1} = \frac{2}{1/\mathrm{Pr} + 1/\mathrm{Re}} = \frac{2 \mathrm{Pr} \mathrm{R}}{\mathrm{Pr} + \mathrm{R}} = \frac{2 \mathrm{TP}}{2 \mathrm{TP} + \mathrm{FP} + \mathrm{FN}}$$</div>
<p>This measures similarity between the set of items predicted to be true and those that actually are true, but is not easy to interpret in terms of a Bayesian probability.</p>
<p>The <em>Accuracy</em> of the model is the probability that it predicts the correct class.
</p>
<div class="math">$$A = P(p=a) = \frac{\mathrm{TP} +\mathrm{TN}}{\mathrm{TP} +\mathrm{TN} + \mathrm{FP} +\mathrm{FN}}$$</div>
<p>
This is intuitive to interpret and, unlike the metrics discussed above, takes the true negatives into account. However, it becomes uninformative if classes are strongly imbalanced. For example, if we wish to predict whether or not a user will click on a given advertisement, we can achieve at least 99% accuracy by predicting <em>No</em> all the time. We therefore need metrics that correct for class imbalance.</p>
<p><em>Cohen's Kappa</em> is a measure of how much better a classifier is than guesswork. If we guessed the class of an item without information, our best strategy would be to pick the maxumum-likelihood class every time, and this would give us a success rate of <span class="math">\(P_{\mathrm{max}}\)</span>. We can then define
</p>
<div class="math">$$\kappa = 1 - \frac{1 - A}{1 - P_{\mathrm{max}}}$$</div>
<p><em>Matthew's Correlation Coefficient</em> is the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson Correlation Coefficient</a> between the actual and predicted classes. It is calculated as</p>
<div class="math">$$\phi = \frac{\mathrm{TP} \mathrm{TN} - \mathrm{FP} \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP})(\mathrm{TP} + \mathrm{FN})(\mathrm{TN} + \mathrm{FP})(\mathrm{TN} + \mathrm{FN})}}
$$</div>
<p>Accuracy and Cohen's Kappa can be extended to the multiclass case in the obvious way. It is not trivial to do this for Precision and Recall. However, we can define them on a per-class basis.
</p>
<div class="math">$$\mathrm{Pr}_{i} = \frac{C_{ii}}{\sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R}_{i} = \frac{C_{ii}}{\sum_{j} C_{ij}}$$</div>
<p><a href="https://www.evidentlyai.com/classification-metrics/multi-class-metrics">Evidently AI</a> suggests three methods for calculating overall precision and recall scores for calculating overall precision and recall scores in a multiclass problem. <em>Macro averaging</em> simple calculates the mean of precision and recall across all classes.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \mathrm{Pr}_{i}}{N}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \mathrm{R}_{i}}{N}$$</div>
<p>where <span class="math">\(N\)</span> is the number of classes.</p>
<p><em>Micro averaging</em> gives an average of precision and recall across all instances.</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} C_{ii}}{\sum_{i} \sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{j} C_{ii}}{\sum_{i} \sum_{j} C_{ij}}$$</div>
<p>These are equivalent, as a false negative for one class is a false positive for another, so while finer grained in one way, micro averaging loses information in another.</p>
<p>The third possibility is <em>weighted averaging</em>. While macro averaging gives all classes equal weight, wieghted averaging considers their overall prevalence in the data.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \left( C_{ii} \sum_{j} C_{ij} \right)}{\sum_{i} \left( \sum_{j} C_{ji} \sum_{k} C_{ik} \right)}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \left(C_{ii} \sum_{j} C_{ij} \right)}{\left(\sum_{j} C_{ij} \right)^{2}}$$</div>
<p>To gereralise Matthew's Correlation Coefficient to multiple classes, we first define the following terms
</p>
<div class="math">$$t_{k} = \sum_{j} C_{kj}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> occurs
</p>
<div class="math">$$p_{k} = \sum_{j} C_{jk}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> is predicted
</p>
<div class="math">$$c = \sum_{k} C_{kk}$$</div>
<p> is the number of correct predictions
</p>
<div class="math">$$s =\sum_{i} \sum_{j} C_{ij}$$</div>
<p> is the total number of samples</p>
<p>We then obtain
</p>
<div class="math">$$\phi = \frac{c s - \vec{t} \cdot \vec{p}}{\sqrt{s^{2} - |p|^{2}}\sqrt{s^{2} - |t|^{2}}}$$</div>
<p>Once you have the numbers, of course, it's important to dig deeper and understand what the factors influencing your model's performance are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>PageRank2024-03-07T00:00:00+00:002024-03-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-07:/pagerank.html<p>Using the connectivity of networks to rank items</p><p>Early web search engines, such as AltaVista, relied on hand-curated indexes of content. This was, of course, difficult to scale. What was needed was an automatic way of ranking web pages. Larry <em>Page</em>, developed an algorithm for <em>ranking</em> the importance of nodes in a network (such as web <em>pages</em>) in terms of their connections during his PhD at Stanford University, and then went on to found Google to exploit his research.</p>
<p>The <em>PageRank</em> algorithm is based on three assumptions
1. The more valuable a web page is, the more likely other web pages are to link to it.
2. Links originating from more valuable pages confer more value on the pages they link to.
3. Pages that link indiscriminately to many other pages confer less value on those pages than those which link more selectively.</p>
<p>Based on these assumptions, it then models a <em>random walk</em> taken through the Internet by a user clicking web links at random. If the user is viewing a web page <span class="math">\(i\)</span> that has <span class="math">\(N_{i}\)</span> outgoing links, they have a probability <span class="math">\(d\)</span> (known as the <em>damping factor</em>, and typically chosen as 0.85) of clicking a link to another page. This link is assumed to be chosen with uniform probability from the page's outgoing links. The PageRank <span class="math">\(P_{i}\)</span> for the page is a measure of how likely the page is to be found by this method.</p>
<p>If <span class="math">\(L_{i}\)</span> is the set of pages that link to <span class="math">\(i\)</span>, the PageRank satisfies the equation</p>
<div class="math">$$P_{i} = d \sum_{j \in L_{i}} \frac{P_{j}}{N_{j}} + (1-d)$$</div>
<p>This is solved iteratively.</p>
<p>If we define a <em>connection matrix</em> <span class="math">\(\mathbf{C}\)</span> such that <span class="math">\(C_{ij}\)</span> is <span class="math">\(1/N_{j}\)</span> if <span class="math">\(j\)</span> connects to <span class="math">\(i\)</span> and 0 otherwise, we can express this as a matrix equation</p>
<div class="math">$$\vec{P} = d \mathbf{C} \cdot \vec{P} + (1-d)$$</div>
<p>We then see that the PageRank is a modified form of the first eigenvector of the connection matrix.</p>
<p>Like <a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a>, PageRank is an example of a <em>collective intelligence</em> algorithm, in that it uses data from the actions of a large number of people to infer its scores.</p>
<p>PageRank is one of the most commercially successful algorithms ever devised, however its uses are not limited to ranking web pages. It can be used to analyse any data that can be modelled as a graph, such as citations in academic papers, patterns of gene activation in cells, or connection in the nervous system. A survey of these uses can be found in <a href="https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf">PageRank Beyond the Web</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Collaborative Fitering2024-02-29T00:00:00+00:002024-02-29T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-29:/collaborative-fitering.html<p>A basic recommendation algorithm</p><p>A problem of interest to a lot of businesses is <em>recommedation</em> - how to predict what their customers are likely to want. One of the simplest approaches to this is <em>Collaborative Fitering</em>, which works by identifying users with similar tastes.</p>
<p>Suppose each user <span class="math">\(i\)</span> has rated a set of items <span class="math">\(R_{i}\)</span>, giving each item <span class="math">\(n\)</span> a score <span class="math">\(S_{i,n}\)</span>. For a second user <span class="math">\(j\)</span>, we can obtain the intersection of their rated items <span class="math">\(R_{i} \cap R_{j}\)</span> and from these compute a weight <span class="math">\(w_{ij}\)</span> using a suitable <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">similarity metric</a> on the scores each user has given to their common items. The most common metrics to use would be the cosine similarity or Pearson correlation. If all scores are positive or zero, cosine similarity will give weights in the range <span class="math">\(0 \le w_{ij} \le 1\)</span>, whereas the Pearson correlation will give weights in the range <span class="math">\(-1 \le w_{ij} \le 1\)</span>, which is potentially more sensitive to polarisation in people's tastes. If two users have no items in common, <span class="math">\(w_{ij} = 0\)</span>.</p>
<p>For an item <span class="math">\(n\)</span> which user <span class="math">\(i\)</span> has not rated, we may then calculate a predicted rating
</p>
<div class="math">$$S^{\prime}_{i,n} = \frac{\sum_{j \mid n \in R_{j}} S_{j,n} w_{ij}}{\sum_{j \mid n \in R_{j}} w_{ij}}$$</div>
<p>This is the weighted mean of other user's weightings for the item, weighted according to the similarity of the users' ratings on other items. Items with a high predicted score for a given user can then be recommended to that user. The name <em>Collaborative Filtering</em> refers to the users collaborating through the algorithm to filter the items according to each other's preferences.</p>
<p>So far we have assumed that users have rated the items with a numerical score. However, in many applications, we only have a binary choice - for example, whether users have purchased an item, or shared a link. In this case, we can use the Tanimoto metric
</p>
<div class="math">$$w_{ij} = \frac{|R_{i} \cap R_{j}|}{|R{i} \cup R_{j}|}$$</div>
<p> as the weighting between users. The Predicted rating for <span class="math">\(n\)</span> then becomes
</p>
<div class="math">$$S_{i,n} = \frac{\sum_{j \mid n \in R_{j}} w_{ij}}{|\left\{j \mid n \in R_{j}\right\}|}$$</div>
<p>
that is, the average similarity to the user of users who have chosen the item.</p>
<p>Collaborative filtering and other recommendation algorithms suffer from the <em>bootstrap problem</em>, in that they require a lot of user data to work effectively, but when starting something up, that data is not available. Until a user has rated a significant number of items, it will not be possible to predict accurately what they will like, and until a significant number of people have rated an item, it will not be possible to predict accurately who will like it. As a result, recommendation systems cannot function effectively as a product in their own right, but work best as a feature of a larger product.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a>]</td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>K-Means Clustering2024-02-22T00:00:00+00:002024-02-22T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-22:/k-means-clustering.html<p>Finding clusters by their centroids</p><p>In the previous article, we discussed <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a>. Another method commonly used method is the <em>K-Means</em> algorithm, which attempts to find <span class="math">\(K\)</span> clusters such that this variance within the clusters is minimised. It does this by the following method</p>
<ol>
<li>Given an appropriately scaled dataset, choose <span class="math">\(K\)</span> points in the range of the data</li>
<li>Assign each point in the dataset to a cluster associated with the nearest of these points, accorting to the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">euclidean distance</a></li>
<li>Recalcuate the points as the means of the datapoints assigned to their clusters</li>
<li>Repeat from step 2 until the assignments converge</li>
</ol>
<p>Several different methods may be used to assign the initial centroids. The <em>Random Partition</em> method initially assigns each datapoint to a random cluster and takes the means of those clusters as the starting points. This tends to produce initial centroids close to the centre of the dataset. <em>Fogy's method</em> choses <span class="math">\(K\)</span> datapoints randomly from the dataset as the initial centroids. This tends to give more widely spaced centroids. A variation of this, the <em>kmeans++</em> method, weights the probability of chosing each datapoint as a centroid by the minimum squared distance of that point from the centroids already chosen. This is the default in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">Scikit-Learn's implementation of K-means</a>, since it is considered more robust. K-means cannot be guaranteed to converge to the optimal solution and is senitive to its initial conditions, it is common practice to rerun the clustering several times with several different sets of initial centroids, and chose the solution with the lowest variance.</p>
<p>Another issue with K-means is how many clusters to chose. This may be done by visialising the data in advance, or with the <em>silhouette score</em>. This is a measure of how much closer a datapoint is to other datapoints in its own cluster than it is to datapoints in other clusters. For a datapoint <span class="math">\(i\)</span> which is a member of cluster <span class="math">\(C_{k}\)</span>, which has <span class="math">\(N_{k}\)</span> datapoints assigned to it, we first calculate the mean distance of <span class="math">\(i\)</span> from the other members of <span class="math">\(C_{k}\)</span></p>
<div class="math">$$a_{i} = \frac{\sum_{j \in C_{k},j \neq i} d(i,j)}{N_{k}-1}$$</div>
<p>where <span class="math">\(d(i,j)\)</span> is the distance betwen datapoints <span class="math">\(i\)</span> and <span class="math">\(j\)</span></p>
<p>We then find the mean distance betwen <span class="math">\(i\)</span> and the datapoints in the closest cluster to it other than the one to which it is assigned.</p>
<div class="math">$$b_{i} = \min_{l \neq k} \frac{\sum_{j \in C_{l}} d(i,j)}{N_{l}}$$</div>
<p>The silhouette score for an individual point is then calculated as </p>
<div class="math">$$s_{i} = \frac{b_{i} - a_{i}}{\max(b_{i},a_{i})}$$</div>
<p>This has a range of -1 to 1, where a high value would indicate that a datapoint is central to its cluster and a low value that it is peripheral. We may then calculate the mean of <span class="math">\(s_{i}\)</span> over the dataset. The optimum number of clusters is the one that maximises this score.</p>
<p><em>X-means</em> is a variant of K-means that aims to select the optimum number of clusters automatically. This follows the following procedure.
1. Perform K-means on the dataset with <span class="math">\(K=2\)</span>.
2. For each cluster, perform K-means again with <span class="math">\(K=2\)</span> for the members of that cluster.
3. Use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to determine whether this improves the model. Keep subdividing clusters until it does not.
4. When no further subdivision are necessary, use the centroids of the clusters thus obtained as the starting point for a final round of K-means clustering on the full dataset.</p>
<p>K-means clustering requires the clusters to be linearly seperable. If this is not the case, it is necessary to perform <a href="https://PlayfulTechnology.co.uk/data-reduction.html">Kernel PCA</a> to map the dataset into a space where they are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hierarchical Clustering2024-02-15T00:00:00+00:002024-02-15T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-15:/hierarchical-clustering.html<p>Clustering data into trees of related items</p><p>When exploring a dataset, it is often useful to identify what groups or <em>clusters</em> of items may exist within the data. This is know as <em>unsupervised</em> learning, since it attempts to learn what classes exist within the data without prior knowledge of what they are, as opposed to <em>supervised learning</em> (classification), which trains a model to identify known classes in the dataset.</p>
<p>A simple method for this is <em>Hierarchical Clustering</em>. This arranges the datapoints in a tree structure by the following method.</p>
<ol>
<li>Assign each data point to a <em>leaf node</em></li>
<li>Calculate the distances between the nodes using an appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a></li>
<li>Create the <em>parent node</em> of the two nodes that are closest to each other, and replace its two <em>daughter nodes</em> with it.</li>
<li>Calculate the distances of the new node to each of the remaining nodes in the dataset</li>
<li>Repeat from step 3 until all all the nodes have been merged into a single tree.</li>
</ol>
<p>At stage 4, there are a number of different <em>linkage methods</em> for calculating the new distances. The main ones are</p>
<dl>
<dt>Single linkage</dt>
<dd>use the minimum distance between two points in each cluster</dd>
<dt>Complete (or maximum) linkage</dt>
<dd>use the maximum distance between two points in each cluster</dd>
<dt>Average linkage</dt>
<dd>use the average distance between points in the two clusters. With Euclidean distances, this can be simplified to the distance between the centroids of the clusters</dd>
<dt>Ward linkage</dt>
<dd>calculate distances between clusters recursively with the formula
<div class="math">$$d(u,v) = \sqrt{ \frac{\left(n_{v} + n_{s}\right) d(s,v)^{2} + \left(n_{v} + n_{t}\right) d(t,v)^{2} - n_{v} d(s,t)^{2}}{n_{s} + n_{t} + n_{v}}}$$</div>
where <span class="math">\(u\)</span> is the cluster formed by merging <span class="math">\(s\)</span> and <span class="math">\(t\)</span>, <span class="math">\(v\)</span> is another cluster, and <span class="math">\(n_{c}\)</span> is the number of datapoints in cluster <span class="math">\(c\)</span>. The distance between two leaf nodes is euclidean. This has the property of minimising the variance of the new cluster.</dd>
</dl>
<p>Ward linkage is the technique most likely to give even cluster sizes, while single linkage is the one most useful when cluster shapes are likely to be irregular.</p>
<p>One problem with Hierarchical Clustering is that, as described above, it does not produce discrete clusters. One way to address this is to visualise the data and chose a number of clusters, and then terminate the clustering early when that number of clusters is reached. The number of clusters may be chosen by visualising the data, either by using <a href="https://PlayfulTechnology.co.uk/data-reduction.html">t-SNE</a> or by performing an initial clustering and plotting the tree structure on a dendrogram. Another method is to chose a distance threshold, and not merge clusters further apart than this. It would be necessary to know the statistical distribution of distances between clusters to choose an appropriate threshold. While I have not seen this implemented, it would be theoretically possible to use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to decide when to separate clusters - this approach would be most useful when Ward linkage was used.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteing in Breast Cancer Patients</a> I used Hierarchical Clustering to identify groups of proteins whose activity was related in patients.</p>
<p>Implementations of Heirarchical Clustering can be found in <a href="https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html">Scipy</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Outlier Detection2024-02-08T00:00:00+00:002024-02-08T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-08:/outlier-detection.html<p>Finding the Odd One Out</p><p>Many datasets contain <em>outliers</em>, datapoints which do not fit the general pattern of the observations. This may be due to errors in the data collection, in which case removing these datapoints will make models fitted to the data more robust and reduce the risk of overfitting. In other cases, the outliers themselves are the signal we want to detect.</p>
<p>One method for doing this is <em>Isolation Forests</em>. As the name implies, it is related to the <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> algorithm discussed in the previous article. It fits a forest of (usually around 100) random decision trees to the dataset by the following method.</p>
<ol>
<li>Pick a feature at random</li>
<li>Pick a random threshold in the range of that feature</li>
<li>Partition the data at that threshold</li>
<li>Repeat the process for each partition</li>
</ol>
<p>We can then calculate an anomaly score for each datapoint. This is the depth in the decision tree at which a datapoint becomes isolated from the rest of the dataset. The mean of this score over all the trees gives a robust estimator of how easily a datapoint can be separated from the rest. The advantages of this method are that it makes no assumptions about the underlying distribution of the data, and that is explainable, in that the features which are most likely to contribute to a datapoint being isolated can be identified from the decision trees.</p>
<p>I used Isolation Forests in my work at <a href="https://PlayfulTechnology.co.uk/amey-strategic-consulting.html">Amey Strategic Consulting</a> to identify faulty traffic flow sensors in the Strategic Road Network.</p>
<p>Another method that makes no assumptions about the underlying distribution is <em>Local Outlier Factors</em>. This calculates how different data points are from their local neighbourhood. First we calculated the distances <span class="math">\(S_{i,j}\)</span> between datapoints in the sample using some appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a> (this requires all variables to be approriately scaled) and identify the <span class="math">\(N\)</span> nearest neighbours of each datapoint (usually 20). We then calculate the <em>local density</em> <span class="math">\(D_{i}\)</span> for each datapoint. This is the inverse of the mean of the distances between the point and each of its neighbours.
</p>
<div class="math">$$D_{i} = \frac{N}{\sum_{k} S_{i,k}}$$</div>
<p> where <span class="math">\(k\)</span> are the indices of the points neighbours. We can then calculate the <em>Local Outlier Factor</em> <span class="math">\(\mathrm{LOF}\)</span> for each datapoint. This is the mean of the ratio between the datapoints local density and that of each of its neighbours, <em>ie</em>
</p>
<div class="math">$$\mathrm{LOF} = \frac{\sum_{k} \frac{D_{i}}{D_{k}}}{N}$$</div>
<p>Samples whose Local Outlier Factor is below a given threshold (<em>ie</em> those whose local density is lower than that of their neighbours) can be identified as outliers.</p>
<p>If we can assume that the data are drawn from a multivariate Gaussian distribution, we can use an <em>Eliptic Envelope</em> method. For a sample of size <span class="math">\(N\)</span> with <span class="math">\(d\)</span> dimensions, we chose a sample size <span class="math">\(h\)</span> such that
</p>
<div class="math">$$\frac{\lfloor N+d+1 \rfloor }{2} < h < N$$</div>
<p>
We then select a large number of subsamples of size <span class="math">\(h\)</span> from the dataset, and calculate the mean and covariance of each. The one where the covariance has the smallest determinant is the one least likely to contain outliers. Datapoints with a large Mahalanobis distance from the mean of this sample are therefore likely to be outliers.</p>
<p>Of these methods, I'd expect Isolation Forests to be the one most likely to be useful in the widest variety of circumstances.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Random Forests2024-02-01T00:00:00+00:002024-02-01T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-01:/random-forests.html<p>Classification and regression with ensembles of decision trees.</p><p>We have mentioned clasification problems in a number of previous articles, and shown how they can be approached with <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>, <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a> and by extension, neural networks. This week we'll examine a different method, based on <em>Decision Trees</em>.</p>
<p>A decision tree can be thought of a set of nested if/else statements. It can be fit by the following procedure.</p>
<ol>
<li>Find the variable that correlates most strongly with the target variable.</li>
<li>Find the set of thresholds against that variable that comes closest to splitting the data into nodes that correspond to the target classes.</li>
<li>Repeat this for each of the nodes you have split the data into, until each <em>leaf node</em> contains a single class.</li>
</ol>
<p>However, this is prone to <em>overfitting</em>, whereby the model fits every detail of the training data but does not generalise well when classifying new data. In effect, it fits the noise as well as the signal.</p>
<p><em>Random Forests</em> is an algorithm that addresses this problem. As the word forest implies, it fits a large number (typically 100) of decision trees to the training data. Each, however, is trained only on a subset of the training data and with a subset of the variables. These subsets are chosen randomly for each tree in the forest.</p>
<p>While each individual tree in the forest will tend to overfit, the fact that they were all fit against different subsets of the data and variables will mean that the errors they make on new data will not be correlated. Therefore a majority vote of the trees provides a much more robust classifier than any individual tree would. It is also possible to take account of uncertainty in the classification by reporting the number of individual trees that voted for each class - in Bayesian terms, this coresponds to <span class="math">\(P(H \mid O)\)</span>. Algorithms that combine the results of multiple classifiers in this way are known as <em>ensemble methods</em>.</p>
<p>If the target variable is continuous, Random Forests can also be used for regression. In this case, the fitting of the decision trees terminates when the variance of the samples in each leaf node fall below a certain threshold. The prediction is then the mean of the predictions from the individual trees.</p>
<p>Random Forests tend to give better results than Logistic Regression when the target classes are unbalanced, and the algorithm is noted for having a high success rate in <a href="https://kaggle.com">Kaggle</a> competitions. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a> I found it gave good results in using grammatical features to classify Fake News.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Information Theory2024-01-25T00:00:00+00:002024-01-25T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-25:/information-theory.html<p>How much information does your data contain?</p><p>Data science can be described as turning data into infomation. However, we need to know how much information there is to find and where to find it. There are various methods we can use to measure this, which derive from the field of <em>Information Theory</em>.</p>
<p>The most basic of these measurements is <em>entropy</em>, which was intoduced by Claude Shannon. If a varible has a probability distribution <span class="math">\(p_{i}\)</span>, the entropy of that variable is given by
</p>
<div class="math">$$H = -\sum_{i} p_{i} \log_{2}p_{i}$$</div>
<p>
This is the expected number of binary decisions needed to identify a value of the variable, or, if we were to generate a stream of symbols from that distribution, the average number of bits per symbol that would be needed to encode that stream in an optimal lossless compression.
This is useful for identifying which variables are most important. Entropy has its maximum value of <span class="math">\(\log_{2} N\)</span>, where <span class="math">\(N\)</span> is the number of possible values, when the values are evenly distributed, and its minimum value of 0 when one value is a certainty.</p>
<p>We also need to quantify how much information is contained in the relationship between two variables. Suppose that two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> have individual probability distributions <span class="math">\(p_{i}\)</span> and <span class="math">\(p_{j}\)</span>, and a joint probability distribution <span class="math">\(p_{ij}\)</span>. If the variables are statistically independent, these distributions would satisfy the relationship <span class="math">\(p_{ij} = p_{i} p_{j}\)</span>. <em>Mutual information</em> characterises the deviation from this as
</p>
<div class="math">$$\mathrm{MI}(A,B) = \sum_{i} \sum_{j} p_{ij} \log_{2} \frac{p_{ij}}{p_{i} p_{j}}$$</div>
<p>
This is the amount of information that knowing the value of one variable will tell you about the other. This can be used for feature selection. Consider two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> and a target variable <span class="math">\(T\)</span>. If <span class="math">\(\textrm{MI}(A,T) > \textrm{MI}(B,T)\)</span> and <span class="math">\(\textrm{MI}(A,B) > \textrm{MI}(B,T)\)</span>, it is likely that any relationship between <span class="math">\(B\)</span> and <span class="math">\(T\)</span> is entirely a consequence of their mutual relationship with <span class="math">\(A\)</span>. Therefore, <span class="math">\(B\)</span> can safely be discarded.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It A Mushroom or Is It A Toadstool</a> I used mutual information to infer hidden variables when building a <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayesian Belief Network</a>.</p>
<p>There are number of information-theory based methods for selecing models. The best known of these, which are closely related to each other are the <em>Bayesian Information Criterion</em></p>
<div class="math">$$\mathrm{BIC} = k \ln n - 2 \ln \hat{L}$$</div>
<p> and the <em>Akaike Information Criterion</em></p>
<div class="math">$$\mathrm{AIC} = 2 ( k- \ln \hat{L} )$$</div>
<p>where <span class="math">\(k\)</span> is the number of free parameters in the model, <span class="math">\(n\)</span> is the number of data points to which the model is fitted, and <span class="math">\(\hat{L}\)</span> is the likelihood of the data under the optimally fitted model. In both cases, a lower value indicates a better model, favouring models that give a high likelihood of the data and penalising more complex models. The main difference between them is that the Bayesian Information Criterion penalises compelexity more heavily, especially for larger datasets.</p>
<p>There are many other uses for information theory in data science, but I'd like finish with one relevant to natural language processing. Marcello Montemurro and Damian Zanette published a paper entitled <a href="https://arxiv.org/abs/0907.1558">Towards the quantification of semantic information in written language</a> in which they intoduced a technique for using the entropy of word frequency distributions across different parts of a document to identify the most significant words, according to the role they play in its structure. I illustrate this in <a href="https://PlayfulTechnology.co.uk/the-entropy-of-alice-in-wonderland.html">The Entropy of Alice in Wonderland</a>.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Latent Semantic Indexing2024-01-18T00:00:00+00:002024-01-18T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-18:/latent-semantic-indexing.html<p>Reducing the dimensionality of language data</p><p>In the article on <a href="https://PlayfulTechnology.co.uk/data-reduction.html">data reduction</a>, we mentioned the <em>curse of dimensionality</em>, whereby large numbers of features make data increasingly difficult to analyse meaningfully. If we take another look at <a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a>, we see that this will generate a feature for each unique word in the corpus that it is trained on, which may be in the tens of thousands. It therefore makes sense to apply a data reduction method and obtain a more compact representation.</p>
<p>TF-IDF, as previously discussed, makes use of the fact that words that occur in some documents but not others are the most useful for distinguishing between the documents. This means that its feature vectors will generally be quite sparse. Therefore, the most appropriate data reduction method to use will be Singular Value Decomposition.</p>
<div class="math">$$\mathbf{TFIDF} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>Tyypically around 200 components are retained. The left singular vectors <span class="math">\(\mathbf{U}\)</span> then represent documents in the lower-dimensional space, while the right sigular vectors <span class="math">\(\mathbf{V}\)</span> represent words in the same space. Words that tend to appear in the same documents will tend to have similar vector representations, and according to the <em>distributional hypothesis</em>, this gives an implicit representation of their meaninng. This implicit representation of meaning gives the technique the name <em>Latent Semantic Analysis</em>.</p>
<p>Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span>, we can calculate a query vector
</p>
<div class="math">$$\vec{q} = \sum_{i}\mathbf{V}_{w_{i}}$$</div>
<p>
We can then search our corpus for the most relevant documents to match the query by calculating a score
</p>
<div class="math">$$S = \mathbf{U} \cdot \vec{q}$$</div>
<p> and selecting the documents with the greatest score. Since it can be used to search the corpus in this way, Latent Semantic Analysis is also known as <em>Latent Semantic Indexing</em>.</p>
<p>An implementation of Latent Semanic Indexing (LSI) can be found in the <a href="https://radimrehurek.com/gensim/models/lsimodel.html">Gensim</a> library, along with several other <em>topic models</em>, which similarly attempt to use the distributional hypothesis to characterise documents.</p>
<p>While LSI can account for different words having similar meanings, it is still a bag of words model and cannot account for the same word having different meanings dependent on context. In my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> I attempted to address this issue by building an NLP pipeline that enriched the documents with Named Entity Recognition and Word Sense Disambiguation before applying LSI, but modern transformer models address it by calculating contextual word vectors. It can, however, be seen as the distant ancestor of these models.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Data Reduction2024-01-11T00:00:00+00:002024-01-11T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-11:/data-reduction.html<p>Mapping data to lower dimensions</p><p>Datasets that involve a large number of features suffer from <em>The Curse of Dimensionality</em>, where, as the number of features increases, it becomes harder and harder to use them to define a meaningful <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">measure of distance</a> between the samples. It becomes necessary to map the data into a smaller number of dimensions. To do this, we need to find mathematical relatioships between the features that can be used to form a more economical representation of the data.</p>
<p>The most common way of doing this is <em>Principal Component Analysis</em> (PCA), which captures linear relationships between features. This starts by calculating the covariance of the features</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i} (\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p>where <span class="math">\(\vec{x_{i}}\)</span> is a sample, <span class="math">\(\bar{\vec{x}}\)</span> is the mean of the samples, and <span class="math">\(N\)</span> is the number of samples. We then calculate the eigenvalues and eigenvectors of this matrix. Each eigenvalue quantifies how much of the variance of the data the associated eigenvector explains. Hopefully, the larger eigenvectors will encode the useful signals in the data and the smaller ones mainly contain noise, which we can filter out. Therefore, if we take the eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues (out of the original <span class="math">\(M\)</span> features), we can use them to form an <span class="math">\(M \times m\)</span> projection matrix <span class="math">\(\mathbf{P}\)</span>. We can then project the data into a lower dimension by calculating
</p>
<div class="math">$$\vec{x_{i}}^{\prime} = (\vec{x_{i}} - \bar{\vec{x}}) \cdot \mathbf{P}$$</div>
<p>We may chose <span class="math">\(m\)</span> by examining a line chart of the eigenvalues in increasing order and looking for an <em>elbow</em> where the slope suddenly increases, or by maximising the amount of variance explained while minimising the number of components retained, as described in <a href="https://PlayfulTechnology.co.uk/how-many-components.html">How Many Components?</a>.</p>
<p>A related technique, <em>Independent component analysis</em>, seeks to maximise the statistical independence between the projected components rather than the explained variance. This is often used in signal processing.</p>
<p>This works well when the data is dense, and when the classes we want to find in the data are linearly seperable. When the data is sparse, we instead use a technique called <em>Singular Value Decomposition</em>. Given an <span class="math">\(N \times M\)</span> matrix <span class="math">\(\mathbf{X}\)</span>, we decompose it into an <span class="math">\(N \times m\)</span> matrix <span class="math">\(\mathbf{U}\)</span>, an <span class="math">\(m \times m\)</span> matrix <span class="math">\(mathbf{\Sigma}\)</span> and an <span class="math">\(M \times m\)</span> matrix <span class="math">\(\mathbf{V}\)</span> such that</p>
<div class="math">$$\mathbf{X} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>These have the additional properties that <span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are <em>unitary matrices</em>, that is </p>
<div class="math">$$\mathbf{U} \cdot \mathbf{U}^{T} = \mathbf{I}$$</div>
<p> and </p>
<div class="math">$$\mathbf{V} \cdot \mathbf{V}^{T} = \mathbf{I}$$</div>
<p>
The matrix <span class="math">\(\mathbf{\Sigma}\)</span> is zero everywhere except along its leading diagonal. The values along the leading diagonal are known as <em>singular values</em>, and act like the eigenvalues in principal component analysis. For a full singular value decomposition, <span class="math">\(m=M\)</span> and the product of the matrices is exactly equal to <span class="math">\(\mathbf{X}\)</span>, but for data reduction we use truncated singular value decomposition, using only the largest <span class="math">\(m\)</span> singular values.</p>
<p><span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are the left and right singular vectors, and represent the mapping of the datapoints and the features into the lower-dimensional space respectively.</p>
<p>In the case where the classes are not linearly seperable, we need to capture non-linear relatiionships between the features. The simplest way doing this is <em>kernel PCA</em>. This relies on the fact that there is normally a way to project the data into a higher dimensional space so that it becomes linearly seperable. To illustate this, considera set of concentric circles in a plane. If we add the distance from the centre as a third dimension, the circles appear as seperate layers.</p>
<p>But wait. Why are we projecting into a higher-dimensional space when we want to reduce the number of dimensions? Well, we don't actually do this. Instead, we define a <em>kernel function</em> <span class="math">\(f(\vec{x},\vec{y})\)</span>, which corresponds to the distance between two points <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> in the higher-dimensional space. We then obtain the <span class="math">\(N \times N\)</span> matrix</p>
<div class="math">$$\mathbf{F}_{i,j} = f(\vec{x_{i}},\vec{x_{j}})$$</div>
<p>We then obtain the eigenvalues and eigenvectors of this matrix. The eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues form an <span class="math">\(N \times m\)</span> matrix whose rows correspond to vectors we would obtain if we carried out PCA in the higher-dimensional space. Unfortunately, for a large dataset, this is more computationally intensive that standard PCA.</p>
<p>There are a number of other techniques for using non-linear relationships in data reduction, collectively known as <em>manifold learning</em>, but this article would get a bit too long if we tried to cover them all. However, one that is of particular interest is <em>t-distributed Stochastic Neighbour Embedding</em> (t-SNE). This tries to map datapoints to a lower dimension so that the statistical distibution of distances between points in the lower dimension is similar to that in the higher dimension. It is sensitive to the local structure of the data, and so useful for exploratory visualisations.</p>
<p>I used several of these techniques in my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a>. Implementations can be found in Scikit-Learn's <a href="https://scikit-learn.org/stable/modules/decomposition.html">decomposition</a> and <a href="https://scikit-learn.org/stable/modules/manifold.html">manifold</a> modules.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>TF-IDF2024-01-04T00:00:00+00:002024-01-04T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-04:/tf-idf.html<p>Characterising documents by their most important words</p><p>In the post on <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a>, we mentioned that Levenshtein distance is only suitable for comparing short strings. One reason for this, as previously discussed, is computational complexity, but another is that by comparing <em>characters</em>, it says nothing about the <em>meaning</em> of what it compares.</p>
<p>So, what can we do if we want to compare large documents in a meaningful way? One thing we could do is compare word frequencies. Of course, we need to take the overall length of the document into account, so we define the <em>Term Frequency</em></p>
<div class="math">$$\mathrm{TF}_{w} = \frac{n_{w}}{\sum_{i} n_{i}}$$</div>
<p> where <span class="math">\(n_{w}\)</span> is the number of times word <span class="math">\(w\)</span> occurs in the document. Using this, we could compute a Euclidean distance or cosine similarity between two documents. </p>
<p>However, not all words are equally important. If we are talking about <em>an algorithm</em>, we can easily see that the content word <em>algorithm</em> is more important that the function word <em>an</em>. Given a corpus of <span class="math">\(D\)</span> documents, of which <span class="math">\(D_{w}\)</span> contain word <span class="math">\(w\)</span>, we then define the <em>Inverse Document Frequency</em></p>
<div class="math">$$\mathrm{IDF}_{w} = \log \frac{D}{D_{w}+1}$$</div>
<p>Adding 1 to the denominator ensures we never divide by zero. You may wonder why we have to do this, since there will be no words in the corpus that do not occur in any documents. However, if we are continually adding documents to our corpus, it would be a major expense to have to recalculate all the previous documents when one was added that contained new vocabulary. To avoid that, we might want to use a fixed dictionary that is provided in advance. However, if our corpus is fixed, and we know that all words will occur in at least one document, we can use <span class="math">\(D_{w}\)</span> as the denominator.</p>
<p>This measures the ability of a word to discriminate between documents in the corpus. For a document <span class="math">\(d\)</span> and a word <span class="math">\(w\)</span> we can then combine these two measures to define <em>TF-IDF</em> as</p>
<div class="math">$$\mathrm{TFIDF}_{w,d} = \mathrm{TF}_{w,d} \mathrm{IDF}_{w} = \frac{n_{w,d}}{\sum_{i} n_{i,d}} \log \frac{D}{D_{w}+1}$$</div>
<p>
which measures the importance of the word in the document weighted by its importance in the corpus. A word that occurs frequently in a few documents but is absent in many will be important for identifying those documents.</p>
<p>One way we can use TF-IDF is to search a corpus of documents. Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span> we can calculate a score for a document <span class="math">\(d\)</span></p>
<div class="math">$$S_{d} = \sum_{i} \mathrm{TFIDF}_{w_{i},d}$$</div>
<p> and retrieve the documents with the highest scores.</p>
<p>TF-IDF is an example of a <em>bag of words</em> model - one based entirely on word frequencies that takes no account of grammar or context. An implementation (to which I have contributed a bug fix) can be found in the <a href="https://radimrehurek.com/gensim/">Gensim</a> topic modelling library.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Similarity and Distance Metrics2023-12-28T00:00:00+00:002023-12-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-28:/similarity-and-distance-metrics.html<p>Methods for comparing data</p><p>Data scientists often need to compare data points. This is necessary for indexing data, for finding clusters in datasets, for detecting outliers and anomalies, for comparing user behaviour in recommendations systems, and for measuring quality of fit when predicting continuous variables. There are various metrics that can be used for this purpose.</p>
<p>One of the most frequently used metrics is <em>Euclidean distance</em>. For two vectors <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span>, this is given by
</p>
<div class="math">$$S = |\vec{x} - \vec{y}| \\
= \sqrt{\sum_{i} (x_{i} - y_{i})^2}$$</div>
<p>
This is analogous to distances in physical space. It is useful when the overall scale of the data is important, and has the property that <em>smaller is better</em>.</p>
<p>When we wish to take the overall scale of the data out of consideration, it is common to use <em>cosine similarity</em> </p>
<div class="math">$$C = \frac{\vec{x} \cdot \vec{y}}{|\vec{x}||\vec{y}|}$$</div>
<p>
This represents the cosine of the angle between the two vectors, measured from the origin. It has a range of -1 to +1 and <em>bigger is better</em>. (If all the components of the vectors are positive, the range is from 0 to 1.) A variation on this is the <em>Pearson correlation</em>
</p>
<div class="math">$$P = \frac{(\vec{x} - \bar{x}) \cdot (\vec{y} - \bar{y})}{|\vec{x} - \bar{x}||\vec{y} - \bar{y}|}$$</div>
<p> where <span class="math">\(\bar{x}\)</span> and <span class="math">\(\bar{y}\)</span> are the means of the components of <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> repsectively. This measures the degree to which the components of the two vectors are linearly correlated with each other.</p>
<p>These metrics are all most useful when the ranges of all the components are similar. Otherwise, the effects of the components with the largest ranges will tend to dominate over those with smaller ranges. The usual remedy for this is to scale the components as </p>
<div class="math">$$\vec{x^{\prime}} = \frac{\vec{x} - \bar{\vec{x}}}{\vec{\sigma}}$$</div>
<p> where <span class="math">\(\bar{\vec{x}}\)</span> and <span class="math">\(\vec{\sigma}\)</span> are the mean and covariance of the sample respectively. Another possibility is to use the <em>Mahalanobis distance</em>
</p>
<div class="math">$$M = \sqrt{(\vec{x} - \vec{y}) \cdot \mathbf{\Sigma}^{-1} \cdot (\vec{x} -\vec{y})}$$</div>
<p>
where <span class="math">\(\mathbf{\Sigma}\)</span> is the <em>covariance matrix</em>
</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i}(\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p> where <span class="math">\(N\)</span> is the number of samples. This not only scales the variables appropriately, but accounts for dependencies between them. It is, however, more computationally expensive, especially for high-dimensional data.</p>
<p>Sometimes we wish to compare data that is not readily described as vectors. Suppose that we wish to compare two users of a social network in terms of which links they have shared. We might consider the links shared by each user as a set of unique items. To compare these sets, we can use the <em>Tanimoto metric</em>
</p>
<div class="math">$$T = \frac{|A \cap B|}{|A \cup B|}$$</div>
<p>, that is the fraction of the links shared by either user that have been shared by both users. This has a range from 0 to 1 and <em>bigger is better</em>.</p>
<p>If we wish to compare two short strings (as for example, in a spellchecking application), the ususal method is the <em>Leveshtein distance</em> . This is the number of insertions, deletions or substitutiions needed to transform one string into another. If we consider the strings <span class="math">\(X\)</span> and <span class="math">\(Y\)</span> as sequences of characters <span class="math">\(x_{1}x_{2}\ldots x_{m}\)</span> and <span class="math">\(y_{1}y_{2}\ldots y_{n}\)</span> respectively, we can define an <span class="math">\(m \times n\)</span> matrix <span class="math">\(\mathbf{L}\)</span> as
</p>
<div class="math">$$L_{i,0} = i$$</div>
<p> for <span class="math">\(i\)</span> from 0 to m
</p>
<div class="math">$$L_{0,j} = j$$</div>
<p> for <span class="math">\(j\)</span> from 0 to n
</p>
<div class="math">$$L_{i,j} = \min \left(L_{i,j-1},L{i-1,j},L{i-1,j-1}+\left\{\begin{array}{c 1} 0 & \quad \mathrm{if } x_{i} = y_{j} \\
1 & \quad \mathrm{if } x_{i} \neq y_{j} \end{array} \right.\right)$$</div>
<p>The Levenshtein distance is then <span class="math">\(L_{m,n}\)</span>. While simple to implement and intuitive to understand, this is only really suitable for comparing short strings, as the complexity is <span class="math">\(\mathcal{O}(m \times n)\)</span>.</p>
<p>A wide variety of distance metrics are implemented in <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">Scipy</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">The Chain Rule and Backpropogation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Chain Rule and Backpropogation2023-12-21T00:00:00+00:002023-12-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-21:/the-chain-rule-and-backpropogation.html<p>Calculating the gradients of complex functions</p><p>In the article about <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a>, we mentioned that logistic regression and neural networks are fit by minimising a loss function. In order to do this, we need to calculate the gradient of the loss function with respect to the parameters. This tells us how we can adjust the parameters to reduce the loss. </p>
<p>Since the functions we want to optimise in machine learning problems can usually be expressed as a function of a function. To differentiate this, we use the <em>chain rule</em>
</p>
<div class="math">$$\frac{df(g(x))}{dx} = \frac{df}{dg}\frac{dg}{dx}$$</div>
<p>To illustrate this, let's see how we can use it to differentiate the cross-entropy loss
</p>
<div class="math">$$L = -\ln p_{c}$$</div>
<p> with respect to the weights <span class="math">\(\mathbf{W}\)</span> of a logistic regression model. First we differentiate the loss with respect to the probability of the correct class.
</p>
<div class="math">$$\frac{dL}{dp_{c}} = -\frac{1}{p_{c}}$$</div>
<p>Then we need to differentiate the probability with respect to each of the logits <span class="math">\(q_{i}\)</span>
</p>
<div class="math">$$p_{c} = \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \\
\frac{\partial p_{c}}{\partial q_{i}} = \frac{\delta_{ic} e^{q_{c}} \sum_{i} e^{q_{i}} - e^{q_{c}} e^{q_{i}}}{\left( \sum_{i} e^{q_{i}} \right)^{2}} \\
= \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \frac{\delta_{ic} \sum_{i} e^{q_{i}} - e^{q_{i}}}{\sum_{i} e^{q_{i}}} \\
= p_{c}(\delta_{ic} - p_{i}) $$</div>
<p>
where <span class="math">\(\delta_{ic}\)</span> is the <em>Kroneker delta function</em>, which is 1 if <span class="math">\(i=c\)</span> and 0 otherwise.
(As an aside, functions whose derivative can be expressed in terms of their output are commonly used in machine learning, because they make differentiation easier. Such functions are often derived from the exponential function in some way).</p>
<p>Then, we need to differentiate the logits with respect to the weights
</p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec{b} \\
\frac{d \vec{q}}{d \mathbf{W}} = \vec{x}$$</div>
<p>Finally, we can combine these derivatives using the chain rule
</p>
<div class="math">$$\frac{dL}{d\mathbf{W}} = \frac{dL}{dp_{c}}\frac{dp_{c}}{d\vec{q}}\frac{d\vec{q}}{d\mathbf{W}} \\
=-\frac{1}{p_{c}}p_{c}(\delta_{ic}-\vec{p}) \otimes \vec{x} \\
=(\vec{p} - \delta_{ic}) \otimes \vec{x}$$</div>
<p> where <span class="math">\(\otimes\)</span> denotes the outer product.</p>
<p>For a deeper neural network, we use the fact that each layer <span class="math">\(n\)</span> of the network can be treated as a function </p>
<div class="math">$$\vec{x}_{n+1} = f_{n}(\mathbf{W}_{n} \cdot \vec{x}_{n} + \vec{b}_{n})$$</div>
<p> and apply the chain rule recursively to calculate the gradient of the loss with respect to each layer's weights and biases. This recursive application of the chain rule is known as <em>backpropogation</em>, and is the basis of most neural network optimisation algorithms.</p>
<p>Of course, very few data scientists ever need to do this themselves on a day-to-day basis, because automatic differentiation and backpropogation are provided by machine learning software libraries, but it's still useful to understand how it works.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Logistic Regression2023-12-14T00:00:00+00:002023-12-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-14:/logistic-regression.html<p>A simple classification algorithm</p><h2>A simple classification algorithm</h2>
<p>Over the past few weeks, we have been looking at algorithms related to <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>. This week, we are starting on a different tack, but it's still in the realm of relating probabilities to observations. </p>
<p>We start with the <em>logistic function</em>
</p>
<div class="math">$$p = \frac{1}{1+e^{-q}}$$</div>
<p>, where <span class="math">\(q\)</span> is a quantity we call a <em>logit</em>. This has the property that as <span class="math">\(q \rightarrow \infty\)</span>, <span class="math">\(p \rightarrow 1\)</span> and as <span class="math">\(q \rightarrow -\infty\)</span>, <span class="math">\(p \rightarrow 0\)</span>, so it can be used to model a probability. If we wish to calculate the probabilities of more than one class, we can generalise this with the <em>softmax function</em>
</p>
<div class="math">$$p_{i} = \frac{e^{q_{i}}}{\sum_{i} e^{q_{i}}}$$</div>
<p> where <span class="math">\(p_{i}\)</span> and <span class="math">\(q_{i}\)</span> represent the probabilities and logits for each class <span class="math">\(i\)</span> respectively.</p>
<p>But what are the logits? In the basic implementation of logistic regression, they are a linear function of some observations. Given a vector <span class="math">\(\vec{x}\)</span> of observations, we may model the logits as </p>
<div class="math">$$q = \vec{w} \cdot \vec{x} + b$$</div>
<p> for the binary case and </p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec {b}$$</div>
<p> in the multiclass case. where <span class="math">\(\vec{w}\)</span> and <span class="math">\(\mathbf{W}\)</span> are <em>weights</em> and <span class="math">\(b\)</span> and <span class="math">\(\vec{b}\)</span> are biases. In terms of Bayes' Theorem.
</p>
<div class="math">$$\vec{b} = \ln P(H)$$</div>
<p> and </p>
<div class="math">$$\mathbf{W} \cdot \vec{x} = \ln P(\vec{x} \mid H)$$</div>
<p>We fit the weights and biases by minimising the <em>cross-entropy loss</em>
</p>
<div class="math">$$L = -\sum_{j} \ln p_{j,c}$$</div>
<p> where <span class="math">\(c\)</span> is the correct class for the example <span class="math">\(j\)</span> in the training dataset. </p>
<p>This works well as a simple classifier under two conditions</p>
<ol>
<li>The classes are fairly evenly balanced</li>
<li>The classes are linearly seperable</li>
</ol>
<p>If there is a strong imbalance between the classes, the bias will tend to dominate over the weights, and the rarer classes will never be predicted. To mitigate this, is is possible to undersample the more common classes or oversample the rarer ones before training.</p>
<p>If the classes are not linearly seperable, it's necessary to transform the data into a space where they are. This may be done by applying </p>
<div class="math">$$\vec{x^{\prime}} = f(\mathbf{M} \cdot \vec{x})$$</div>
<p> where <span class="math">\(f\)</span> is some non-linear function and <span class="math">\(\mathbf{M}\)</span> is a matrix of weights. We may in fact apply several layers of similar transformations, each with its own set of weight parameters. That is the basis of neural networks.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
<td><a href="[filename}chain-rule.md">The Chain Rule and Backpropogation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Markov Chain Monte Carlo2023-12-07T00:00:00+00:002023-12-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-07:/markov-chain-monte-carlo.html<p>Estimating posterior distributions of continuous variables</p><h2>Estimating the posterior distributions of continuous variables</h2>
<p>In our previous discussions of <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' theorem</a> we have assumed that the probability distributions involved are of discrete variables. However, in many cases we wish to deal with continuous variables. In this case, Bayes' Theorem becomes</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{\int P(H) P(O \mid H) dH}$$</div>
<p>Unfortunately, for many distributions we may be interested in (including the ubiquitous normal distribution), the integral involved is intractible. The problem only gets worse in complex models, especially where we distributions may have multiple parameters. Some distribuitions have a <em>conjugate prior</em>, where the posterior distribution is of the same form as the prior distribution and may be obtained by an appropriate adjustment of parameters, but this is not always the case, and we need a numerical method that is more generally applicable.</p>
<p>The method we use is called <em>Markov Chain Monte Carlo</em> because it uses random samples Markov chains to explore the parameter space of the distribution. There are a number of variations of this, so for the sake of illustration, we will select a particular variant, the <em>Metropolis-Hastings algorithm</em>, as the basis of further discussion.</p>
<p>We start with a Markov chain <span class="math">\(P(H^{\prime} \mod H)\)</span> that, given a sample hypothesis <span class="math">\(H\)</span> generates a nearby hypothesis <span class="math">\(H^{\prime}\)</span>. At timestep <span class="math">\(t=0\)</span>, we generate a set of samples <span class="math">\(H_{i,0}\)</span> from the prior distribution. Then at each timestep <span class="math">\(t\)</span>, we generate a set of alternative hypotheses <span class="math">\(H^{\prime}_{i,t}\)</span> from the Markov chain given <span class="math">\(H_{i,t}\)</span>. For each pair of hypotheses, we then calculate an acceptance probability</p>
<div class="math">$$ A(H^{\prime}_{i,t},H_{i,t}) = \min \left( 1, \frac{P(O \mid H^{\prime}_{i,t}) P(H^{\prime}_{i,t}) P(H_{i,t} \mid H^{\prime}_{i,t})}{P(O \mid H_{i,t}) P(H_{i,t}) P(H^{\prime}_{i,t}) \mid H_{i,t}} \right) $$</div>
<p>We then generate a set of samples <span class="math">\(S_{i}\)</span> from a uniform distribution between 0 and 1, and update the samples as</p>
<div class="math">$$H_{i,t+1} = \left\{ \begin{array}{c 1} H^{\prime}_{i,t} & \quad \textrm{if } S_{i} \leq A(H^{\prime}_{i,t},H_{i,t}) \\ H_{i,t} & \quad \textrm{otherwise} \end{array} \right.$$</div>
<p>Provided that the model and the choice of priors is suitable for the data being modelled, over sufficient steps, the distiribution of <span class="math">\((H_{i,t}\)</span> will converge to <span class="math">\(P(H|O)\)</span>. We can envision this as each sample exploring the nearby regions of the distribution and preferring to move towards regions of higher likelihood.</p>
<p>Markov Chain Monte Carlo is implemented in the <a href="https://www.pymc.io/">PyMC</a> library, which provides a comprehensive toolkit for probabilistic modelling. </p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Viterbi Algorithm2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-30:/the-viterbi-algorithm.html<p>Finding the hidden states that generated a sequence</p><h2>Finding the Hidden States that generated a sequence</h2>
<p>If we have a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span> generated by a <a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Model</a>. One thing we may wish to do is infer the maximum likelihood sequence of hidden states <span class="math">\(S_{0},S_{1}...S_{t}\)</span> that gave rise for this. A useful technique for this is the <em>Viterbi Algorithm</em>.</p>
<p>The Viterbi algorithm represents the possible paths through the sequence of hidden states as a graphical model called a <em>trellis</em>. The possible hidden states at each time step are represented by nodes, with edges representing the transitiions between them.</p>
<p>At each time step <span class="math">\(t\)</span>, we start by caclulating the probabilities of the hidden states given the observation at that time step, <span class="math">\(P(S_{i,t} \mid X_{t})\)</span> and place corresponding nodes on the trellis. We then find the maximum likelihood predecessor for each node</p>
<div class="math">$$\texttt{argmax} \left( P(S_{j,t-1}) P(S_{j,t-1} \mid S_{i,t}) \right)$$</div>
<p>and connect an edge from it to its successor. Any nodes at <span class="math">\(t-1\)</span> that have no outgoing edges are then deleted, along with their incoming edge, and this is repeated at each previous time step until no more nodes can be deleted. Then, working forwards through the trellis from the first step at which nodes were deleted, we recalculate the probabilities at each timeslice as </p>
<div class="math">$$P^{\prime}(S_{i,t}) = \frac{P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}{\sum_{i} P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}$$</div>
<p>where <span class="math">\(S_{i,t}\)</span> are the remaining states at time <span class="math">\(t\)</span> and <span class="math">\(S_{j,t-1}\)</span> is the maximum likelihood predecessor of each state. </p>
<p>At the end of the sequence, we may select the maximum likelihood final state <span class="math">\(\texttt{argmax} P(S_{i,t})\)</span>. The path leading to it is then the maximum likelihood sequence of states given the observations. The Viterbi Algorithm is particularly suitable for real-time applications, as any time step where the number of possible states falls to 1 may be output immediately and removed from the trellis, which in turn reduces memory requirements and computation time.</p>
<p>I first encountered the Viterbi algorithm in the context of error-correcting codes for digital television. The sequence of bits to be transmitted in a digital TV signal can be protected against errors by interspersing it with extra bits derived from a <em>convolutional code</em> - this is a binary function of a number of previous bits. This converts the transmitted sequence from an apparently random sequence (due to data compression) to a Markov process. At the receiving side, we treat the received bitstream (which inevitably contains errors) as the observations and the transmitted bitstream as the hidden states, using the Viterbi algorithm to recover it.</p>
<p>I later used the Viterbi Algorithm for <a href="https://PlayfulTechnology.co.uk/true-212.html">Word Sense Disambiguation</a>. In this application, the observations were words and the hidden states were <a href="https://wordnet.princeton.edu/">WordNet</a> word senses. There were a few complications to take into account - function words, out-of-vocabulary words, multi-word expressions, proper names - but it achieved 70% accuracy, which was described to me as "state of the art".</p>
<p>It's this flexibility and applicability to a range of different problems that makes the Viterbi Algorithm one of my favourite algorithms.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hidden Markov Models2023-11-23T00:00:00+00:002023-11-23T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-23:/hidden-markov-models.html<p>Using Bayes' Theorem to analyse sequences</p><h2>Using Bayes'Theorem to analyse sequences</h2>
<p>Suppose we wish to analyse a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span>. This can be modelled using <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a> as a <em>Markov process</em> <span class="math">\(P(X_{t} \mid X_{t-1})\)</span>, <em>ie</em> the probability of each event depends on the previous event in the sequence.</p>
<p>If there are <span class="math">\(N\)</span> possible values that <span class="math">\(X\)</span> can take, the number of transition probabilities between them is <span class="math">\(N^{2}\)</span>. Such a model would quickly become very large and not very informative. We need a way to make the models more tractable.</p>
<p>To do this, we assume that the probability of each event can be described in terms of a hidden state, <span class="math">\(S\)</span>, as <span class="math">\(P(X_{t} \mid S_{t})\)</span>. The states can then be modelled by a Markov process, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>. This is known as a <em>Hidden Markov Model</em>, since it models a sequence of hidden states with a Markov process. The number of hidden states can be considerable smaller than the number of possible events, and the states can group events into meaningful categories. The model consists of three distributions, the inital state distribution, <span class="math">\(P(S_{0})\)</span>, the transition probability distribution, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>, and the conditional distribution of the events <span class="math">\(P(X \mid S)\)</span>. </p>
<p>Starting from the initial state distribution <span class="math">\(P(S_{0})\)</span>, we can caclulate the posterior distributions of the hidden states at each step <span class="math">\(t\)</span> of a sequence by the following method.</p>
<ol>
<li>Calculate the posterior distribution of the hidden state given the observed event <span class="math">\(X_{t}\)</span> using Bayes' Theorem
<div class="math">$$P(S_{t} \mid X_{t}) = \frac{P(S_{t}) P(X_{t} \mid S_{t})}{P(X_{t})}$$</div>
</li>
<li>Calculate the prior probability of the next state
<div class="math">$$P(S_{t+1}) = P(S_{t+!} \mid S{t}) P(S_{t} \mid {X_t})$$</div>
</li>
</ol>
<p>A concrete example is <a href="https://PlayfulTechnology.co.uk/video-part-of-speech-tagging.html">Part of Speech Tagging</a>. In this application, the observed events are words and the hidden states are the parts of speech (noun, verb, adjective etc.). This approach is particularly useful when you want the probability of each part of speech for a given word, rather than a single tag. I used this approach in my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, using my own open source <a href="https://PlayfulTechnology.co.uk/a-hidden-markov-model-library.html">Hidden Markov Model library</a>, which I had created as a lerning exercise when I first learnt about HMMs. I was pleased to discover that a colleague on that project had also used the library, but I no longer maintain it, as I've learnt a lot since then and if I did any more work on it I'd prefer to restart it from scratch.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a></td>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayes' Theorem2023-11-16T00:00:00+00:002023-11-16T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-16:/bayes-theorem.html<p>The probability of a hypothesis given observations.</p><h2>Estimating the probability of a hypothesis given observations.</h2>
<p>This is the beginning of what will hopefully be a regular series of articles explaining Key Algorithms in data science.</p>
<p>If you look at <a href="https://www.linkedin.com/in/peterjbleackley">my LinkedIn profile</a>, you'll see that the banner shows the formula </p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{P(O)}$$</div>
<p>This is a foundational rule for calculating conditional probabilities, known a <em>Bayes' Theorem</em>, after the Reverend Thomas Bayes, who first proposed it. It may be read as <em>the probability of a hypothesis given some observations is equal to the prior probability of the hypothesis multiplies by the probability of the observations given that hypothesis, and divided by the probability of the observations</em>. </p>
<p>To illustrate this, consider a family where the father has rhesus-positive blood and the mother has rhesus-negative blood. Rhesus-positive is a dominant trait - the father might have one or two copies of the Rh+ gene, whereas rhesus-negative is recessive - the mother must have two copies of the Rh- gene.</p>
<p>Let <span class="math">\(H\)</span> be the probability that the father has 2 copies of the RH+ gene. Without further information, 1/2 is the best estimate for this. If the family's first child is rhesus-positive, the probability of this is <span class="math">\(P(O \mid H) = 1\)</span> if the father has two copies of the Rh+ gene and <span class="math">\(P(O \mid ¬H) = \frac{1}{2}\)</span> if he has 1 copy. In general the overall probability of the observations give a set of hypotheses <span class="math">\(H_{i}\)</span> is given by
</p>
<div class="math">$$P(O) = \sum_{i} H_{i} P(O \mid H_{i})$$</div>
<p>, since the posterior probabilities of all hypotheses must sum to 1. Therefore, we can update the probability of the father having two copies of the Rh+ gene as
</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{1}{2} \times 1}{\frac{1}{2} \times 1 + \frac{1}{2} \times \frac{1}{2}} = \frac{2}{3}$$</div>
<p>If the family's second child is also rhesus-positive, we can further update our estimate with the new information</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{2}{3} \times 1}{\frac{2}{3} \times 1 + \frac{1}{3} \times \frac{1}{2}} = \frac{4}{5}$$</div>
<p>It is easy to see that if we had known both children's blood groups from the outset, and used <span class="math">\(P(O \mid ¬H) = \frac{1}{4}\)</span> we could have got the same result.</p>
<p>In data science, we often have to estimate the probability of a hypothesis given some evidence, so Bayes' theorem is a useful thing to have in our toolkit. </p>
<p>If we need to take observations of several different variables into account, there are two ways we can do it, the first, the <em>Naive Bayes</em> approach, treats all the variables as statistically independent, as we did in the above example. While this has the advantage of simplicity, it is only really viable when the problem is sufficiently simple.</p>
<p>For more complex problems, we need to model the dependencies between variables. We do this with a graphical method called a <em>Bayesian Belief Net</em>, where each node on a graph represents a variable, and the links represent dependencies between them. Each node then calculates the probability of the variable it represents in terms of the variables it is dependent on. A simple example can be seen in the Data Science Notebook <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It a Mushroom or Is It a Toadstool?</a>.</p>
<p>For my first AI project, I was asked to chose the best system to implement an automatic diagnostic system. I chose a Bayesian Belief Network on the grounds that it was important for the system to be explainable. Since each node of the Bayesian Belief Newtork represents a meaningful variable, its results are more explainable that those of a neural network, whose nodes are simply steps in a calculation. More recently I used Bayesian models in a project to predict the optimum settings for machine tools, so Bayes' Theorem has followed me throughout my data science career.</p>
<table>
<thead>
<tr>
<th></th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>