Playful Technology Limitedhttps://PlayfulTechnology.co.uk/2024-04-18T00:00:00+01:00Priority Queues2024-04-18T00:00:00+01:002024-04-18T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-18:/priority-queues.html<p>Efficiently iterating over items in order</p><p>While researching last week's article on <a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a>, I found that two methods for constructing ball trees and the algorithm for querying ANNOY both involved <em>Priority Queues</em>. Since these are an important component of a number of different algorithms, it is worth examining them in detail.</p>
<p>Suppose we want to iterate over a set of items in a particular order. The naive way of doing this is to sort the list of items and then iterate over them. However, sorting is an expensive operation for large datasets, and we may want to add further items to the list while still iterating, which would necessitate re-sorting the list each time. We therefore need a more efficient way of tackling this.</p>
<p>Priority Queues address this by storing the data in a partially ordered data structure whose elements can be reordered efficiently when items are added or removed. Most implementations use a <em>heap</em>, which is a list of items with the following properties.</p>
<ol>
<li>The item at index <span class="math">\(i\)</span> is the parent of the items at <span class="math">\(2i+1\)</span> and <span class="math">\(2(i+1)\)</span></li>
<li>The parent is less than or equal to each of its children.</li>
</ol>
<p>These properties can be efficiently maintained by the following operations.</p>
<dl>
<dt><em>Shift Up</em></dt>
<dd>While an item is less than its parent (or not at the start of the list), swap it with its parent and check to see if its less than its new parent</dd>
<dt><em>Shift Down</em></dt>
<dd>While an item is greater than the smaller of its two children (or not at the end of the list), swap it with that child and check to see if it is smaller than either of its new children.</dd>
</dl>
<p>(<em>Note</em>: What I'm describing here is a <em>Min Heap</em>, which is used when we want to iterate over our items in ascending order. Most Python implementations of priority queues use this. There are also <em>Max Heaps</em>, which are used to iterate over items in descending order).</p>
<p>To add an item to the heap, we place it at the end, and then Shift Up until it reaches its proper place. When we remove the first item from the heap during iteration, we more the last item from the heap to the first position, and then Shift Down until it reaches its proper place.</p>
<p>There are several implementations of priority queues in Python <a href="https://docs.python.org/3/library/heapq.html">heapq</a> in the standard library, <a href="https://pypi.org/project/HeapDict/">heapdict</a> which implements a dictionary interface and allows the priority of items to be altered, and <a href="https://docs.python.org/3/library/queue.html#queue.PriorityQueue">PriorityQueue</a> in the standard Queue library, which is useful for sheduling data items to be processed by workers in a multithreaded application.</p>
<p>Prioritising tasks is an important part of many algorithms, so this is a useful tool to be aware of when designing an algorithm.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
<td></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Vector Search Trees2024-04-11T00:00:00+01:002024-04-11T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-11:/vector-search-trees.html<p>Finding nearest neigbours quickly</p><p>There are many applications where we need to search a dataset for the nearest neghbours of a given point. For a large dataset, comparing the data point to the entire dataset will be too slow, especially if we need to do it frequently. If we store the dataset to be searched in a tree structure, we can improve the efficiency of queries from <span class="math">\(\mathcal{O} (N)\)</span> to <span class="math">\(\mathcal{O} (\log N)\)</span>.</p>
<p>A simple method to construct the search tree is <em>KD Trees</em>. This method iterates over the dimension of the dataset, partitioning it into hyperrectanular blocks. Each of these blocks is partitioned at the median of the datapoints contained in it along the dimension under consideration. Using the median ensures that the number of points in each partition will be balanced. This allows for rapid construction of the search tree, and and rapid searching if the dimensionality of the data is low, but its performance degrades when the number of dimensions in the dataset is large. The documentation for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree">SciPy implementation of KD Tree</a> notes that <em>20 is already too large</em>. Adding new data to the tree after initial construction also runs a high risk of the tree becoming unbalanced.</p>
<p>An alternative, that improves performance at higher dimensionalities, is <em>Ball Trees</em>. In this, each node represents a ball of centroid <span class="math">\(\vec{C}\)</span> and radius <span class="math">\(r\)</span>. Data is assigned to the nodes in such a way as to minimise the hypervolume of the balls. Several methods for doing this are available, as detailed by Stephen M. Omohundro in <a href="https://ftp.icsi.berkeley.edu/ftp/pub/techreports/1989/tr-89-063.pdf">Five Balltree Construction Algorithms</a>. The one used in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree">Scikit-Learn implementation of Ball Trees</a> is a variation on the KD Tree construction algorithm, where instead of iterating through the dimensions in a fixed order, each node is partitioned along the dimension in which the spread of its datapoints is greatest. Another method is an <em>online insertion algorithm</em>, which is suitable for when we want to continually add new data to the search tree. Given a tree, each new node is added to the tree in the position that minimises the increase in volume of the nodes that contain it. It is also possible to build a Ball Tree bottom up, with a method based on <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierachical Clustering</a>.</p>
<p>Another method for constructing search trees is <em>ANNOY</em> (Approximate Nearest Neighbours Oh Yeah), which was developed by Erik Bernhardsson at Spotify, who needed to be able to search large datasets of high dimensional datasets as quickly as possible for music recommendations. In this method, the dataset is recursively partitioned by picking two datapoints at random from each existing partition and splitting the partition midway between them. The random assignment of the partitions means that it is possible for the nearest neighbour of a point to fall into a different partition. Therefore, an ensemble of trees, similar to a <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> is constructed. We can then find a candidate nearest neighbour from each tree and select the best. The randomness of the algorithm makes the matches approximate, rather than exact, but for many applications this doesn't matter
Here is <a href="https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html">Eric Berharsson's own description of ANNOY</a>. There's a <a href="https://pypi.org/project/annoy/1.0.3/">Python implementation of ANNOY</a> on PyPI and it can be used to search word vectors or document vectors in <a href="https://radimrehurek.com/gensim/similarities/annoy.html">Gensim</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/priority-queues.html">Priority Queues</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Cross Validation2024-04-04T00:00:00+01:002024-04-04T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-04-04:/cross-validation.html<p>Ensuring unbiassed selection of hyperparameters</p><p>When training a model, standard practice is to hold back part of the dataset for testing. This ensures that we have tested the model's ability to generalise to unseen data.</p>
<p>However, many models have <em>hyperparameters</em>, such as the regularisation penalties used in <a href="https://PlayfulTechnology.co.uk/linear-regression.html">regularised linear models</a>. In order to select the best values for these hyperparameters, it is necessary to try fitting the model with different values of hyperparameters and select the version that gives the best results. However, if we use the same test dataset for hyperparameter selection as we do for overall model testing, there is a risk that the hyperparameters will themselves be overfit to the test dataset.</p>
<p>One solution to this is to further subdivide the dataset into training, validation and test datasets. We use the valuation dataset to assess which hyperparameters give the best performance, and then use the test dataset to evaluate how well this model performs on unseen data. Many publicly available datasets come partitioned in this way. However, if we have a limited amount of data to work with, we may find that this approach reduces the training dataset too much.</p>
<p>An alternative to this is <em>Cross Validation</em>. The basic procedure is to make several different partitions of the data into training and validation sets, and calculates the average of the <a href="https://PlayfulTechnology.co.uk/tag/evaluation.html">evaluation metrics</a> across the different partitions. This, while more computationally expensive than using a single validation partition, gives more robust results, since the choice of hyperparameters will not depend on the results from a single validation partition. Once hyperparemeters have been chosen, the data used for validation can then be folded back into the training dataset to train the final model.</p>
<p>Several strategies may be used for making the split. The simplest is the <em>Leave One Out</em> strategy. For a training dataset of size <span class="math">\(N\)</span>, this makes <span class="math">\(N\)</span> partitions into <span class="math">\(N-1\)</span> training examples and 1 validation example. A variation of this is <em>Leave P Out</em>, which makes <span class="math">\(\binom{N}{P}\)</span> partitions of <span class="math">\(N-P\)</span> training examples and <span class="math">\(P\)</span> validation examples. These methods are computationally espensive have the disadvantage that there is considerable overlap between the partitions, so their results are not independent.</p>
<p>A more commonly used strategy is <em>K-Fold Cross Validation</em>. This divides the data into <span class="math">\(K\)</span> <em>folds</em> of <span class="math">\(\frac{N}{K}\)</span> examples. Each of these in turn is used as the validation partition, with the remaining folds combined to form the training partition. Usually 5 or 10 folds are used. This is more efficient than Leave One Out, and provides greater independence between tests, as each training dataset overlaps by only <span class="math">\(\frac{K-2}{K-1}\)</span> with the others, as opposed to almost complete overlap in Leave One Out. For further statistical rigour (at the expense of greater compute time) <em>Repeated K-Fold Cross Validation</em> performs this several times, with different assignments of examples to folds. </p>
<p>If the classes to be predicted are highly unbalanced, there is a risk that some folds may not contain any examples of a particular class, thus skewing the results. <em>Stratified K-Fold Cross Validation</em> addresses this problem by grouping the examples by target class, and then dividing each class equally between the folds. If there are know statistical dependencies in the training examples, <em>Group K Fold</em> divides the dataset into groups according to some feature which is expected to have impotant staticstical correlations with other variables, and assigns the data to folds group by group, so that the same group is never present in both the training and validation dataset. This ensures that the model will generalise across groups. Group K-Fold relaxes the requirement that folds be of equal size. These two strategies can be combined as <em>Stratified Group K-Fold Cross Validation</em>.</p>
<p>Related to Group K-Fold is the <em>Leave One Group Out</em> strategy, which in effect treats each group as a fold, and the <em>Leave P Groups Out</em>, strategy, which, given <span class="math">\(G\)</span> groups, forms <span class="math">\(\binom{G}{P}\)</span> partitions, each containing <span class="math">\(G-P\)</span> groups in the training dataset and <span class="math">\(P\)</span> groups in the test dataset.</p>
<p>Another possible strategy for <em>Shuffle Split Cross Validation</em>. In this, the dataset is repeatedly shuffled, and after each shuffle split into a training and validation dataset. Whereas with K-Fold cross validation and its variants the size of the validation dataset is dependent on the number of iterations, in Shuffle Split Cross Validation they may be selected independently of each other. Stratification and Grouping may be applied to Shuffle Split as they are to K-Fold.</p>
<p>In my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a> I had to evaluate a large number of candidate models. K-Fold Cross Validation played an essential role in this</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/vector-search-trees.html">Vector Search Trees</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Linear Regression2024-03-28T00:00:00+00:002024-03-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-28:/linear-regression.html<p>Fitting linear models</p><p>After the discussion of <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a> in the last article, it makes sense to discuss regression models themselves. For many problems, we wish to fit a function of the form</p>
<div class="math">$$y = m x + c$$</div>
<p>or, for multivariate problems</p>
<div class="math">$$\vec{y} = \mathbf{M} \vec{x} + \vec{c}$$</div>
<p>The simplest method for this is <em>Ordinary Least Squares</em>, which choses the parameters so as to mimimise the mean squared error of the model. This has a closed form solution, but there are disadvantages to using it with multivariate data. Firstly, there is a danger of overfitting, with variables of little importance adding to the complexity of the model, and secondly there is the possibility of dependencies existing between the input variables and thus introducing redundancy into the model. These issues may be addressed by applying <a href="https://PlayfulTechnology.co.uk/data-reduction.html">principal component analysis</a> to the input data, but this has the disadvantage of making the model less explainable.</p>
<p>There are a number of methods of reducing the complexity of multivariate linear regression models. One of these is <em>Least Angle Regression</em> (LARS). This is a method of fitting the model that minimises the number of components used to predict the outputs. Rather than following the gradient of the loss function, it adjust the weights corresponding with the input variable that has the strongest correlation with the residuals at each step of the optimisation. When more than one variable have equally strong correlations with the target residuals, they are increased together in the joint least squares direction. While LARS identifies the most important variables contributing to the prediction, it does not solve the problem of colinearity between variables and is sensitive to noise.</p>
<p>Other methods for preventing overfitting involve adding a <em>regularisation penalty</em> to the loss function in the optimisation. For <em>Lasso regression</em>, this penalty is the sum of the absolute values of the weights, so the loss function to be optimised is</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} |M_{jk}|$$</div>
<p>
where <span class="math">\(N\)</span> is the number of samples and <span class="math">\(\alpha\)</span> is a hyperparameter.</p>
<p>For <em>Ridge regression</em>, the penalty term is the sum of the squares of the model weights, hence the loss function is </p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \sum_{j} \sum_{k} M_{jk}^{2}$$</div>
<p>Lasso regression favours sparse models (that is, those with fewer non-zero weights), whereas ridge regression favours generally small weights.</p>
<p>These methods can be combined. <em>Lasso LARS</em> applies the Lasso regularisation penalty to LARS, which reduces LARS vulnerability to collinearity and noise. In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteins in Breast Cancer Patients</a> I used this method to fit numerical variables related to the progress of cancer to measures of activity in clusters of proteins. This method was chosen because I wished to assess which protein clusters were strong predictors.</p>
<p><em>ElasticNet</em> combines the Lasso and Ridge regression methods, optimising the loss function</p>
<div class="math">$$L = \frac{\sum_{i}\left| \vec{y}_i - \left(\mathbf{M} \vec{x}_{i} + \vec{c} \right) \right|^{2}}{2 N} + \alpha \left( \rho \sum_{j} \sum_{k} |M_{jk}| + (1 - \rho) \sum_{j} \sum_{k} M_{jk}^{2} \right)$$</div>
<p>where <span class="math">\(\rho\)</span> is another hyperparameter, ranging from 0 to 1, which determines the relative importance of the two regularisation penalties.</p>
<p>These algorithms and a number of related ones, are implemented in <a href="https://scikit-learn.org/stable/modules/linear_model.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/cross-validation.html">Cross Validation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Regression2024-03-21T00:00:00+00:002024-03-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-21:/evaluation-metrics-for-regression.html<p>How good is your regression model?</p><p>In the previous article, we looked at <a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a> which are applicable when we are predicting discrete categories. This time, we'll look at how to evaluate models that predict continuous variables.</p>
<p>Suppose, in our test dataset, we have <span class="math">\(N\)</span> data points. We'll designate the predicted values as <span class="math">\(f_{i}\)</span> and the actual values as <span class="math">\(y_{i}\)</span>. One of the most obvious metrics to use is the <em>mean squared error</em></p>
<div class="math">$$\mathrm{MSE} = \frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}$$</div>
<p>This is essentially the variance of the errors. Since the mean squared error is often used as the loss function when fitting a regression model, we can easily compare this metric to the fitting loss to give an indication of how well the model has generalised. However, it can be difficult to interpret, since the scale of the metric is not the same as the original data. We may therefore wish to use the <em>root mean squared error</em></p>
<div class="math">$$\mathrm{RMSE} = \sqrt{\frac{\sum_{i} (y_{i} - f_{i})^{2}}{N}}$$</div>
<p>which is the standard deviation of the errors. However, both these metrics can be sensitive to outliers, because of the squaring of the errors, which effectively gives larger errors higher weight. A metric that is less sensitive to this is the <em>mean absolute error</em></p>
<div class="math">$$\mathrm{MAE} = \frac{\sum_{i} |y_{i} - f_{i}|}{N}$$</div>
<p>This gives the same weight to small errors as to large ones. If we were to chose a constant <span class="math">\(f\)</span> that minimises the mean absolute error, it would correspond to the median of <span class="math">\(y\)</span>.</p>
<p>If we wish to use a metric that is independent of the scale of the data, we can use the <em>mean absolute percentage error</em></p>
<div class="math">$$\mathrm{MAPE} = \frac{1}{N}\sum_{i} \left| \frac{y_{i} - f_{i}}{y_{i}} \right|$$</div>
<p>While this is intuitively easy to understand, it has two disadvangtages. One is that it gives lower errors when the predicted valuea are too high than when they are two low, and the other is that it can diverge if any of the values of <span class="math">\(y_{i}\)</span> are close to zero. There are a number of approaches to mitigating these disadvantages. The <em>weighted mean absolute percetage error</em></p>
<div class="math">$$\mathrm{wMAPE} = \frac{\sum_{i}|y_{i} - f_{i}|}{\sum_{i}|y_{i}|}$$</div>
<p>
is robust against divergence, because it scales the errors by the mean absolute value of the true values, rather than the individual true values.</p>
<p>The <em>symmetrical mean average percentage error</em>
</p>
<div class="math">$$\mathrm{sMAPE} = \frac{100}{N} \frac{|y_{i} - f_{i}|}{|y_{i}| + |f_{i}|}$$</div>
<p>
is bounded between 0% and 100%. When <span class="math">\(y_{i}\)</span> and <span class="math">\(f_{i}\)</span> are both 0, the datapoint's percentage error is taken to be 0.</p>
<p>The <em>mean absolute scaled error</em>
</p>
<div class="math">$$\mathrm{MASE} = \frac{\sum{i}|y_{i} - f_{i}|}{\sum_{i} |y_{i} - \bar{y}|}$$</div>
<p>where </p>
<div class="math">$$\bar{y} = \frac{\sum_{i} y_{i}}{N}$$</div>
<p> is the mean of the true values</p>
<p>is similar to the weighted mean absolute percentage error, but scaled by the sum of the absolute deviations rather than the sum of the absolute values. It gives equal weight to positive and negative errors.</p>
<p>The <em>mean absolute log error</em></p>
<div class="math">$$\mathrm{MALE} = \frac{\sum_{i}|\ln y_{i} - ln f_{i}|}{N}$$</div>
<p>
gives equal weight to positive and negative errors, but requires the forecasted and actual values to be strictly positive, or it will diverge.</p>
<p>Another important metric is the <em>coefficient of determination</em>, or <em>explained variance</em></p>
<div class="math">$$R^{2} = 1 - \frac{\sum_{i}(y_{i} - f_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^2}$$</div>
<p>This can be seen as bearing a similar relationship to the mean squared error as the mean absolute scaled error has to the mean absolute error. It is a measure of how successful a model is at predicting the variability of the data. It is less sensitive to outliers that MSE, because an outlier will increase the denominator as well as the numerator. It is equivalent to the square of <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson correlation coefficient</a> between the actual and predicted values.</p>
<p>All these metrics test primarily for random errors. If we wish to test for systematic errors we can use the <em>mean signed difference</em></p>
<div class="math">$$\mathrm{MSD} = \frac{\sum_{i} y_{i} - f_{i}}{N}$$</div>
<p>which indicates the magnitude and direction of any likely bias in the model.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
<td><a href="https://PlayfulTechnology.co.uk/linear-regression.html">Linear Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Evaluation Metrics for Classifiers2024-03-14T00:00:00+00:002024-03-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-14:/evaluation-metrics-for-classifiers.html<p>How good is your classifier model?</p><p>One vitally important task in any data science project is to assess how well the model performs. Various metrics are available for doing this, and each has its own advantages and disadvantages.
This is a large topic, so we will seperate it into metrics suitable for classifiers (this article) and those suitable for regression (next article).</p>
<p>A detailed description of the performance of a classifier model is given by the <em>Confusion Matrix</em> <span class="math">\(\mathbf{C}\)</span>, where <span class="math">\(C_{ij}\)</span> is the number of instances of class <span class="math">\(i\)</span> that are predicted to belong to class <span class="math">\(j\)</span>. This is useful for visualising the peerformance of the classifier, and the metrics discussed below can be calculated from it</p>
<p>Consider a binary classification problem. We may classify the results in our test dataset as True Positives, True Negatives, False Positives and False Negatives. The number of each of these is denoted <span class="math">\(\mathrm{TP} = C_{1,1}\)</span>, <span class="math">\(\mathrm{TN} = C_{0,0}\)</span>, <span class="math">\(\mathrm{FP} = C_{0,1}\)</span> and <span class="math">\(\mathrm{FN} = C_{1,0}\)</span> respectively.</p>
<p>The <em>Precision</em> of the classifier is the probability that an item predicted to be true is actually true. This is given by
</p>
<div class="math">$$ \mathrm{Pr} = \frac{\mathrm{TP}}{\mathrm{TP} +\mathrm{FP}}$$</div>
<p>
In Bayesian terms, if the predicted class is <span class="math">\(p\)</span> and the actual class is <span class="math">\(a\)</span>,
</p>
<div class="math">$$\mathrm{Pr} = P(a=\mathsf{True} \mid p=\mathsf{True})$$</div>
<p>The <em>Recall</em> of the classifier is the probability that a true item is predicted to te true. This is given by
</p>
<div class="math">$$\mathbf{R} = P(p=\mathsf{True} \mid a=\mathsf{True}) = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $$</div>
<p>Which of these is more informative depends on the application. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a>, my initial approach gave 100% Recall. However, since I had designated <em>True</em> to indicate a reliable article and <em>False</em> to indicate fake news, Precision was a more important measure of the model's ability to discriminate fact from fiction.</p>
<p>The F1 score is a metric that seeks to balance Precision and Recall. It is defined as the harmonic mean of them.</p>
<div class="math">$$F_{1} = \frac{2}{1/\mathrm{Pr} + 1/\mathrm{Re}} = \frac{2 \mathrm{Pr} \mathrm{R}}{\mathrm{Pr} + \mathrm{R}} = \frac{2 \mathrm{TP}}{2 \mathrm{TP} + \mathrm{FP} + \mathrm{FN}}$$</div>
<p>This measures similarity between the set of items predicted to be true and those that actually are true, but is not easy to interpret in terms of a Bayesian probability.</p>
<p>The <em>Accuracy</em> of the model is the probability that it predicts the correct class.
</p>
<div class="math">$$A = P(p=a) = \frac{\mathrm{TP} +\mathrm{TN}}{\mathrm{TP} +\mathrm{TN} + \mathrm{FP} +\mathrm{FN}}$$</div>
<p>
This is intuitive to interpret and, unlike the metrics discussed above, takes the true negatives into account. However, it becomes uninformative if classes are strongly imbalanced. For example, if we wish to predict whether or not a user will click on a given advertisement, we can achieve at least 99% accuracy by predicting <em>No</em> all the time. We therefore need metrics that correct for class imbalance.</p>
<p><em>Cohen's Kappa</em> is a measure of how much better a classifier is than guesswork. If we guessed the class of an item without information, our best strategy would be to pick the maxumum-likelihood class every time, and this would give us a success rate of <span class="math">\(P_{\mathrm{max}}\)</span>. We can then define
</p>
<div class="math">$$\kappa = 1 - \frac{1 - A}{1 - P_{\mathrm{max}}}$$</div>
<p><em>Matthew's Correlation Coefficient</em> is the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Pearson Correlation Coefficient</a> between the actual and predicted classes. It is calculated as</p>
<div class="math">$$\phi = \frac{\mathrm{TP} \mathrm{TN} - \mathrm{FP} \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP})(\mathrm{TP} + \mathrm{FN})(\mathrm{TN} + \mathrm{FP})(\mathrm{TN} + \mathrm{FN})}}
$$</div>
<p>Accuracy and Cohen's Kappa can be extended to the multiclass case in the obvious way. It is not trivial to do this for Precision and Recall. However, we can define them on a per-class basis.
</p>
<div class="math">$$\mathrm{Pr}_{i} = \frac{C_{ii}}{\sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R}_{i} = \frac{C_{ii}}{\sum_{j} C_{ij}}$$</div>
<p><a href="https://www.evidentlyai.com/classification-metrics/multi-class-metrics">Evidently AI</a> suggests three methods for calculating overall precision and recall scores for calculating overall precision and recall scores in a multiclass problem. <em>Macro averaging</em> simple calculates the mean of precision and recall across all classes.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \mathrm{Pr}_{i}}{N}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \mathrm{R}_{i}}{N}$$</div>
<p>where <span class="math">\(N\)</span> is the number of classes.</p>
<p><em>Micro averaging</em> gives an average of precision and recall across all instances.</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} C_{ii}}{\sum_{i} \sum_{j} C_{ji}}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{j} C_{ii}}{\sum_{i} \sum_{j} C_{ij}}$$</div>
<p>These are equivalent, as a false negative for one class is a false positive for another, so while finer grained in one way, micro averaging loses information in another.</p>
<p>The third possibility is <em>weighted averaging</em>. While macro averaging gives all classes equal weight, wieghted averaging considers their overall prevalence in the data.
</p>
<div class="math">$$\mathrm{Pr} = \frac{\sum_{i} \left( C_{ii} \sum_{j} C_{ij} \right)}{\sum_{i} \left( \sum_{j} C_{ji} \sum_{k} C_{ik} \right)}$$</div>
<div class="math">$$\mathrm{R} = \frac{\sum_{i} \left(C_{ii} \sum_{j} C_{ij} \right)}{\left(\sum_{j} C_{ij} \right)^{2}}$$</div>
<p>To gereralise Matthew's Correlation Coefficient to multiple classes, we first define the following terms
</p>
<div class="math">$$t_{k} = \sum_{j} C_{kj}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> occurs
</p>
<div class="math">$$p_{k} = \sum_{j} C_{jk}$$</div>
<p> is the number of times class <span class="math">\(k\)</span> is predicted
</p>
<div class="math">$$c = \sum_{k} C_{kk}$$</div>
<p> is the number of correct predictions
</p>
<div class="math">$$s =\sum_{i} \sum_{j} C_{ij}$$</div>
<p> is the total number of samples</p>
<p>We then obtain
</p>
<div class="math">$$\phi = \frac{c s - \vec{t} \cdot \vec{p}}{\sqrt{s^{2} - |p|^{2}}\sqrt{s^{2} - |t|^{2}}}$$</div>
<p>Once you have the numbers, of course, it's important to dig deeper and understand what the factors influencing your model's performance are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-regression.html">Evaluation Metrics for Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>PageRank2024-03-07T00:00:00+00:002024-03-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-03-07:/pagerank.html<p>Using the connectivity of networks to rank items</p><p>Early web search engines, such as AltaVista, relied on hand-curated indexes of content. This was, of course, difficult to scale. What was needed was an automatic way of ranking web pages. Larry <em>Page</em>, developed an algorithm for <em>ranking</em> the importance of nodes in a network (such as web <em>pages</em>) in terms of their connections during his PhD at Stanford University, and then went on to found Google to exploit his research.</p>
<p>The <em>PageRank</em> algorithm is based on three assumptions
1. The more valuable a web page is, the more likely other web pages are to link to it.
2. Links originating from more valuable pages confer more value on the pages they link to.
3. Pages that link indiscriminately to many other pages confer less value on those pages than those which link more selectively.</p>
<p>Based on these assumptions, it then models a <em>random walk</em> taken through the Internet by a user clicking web links at random. If the user is viewing a web page <span class="math">\(i\)</span> that has <span class="math">\(N_{i}\)</span> outgoing links, they have a probability <span class="math">\(d\)</span> (known as the <em>damping factor</em>, and typically chosen as 0.85) of clicking a link to another page. This link is assumed to be chosen with uniform probability from the page's outgoing links. The PageRank <span class="math">\(P_{i}\)</span> for the page is a measure of how likely the page is to be found by this method.</p>
<p>If <span class="math">\(L_{i}\)</span> is the set of pages that link to <span class="math">\(i\)</span>, the PageRank satisfies the equation</p>
<div class="math">$$P_{i} = d \sum_{j \in L_{i}} \frac{P_{j}}{N_{j}} + (1-d)$$</div>
<p>This is solved iteratively.</p>
<p>If we define a <em>connection matrix</em> <span class="math">\(\mathbf{C}\)</span> such that <span class="math">\(C_{ij}\)</span> is <span class="math">\(1/N_{j}\)</span> if <span class="math">\(j\)</span> connects to <span class="math">\(i\)</span> and 0 otherwise, we can express this as a matrix equation</p>
<div class="math">$$\vec{P} = d \mathbf{C} \cdot \vec{P} + (1-d)$$</div>
<p>We then see that the PageRank is a modified form of the first eigenvector of the connection matrix.</p>
<p>Like <a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a>, PageRank is an example of a <em>collective intelligence</em> algorithm, in that it uses data from the actions of a large number of people to infer its scores.</p>
<p>PageRank is one of the most commercially successful algorithms ever devised, however its uses are not limited to ranking web pages. It can be used to analyse any data that can be modelled as a graph, such as citations in academic papers, patterns of gene activation in cells, or connection in the nervous system. A survey of these uses can be found in <a href="https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf">PageRank Beyond the Web</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/evaluation-metrics-for-classifiers.html">Evaluation Metrics for Classifiers</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Collaborative Fitering2024-02-29T00:00:00+00:002024-02-29T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-29:/collaborative-fitering.html<p>A basic recommendation algorithm</p><p>A problem of interest to a lot of businesses is <em>recommedation</em> - how to predict what their customers are likely to want. One of the simplest approaches to this is <em>Collaborative Fitering</em>, which works by identifying users with similar tastes.</p>
<p>Suppose each user <span class="math">\(i\)</span> has rated a set of items <span class="math">\(R_{i}\)</span>, giving each item <span class="math">\(n\)</span> a score <span class="math">\(S_{i,n}\)</span>. For a second user <span class="math">\(j\)</span>, we can obtain the intersection of their rated items <span class="math">\(R_{i} \cap R_{j}\)</span> and from these compute a weight <span class="math">\(w_{ij}\)</span> using a suitable <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">similarity metric</a> on the scores each user has given to their common items. The most common metrics to use would be the cosine similarity or Pearson correlation. If all scores are positive or zero, cosine similarity will give weights in the range <span class="math">\(0 \le w_{ij} \le 1\)</span>, whereas the Pearson correlation will give weights in the range <span class="math">\(-1 \le w_{ij} \le 1\)</span>, which is potentially more sensitive to polarisation in people's tastes. If two users have no items in common, <span class="math">\(w_{ij} = 0\)</span>.</p>
<p>For an item <span class="math">\(n\)</span> which user <span class="math">\(i\)</span> has not rated, we may then calculate a predicted rating
</p>
<div class="math">$$S^{\prime}_{i,n} = \frac{\sum_{j \mid n \in R_{j}} S_{j,n} w_{ij}}{\sum_{j \mid n \in R_{j}} w_{ij}}$$</div>
<p>This is the weighted mean of other user's weightings for the item, weighted according to the similarity of the users' ratings on other items. Items with a high predicted score for a given user can then be recommended to that user. The name <em>Collaborative Filtering</em> refers to the users collaborating through the algorithm to filter the items according to each other's preferences.</p>
<p>So far we have assumed that users have rated the items with a numerical score. However, in many applications, we only have a binary choice - for example, whether users have purchased an item, or shared a link. In this case, we can use the Tanimoto metric
</p>
<div class="math">$$w_{ij} = \frac{|R_{i} \cap R_{j}|}{|R{i} \cup R_{j}|}$$</div>
<p> as the weighting between users. The Predicted rating for <span class="math">\(n\)</span> then becomes
</p>
<div class="math">$$S_{i,n} = \frac{\sum_{j \mid n \in R_{j}} w_{ij}}{|\left\{j \mid n \in R_{j}\right\}|}$$</div>
<p>
that is, the average similarity to the user of users who have chosen the item.</p>
<p>Collaborative filtering and other recommendation algorithms suffer from the <em>bootstrap problem</em>, in that they require a lot of user data to work effectively, but when starting something up, that data is not available. Until a user has rated a significant number of items, it will not be possible to predict accurately what they will like, and until a significant number of people have rated an item, it will not be possible to predict accurately who will like it. As a result, recommendation systems cannot function effectively as a product in their own right, but work best as a feature of a larger product.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/pagerank.html">PageRank</a>]</td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>K-Means Clustering2024-02-22T00:00:00+00:002024-02-22T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-22:/k-means-clustering.html<p>Finding clusters by their centroids</p><p>In the previous article, we discussed <a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a>. Another method commonly used method is the <em>K-Means</em> algorithm, which attempts to find <span class="math">\(K\)</span> clusters such that this variance within the clusters is minimised. It does this by the following method</p>
<ol>
<li>Given an appropriately scaled dataset, choose <span class="math">\(K\)</span> points in the range of the data</li>
<li>Assign each point in the dataset to a cluster associated with the nearest of these points, accorting to the <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">euclidean distance</a></li>
<li>Recalcuate the points as the means of the datapoints assigned to their clusters</li>
<li>Repeat from step 2 until the assignments converge</li>
</ol>
<p>Several different methods may be used to assign the initial centroids. The <em>Random Partition</em> method initially assigns each datapoint to a random cluster and takes the means of those clusters as the starting points. This tends to produce initial centroids close to the centre of the dataset. <em>Fogy's method</em> choses <span class="math">\(K\)</span> datapoints randomly from the dataset as the initial centroids. This tends to give more widely spaced centroids. A variation of this, the <em>kmeans++</em> method, weights the probability of chosing each datapoint as a centroid by the minimum squared distance of that point from the centroids already chosen. This is the default in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">Scikit-Learn's implementation of K-means</a>, since it is considered more robust. K-means cannot be guaranteed to converge to the optimal solution and is senitive to its initial conditions, it is common practice to rerun the clustering several times with several different sets of initial centroids, and chose the solution with the lowest variance.</p>
<p>Another issue with K-means is how many clusters to chose. This may be done by visialising the data in advance, or with the <em>silhouette score</em>. This is a measure of how much closer a datapoint is to other datapoints in its own cluster than it is to datapoints in other clusters. For a datapoint <span class="math">\(i\)</span> which is a member of cluster <span class="math">\(C_{k}\)</span>, which has <span class="math">\(N_{k}\)</span> datapoints assigned to it, we first calculate the mean distance of <span class="math">\(i\)</span> from the other members of <span class="math">\(C_{k}\)</span></p>
<div class="math">$$a_{i} = \frac{\sum_{j \in C_{k},j \neq i} d(i,j)}{N_{k}-1}$$</div>
<p>where <span class="math">\(d(i,j)\)</span> is the distance betwen datapoints <span class="math">\(i\)</span> and <span class="math">\(j\)</span></p>
<p>We then find the mean distance betwen <span class="math">\(i\)</span> and the datapoints in the closest cluster to it other than the one to which it is assigned.</p>
<div class="math">$$b_{i} = \min_{l \neq k} \frac{\sum_{j \in C_{l}} d(i,j)}{N_{l}}$$</div>
<p>The silhouette score for an individual point is then calculated as </p>
<div class="math">$$s_{i} = \frac{b_{i} - a_{i}}{\max(b_{i},a_{i})}$$</div>
<p>This has a range of -1 to 1, where a high value would indicate that a datapoint is central to its cluster and a low value that it is peripheral. We may then calculate the mean of <span class="math">\(s_{i}\)</span> over the dataset. The optimum number of clusters is the one that maximises this score.</p>
<p><em>X-means</em> is a variant of K-means that aims to select the optimum number of clusters automatically. This follows the following procedure.
1. Perform K-means on the dataset with <span class="math">\(K=2\)</span>.
2. For each cluster, perform K-means again with <span class="math">\(K=2\)</span> for the members of that cluster.
3. Use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to determine whether this improves the model. Keep subdividing clusters until it does not.
4. When no further subdivision are necessary, use the centroids of the clusters thus obtained as the starting point for a final round of K-means clustering on the full dataset.</p>
<p>K-means clustering requires the clusters to be linearly seperable. If this is not the case, it is necessary to perform <a href="https://PlayfulTechnology.co.uk/data-reduction.html">Kernel PCA</a> to map the dataset into a space where they are.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
<td><a href="https://PlayfulTechnology.co.uk/collaborative-fitering.html">Collaborative Filtering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hierarchical Clustering2024-02-15T00:00:00+00:002024-02-15T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-15:/hierarchical-clustering.html<p>Clustering data into trees of related items</p><p>When exploring a dataset, it is often useful to identify what groups or <em>clusters</em> of items may exist within the data. This is know as <em>unsupervised</em> learning, since it attempts to learn what classes exist within the data without prior knowledge of what they are, as opposed to <em>supervised learning</em> (classification), which trains a model to identify known classes in the dataset.</p>
<p>A simple method for this is <em>Hierarchical Clustering</em>. This arranges the datapoints in a tree structure by the following method.</p>
<ol>
<li>Assign each data point to a <em>leaf node</em></li>
<li>Calculate the distances between the nodes using an appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a></li>
<li>Create the <em>parent node</em> of the two nodes that are closest to each other, and replace its two <em>daughter nodes</em> with it.</li>
<li>Calculate the distances of the new node to each of the remaining nodes in the dataset</li>
<li>Repeat from step 3 until all all the nodes have been merged into a single tree.</li>
</ol>
<p>At stage 4, there are a number of different <em>linkage methods</em> for calculating the new distances. The main ones are</p>
<dl>
<dt>Single linkage</dt>
<dd>use the minimum distance between two points in each cluster</dd>
<dt>Complete (or maximum) linkage</dt>
<dd>use the maximum distance between two points in each cluster</dd>
<dt>Average linkage</dt>
<dd>use the average distance between points in the two clusters. With Euclidean distances, this can be simplified to the distance between the centroids of the clusters</dd>
<dt>Ward linkage</dt>
<dd>calculate distances between clusters recursively with the formula
<div class="math">$$d(u,v) = \sqrt{ \frac{\left(n_{v} + n_{s}\right) d(s,v)^{2} + \left(n_{v} + n_{t}\right) d(t,v)^{2} - n_{v} d(s,t)^{2}}{n_{s} + n_{t} + n_{v}}}$$</div>
where <span class="math">\(u\)</span> is the cluster formed by merging <span class="math">\(s\)</span> and <span class="math">\(t\)</span>, <span class="math">\(v\)</span> is another cluster, and <span class="math">\(n_{c}\)</span> is the number of datapoints in cluster <span class="math">\(c\)</span>. The distance between two leaf nodes is euclidean. This has the property of minimising the variance of the new cluster.</dd>
</dl>
<p>Ward linkage is the technique most likely to give even cluster sizes, while single linkage is the one most useful when cluster shapes are likely to be irregular.</p>
<p>One problem with Hierarchical Clustering is that, as described above, it does not produce discrete clusters. One way to address this is to visualise the data and chose a number of clusters, and then terminate the clustering early when that number of clusters is reached. The number of clusters may be chosen by visualising the data, either by using <a href="https://PlayfulTechnology.co.uk/data-reduction.html">t-SNE</a> or by performing an initial clustering and plotting the tree structure on a dendrogram. Another method is to chose a distance threshold, and not merge clusters further apart than this. It would be necessary to know the statistical distribution of distances between clusters to choose an appropriate threshold. While I have not seen this implemented, it would be theoretically possible to use the <a href="https://PlayfulTechnology.co.uk/information-theory.html">Bayesian Information Criterion</a> to decide when to separate clusters - this approach would be most useful when Ward linkage was used.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/clustering-proteins-in-breast-cancer-patients.html">Clustering Proteing in Breast Cancer Patients</a> I used Hierarchical Clustering to identify groups of proteins whose activity was related in patients.</p>
<p>Implementations of Heirarchical Clustering can be found in <a href="https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html">Scipy</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html">Scikit-Learn</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
<td><a href="https://PlayfulTechnology.co.uk/k-means-clustering.html">K-Means Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Outlier Detection2024-02-08T00:00:00+00:002024-02-08T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-08:/outlier-detection.html<p>Finding the Odd One Out</p><p>Many datasets contain <em>outliers</em>, datapoints which do not fit the general pattern of the observations. This may be due to errors in the data collection, in which case removing these datapoints will make models fitted to the data more robust and reduce the risk of overfitting. In other cases, the outliers themselves are the signal we want to detect.</p>
<p>One method for doing this is <em>Isolation Forests</em>. As the name implies, it is related to the <a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forest</a> algorithm discussed in the previous article. It fits a forest of (usually around 100) random decision trees to the dataset by the following method.</p>
<ol>
<li>Pick a feature at random</li>
<li>Pick a random threshold in the range of that feature</li>
<li>Partition the data at that threshold</li>
<li>Repeat the process for each partition</li>
</ol>
<p>We can then calculate an anomaly score for each datapoint. This is the depth in the decision tree at which a datapoint becomes isolated from the rest of the dataset. The mean of this score over all the trees gives a robust estimator of how easily a datapoint can be separated from the rest. The advantages of this method are that it makes no assumptions about the underlying distribution of the data, and that is explainable, in that the features which are most likely to contribute to a datapoint being isolated can be identified from the decision trees.</p>
<p>I used Isolation Forests in my work at <a href="https://PlayfulTechnology.co.uk/amey-strategic-consulting.html">Amey Strategic Consulting</a> to identify faulty traffic flow sensors in the Strategic Road Network.</p>
<p>Another method that makes no assumptions about the underlying distribution is <em>Local Outlier Factors</em>. This calculates how different data points are from their local neighbourhood. First we calculated the distances <span class="math">\(S_{i,j}\)</span> between datapoints in the sample using some appropriate <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">metric</a> (this requires all variables to be approriately scaled) and identify the <span class="math">\(N\)</span> nearest neighbours of each datapoint (usually 20). We then calculate the <em>local density</em> <span class="math">\(D_{i}\)</span> for each datapoint. This is the inverse of the mean of the distances between the point and each of its neighbours.
</p>
<div class="math">$$D_{i} = \frac{N}{\sum_{k} S_{i,k}}$$</div>
<p> where <span class="math">\(k\)</span> are the indices of the points neighbours. We can then calculate the <em>Local Outlier Factor</em> <span class="math">\(\mathrm{LOF}\)</span> for each datapoint. This is the mean of the ratio between the datapoints local density and that of each of its neighbours, <em>ie</em>
</p>
<div class="math">$$\mathrm{LOF} = \frac{\sum_{k} \frac{D_{i}}{D_{k}}}{N}$$</div>
<p>Samples whose Local Outlier Factor is below a given threshold (<em>ie</em> those whose local density is lower than that of their neighbours) can be identified as outliers.</p>
<p>If we can assume that the data are drawn from a multivariate Gaussian distribution, we can use an <em>Eliptic Envelope</em> method. For a sample of size <span class="math">\(N\)</span> with <span class="math">\(d\)</span> dimensions, we chose a sample size <span class="math">\(h\)</span> such that
</p>
<div class="math">$$\frac{\lfloor N+d+1 \rfloor }{2} < h < N$$</div>
<p>
We then select a large number of subsamples of size <span class="math">\(h\)</span> from the dataset, and calculate the mean and covariance of each. The one where the covariance has the smallest determinant is the one least likely to contain outliers. Datapoints with a large Mahalanobis distance from the mean of this sample are therefore likely to be outliers.</p>
<p>Of these methods, I'd expect Isolation Forests to be the one most likely to be useful in the widest variety of circumstances.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
<td><a href="https://PlayfulTechnology.co.uk/hierarchical-clustering.html">Hierarchical Clustering</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Random Forests2024-02-01T00:00:00+00:002024-02-01T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-02-01:/random-forests.html<p>Classification and regression with ensembles of decision trees.</p><p>We have mentioned clasification problems in a number of previous articles, and shown how they can be approached with <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>, <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a> and by extension, neural networks. This week we'll examine a different method, based on <em>Decision Trees</em>.</p>
<p>A decision tree can be thought of a set of nested if/else statements. It can be fit by the following procedure.</p>
<ol>
<li>Find the variable that correlates most strongly with the target variable.</li>
<li>Find the set of thresholds against that variable that comes closest to splitting the data into nodes that correspond to the target classes.</li>
<li>Repeat this for each of the nodes you have split the data into, until each <em>leaf node</em> contains a single class.</li>
</ol>
<p>However, this is prone to <em>overfitting</em>, whereby the model fits every detail of the training data but does not generalise well when classifying new data. In effect, it fits the noise as well as the signal.</p>
<p><em>Random Forests</em> is an algorithm that addresses this problem. As the word forest implies, it fits a large number (typically 100) of decision trees to the training data. Each, however, is trained only on a subset of the training data and with a subset of the variables. These subsets are chosen randomly for each tree in the forest.</p>
<p>While each individual tree in the forest will tend to overfit, the fact that they were all fit against different subsets of the data and variables will mean that the errors they make on new data will not be correlated. Therefore a majority vote of the trees provides a much more robust classifier than any individual tree would. It is also possible to take account of uncertainty in the classification by reporting the number of individual trees that voted for each class - in Bayesian terms, this coresponds to <span class="math">\(P(H \mid O)\)</span>. Algorithms that combine the results of multiple classifiers in this way are known as <em>ensemble methods</em>.</p>
<p>If the target variable is continuous, Random Forests can also be used for regression. In this case, the fitting of the decision trees terminates when the variance of the samples in each leaf node fall below a certain threshold. The prediction is then the mean of the predictions from the individual trees.</p>
<p>Random Forests tend to give better results than Logistic Regression when the target classes are unbalanced, and the algorithm is noted for having a high success rate in <a href="https://kaggle.com">Kaggle</a> competitions. In <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">The Grammar of Truth and Lies</a> I found it gave good results in using grammatical features to classify Fake News.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
<td><a href="https://PlayfulTechnology.co.uk/outlier-detection.html">Outlier Detection</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Information Theory2024-01-25T00:00:00+00:002024-01-25T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-25:/information-theory.html<p>How much information does your data contain?</p><p>Data science can be described as turning data into infomation. However, we need to know how much information there is to find and where to find it. There are various methods we can use to measure this, which derive from the field of <em>Information Theory</em>.</p>
<p>The most basic of these measurements is <em>entropy</em>, which was intoduced by Claude Shannon. If a varible has a probability distribution <span class="math">\(p_{i}\)</span>, the entropy of that variable is given by
</p>
<div class="math">$$H = -\sum_{i} p_{i} \log_{2}p_{i}$$</div>
<p>
This is the expected number of binary decisions needed to identify a value of the variable, or, if we were to generate a stream of symbols from that distribution, the average number of bits per symbol that would be needed to encode that stream in an optimal lossless compression.
This is useful for identifying which variables are most important. Entropy has its maximum value of <span class="math">\(\log_{2} N\)</span>, where <span class="math">\(N\)</span> is the number of possible values, when the values are evenly distributed, and its minimum value of 0 when one value is a certainty.</p>
<p>We also need to quantify how much information is contained in the relationship between two variables. Suppose that two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> have individual probability distributions <span class="math">\(p_{i}\)</span> and <span class="math">\(p_{j}\)</span>, and a joint probability distribution <span class="math">\(p_{ij}\)</span>. If the variables are statistically independent, these distributions would satisfy the relationship <span class="math">\(p_{ij} = p_{i} p_{j}\)</span>. <em>Mutual information</em> characterises the deviation from this as
</p>
<div class="math">$$\mathrm{MI}(A,B) = \sum_{i} \sum_{j} p_{ij} \log_{2} \frac{p_{ij}}{p_{i} p_{j}}$$</div>
<p>
This is the amount of information that knowing the value of one variable will tell you about the other. This can be used for feature selection. Consider two variables <span class="math">\(A\)</span> and <span class="math">\(B\)</span> and a target variable <span class="math">\(T\)</span>. If <span class="math">\(\textrm{MI}(A,T) > \textrm{MI}(B,T)\)</span> and <span class="math">\(\textrm{MI}(A,B) > \textrm{MI}(B,T)\)</span>, it is likely that any relationship between <span class="math">\(B\)</span> and <span class="math">\(T\)</span> is entirely a consequence of their mutual relationship with <span class="math">\(A\)</span>. Therefore, <span class="math">\(B\)</span> can safely be discarded.</p>
<p>In <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It A Mushroom or Is It A Toadstool</a> I used mutual information to infer hidden variables when building a <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayesian Belief Network</a>.</p>
<p>There are number of information-theory based methods for selecing models. The best known of these, which are closely related to each other are the <em>Bayesian Information Criterion</em></p>
<div class="math">$$\mathrm{BIC} = k \ln n - 2 \ln \hat{L}$$</div>
<p> and the <em>Akaike Information Criterion</em></p>
<div class="math">$$\mathrm{AIC} = 2 ( k- \ln \hat{L} )$$</div>
<p>where <span class="math">\(k\)</span> is the number of free parameters in the model, <span class="math">\(n\)</span> is the number of data points to which the model is fitted, and <span class="math">\(\hat{L}\)</span> is the likelihood of the data under the optimally fitted model. In both cases, a lower value indicates a better model, favouring models that give a high likelihood of the data and penalising more complex models. The main difference between them is that the Bayesian Information Criterion penalises compelexity more heavily, especially for larger datasets.</p>
<p>There are many other uses for information theory in data science, but I'd like finish with one relevant to natural language processing. Marcello Montemurro and Damian Zanette published a paper entitled <a href="https://arxiv.org/abs/0907.1558">Towards the quantification of semantic information in written language</a> in which they intoduced a technique for using the entropy of word frequency distributions across different parts of a document to identify the most significant words, according to the role they play in its structure. I illustrate this in <a href="https://PlayfulTechnology.co.uk/the-entropy-of-alice-in-wonderland.html">The Entropy of Alice in Wonderland</a>.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
<td><a href="https://PlayfulTechnology.co.uk/random-forests.html">Random Forests</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Latent Semantic Indexing2024-01-18T00:00:00+00:002024-01-18T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-18:/latent-semantic-indexing.html<p>Reducing the dimensionality of language data</p><p>In the article on <a href="https://PlayfulTechnology.co.uk/data-reduction.html">data reduction</a>, we mentioned the <em>curse of dimensionality</em>, whereby large numbers of features make data increasingly difficult to analyse meaningfully. If we take another look at <a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a>, we see that this will generate a feature for each unique word in the corpus that it is trained on, which may be in the tens of thousands. It therefore makes sense to apply a data reduction method and obtain a more compact representation.</p>
<p>TF-IDF, as previously discussed, makes use of the fact that words that occur in some documents but not others are the most useful for distinguishing between the documents. This means that its feature vectors will generally be quite sparse. Therefore, the most appropriate data reduction method to use will be Singular Value Decomposition.</p>
<div class="math">$$\mathbf{TFIDF} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>Tyypically around 200 components are retained. The left singular vectors <span class="math">\(\mathbf{U}\)</span> then represent documents in the lower-dimensional space, while the right sigular vectors <span class="math">\(\mathbf{V}\)</span> represent words in the same space. Words that tend to appear in the same documents will tend to have similar vector representations, and according to the <em>distributional hypothesis</em>, this gives an implicit representation of their meaninng. This implicit representation of meaning gives the technique the name <em>Latent Semantic Analysis</em>.</p>
<p>Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span>, we can calculate a query vector
</p>
<div class="math">$$\vec{q} = \sum_{i}\mathbf{V}_{w_{i}}$$</div>
<p>
We can then search our corpus for the most relevant documents to match the query by calculating a score
</p>
<div class="math">$$S = \mathbf{U} \cdot \vec{q}$$</div>
<p> and selecting the documents with the greatest score. Since it can be used to search the corpus in this way, Latent Semantic Analysis is also known as <em>Latent Semantic Indexing</em>.</p>
<p>An implementation of Latent Semanic Indexing (LSI) can be found in the <a href="https://radimrehurek.com/gensim/models/lsimodel.html">Gensim</a> library, along with several other <em>topic models</em>, which similarly attempt to use the distributional hypothesis to characterise documents.</p>
<p>While LSI can account for different words having similar meanings, it is still a bag of words model and cannot account for the same word having different meanings dependent on context. In my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> I attempted to address this issue by building an NLP pipeline that enriched the documents with Named Entity Recognition and Word Sense Disambiguation before applying LSI, but modern transformer models address it by calculating contextual word vectors. It can, however, be seen as the distant ancestor of these models.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
<td><a href="https://PlayfulTechnology.co.uk/information-theory.html">Information Theory</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Data Reduction2024-01-11T00:00:00+00:002024-01-11T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-11:/data-reduction.html<p>Mapping data to lower dimensions</p><p>Datasets that involve a large number of features suffer from <em>The Curse of Dimensionality</em>, where, as the number of features increases, it becomes harder and harder to use them to define a meaningful <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">measure of distance</a> between the samples. It becomes necessary to map the data into a smaller number of dimensions. To do this, we need to find mathematical relatioships between the features that can be used to form a more economical representation of the data.</p>
<p>The most common way of doing this is <em>Principal Component Analysis</em> (PCA), which captures linear relationships between features. This starts by calculating the covariance of the features</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i} (\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p>where <span class="math">\(\vec{x_{i}}\)</span> is a sample, <span class="math">\(\bar{\vec{x}}\)</span> is the mean of the samples, and <span class="math">\(N\)</span> is the number of samples. We then calculate the eigenvalues and eigenvectors of this matrix. Each eigenvalue quantifies how much of the variance of the data the associated eigenvector explains. Hopefully, the larger eigenvectors will encode the useful signals in the data and the smaller ones mainly contain noise, which we can filter out. Therefore, if we take the eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues (out of the original <span class="math">\(M\)</span> features), we can use them to form an <span class="math">\(M \times m\)</span> projection matrix <span class="math">\(\mathbf{P}\)</span>. We can then project the data into a lower dimension by calculating
</p>
<div class="math">$$\vec{x_{i}}^{\prime} = (\vec{x_{i}} - \bar{\vec{x}}) \cdot \mathbf{P}$$</div>
<p>We may chose <span class="math">\(m\)</span> by examining a line chart of the eigenvalues in increasing order and looking for an <em>elbow</em> where the slope suddenly increases, or by maximising the amount of variance explained while minimising the number of components retained, as described in <a href="https://PlayfulTechnology.co.uk/how-many-components.html">How Many Components?</a>.</p>
<p>A related technique, <em>Independent component analysis</em>, seeks to maximise the statistical independence between the projected components rather than the explained variance. This is often used in signal processing.</p>
<p>This works well when the data is dense, and when the classes we want to find in the data are linearly seperable. When the data is sparse, we instead use a technique called <em>Singular Value Decomposition</em>. Given an <span class="math">\(N \times M\)</span> matrix <span class="math">\(\mathbf{X}\)</span>, we decompose it into an <span class="math">\(N \times m\)</span> matrix <span class="math">\(\mathbf{U}\)</span>, an <span class="math">\(m \times m\)</span> matrix <span class="math">\(mathbf{\Sigma}\)</span> and an <span class="math">\(M \times m\)</span> matrix <span class="math">\(\mathbf{V}\)</span> such that</p>
<div class="math">$$\mathbf{X} \approx \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^{T}$$</div>
<p>These have the additional properties that <span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are <em>unitary matrices</em>, that is </p>
<div class="math">$$\mathbf{U} \cdot \mathbf{U}^{T} = \mathbf{I}$$</div>
<p> and </p>
<div class="math">$$\mathbf{V} \cdot \mathbf{V}^{T} = \mathbf{I}$$</div>
<p>
The matrix <span class="math">\(\mathbf{\Sigma}\)</span> is zero everywhere except along its leading diagonal. The values along the leading diagonal are known as <em>singular values</em>, and act like the eigenvalues in principal component analysis. For a full singular value decomposition, <span class="math">\(m=M\)</span> and the product of the matrices is exactly equal to <span class="math">\(\mathbf{X}\)</span>, but for data reduction we use truncated singular value decomposition, using only the largest <span class="math">\(m\)</span> singular values.</p>
<p><span class="math">\(\mathbf{U}\)</span> and <span class="math">\(\mathbf{V}\)</span> are the left and right singular vectors, and represent the mapping of the datapoints and the features into the lower-dimensional space respectively.</p>
<p>In the case where the classes are not linearly seperable, we need to capture non-linear relatiionships between the features. The simplest way doing this is <em>kernel PCA</em>. This relies on the fact that there is normally a way to project the data into a higher dimensional space so that it becomes linearly seperable. To illustate this, considera set of concentric circles in a plane. If we add the distance from the centre as a third dimension, the circles appear as seperate layers.</p>
<p>But wait. Why are we projecting into a higher-dimensional space when we want to reduce the number of dimensions? Well, we don't actually do this. Instead, we define a <em>kernel function</em> <span class="math">\(f(\vec{x},\vec{y})\)</span>, which corresponds to the distance between two points <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> in the higher-dimensional space. We then obtain the <span class="math">\(N \times N\)</span> matrix</p>
<div class="math">$$\mathbf{F}_{i,j} = f(\vec{x_{i}},\vec{x_{j}})$$</div>
<p>We then obtain the eigenvalues and eigenvectors of this matrix. The eigenvectors corresponding to the <span class="math">\(m\)</span> largest eigenvalues form an <span class="math">\(N \times m\)</span> matrix whose rows correspond to vectors we would obtain if we carried out PCA in the higher-dimensional space. Unfortunately, for a large dataset, this is more computationally intensive that standard PCA.</p>
<p>There are a number of other techniques for using non-linear relationships in data reduction, collectively known as <em>manifold learning</em>, but this article would get a bit too long if we tried to cover them all. However, one that is of particular interest is <em>t-distributed Stochastic Neighbour Embedding</em> (t-SNE). This tries to map datapoints to a lower dimension so that the statistical distibution of distances between points in the lower dimension is similar to that in the higher dimension. It is sensitive to the local structure of the data, and so useful for exploratory visualisations.</p>
<p>I used several of these techniques in my work at <a href="https://PlayfulTechnology.co.uk/pentland-brands.html">Pentland Brands</a>. Implementations can be found in Scikit-Learn's <a href="https://scikit-learn.org/stable/modules/decomposition.html">decomposition</a> and <a href="https://scikit-learn.org/stable/modules/manifold.html">manifold</a> modules.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
<td><a href="https://PlayfulTechnology.co.uk/latent-semantic-indexing.html">Latent Semantic Indexing</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>TF-IDF2024-01-04T00:00:00+00:002024-01-04T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2024-01-04:/tf-idf.html<p>Characterising documents by their most important words</p><p>In the post on <a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a>, we mentioned that Levenshtein distance is only suitable for comparing short strings. One reason for this, as previously discussed, is computational complexity, but another is that by comparing <em>characters</em>, it says nothing about the <em>meaning</em> of what it compares.</p>
<p>So, what can we do if we want to compare large documents in a meaningful way? One thing we could do is compare word frequencies. Of course, we need to take the overall length of the document into account, so we define the <em>Term Frequency</em></p>
<div class="math">$$\mathrm{TF}_{w} = \frac{n_{w}}{\sum_{i} n_{i}}$$</div>
<p> where <span class="math">\(n_{w}\)</span> is the number of times word <span class="math">\(w\)</span> occurs in the document. Using this, we could compute a Euclidean distance or cosine similarity between two documents. </p>
<p>However, not all words are equally important. If we are talking about <em>an algorithm</em>, we can easily see that the content word <em>algorithm</em> is more important that the function word <em>an</em>. Given a corpus of <span class="math">\(D\)</span> documents, of which <span class="math">\(D_{w}\)</span> contain word <span class="math">\(w\)</span>, we then define the <em>Inverse Document Frequency</em></p>
<div class="math">$$\mathrm{IDF}_{w} = \log \frac{D}{D_{w}+1}$$</div>
<p>Adding 1 to the denominator ensures we never divide by zero. You may wonder why we have to do this, since there will be no words in the corpus that do not occur in any documents. However, if we are continually adding documents to our corpus, it would be a major expense to have to recalculate all the previous documents when one was added that contained new vocabulary. To avoid that, we might want to use a fixed dictionary that is provided in advance. However, if our corpus is fixed, and we know that all words will occur in at least one document, we can use <span class="math">\(D_{w}\)</span> as the denominator.</p>
<p>This measures the ability of a word to discriminate between documents in the corpus. For a document <span class="math">\(d\)</span> and a word <span class="math">\(w\)</span> we can then combine these two measures to define <em>TF-IDF</em> as</p>
<div class="math">$$\mathrm{TFIDF}_{w,d} = \mathrm{TF}_{w,d} \mathrm{IDF}_{w} = \frac{n_{w,d}}{\sum_{i} n_{i,d}} \log \frac{D}{D_{w}+1}$$</div>
<p>
which measures the importance of the word in the document weighted by its importance in the corpus. A word that occurs frequently in a few documents but is absent in many will be important for identifying those documents.</p>
<p>One way we can use TF-IDF is to search a corpus of documents. Given a query <span class="math">\(Q = w_{1}w_{2}\ldots w_{n}\)</span> we can calculate a score for a document <span class="math">\(d\)</span></p>
<div class="math">$$S_{d} = \sum_{i} \mathrm{TFIDF}_{w_{i},d}$$</div>
<p> and retrieve the documents with the highest scores.</p>
<p>TF-IDF is an example of a <em>bag of words</em> model - one based entirely on word frequencies that takes no account of grammar or context. An implementation (to which I have contributed a bug fix) can be found in the <a href="https://radimrehurek.com/gensim/">Gensim</a> topic modelling library.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
<td><a href="https://PlayfulTechnology.co.uk/data-reduction.html">Data Reduction</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Similarity and Distance Metrics2023-12-28T00:00:00+00:002023-12-28T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-28:/similarity-and-distance-metrics.html<p>Methods for comparing data</p><p>Data scientists often need to compare data points. This is necessary for indexing data, for finding clusters in datasets, for detecting outliers and anomalies, for comparing user behaviour in recommendations systems, and for measuring quality of fit when predicting continuous variables. There are various metrics that can be used for this purpose.</p>
<p>One of the most frequently used metrics is <em>Euclidean distance</em>. For two vectors <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span>, this is given by
</p>
<div class="math">$$S = |\vec{x} - \vec{y}| \\
= \sqrt{\sum_{i} (x_{i} - y_{i})^2}$$</div>
<p>
This is analogous to distances in physical space. It is useful when the overall scale of the data is important, and has the property that <em>smaller is better</em>.</p>
<p>When we wish to take the overall scale of the data out of consideration, it is common to use <em>cosine similarity</em> </p>
<div class="math">$$C = \frac{\vec{x} \cdot \vec{y}}{|\vec{x}||\vec{y}|}$$</div>
<p>
This represents the cosine of the angle between the two vectors, measured from the origin. It has a range of -1 to +1 and <em>bigger is better</em>. (If all the components of the vectors are positive, the range is from 0 to 1.) A variation on this is the <em>Pearson correlation</em>
</p>
<div class="math">$$P = \frac{(\vec{x} - \bar{x}) \cdot (\vec{y} - \bar{y})}{|\vec{x} - \bar{x}||\vec{y} - \bar{y}|}$$</div>
<p> where <span class="math">\(\bar{x}\)</span> and <span class="math">\(\bar{y}\)</span> are the means of the components of <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{y}\)</span> repsectively. This measures the degree to which the components of the two vectors are linearly correlated with each other.</p>
<p>These metrics are all most useful when the ranges of all the components are similar. Otherwise, the effects of the components with the largest ranges will tend to dominate over those with smaller ranges. The usual remedy for this is to scale the components as </p>
<div class="math">$$\vec{x^{\prime}} = \frac{\vec{x} - \bar{\vec{x}}}{\vec{\sigma}}$$</div>
<p> where <span class="math">\(\bar{\vec{x}}\)</span> and <span class="math">\(\vec{\sigma}\)</span> are the mean and covariance of the sample respectively. Another possibility is to use the <em>Mahalanobis distance</em>
</p>
<div class="math">$$M = \sqrt{(\vec{x} - \vec{y}) \cdot \mathbf{\Sigma}^{-1} \cdot (\vec{x} -\vec{y})}$$</div>
<p>
where <span class="math">\(\mathbf{\Sigma}\)</span> is the <em>covariance matrix</em>
</p>
<div class="math">$$\mathbf{\Sigma} = \frac{\sum_{i}(\vec{x_{i}} - \bar{\vec{x}}) \otimes (\vec{x_{i}} - \bar{\vec{x}})}{N}$$</div>
<p> where <span class="math">\(N\)</span> is the number of samples. This not only scales the variables appropriately, but accounts for dependencies between them. It is, however, more computationally expensive, especially for high-dimensional data.</p>
<p>Sometimes we wish to compare data that is not readily described as vectors. Suppose that we wish to compare two users of a social network in terms of which links they have shared. We might consider the links shared by each user as a set of unique items. To compare these sets, we can use the <em>Tanimoto metric</em>
</p>
<div class="math">$$T = \frac{|A \cap B|}{|A \cup B|}$$</div>
<p>, that is the fraction of the links shared by either user that have been shared by both users. This has a range from 0 to 1 and <em>bigger is better</em>.</p>
<p>If we wish to compare two short strings (as for example, in a spellchecking application), the ususal method is the <em>Leveshtein distance</em> . This is the number of insertions, deletions or substitutiions needed to transform one string into another. If we consider the strings <span class="math">\(X\)</span> and <span class="math">\(Y\)</span> as sequences of characters <span class="math">\(x_{1}x_{2}\ldots x_{m}\)</span> and <span class="math">\(y_{1}y_{2}\ldots y_{n}\)</span> respectively, we can define an <span class="math">\(m \times n\)</span> matrix <span class="math">\(\mathbf{L}\)</span> as
</p>
<div class="math">$$L_{i,0} = i$$</div>
<p> for <span class="math">\(i\)</span> from 0 to m
</p>
<div class="math">$$L_{0,j} = j$$</div>
<p> for <span class="math">\(j\)</span> from 0 to n
</p>
<div class="math">$$L_{i,j} = \min \left(L_{i,j-1},L{i-1,j},L{i-1,j-1}+\left\{\begin{array}{c 1} 0 & \quad \mathrm{if } x_{i} = y_{j} \\
1 & \quad \mathrm{if } x_{i} \neq y_{j} \end{array} \right.\right)$$</div>
<p>The Levenshtein distance is then <span class="math">\(L_{m,n}\)</span>. While simple to implement and intuitive to understand, this is only really suitable for comparing short strings, as the complexity is <span class="math">\(\mathcal{O}(m \times n)\)</span>.</p>
<p>A wide variety of distance metrics are implemented in <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">Scipy</a></p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-chain-rule-and-backpropogation.html">The Chain Rule and Backpropogation</a></td>
<td><a href="https://PlayfulTechnology.co.uk/tf-idf.html">TF-IDF</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Chain Rule and Backpropogation2023-12-21T00:00:00+00:002023-12-21T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-21:/the-chain-rule-and-backpropogation.html<p>Calculating the gradients of complex functions</p><p>In the article about <a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a>, we mentioned that logistic regression and neural networks are fit by minimising a loss function. In order to do this, we need to calculate the gradient of the loss function with respect to the parameters. This tells us how we can adjust the parameters to reduce the loss. </p>
<p>Since the functions we want to optimise in machine learning problems can usually be expressed as a function of a function. To differentiate this, we use the <em>chain rule</em>
</p>
<div class="math">$$\frac{df(g(x))}{dx} = \frac{df}{dg}\frac{dg}{dx}$$</div>
<p>To illustrate this, let's see how we can use it to differentiate the cross-entropy loss
</p>
<div class="math">$$L = -\ln p_{c}$$</div>
<p> with respect to the weights <span class="math">\(\mathbf{W}\)</span> of a logistic regression model. First we differentiate the loss with respect to the probability of the correct class.
</p>
<div class="math">$$\frac{dL}{dp_{c}} = -\frac{1}{p_{c}}$$</div>
<p>Then we need to differentiate the probability with respect to each of the logits <span class="math">\(q_{i}\)</span>
</p>
<div class="math">$$p_{c} = \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \\
\frac{\partial p_{c}}{\partial q_{i}} = \frac{\delta_{ic} e^{q_{c}} \sum_{i} e^{q_{i}} - e^{q_{c}} e^{q_{i}}}{\left( \sum_{i} e^{q_{i}} \right)^{2}} \\
= \frac{e^{q_{c}}}{\sum_{i} e^{q_{i}}} \frac{\delta_{ic} \sum_{i} e^{q_{i}} - e^{q_{i}}}{\sum_{i} e^{q_{i}}} \\
= p_{c}(\delta_{ic} - p_{i}) $$</div>
<p>
where <span class="math">\(\delta_{ic}\)</span> is the <em>Kroneker delta function</em>, which is 1 if <span class="math">\(i=c\)</span> and 0 otherwise.
(As an aside, functions whose derivative can be expressed in terms of their output are commonly used in machine learning, because they make differentiation easier. Such functions are often derived from the exponential function in some way).</p>
<p>Then, we need to differentiate the logits with respect to the weights
</p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec{b} \\
\frac{d \vec{q}}{d \mathbf{W}} = \vec{x}$$</div>
<p>Finally, we can combine these derivatives using the chain rule
</p>
<div class="math">$$\frac{dL}{d\mathbf{W}} = \frac{dL}{dp_{c}}\frac{dp_{c}}{d\vec{q}}\frac{d\vec{q}}{d\mathbf{W}} \\
=-\frac{1}{p_{c}}p_{c}(\delta_{ic}-\vec{p}) \otimes \vec{x} \\
=(\vec{p} - \delta_{ic}) \otimes \vec{x}$$</div>
<p> where <span class="math">\(\otimes\)</span> denotes the outer product.</p>
<p>For a deeper neural network, we use the fact that each layer <span class="math">\(n\)</span> of the network can be treated as a function </p>
<div class="math">$$\vec{x}_{n+1} = f_{n}(\mathbf{W}_{n} \cdot \vec{x}_{n} + \vec{b}_{n})$$</div>
<p> and apply the chain rule recursively to calculate the gradient of the loss with respect to each layer's weights and biases. This recursive application of the chain rule is known as <em>backpropogation</em>, and is the basis of most neural network optimisation algorithms.</p>
<p>Of course, very few data scientists ever need to do this themselves on a day-to-day basis, because automatic differentiation and backpropogation are provided by machine learning software libraries, but it's still useful to understand how it works.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
<td><a href="https://PlayfulTechnology.co.uk/similarity-and-distance-metrics.html">Similarity and Distance Metrics</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Logistic Regression2023-12-14T00:00:00+00:002023-12-14T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-14:/logistic-regression.html<p>A simple classification algorithm</p><h2>A simple classification algorithm</h2>
<p>Over the past few weeks, we have been looking at algorithms related to <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a>. This week, we are starting on a different tack, but it's still in the realm of relating probabilities to observations. </p>
<p>We start with the <em>logistic function</em>
</p>
<div class="math">$$p = \frac{1}{1+e^{-q}}$$</div>
<p>, where <span class="math">\(q\)</span> is a quantity we call a <em>logit</em>. This has the property that as <span class="math">\(q \rightarrow \infty\)</span>, <span class="math">\(p \rightarrow 1\)</span> and as <span class="math">\(q \rightarrow -\infty\)</span>, <span class="math">\(p \rightarrow 0\)</span>, so it can be used to model a probability. If we wish to calculate the probabilities of more than one class, we can generalise this with the <em>softmax function</em>
</p>
<div class="math">$$p_{i} = \frac{e^{q_{i}}}{\sum_{i} e^{q_{i}}}$$</div>
<p> where <span class="math">\(p_{i}\)</span> and <span class="math">\(q_{i}\)</span> represent the probabilities and logits for each class <span class="math">\(i\)</span> respectively.</p>
<p>But what are the logits? In the basic implementation of logistic regression, they are a linear function of some observations. Given a vector <span class="math">\(\vec{x}\)</span> of observations, we may model the logits as </p>
<div class="math">$$q = \vec{w} \cdot \vec{x} + b$$</div>
<p> for the binary case and </p>
<div class="math">$$\vec{q} = \mathbf{W} \cdot \vec{x} + \vec {b}$$</div>
<p> in the multiclass case. where <span class="math">\(\vec{w}\)</span> and <span class="math">\(\mathbf{W}\)</span> are <em>weights</em> and <span class="math">\(b\)</span> and <span class="math">\(\vec{b}\)</span> are biases. In terms of Bayes' Theorem.
</p>
<div class="math">$$\vec{b} = \ln P(H)$$</div>
<p> and </p>
<div class="math">$$\mathbf{W} \cdot \vec{x} = \ln P(\vec{x} \mid H)$$</div>
<p>We fit the weights and biases by minimising the <em>cross-entropy loss</em>
</p>
<div class="math">$$L = -\sum_{j} \ln p_{j,c}$$</div>
<p> where <span class="math">\(c\)</span> is the correct class for the example <span class="math">\(j\)</span> in the training dataset. </p>
<p>This works well as a simple classifier under two conditions</p>
<ol>
<li>The classes are fairly evenly balanced</li>
<li>The classes are linearly seperable</li>
</ol>
<p>If there is a strong imbalance between the classes, the bias will tend to dominate over the weights, and the rarer classes will never be predicted. To mitigate this, is is possible to undersample the more common classes or oversample the rarer ones before training.</p>
<p>If the classes are not linearly seperable, it's necessary to transform the data into a space where they are. This may be done by applying </p>
<div class="math">$$\vec{x^{\prime}} = f(\mathbf{M} \cdot \vec{x})$$</div>
<p> where <span class="math">\(f\)</span> is some non-linear function and <span class="math">\(\mathbf{M}\)</span> is a matrix of weights. We may in fact apply several layers of similar transformations, each with its own set of weight parameters. That is the basis of neural networks.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
<td><a href="[filename}chain-rule.md">The Chain Rule and Backpropogation</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Markov Chain Monte Carlo2023-12-07T00:00:00+00:002023-12-07T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-12-07:/markov-chain-monte-carlo.html<p>Estimating posterior distributions of continuous variables</p><h2>Estimating the posterior distributions of continuous variables</h2>
<p>In our previous discussions of <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' theorem</a> we have assumed that the probability distributions involved are of discrete variables. However, in many cases we wish to deal with continuous variables. In this case, Bayes' Theorem becomes</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{\int P(H) P(O \mid H) dH}$$</div>
<p>Unfortunately, for many distributions we may be interested in (including the ubiquitous normal distribution), the integral involved is intractible. The problem only gets worse in complex models, especially where we distributions may have multiple parameters. Some distribuitions have a <em>conjugate prior</em>, where the posterior distribution is of the same form as the prior distribution and may be obtained by an appropriate adjustment of parameters, but this is not always the case, and we need a numerical method that is more generally applicable.</p>
<p>The method we use is called <em>Markov Chain Monte Carlo</em> because it uses random samples Markov chains to explore the parameter space of the distribution. There are a number of variations of this, so for the sake of illustration, we will select a particular variant, the <em>Metropolis-Hastings algorithm</em>, as the basis of further discussion.</p>
<p>We start with a Markov chain <span class="math">\(P(H^{\prime} \mod H)\)</span> that, given a sample hypothesis <span class="math">\(H\)</span> generates a nearby hypothesis <span class="math">\(H^{\prime}\)</span>. At timestep <span class="math">\(t=0\)</span>, we generate a set of samples <span class="math">\(H_{i,0}\)</span> from the prior distribution. Then at each timestep <span class="math">\(t\)</span>, we generate a set of alternative hypotheses <span class="math">\(H^{\prime}_{i,t}\)</span> from the Markov chain given <span class="math">\(H_{i,t}\)</span>. For each pair of hypotheses, we then calculate an acceptance probability</p>
<div class="math">$$ A(H^{\prime}_{i,t},H_{i,t}) = \min \left( 1, \frac{P(O \mid H^{\prime}_{i,t}) P(H^{\prime}_{i,t}) P(H_{i,t} \mid H^{\prime}_{i,t})}{P(O \mid H_{i,t}) P(H_{i,t}) P(H^{\prime}_{i,t}) \mid H_{i,t}} \right) $$</div>
<p>We then generate a set of samples <span class="math">\(S_{i}\)</span> from a uniform distribution between 0 and 1, and update the samples as</p>
<div class="math">$$H_{i,t+1} = \left\{ \begin{array}{c 1} H^{\prime}_{i,t} & \quad \textrm{if } S_{i} \leq A(H^{\prime}_{i,t},H_{i,t}) \\ H_{i,t} & \quad \textrm{otherwise} \end{array} \right.$$</div>
<p>Provided that the model and the choice of priors is suitable for the data being modelled, over sufficient steps, the distiribution of <span class="math">\((H_{i,t}\)</span> will converge to <span class="math">\(P(H|O)\)</span>. We can envision this as each sample exploring the nearby regions of the distribution and preferring to move towards regions of higher likelihood.</p>
<p>Markov Chain Monte Carlo is implemented in the <a href="https://www.pymc.io/">PyMC</a> library, which provides a comprehensive toolkit for probabilistic modelling. </p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
<td><a href="https://PlayfulTechnology.co.uk/logistic-regression.html">Logistic Regression</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Viterbi Algorithm2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-30:/the-viterbi-algorithm.html<p>Finding the hidden states that generated a sequence</p><h2>Finding the Hidden States that generated a sequence</h2>
<p>If we have a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span> generated by a <a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Model</a>. One thing we may wish to do is infer the maximum likelihood sequence of hidden states <span class="math">\(S_{0},S_{1}...S_{t}\)</span> that gave rise for this. A useful technique for this is the <em>Viterbi Algorithm</em>.</p>
<p>The Viterbi algorithm represents the possible paths through the sequence of hidden states as a graphical model called a <em>trellis</em>. The possible hidden states at each time step are represented by nodes, with edges representing the transitiions between them.</p>
<p>At each time step <span class="math">\(t\)</span>, we start by caclulating the probabilities of the hidden states given the observation at that time step, <span class="math">\(P(S_{i,t} \mid X_{t})\)</span> and place corresponding nodes on the trellis. We then find the maximum likelihood predecessor for each node</p>
<div class="math">$$\texttt{argmax} \left( P(S_{j,t-1}) P(S_{j,t-1} \mid S_{i,t}) \right)$$</div>
<p>and connect an edge from it to its successor. Any nodes at <span class="math">\(t-1\)</span> that have no outgoing edges are then deleted, along with their incoming edge, and this is repeated at each previous time step until no more nodes can be deleted. Then, working forwards through the trellis from the first step at which nodes were deleted, we recalculate the probabilities at each timeslice as </p>
<div class="math">$$P^{\prime}(S_{i,t}) = \frac{P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}{\sum_{i} P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}$$</div>
<p>where <span class="math">\(S_{i,t}\)</span> are the remaining states at time <span class="math">\(t\)</span> and <span class="math">\(S_{j,t-1}\)</span> is the maximum likelihood predecessor of each state. </p>
<p>At the end of the sequence, we may select the maximum likelihood final state <span class="math">\(\texttt{argmax} P(S_{i,t})\)</span>. The path leading to it is then the maximum likelihood sequence of states given the observations. The Viterbi Algorithm is particularly suitable for real-time applications, as any time step where the number of possible states falls to 1 may be output immediately and removed from the trellis, which in turn reduces memory requirements and computation time.</p>
<p>I first encountered the Viterbi algorithm in the context of error-correcting codes for digital television. The sequence of bits to be transmitted in a digital TV signal can be protected against errors by interspersing it with extra bits derived from a <em>convolutional code</em> - this is a binary function of a number of previous bits. This converts the transmitted sequence from an apparently random sequence (due to data compression) to a Markov process. At the receiving side, we treat the received bitstream (which inevitably contains errors) as the observations and the transmitted bitstream as the hidden states, using the Viterbi algorithm to recover it.</p>
<p>I later used the Viterbi Algorithm for <a href="https://PlayfulTechnology.co.uk/true-212.html">Word Sense Disambiguation</a>. In this application, the observations were words and the hidden states were <a href="https://wordnet.princeton.edu/">WordNet</a> word senses. There were a few complications to take into account - function words, out-of-vocabulary words, multi-word expressions, proper names - but it achieved 70% accuracy, which was described to me as "state of the art".</p>
<p>It's this flexibility and applicability to a range of different problems that makes the Viterbi Algorithm one of my favourite algorithms.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
<td><a href="https://PlayfulTechnology.co.uk/markov-chain-monte-carlo.html">Markov Chain Monte Carlo</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hidden Markov Models2023-11-23T00:00:00+00:002023-11-23T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-23:/hidden-markov-models.html<p>Using Bayes' Theorem to analyse sequences</p><h2>Using Bayes'Theorem to analyse sequences</h2>
<p>Suppose we wish to analyse a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span>. This can be modelled using <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a> as a <em>Markov process</em> <span class="math">\(P(X_{t} \mid X_{t-1})\)</span>, <em>ie</em> the probability of each event depends on the previous event in the sequence.</p>
<p>If there are <span class="math">\(N\)</span> possible values that <span class="math">\(X\)</span> can take, the number of transition probabilities between them is <span class="math">\(N^{2}\)</span>. Such a model would quickly become very large and not very informative. We need a way to make the models more tractable.</p>
<p>To do this, we assume that the probability of each event can be described in terms of a hidden state, <span class="math">\(S\)</span>, as <span class="math">\(P(X_{t} \mid S_{t})\)</span>. The states can then be modelled by a Markov process, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>. This is known as a <em>Hidden Markov Model</em>, since it models a sequence of hidden states with a Markov process. The number of hidden states can be considerable smaller than the number of possible events, and the states can group events into meaningful categories. The model consists of three distributions, the inital state distribution, <span class="math">\(P(S_{0})\)</span>, the transition probability distribution, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>, and the conditional distribution of the events <span class="math">\(P(X \mid S)\)</span>. </p>
<p>Starting from the initial state distribution <span class="math">\(P(S_{0})\)</span>, we can caclulate the posterior distributions of the hidden states at each step <span class="math">\(t\)</span> of a sequence by the following method.</p>
<ol>
<li>Calculate the posterior distribution of the hidden state given the observed event <span class="math">\(X_{t}\)</span> using Bayes' Theorem
<div class="math">$$P(S_{t} \mid X_{t}) = \frac{P(S_{t}) P(X_{t} \mid S_{t})}{P(X_{t})}$$</div>
</li>
<li>Calculate the prior probability of the next state
<div class="math">$$P(S_{t+1}) = P(S_{t+!} \mid S{t}) P(S_{t} \mid {X_t})$$</div>
</li>
</ol>
<p>A concrete example is <a href="https://PlayfulTechnology.co.uk/video-part-of-speech-tagging.html">Part of Speech Tagging</a>. In this application, the observed events are words and the hidden states are the parts of speech (noun, verb, adjective etc.). This approach is particularly useful when you want the probability of each part of speech for a given word, rather than a single tag. I used this approach in my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, using my own open source <a href="https://PlayfulTechnology.co.uk/a-hidden-markov-model-library.html">Hidden Markov Model library</a>, which I had created as a lerning exercise when I first learnt about HMMs. I was pleased to discover that a colleague on that project had also used the library, but I no longer maintain it, as I've learnt a lot since then and if I did any more work on it I'd prefer to restart it from scratch.</p>
<table>
<thead>
<tr>
<th>Previous</th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a></td>
<td><a href="https://PlayfulTechnology.co.uk/the-viterbi-algorithm.html">The Viterbi Algorithm</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayes' Theorem2023-11-16T00:00:00+00:002023-11-16T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-16:/bayes-theorem.html<p>The probability of a hypothesis given observations.</p><h2>Estimating the probability of a hypothesis given observations.</h2>
<p>This is the beginning of what will hopefully be a regular series of articles explaining Key Algorithms in data science.</p>
<p>If you look at <a href="https://www.linkedin.com/in/peterjbleackley">my LinkedIn profile</a>, you'll see that the banner shows the formula </p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{P(O)}$$</div>
<p>This is a foundational rule for calculating conditional probabilities, known a <em>Bayes' Theorem</em>, after the Reverend Thomas Bayes, who first proposed it. It may be read as <em>the probability of a hypothesis given some observations is equal to the prior probability of the hypothesis multiplies by the probability of the observations given that hypothesis, and divided by the probability of the observations</em>. </p>
<p>To illustrate this, consider a family where the father has rhesus-positive blood and the mother has rhesus-negative blood. Rhesus-positive is a dominant trait - the father might have one or two copies of the Rh+ gene, whereas rhesus-negative is recessive - the mother must have two copies of the Rh- gene.</p>
<p>Let <span class="math">\(H\)</span> be the probability that the father has 2 copies of the RH+ gene. Without further information, 1/2 is the best estimate for this. If the family's first child is rhesus-positive, the probability of this is <span class="math">\(P(O \mid H) = 1\)</span> if the father has two copies of the Rh+ gene and <span class="math">\(P(O \mid ¬H) = \frac{1}{2}\)</span> if he has 1 copy. In general the overall probability of the observations give a set of hypotheses <span class="math">\(H_{i}\)</span> is given by
</p>
<div class="math">$$P(O) = \sum_{i} H_{i} P(O \mid H_{i})$$</div>
<p>, since the posterior probabilities of all hypotheses must sum to 1. Therefore, we can update the probability of the father having two copies of the Rh+ gene as
</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{1}{2} \times 1}{\frac{1}{2} \times 1 + \frac{1}{2} \times \frac{1}{2}} = \frac{2}{3}$$</div>
<p>If the family's second child is also rhesus-positive, we can further update our estimate with the new information</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{2}{3} \times 1}{\frac{2}{3} \times 1 + \frac{1}{3} \times \frac{1}{2}} = \frac{4}{5}$$</div>
<p>It is easy to see that if we had known both children's blood groups from the outset, and used <span class="math">\(P(O \mid ¬H) = \frac{1}{4}\)</span> we could have got the same result.</p>
<p>In data science, we often have to estimate the probability of a hypothesis given some evidence, so Bayes' theorem is a useful thing to have in our toolkit. </p>
<p>If we need to take observations of several different variables into account, there are two ways we can do it, the first, the <em>Naive Bayes</em> approach, treats all the variables as statistically independent, as we did in the above example. While this has the advantage of simplicity, it is only really viable when the problem is sufficiently simple.</p>
<p>For more complex problems, we need to model the dependencies between variables. We do this with a graphical method called a <em>Bayesian Belief Net</em>, where each node on a graph represents a variable, and the links represent dependencies between them. Each node then calculates the probability of the variable it represents in terms of the variables it is dependent on. A simple example can be seen in the Data Science Notebook <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It a Mushroom or Is It a Toadstool?</a>.</p>
<p>For my first AI project, I was asked to chose the best system to implement an automatic diagnostic system. I chose a Bayesian Belief Network on the grounds that it was important for the system to be explainable. Since each node of the Bayesian Belief Newtork represents a meaningful variable, its results are more explainable that those of a neural network, whose nodes are simply steps in a calculation. More recently I used Bayesian models in a project to predict the optimum settings for machine tools, so Bayes' Theorem has followed me throughout my data science career.</p>
<table>
<thead>
<tr>
<th></th>
<th>Next</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td><a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Models</a></td>
</tr>
</tbody>
</table>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>QARAC: Porting to PyTorch2023-11-08T00:00:00+00:002023-11-08T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-08:/qarac-porting-to-pytorch.html<p>PyTorch is more suitable for co-training multiple models and objectives</p><p>Most of my previous work with neural networks has been based on the <a href="https://keras.io">Keras</a> library, so in implementing QARAC, I initially used what I was most familiar with. Some lower-level parts of the algorithm were implemented in <a href="https://tensorflow.org">TensorFlow</a>, which for some years has been the default backend to Keras (however, it is now possible to use Keras with a choice of backends again). </p>
<p>I decided to to some local testing before committing large amounts of compute time to training the models, but when I did so, I got the following warning.</p>
<div class="highlight"><pre><span></span><code><span class="n">WARNING</span><span class="o">:</span><span class="n">tensorflow</span><span class="o">:</span><span class="n">Gradients</span><span class="w"> </span><span class="k">do</span><span class="w"> </span><span class="k">not</span><span class="w"> </span><span class="n">exist</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="k">variables</span><span class="w"> </span><span class="err">[</span><span class="s1">'tf_roberta_model/roberta/pooler/dense/kernel:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model/roberta/pooler/dense/bias:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'qarac_trainer_model/qarac_encoder_model/global_attention_pooling_head/local projection:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'qarac_trainer_model/qarac_encoder_model_1/global_attention_pooling_head_1/local projection:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model_1/roberta/pooler/dense/kernel:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model_1/roberta/pooler/dense/bias:0'</span><span class="err">]</span><span class="w"> </span><span class="k">when</span><span class="w"> </span><span class="n">minimizing</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">loss</span><span class="p">.</span><span class="w"> </span><span class="k">If</span><span class="w"> </span><span class="n">you</span><span class="s1">'re using `model.compile()`, did you forget to provide a `loss` argument?</span>
<span class="s1">WARNING:tensorflow:Gradients do not exist for variables ['</span><span class="n">tf_roberta_model</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">kernel</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">bias</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">qarac_trainer_model</span><span class="o">/</span><span class="n">qarac_encoder_model</span><span class="o">/</span><span class="n">global_attention_pooling_head</span><span class="o">/</span><span class="k">local</span><span class="w"> </span><span class="n">projection</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">qarac_trainer_model</span><span class="o">/</span><span class="n">qarac_encoder_model_1</span><span class="o">/</span><span class="n">global_attention_pooling_head_1</span><span class="o">/</span><span class="k">local</span><span class="w"> </span><span class="n">projection</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model_1</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">kernel</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model_1</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">bias</span><span class="o">:</span><span class="mi">0</span><span class="s1">'] when minimizing the loss. If you'</span><span class="n">re</span><span class="w"> </span><span class="k">using</span><span class="w"> </span><span class="n n-Quoted">`model.compile()`</span><span class="p">,</span><span class="w"> </span><span class="n">did</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">forget</span><span class="w"> </span><span class="k">to</span><span class="w"> </span><span class="n">provide</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n n-Quoted">`loss`</span><span class="w"> </span><span class="n">argument</span><span class="nv">?</span><span class="w"></span>
</code></pre></div>
<p>It looks like the <a href="https://PlayfulTechnology.co.uk/qarac-models-and-corpora.html">Training Model</a> wasn't able to propogate gradients between its constituent models. This seems to be a feature of the architecture of Keras, in which <code>Model</code>s are made up of <code>Layer</code>s. Layers are designed to be components of a larger model, and so propogate gradients across their inputs, whereas models, which are intended to be complete systems, do not. From Keras's point of view, I was trying to use models as layers, and it didn't like it.</p>
<p>Since <a href="https://huggingface.co">HuggingFace</a> models are available in both TensorFlow and <a href="https://pytorch.org">PyTorch</a>, so I looked to see if PyTorch would be more suitable for what I wanted to do. I found that PyTorch doesn't make the same distinction that Keras does between layers and models - both are <code>Module</code>s, so there would be not problem with propogating gradients between them. The learning curve going from Keras to PyTorch wasn't too steep. The main differences were that the method of a Keras layer that's called <code>call</code> is called <code>forward</code> in PyTorch, and there's no direct equivalent of a Keras model's <code>compile</code> and <code>fit</code> methods, so you have to write a training loop. Also, HuggingFace's PyTorch and TensorFlow models aren't exact drop-in replacements for each other, so on occasion adjustments were needed where one wanted a parameter that the other didn't. </p>
<p>You should learn something new on every project, and that has been one of my key personal goals for QARAC. I didn't envisage that I'd end up learning PyTorch for this project, but the fact that I have done is welcome and will come in useful for future projects. </p>
<p>There's only one more thing I need before I can train the models, and that's a budget for compute time, or a <a href="https://huggingface.co/docs/hub/spaces-gpus#community-gpu-grants">community hardware grant</a> from HuggingFace.</p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>QARAC: Models and Corpora2023-09-14T00:00:00+01:002023-09-14T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-09-14:/qarac-models-and-corpora.html<p>Selection of models and training corpora for QARAC</p><p>I've made some early progress on developing QARAC, and I'm not far from being able to make a first attempt at training it. I've chosen base models, coded the model heads and the training model, and found appropriate datasets to train on.</p>
<h2>Models</h2>
<h3>Base models</h3>
<p>I was initally interested in using <a href="https://arxiv.org/abs/2302.10866">Hyena models</a> as my base models and training them with the <a href="http://www.natcorp.ox.ac.uk/">British National Corpus</a>. However, I found it harder to implement Hyena models in <a href="https://keras.io">Keras</a> than I anticipated, and didn't want this to be a roadblock. I've therefore decided to start by using <a href="https://huggingface.co/roberta-base">RoBERTa</a>. However, I may need to consider another model for the decoder.</p>
<h3>Model Heads</h3>
<p>For the encoder models, the head used is a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/layers/GlobalAttentionPoolingHead.py">Global Attention Pooling Head</a>. If <em>attention</em> in a transformer model is the relevance of each word in a document to the meaning of each other word, <em>global attention</em> may be defined as the relevance of each word to the overall meaning of the document. This is calculated as follows</p>
<p>Given the contextual word vectors <span class="math">\(\vec{v_{i}}\)</span> produced by the base encoder model, and two trainable matrices <span class="math">\(\mathbf{L}\)</span> and <span class="math">\(\mathbf{G}\)</span>, define the <em>local projection</em>
</p>
<div class="math">$$\vec{l_{i}} = \vec{v_{i}} \cdot \mathbf{L}$$</div>
<p> and the <em>global projection</em>
</p>
<div class="math">$$\vec{g} = \left( \sum_{i} \vec{v_{i}} \right) \cdot \mathbf{G}$$</div>
<p>. The attention is then calculated as the cosine similarity of the two projections
</p>
<div class="math">$$a_{i} = \hat{l_{i}} \cdot \hat{g}$$</div>
<p>. Finally, the encoded vector is calculated as the sum of the word vectors weighted by the attention
</p>
<div class="math">$$\vec{E} = \sum_{i} a_{i} \vec{v_{i}}$$</div>
<p>.</p>
<p>For the decoder models, the head used is a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/QaracDecoderModel.py#L13">QaracDecoderHead</a>. This prepends a vector representing an encoded document to the vectors generated by the base model, passes this through a <code>TFRobertaLayer</code>, removes the first vector from the output of that layer, then feeds that through another <code>TFRobertaLayer</code> and finally a <code>TFRobertaLMHead</code>, returning the output of that layer.</p>
<h3>The Training Model</h3>
<p>To prevent <a href="https://en.wikipedia.org/wiki/Catastrophic_interference">catastrophic forgetting</a>, the question encoder, answer encoder and decoder must all be trained together, targeting all training objectives simultaneously. To do this, they are combined into a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/QaracTrainerModel.py">Trainer Model</a>.
Given a sentence <span class="math">\(\mathbf{S}\)</span>, a question <span class="math">\(\mathbf{Q}\)</span>, an answer <span class="math">\(\mathbf{A}\)</span>, two propositions <span class="math">\(\mathbf{P_{0}}\)</span> and <span class="math">\(\mathbf{P_{1}}\)</span>, and two statements <span class="math">\(\mathbf{s_{0}}\)</span> and <span class="math">\(\mathbf{s_{1}}\)</span>,
the following outputs are calculated</p>
<div class="math">$$\texttt{encode_decode} = \mathcal{D}(\mathcal{AE}(\mathbf{S}))$$</div>
<div class="math">$$\texttt{question_answering} = \mathcal{QE}(\mathbf{Q}) - \mathcal{AE}(\mathbf{S})$$</div>
<div class="math">$$\texttt{reasoning} = \mathcal{D}(\mathcal{A£}(\mathbf{P_{0}} + \mathcal{AE}{P_{1}})$$</div>
<div class="math">$$\texttt{consistency} = \mathit{cossim}(\mathcal{AE}(\mathbf{s_{0}}),\mathcal{AE}(\mathbf{s_{1}}))$$</div>
<p>For the decoding and question answering objectives, the loss to be minimised is the sparse categorical crossentropy of the generated answer against the answer in the training set. For question answering, it is the squared Eudlidean length of the vector produced, and for consistency is the mean squared error from the desired label (1 for consistent statements, -1 for contradictory statements, 0 for unrelated statements).</p>
<p>The output for question answering and its associated loss are chosen to reflect the intended use of the question encoder, to generate a query vector for a vector database.</p>
<h2>Training Corpora</h2>
<h3>Question Answering</h3>
<p>For Question Answering, the most suitable corpus I have found is the <a href="https://paperswithcode.com/dataset/wikiqa">WikiQA</a> dataset. This contains a sample of questions obtained from Bing queries, along with the first paragraph of a Wikipedia article relevant to each question. The paragraph is split into sentences, one per line, and the sentences are labelled 1 if they are considered a valid answer to the question, and 0 otherwise. The rows labelled 1 will be used to train the question answering objective.</p>
<p>It has been necessary to perform coreference resolution on this dataset, for which <a href="https://docs.allennlp.org/main/">AllenNLP</a> was used. Since it was necessary to combine all the sentences for a given question into a single document to perform coreference resolution and then separate them afterwards, some rather nasty edge cases had to be dealt with.</p>
<h3>Reasoning</h3>
<p>For Reasoning, the <a href="https://github.com/ZeinabAghahadi/Syllogistic-Commonsense-Reasoning">Avicenna: Syllogistic Commonsense Reasoning</a> dataset will be used. This contains pairs of sentences, a label "yes" if they can be used to form a valid syllogism and "no" if not, and a conclusion to the syllogism if it exists. Only the examples where a valid syllgism exists will be used to train the dataset.</p>
<h3>Consistency</h3>
<p>For Consistency, the <a href="https://www.kaggle.com/datasets/stanfordu/stanford-natural-language-inference-corpus">Stanford Natural Language Inference Corpus</a> will be used. This contains pairs of sentences, labelled as "entailment", "contradiction" or "neutral". These values will be mapped to +1, -1 and 0 respectively.</p>
<h3>Encode/Decode</h3>
<p>To train the decoding of encoded sentences, a combined dataset consisting of
+ all the answer sentences from the WikiQA dataset, whether they are labelled as correct or not
+ all the the propositions from the Avicenna dataset, whether there is a valid conclusion or not
+ the conclusions from the Avicenna dataset, where these are available
+ the sentences from the SNLI corpus
will be used. </p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>QARAC: Question Answering, Reasoning and Consistency2023-08-21T00:00:00+01:002023-08-21T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-08-21:/qarac-question-answering-reasoning-and-consistency.html<p>A project to create a factually accurate NLP system</p><p>Following on from my previous article on <a href="https://PlayfulTechnology.co.uk/the-future-of-natural-language-processing.html">The Future of Natural Language Processing</a>, I've decided to start start a personal research project to put some of these ideas into practice and test them out. </p>
<p>I'm calling the proposed system <strong>QARAC</strong>, which stands for <em>Question Answering, Reasoning and Consistency</em>.</p>
<h2>NLP Components and Training Objectives</h2>
<p>The main NLP components of the system will be two <em>encoders</em> and a <em>decoder</em>. The two encoders will share a base model, and each will map a sentence <strong>S</strong> to a vector <em>v</em>. One will be be a <em>question encoder</em>, <span class="math">\(\mathcal{QE}\)</span> and the other an <em>answer encoder</em>, <span class="math">\(\mathcal{AE}\)</span>.</p>
<p>The <em>decoder</em> <span class="math">\(\mathcal{D}\)</span> will be an autoregressive model that, given a vector <em>v</em> generates a sentence <strong>S</strong>. In particular, it will be trained to act as the inverse function to the answer encoder, so that </p>
<div class="math">$$\mathcal{D}(\mathcal{AE}(\mathbf{S})) = \mathbf{S}$$</div>
<p>. Further training objectives give the system its name.</p>
<h3>Question Answering</h3>
<p>Given a question <strong>Q</strong> and a corresponding answer <strong>A</strong>, the <em>Question Answering</em> objective is that </p>
<div class="math">$$\mathcal{QE}(\mathbf{Q}) = \mathcal{AE}(\mathbf{A})$$</div>
<p>. We might naively try to use this to create a simple question answering system as </p>
<div class="math">$$\mathbf{A} = \mathcal{D}(\mathcal{QE}(\mathbf{Q}))$$</div>
<p>, but this of course would be no more likely to produce accurate results than current LLMs.</p>
<h3>Reasoning</h3>
<p>Given two propositions <span class="math">\(\mathbf{P_{0}}\)</span> and <span class="math">\(\mathbf{P_{1}}\)</span>, and a conclusion <strong>C</strong> that follows from them, the <em>Reasoning</em> objective is that </p>
<div class="math">$$\mathcal{D}(\mathcal{AE}(\mathbf{P_{0}}) + \mathcal{AE}(\mathbf{P_{1}})) = \mathbf{C}$$</div>
<p>. </p>
<h3>Consistency</h3>
<p>Given two statements <span class="math">\(\mathbf{S_{0}}\)</span> and <span class="math">\(\mathbf{S_{1}}\)</span>, the <em>consistency objective</em> is </p>
<div class="math">$$\mathit{cossim}(\mathcal{AE}(\mathbf{S_{0}}),\mathcal{AE}(\mathbf{S_{1}})) = \left\{ \begin{array}{c 1}
+1 & \quad \textrm{if statements are consistent} \\
0 & \quad \textrm{if statements are unrelated} \\
-1 & \quad \textrm{if statements contradict}
\end{array}
\right. $$</div>
<h2>Knowledge base components.</h2>
<p>As previously stated, the system will need a knowledge base in order to produce accurate answers. This will be stored in a vector database and harvested by a crawler.</p>
<p>The crawler will start from a site considered likely to be a reliable source of factual information, extract statements from each document it crawls, end encode them with the answer encoder. It will then test them for consistency with the existing knowledge base, deciding on that basis which to add to the knowledge base and which to reject. It will also calculate an overall reliability score for each document. Links originating from documents with high reliability scores will be prioritised by the crawler for further investigation, and the crawler will terminate when there are no links left to be explored that come primarily from reliable sources.</p>
<h2>Querying</h2>
<p>Presented with a question, QARAC will first use the question encoder to obtain a query vector. It will then find the top few matching vectors from the knowledge base, and use the cosine similarity of the answer vectors to the query vector used as a measure of confidence. If two vectors can be added to produce one with a higher confidence score, this will be added to the results set as an inferred answer. The answer vectors will then be converted to text by the decoder, and the results present to the user, showing the sources of the original vectors and the chain of reasoning to the inferred ones.</p>
<h2>Assessment</h2>
<p>Well, that's the theory. This is a research project, however, and the point is to see how well this system performs in practice, and whether it provides insights into how NLP models could be further improved. As such, a demonstration system will be made accessible, and feedback solicited from users about its performance.</p>
<p>Code for the projects will be published on <a href="https://github.com/PeteBleackley/QARAC">GitHub</a> and trained models on <a href="https://huggingface.co/PlayfulTechnology">HuggingFace</a>. Project updates will be published here under the tag <a href="https://PlayfulTechnology.co.uk/tag/qarac.html">QARAC</a>.</p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Future of Natural Language Processing2023-02-01T00:00:00+00:002023-02-01T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-02-01:/the-future-of-natural-language-processing.html<p>NLP systems need knowledge and logic</p><h1>The Future of Natural Language Processing</h1>
<p><a href="https://openai.com/blog/chatgpt/">ChatGPT</a> and similar generative language models have been attracting a lot of attention recently. The trouble is, that while they're good at producing fluent text, they don't necessarily produce accurate or useful text. With ChatGPT, the fact that it admits that it doesn't know the answer some of the time produces a false expectation that it knows what it's talking about the rest of the time, but if you ask it questions about a subject you know about, you'll find it makes mistakes ranging from the subtle to the absurd. <a href="https://www.engadget.com/cnet-reviewing-ai-written-articles-serious-errors-113041405.html">CNET</a> found the hard way that generative models are not a reliable source of content. The reason for this is that the text they generate is based on statistical patterns inferred from their training datasets. At no stage in the process does the model actually understand either the text it's been trained on or what it is being asked to do. It is surmised that in a sufficiently complex model, such understanding may arise as an emergent property of the network, but even if it does, large language models are generally trained on text harvested from the internet, thus leading to a garbage-in garbage-out problem.</p>
<p>This means that the most likely use of generative language models in the near term is as an efficient source of clickbait and fake news. This makes <a href="https://dev.to/fannieailiverse/open-sourced-gptzero-3kik">GPTZero</a> and <a href="https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text/">OpenAI's own AI-written text classifier</a> important. Search engines will need to incorporate tools like these to ensure that results are more likely to come from reliable sources.</p>
<p>However, it clearly isn't enough to trust the neural network. Future generations of NLP models will need to incorporate knowledge and a concept of logical consistency, so that they can discriminate truth from falsehood. My own work with <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> used <a href="https://www.wikidata.org/">WikiData</a> as a knowledge base for Named Entity Recognition with good effect, so I know how powerful the incorporation of a good knowledge base can be. However, if we want the system to be able to learn and grow its own knowlege base, it needs to understand whether or not data is logically consistent. We can envisage a model that vectorizes statements in such a way that for two statements that are logically consistent, the cosine similarity of the vectors is close to 1, for two statements that are inconsistent, the cosine similarity is close to -1, and for two statements that are unrelated, the cosine similarity is close to zero. The <a href="https://www.kaggle.com/datasets/stanfordu/stanford-natural-language-inference-corpus">Stanford Natural Language Inference Corpus</a>, available from Kaggle, would be a suitable dataset to train this on. Once we could predict logical consistency in this way, we should be able to boostrap a knowledge base from a corpus of trusted facts by adding only statements that are consistent with what is already known.</p>
<p>These vectors have the property that arithmetical negation corresponds to logical negation. It's possible, therefore, that we could perform logical inference by means of arithmetical operations of the vectors. The sum of two vectors may correspond to a logical syllogism, allowing the system to deduce new facts from its knowledge base.</p>
<p>A system that could model consistency would have a lot of powerful applications. <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">Fake News Detection</a> is one possibility - if a document repeatedly contradicted trusted sources, it could be classified as unreliable. Conversely, a document would also be suspicious if it made similar claims to sources known to be unreliable - the QAnon conspiracy theory made similar claims to <a href="https://sourcebooks.fordham.edu/basis/procop-anec.asp">The Secret History</a> - smear campaigns and scare stories haven't changed much since Roman times. Used alongside anomaly detection, it could also detect when an author had concealed dubious claims in an otherwise factual document. However, it could also be a proof-reading tool, allowing authors and editors to check their work for errors more efficiently.</p>
<p>It would also be able to detect opinion and partisanship. Suppose two sources both make claims A and B. However, one source also makes claim C and the other makes claim D. While neither of C or D is inconsistent with A or B, they are inconsistent with each other. We can therefore deduce that A and B are more likely to be accepted by consensus as fact, whereas C and D are opinions. Clustering sources by which opinions they were likely to share would identify partisan groups of sources.</p>
<p>These are just a few possible applications - the ones that occur to me off the top of my head - but they clearly show that knowledge, consistency and reasoning are the missing ingredients needed to make NLP technology truly useful.</p>
<p>If you are interested in these ideas, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=The%20Future%20of%20NLP">contact Playful Technology Limited</a></p>How many components?2022-01-17T00:00:00+00:002022-01-17T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2022-01-17:/how-many-components.html<p>A simple method for chosing the number of components to use in principal component analysis</p><h2>A simple method for chosing the number of components to use in principal component analysis</h2>
<p>A common problem in data science is <em>the curse of dimensionality</em>. Essentially, the more different variables a dataset encompasses, the more mathematically intractible it is to make measurements based on them all. The usual method for dealing with this problem is <em>Principal Component Analysis</em>, which seeks to reduct the data to a lower number of dimension while retaining as much information as possible. The most common method of doing this is as follows.</p>
<ol>
<li>Obtain either the covariance matrix of the variables of a similarity matrix of the observations, using a metric such as cosine similarity</li>
<li>Calculate the eigenvalues and eigenvectors of this matrix</li>
<li>Use the eigenvectors corresponding to the N largest eigenvalues to form an orthonormal basis</li>
</ol>
<p>This however, raises the question of how to select an appropriate value of N. Since our aim is to explain the maximum amount of variance with the minimum number of components, a simple approach is to find a maximum in the sum of the proportion of components discarded with the proportion of variance retained, as measured by the eigenvalues. <code>numpy</code> helps us with this by returning eigenvalues and eigenvectors in increasing order of eigenvector. The following code illustrates the method</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span>
<span class="kn">import</span> <span class="nn">numpy.linalg</span>
<span class="k">def</span> <span class="nf">reduce_data</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">metric</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="sd">"""Reduces X to the number of dimensions that retains the maximum amount of infomation for the minimum number of components</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> X : numpy.ndarray</span>
<span class="sd"> (n * m) array containing n rows of m-dimensional observations</span>
<span class="sd"> metric : function (optional, default = None)</span>
<span class="sd"> Similarity metric. Takes an (n * m) array and returns an (n * n) array of similarities</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> numpy.ndarray</span>
<span class="sd"> The data reduced to the optimum number of dimensions</span>
<span class="sd"> """</span>
<span class="n">similarity</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="kp">cov</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">if</span> <span class="n">metric</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">metric</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="p">(</span><span class="n">eigenvalues</span><span class="p">,</span> <span class="n">eigenvectors</span><span class="p">)</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">eigh</span><span class="p">(</span><span class="n">similarity</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">eigenvalues</span><span class="o">.</span><span class="kp">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">excluded</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="kp">arange</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">/</span><span class="n">n</span>
<span class="n">explained</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="p">(</span><span class="n">eigenvalues</span><span class="o">.</span><span class="kp">cumsum</span><span class="p">()</span><span class="o">/</span><span class="n">eigenvalues</span><span class="o">.</span><span class="kp">sum</span><span class="p">())</span>
<span class="n">cutoff</span> <span class="o">=</span> <span class="p">(</span><span class="n">excluded</span> <span class="o">+</span> <span class="n">explained</span><span class="p">)</span><span class="o">.</span><span class="kp">argmax</span><span class="p">()</span>
<span class="n">basis</span> <span class="o">=</span> <span class="n">eigenvectors</span><span class="p">[:,</span><span class="n">cutoff</span><span class="p">:]</span>
<span class="k">return</span> <span class="n">X</span><span class="o">.</span><span class="kp">dot</span><span class="p">(</span><span class="n">basis</span><span class="p">)</span> <span class="k">if</span> <span class="n">metric</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">basis</span>
</code></pre></div>GHFP Research Institute2021-06-21T00:00:00+01:002021-06-21T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-06-21:/ghfp-research-institute.html<p>Interactive Mapping of the Better Place Index</p><h2>Interactive Mapping of the Better Place Index</h2>
<h3>The Client</h3>
<p><a href="https://ghfp.org/">GHFP Research Institute</a></p>
<h3>The Problem</h3>
<p>In collaboration with <a href="https://pureportal.coventry.ac.uk/en/organisations/centre-for-trust-peace-and-social-relations-2">The Centre for Trust, Peace and International Relations</a> at the University of Coventry, the GHFP Research Institute had developed <em>the Better Place Index</em> a metric of quality of life in different countries. They wished to create an interactive map which would allow users to explore how this metric and its key contributing factors varied from country to country.</p>
<h3>The Approach</h3>
<p>Geopandas was used to combine the <a href="https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/">Natural Earth Countries Shapefile</a> with a spreadsheet of the Better Place Index and its contributing factors. The resulting GeoDataFrame was then used in CartoFrames to produce an <a href="https://www.thebetterplaceindex.report/map">interactive map of the Better Place Index</a> on which</p>
<ul>
<li>Countries are coloured according to the Better Place Index</li>
<li>Hovering the mouse over a country displays the Better Place Index for that country, and its best and worst contributing factors</li>
<li>Countries may be selected by ranges of the Better Place Index, or by the best or worst contributing factor.</li>
</ul>
<h3>Technology Used</h3>
<ul>
<li><a href="https://geopandas.org/">Geopandas</a></li>
<li><a href="https://carto.com/">CartoFrames</a></li>
</ul>Is It A Mushroom or Is It A Toadstool?2021-05-19T00:00:00+01:002021-05-19T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-19:/is-it-a-mushroom-or-is-it-a-toadstool.html<p>Using Bayesian Belief Networks to classify fungus edibility</p><h2>Using Bayesian Belief Networks to classify fungus edibility</h2>
<p>The <a href="https://www.kaggle.com/uciml/mushroom-classification">UCI Machine Learning Mushroom Classification Dataset</a> on Kaggle tabulates discrete features around 8000 specimens of fungi. There are 23 species represented, and the challenge is to classify which are edible and which are poisonous. Since the data are all categorical, I decided that a Bayesian Belief Network would be a suitable, and used an ad-hoc clustering algorithm to infer a hidden variable.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/bayesian-belief-network-for-fungus-edibility?kernelSessionId=1503132" title="Bayesian Belief Network for fungus edibility" width="100%"></iframe>
<p>These results seem promising, but I wanted to see if I could do even better. This time I used Mutual Information to infer two hidden variables.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/bayesian-belief-network-for-fungi-2?kernelSessionId=6991228" title="Bayesian Belief Network for Fungi 2" width="100%"></iframe>
<p><strong>WARNING</strong> This is intended solely as a technology demonstration. Playful Technology Limited cannot accept any liability if you pick wild mushrooms on the basis of these notebooks. If you want to forage for wild mushrooms, find an experienced guide.</p>
<p>If you are interested in classification problems, <a href="mailto:peter.bleackley@playfultechnology.co.uk">contact me</a>.</p>Clustering Proteins in Breast Cancer Patients2021-05-10T00:00:00+01:002021-05-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-10:/clustering-proteins-in-breast-cancer-patients.html<p>Using clustering techniques finds groups of proteins that may be of clinical significance</p><h2>Using clustering techniques finds groups of proteins that may be of clinical significance</h2>
<p>Breast cancer is the most common form of cancer in women, and most of us probably know somebody who's been affected by it, so when another data scientist suggested I look at the breast cancer proteome on Kaggle, I thought it was a worthwhile thing to do. I'm not a biologist, but I know that cell behaviour involves complex networks of interacting proteins, so I thought that clustering would be a good way of uncovering these networks. I was pleased to discover that the protein clusters discovered seemed to be predictive of clinical outcomes.</p>
<iframe src="https://www.kaggle.com/embed/petebleackley/clustering-proteins?kernelSessionId=5010029" height="800" width="100%" frameborder="0" scrolling="auto" title="Clustering proteins"></iframe>
<p>This is something I hope might be useful to clinical researchers. If you are interested in this work, please <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Breast%20cancer%20proteome">contact me</a>.</p>The Grammar of Truth and Lies2021-05-10T00:00:00+01:002021-05-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-10:/the-grammar-of-truth-and-lies-nb.html<p>Using Natural Language Processing to detect fake news</p><h2>Using Natural Language Processing to detect fake news</h2>
<p>The issue of trust in the media is very important to me, and so when a dataset of fake news items was posted on Kaggle, I decided to see if NLP could be used to distinguish between real and fake news.</p>
<iframe src="https://www.kaggle.com/embed/petebleackley/the-grammar-of-truth-and-lies?kernelSessionId=62289416" height="800" scrolling="auto" title="The Grammar of Truth and Lies" width="100%"></iframe>
<p>I later presented this at two data science meetups and on my video channel.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OyA59kIQcAU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Later, another corpus of real and fake news stories was published on Kaggle, giving me the chance to see if the results were replicable. Fortunately, it appears that they hold up well.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/the-grammar-of-truth-and-lies-part-2?kernelSessionId=54101611" title="The Grammar of Truth and Lies part 2" width="100%"></iframe>
<p>If you are interested in fake news detection, <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Fake%20news%20detection">contact me</a>.</p>Lobbying With Data - How Can Data Help Businesses Influence Policy?2020-07-03T00:00:00+01:002020-07-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-07-03:/lobbying-with-data-how-can-data-help-businesses-influence-policy.html<p>Webinar on how to influence with data science</p><h2>Webinar on how to influence with data science</h2>
<p>I was invited by <a href="https://drivaartsdriva.com/">DRIVA Arts DRIVA</a> to take part in a webinar. Along with <a href="https://www.linkedin.com/in/bonamywaddell/">Bonami Waddell</a> and <a href="https://www.linkedin.com/in/sagihaider/">Haider Raza</a> I discussed what the best strategies were for data scientists to get their message across to decision makers. See below for the video.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZC8ddOhyZ00" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>Video: NLP vs Filter Bubbles2020-06-15T00:00:00+01:002020-06-15T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-15:/video-nlp-vs-filter-bubbles.html<p>Using Topic Modelling and Sentiment Analysis to find common ground between people of differing opinions</p><h2>Using Topic Modelling and Sentiment Analysis to find common ground between people of differing opinions</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/1VKVFJ3pdJw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My latest video is about the Common Ground Algorithm, an idea I've had to try to address the problem of filter buo bbles on line. Given two people with differing opinions, can we use NLP to find common ground between them, and thus encourage civil discussion between people who might otherwise distrust each other? As usual, you can <a href="https://www.kaggle.com/petebleackley/the-common-ground-algorithm">explore the code in this Kaggle kernel</a>.</p>Video: Part of Speech Tagging2020-06-08T00:00:00+01:002020-06-08T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-08:/video-part-of-speech-tagging.html<p>Three approaches to Part of Speech Tagging</p><h2>Three approaches to Part of Speech Tagging</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/UDa7YIPqpiA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My latest video discusses three approaches to a simple NLP task, Part of Speech tagging. Here's a link to <a href="https://gesis.mybinder.org/binder/v2/gh/PeteBleackley/ask-a-data-scientist/780aa74550de278b2ec31f8fbb8dd81af3227fb5">the code</a>.</p>All Street Research2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/all-street-research.html<p>Finding the most relevant paragraphs from corporate documents for given themes</p><h2>Finding the most relevant paragraphs from corporate documents for given themes</h2>
<h3>The Client</h3>
<p><a href="https://www.allstreet.org/">All Street Research</a></p>
<h3>The Problem</h3>
<p>All Street Research wanted to be able to find the most relevant paragraphs of corporate documents related to given themes.</p>
<h3>The Approach</h3>
<p>A set of key words and phrases was obtained for each of the topics of interest. Then, from a corpus of corporate documents, words which correlated with the key words on a paragraph level were identified. These correlations were used to derive a scoring function for each theme that was used to identify the most relevant paragraphs.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://www.nltk.org/">NLTK</a></li>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://numpy.org/">Numpy</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebooks</a></li>
</ul>Amey Strategic Consulting2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/amey-strategic-consulting.html<p>Automatic Diagnostics for the Strategic Road Network</p><h2>Automatic Diagnostics for the Strategic Road Network</h2>
<h3>The Client</h3>
<p><a href="https://www.amey.co.uk/amey-consulting/services/strategic-consulting/">Amey Strategic Consulting</a></p>
<h3>The Problem</h3>
<p>As part of a major data science project on behalf of <a href="https://highwaysengland.co.uk/">Highways England</a>, Amey wished to create an automatic diagnostic system that would detect faults in traffic flow sensors on the strategic road network. As well as enabling timely and efficient maintenance, this would prevent delays to journeys caused by incorrectly set signals, which are estimated to cost the economy £7.5 million per year.</p>
<h3>The Approach</h3>
<p>From a shapefile containing the geometry of the Strategic Road Network, the topology of the network was calculated and groups of sensors assigned to links, which are sections of carriageway between two junctions. Over a link, traffic flow readings should be approcimately consistent at a given time. Anomaly detection can then be used to find the sensor whose readings are most different from the rest. This should vary randomly, but if the same sensor is inconsistent with the rest for a few minutes at a time, it can be assumed to be faulty.</p>
<p>After testing this approach on one link, a simple dashboard was created to demonstrate the results and work began on scaling to the full network.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://geopandas.org/">Geopandas</a></li>
<li><a href="https://scikit-learn.org/stable/">Scikit-learn</a> (Isolation Forests)</li>
<li><a href="https://jupyter.org/">Jupyter Lab</a></li>
<li><a href="https://spark.apache.org/docs/latest/api/python/">PySpark</a></li>
</ul>Formisimo2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/formisimo.html<p>Real time prediction of web form conversion</p><h2>Real time prediction of web form conversion</h2>
<h3>The Client</h3>
<p><a href="https://www.zuko.io/formisimo">Formisimo</a></p>
<h3>The Problem</h3>
<p>Formisimo wanted to predict in real time whether users would complete or abandon web forms, in order to generate nudges that would encourage frustrated users to complete the form. Their early models were able to predict from the full history of user interactions whether a given user had completed or abandoned the form, but could not reproduce this under a simulation of real time operation.</p>
<h3>The Approach</h3>
<p>After some initial experiments models based on Support Vector Machines and Hidden Markov models, a deep investigation of the data was made. It was found that a useful prediction of whether a user would complete the form or not could be made only within the last 100 interactions. It was therefore decided to change the prediction from whether or not the user would complete the form to whether the user was within 100 events of abandoning the form. Models based on this insight showed improved performance, and further improvements were made by using a LSTM model.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://scikit-learn.org/stable/">Scikit-Learn</a> (Support Vector Machines)</li>
<li><a href="https://pypi.python.org/pypi/Markov">Hidden Markov Models</a></li>
<li><a href="https://keras.io/">Keras</a> (LSTM networks)</li>
</ul>Pentland Brands2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/pentland-brands.html<p>Could 3D face scans be used to recommend swimming goggles?</p><h2>Could 3D face scans be used to recommend swimming goggles?</h2>
<h3>The Client</h3>
<p><a href="http://https//pentlandbrands.com/">Pentland Brands</a></p>
<h3>The Problem</h3>
<p>Pentland Brands wanted to create an app to recommend swimming goggles to potential customers. During a trial, they had collected point cloud models of test subjects' faces using the 3D scanner on an IPhone, along with metadata about the test subjects, and whether they liked or disliked various styles of goggles. They wished to know if it would be possible to predict whether a given person would like a particular style of goggles.</p>
<h3>The Approach</h3>
<p>A test framework was created which allowed the performance of various models to be compared. A number of data reduction techniques and classifier algorithms were applied to the data and their performance in predicting the test subjects' preferences were assessed. Unfortunately, it was discovered that there was no significant correlation between facial shapes and preferences, so Playful Technology Limited recommended that the project be discontinued.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://github.com/strawlab/python-pcl">python-pcl</a></li>
<li><a href="https://scikit-learn.org/stable/">scikit-learn</a></li>
<li><a href="https://pypi.org/project/Theano">Theano</a> (graph convolutional network)</li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebook</a></li>
</ul>Rolls Royce AI Hub2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/rolls-royce-ai-hub.html<p>Extracting structured data from technical documents</p><h2>Extracting structured data from technical documents</h2>
<h3>The Client</h3>
<p><a href="https://www.rolls-royce.com/products-and-services/r2datalabs.aspx">R<sup>2</sup> Data Labs</a></p>
<h3>The Problem</h3>
<p>Rolls Royce had a large quantity of technical documents which they wanted to be able to search. They wished to develop their own search system in house, partly for security reasons and partly to ensure that it was optimal for their needs.</p>
<h3>The Approach</h3>
<p>A testbed was developed to compare the performance of various topic modelling algorithms for searching the documents. During this work, a bug was found in the <a href="https://radimrehurek.com/gensim/models/tfidfmodel.html">Gensim implementation of TF-IDF</a> and corrected. It was then necessary to develop a parser library that could extract structured data from various document formats. Many of the documents were scanned PDFs for regulatory reasons, and this led to two problems. Firstly, the OCR program used could infer the physical structure of the document (pages, layouts), but it was necessary to develop heuristics to infer logical structure (chapters, sections, paragraphs). Secondly, it was found that tables confused OCR. A method to handle this was developed in collaboration with another contractor, whereby tables would be separated into individual cells, OCR run on each cell, and the results assembled into a Pandas DataFrame. Methods were developed to account for row and column headers, as well as multirow and multicolumn spans.</p>
<p>During this project I also sat on a tender panel to advise on technical aspects of the bid and gave advice on a proposed collaborative project.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://poppler.freedesktop.org/">Poppler</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://opencv.org/">OpenCV</a></li>
</ul>Social Finance2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/social-finance.html<p>ETL and Data Cleansing for social services datasets</p><h2>ETL and Data Cleansing for social services datasets</h2>
<h3>The Client</h3>
<p><a href="https://www.socialfinance.org.uk/">Social Finance</a></p>
<h3>The Problem</h3>
<p>Social Finance wished to create an analytics system to help understand the case histories of vulnerable young people. The data was supplied to central government by local authorities in a complex XML format and data was often missing or inconsistent. This data was highly sensitive so strict data security protocols were necessary.</p>
<h3>The Approach</h3>
<p>The XML files were parsed and transformed into a set of relational tables. Heuristics were devised to correct missing and inconsistent values. Fields that carried a high risk of deanonymisation were removed.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebooks</a></li>
<li><a href="https://www.postgresql.org/">PostgreSQL</a></li>
</ul>True 2122020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/true-212.html<p>Natural Language Processing for semantically enhanced content matching</p><h2>Natural Language Processing for semantically enhanced content matching</h2>
<h3>The Client</h3>
<p><a href="https://www.true212.com/">True 212</a></p>
<h3>The Problem</h3>
<p>True 212 wanted to identify relevant content to link to from their news and culture blogs. They believed that a simple bag-of-words approach would lead to naive matches, and wished to extract semantics from the documents to enable richer matches.</p>
<h3>The Approach</h3>
<p>A NLP pipeline was created with the following stages.</p>
<p>A Named Entity Recognition system that identified candidate named entities in a document and found corresponding <a href="https://www.wikidata.org/">WikiData</a> entities. Known relationships between WikData entities were used to disambiguate candidate matches.</p>
<p>A Part of Speech Tagger that used Hidden Markov Models to return a the probability distribution over the part of speech categories used in WordNet for each word in a sentence.</p>
<p>A Word Sense Disambiguation component that used the Viterbi algorithm to find the maximum likelihood sequence of <a href="https://wordnet.princeton.edu/">WordNet</a> IDs corresponding to the words in a given sentence, allowing for stopwords, multiword expressions, named entities and out-of-vocabulary words. This achieved state-of-the-art accuracy (70%).</p>
<p>A Latent Semantic Indexing model which was trained on the semantically enhanced documents to perform rich matching.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://numpy.org/">Numpy</a></li>
<li><a href="https://www.scipy.org/">Scipy</a></li>
<li><a href="https://scikit-learn.org/stable/">Scikit-Learn</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://www.mongodb.com/">MongoDB</a></li>
</ul>Video: The Grammar of Truth and Lies2020-06-01T00:00:00+01:002020-06-01T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-01:/video-the-grammar-of-truth-and-lies.html<p>Using NLP to detect fake news</p><h2>Using NLP to detect fake news</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OyA59kIQcAU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My talk on Fake News detection, "The Grammar of Truth and Lies", has gone down well at a couple of Meetups and a lunchtime talk for <a href="https://PlayfulTechnology.co.uk/amey-strategic-consulting.html">a client</a>, so I decided to make a version for my <a href="https://www.youtube.com/channel/UCx20P1dncSSFqwusJ6uNUbg">YouTube channel</a>.</p>Video: The Entropy of "Alice In Wonderland"2020-05-26T00:00:00+01:002020-05-26T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-05-26:/video-the-entropy-of-alice-in-wonderland.html<p>Video explaining an entropy-based keyword extraction technique, using "Alice In Wonderland"</p><h2>Video explaining an entropy-based keyword extraction technique, using "Alice In Wonderland"</h2>
<p>Here is the first of a new video series discussing NLP and Data Science.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zC4ZXvAxnHA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>It's discussing an <a href="https://arxiv.org/abs/0907.1558">entropy-based keyword extraction algorithm</a> devised by Marcello Montemurro and Damian Zanette, for which I created the <a href="https://github.com/PeteBleackley/gensim/blob/release-3.8.3/gensim/summarization/mz_entropy.py">Gensim implementation</a>. To illustrate it's use, I've analysed the text of <em>Alice in Wonderland</em>, in this <a href="https://www.kaggle.com/petebleackley/entropy-based-keyword-extraction">Kaggle kernel</a>.</p>
<p>I have more of these videos planned for the near future, and I am also planning a webinar series entitled <a href="https://PlayfulTechnology.co.uk/pages/ask-a-data-scientist.html">Ask a Data Scientist</a>. People will be able to send in data science questions, which I will answer with live coding examples. Subscribers will be able to take part in the live event, and the recording will be available on the channel afterwards. If you're interested in this, <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Ask%20%A%20%Data%20Scientist">please get in touch</a>.</p>The Entropy of "Alice in Wonderland"2020-05-13T00:00:00+01:002020-05-13T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-05-13:/the-entropy-of-alice-in-wonderland.html<p>Demonstration of Montemurro and Zanette's information theory based keyword agorithm</p><h2>Demonstration of Montemurro and Zanette's information theory based keyword agorithm</h2>
<p>Several years ago. I read in <a href="https://www.newscientist.com/">New Scientist</a> about an information theory based technique for identifying the most significant words in a document, according to the role they play in its structure. After looking up the paper, <a href="https://arxiv.org/abs/0907.1558">Towards the quantification of semantic information in written language</a> by Marcello Montemurro and Damian Zanette, I implemented the algorithm and contributed it to <a href="https://radimrehurek.com/gensim/">Gensim</a>. Unfortunately, it's no longer in the latest release, but I have created a <a href="https://github.com/PeteBleackley/gensim">fork of Gensim</a> to allow further development of features that have been dropped from the latest release.</p>
<p>When I found the text of <em>Alice's Adventures in Wonderland</em> as a Kaggle Dataset, it provided the opportunity to create a demonstration for the algorithm.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/entropy-based-keyword-extraction?kernelSessionId=34819997" title="Entropy Based Keyword Extraction" width="100%"></iframe>
<p>I also created a video explaining it.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zC4ZXvAxnHA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>If you are interested in document analysis, please <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Document%20analysis">contact me</a>.</p>Apple, Bias, Credit2019-11-12T00:00:00+00:002019-11-12T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2019-11-12:/apple-bias-credit.html<p>The importance of understanding your data before using it to train a model</p><h2>The importance of understanding your data before using it to train a model</h2>
<p>News has recently broken that <a href="https://www.zdnet.com/article/apple-card-issuer-investigated-over-gender-bias-in-credit-algorithm/">Apple's credit card gives women lower credit limits that men, even when they have identical credit histories to their husbands</a>. While we don't know precisely what causes this, anybody who knows the basics of machine learning could tell you that if you train on biassed data, you get a biassed model.</p>
<p>Apple and Goldman Sachs appear to have made one of the most basic data science errors in the book. They've thrown a lot of data at a model (probably a black-box model), without making sure they'd understood it first. If they had done a proper exploratory analysis beforehand, they could have identified potential sources of bias in their data and corrected for them.</p>
<p>An example from one of my previous projects illustrates the importance of understanding your data. <a href="https://PlayfulTechnology.co.uk/formisimo.html">Formisimo</a> wanted to predict in real-time whether users would complete or abandon web-forms. Their existing models were capable of predicting to a certain degree of accuracy whether customer had completed or abandoned the form given a complete history of their interactions, but didn't work in a simulation or real-time behaviour. My investigations showed that it was only in the last hundred interactions that a real signal of whether the user would complete or not was present. Taking this into account enabled me to create much better models for them.</p>
<p>Apple now need to go over their training data, work out where the source of bias is, and fix it. If they need a fresh pair of eyes on it, they can <a href="mailto:peter.bleackley@playfultechnolgy.co.uk">contact Playful Technology Limited</a>.</p>The Grammar of Truth and Lies2019-05-08T00:00:00+01:002019-05-08T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2019-05-08:/the-grammar-of-truth-and-lies.html<p>Using Computational Linguistics to Detect Fake News</p><h2>Using Computational Linguistics to Detect Fake News</h2>
<p>There's an old saying that I first read in a Terry Pratchett book that "A lie can run around the world before the truth has got its boots on." This seems to be a particular problem on the Internet a the moment, as propaganda, conspiracy theories and outright dishonesty are big business. As a computational linguist, I started to wonder if there was any way that natural language processing could be used to distinguish between real news and fake news. So when a corpus of fake news articles was published on Kaggle, I decided to investigate.</p>
<p><a href="https://www.kaggle.com/petebleackley/the-grammar-of-truth-and-lies">The Grammar of Truth and Lies</a></p>
<p>The first thing I needed was a sample of news articles from a reliable source to compare the fake news corpus with. For this, I used the Reuters Corpus, from which is available as part of <a href="https://www.nltk.org/">NLTK</a>. Fortunately, it contained a similar number of articles to the fake news corpus, thus avoiding balance issues.</p>
<p>The next challenge was what features to use. I decided not to use vocabulary, since the news stories covered in the Reuters corpus and those in the fake news corpus were from different time periods, and so this would introduce bias - it would be possible to train a model that thought that any mention of "Hillary Clinton" was automatically fake news, for example. Therefore, I used features based on the grammatical structure of sentences. Using <a href="https://https//textblob.readthedocs.io/en/dev/">TextBlob</a>, I performed Part of Speech tagging on the document and concatenated the tags to form a feature for each sentence. These were of course, ridiculously sparse, so I used <a href="https://radimrehurek.com/gensim/">Gensim</a> to perform Latent Semantic Indexing, before classifying with Logistic Regression and Random Forest models from <a href="">scikit-learn</a>.</p>
<p>The results were OK, but I thought I could do better. At first I tried adding sentiment analysis to the model, which brought a moderate improvement, but then I remembered that stopword frequencies are often used for stylometric analysis, such as author identification. Since they're independent from the content and largely governed by subconscious factors, I thought that they might possible contain signals of dishonest intent, so I added them to the feature extraction.</p>
<p>This gave me a classifier that was 90% accurate in distinguishing between fake news and reliable sources. It can't say definitively whether an article is true or not, but it's good at picking up whether an article looks similar to reliable news or fake news. The best thing is that the model is quite simple, so that finding signals of dishonest intent in online content is clearly a tractable problem.</p>
<p>Watch me presenting this at PyData London.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/4m1e--6yQWI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>Neural Isn't Always Better2018-10-10T00:00:00+01:002018-10-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2018-10-10:/neural-isnt-always-better.html<p>Neural networks don't match the performance of the Viterbi algorithm for Word Sense Disambiguation</p><h2>Neural networks don't match the performance of the Viterbi algorithm for Word Sense Disambiguation</h2>
<p>For two previous clients, <a href="https://www.metafused.com/">Metafused</a> and <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, I have created <a href="http://www.scholarpedia.org/article/Word_sense_disambiguation">Word Sense Disambiguation systems</a>. I've used a Bayesian Viterbi algorithm, and in each case, achieved an accuracy of 70%, which according to the Scholarpedia article I've just linked, is good. However, I've always wondered if I could do better. Since neural networks (or "deep learning") are fashionable in AI research at the moment, I though that while I was between projects, I'd have a go at seeing what they could do. After all, they are currently popular for machine translation, which is an analogous problem.</p>
<p>I tried two different neural architectures - LSTM and Convolutional networks using <a href="http://keras.io/">Keras</a>. In each case, I trained two <a href="https://radimrehurek.com/gensim/models/lsimodel.html">LSI</a> models on the <a href="https://www.gabormelli.com/RKB/SemCor_Corpus">Semcor</a> corpus, one representing words, the other <a href="http://https//wordnet.princeton.edu/">WordNet</a> senses. The neural network was trained to map from one embedding to the other, and the WordNet embedding searched for the word sense closest to each vector produced by the neural network. And the results were...</p>
<p>Absolute gibberish. The output bore no resemblance to the input whatsoever.</p>
<p>Why was that? Well, with the Viterbi algorithm, I only had to search for word senses that were relevant to the input word. The neural network had to search the entire space of WordNet senses, and this was a pretty dense embedding. That meant that the slightest error in the mapping would lead to the wrong sense being identified.</p>
<p>Secondly, I think that the limit of 70% accuracy in Word Sense Disambiguation comes from the training data. There's only really Semcor available, and while it's a good corpus, I believe that a much bigger corpus of WordNet tagged sentences would be necessary to make a significant improvement in Word Sense Disambiguation performance. Modern machine translation systems use huge corpora harvested from the web, and even then they are often ropy and fragile.</p>
<p>The ScholarPedia article linked above suggests that using some global information about a document may improve the performance. At a later date I may experiment with integrating this into the Viterbi algorithm.</p>Ontologies for Named Entity Recognition2018-01-04T00:00:00+00:002018-01-04T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2018-01-04:/ontologies-for-named-entity-recognition.html<p>Semantic relationships make an ontology useful for Named Entity Recognition</p><h2>Semantic relationships make an ontology useful for Named Entity Recognition</h2>
<p>I once had two projects in succession where I was trying to identify named entities in free text. One was successful, the other wasn't, and the reasons why are interesting.</p>
<p>The first project was for <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, and I used the <a href="https://www.wikidata.org/">WikiData</a> ontology.The second was for a pharmaceutical company, and used the <a href="https://www.nlm.nih.gov/mesh/">MeSH</a> ontology. In each case, a search of the ontology database would return several false positives - for example, searching for "Africa" might return <a href="https://www.wikidata.org/wiki/Q15">the continent</a> or <a href="https://www.wikidata.org/wiki/Q181238">the Roman Province</a>, whereas "Lagos" could be <a href="https://www.wikidata.org/wiki/Q8673">the capital of Nigeria</a> or <a href="https://www.wikidata.org/wiki/Q8780001">a railway station in Portugal</a>. However, WikiData doesn't just store entities, it makes claims about them - that is, it encodes semantic relationships between them. Therefore, if a document mentions both "Lagos" and "Africa", a Named Entity Recognition system based on WikiData can use the fact that Lagos is a city in Africa to determine which Lagos you mean and which Africa you mean.</p>
<p>That unfortunately wasn't the case with MeSH. It didn't encode relationships between the medical terms it documents in any useful way, so it wasn't possible to perform the same sort of disambiguation as with WikiData. The key insights from this are that relationships are meaning, and that before working with an ontology, it's vital to know not just what entities it contains, but what relationships between them it encodes. An ontology of entities can be used for manual tagging, but for analysis, you need an ontology of relationships.</p>A Hidden Markov Model Library2016-08-03T00:00:00+01:002016-08-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2016-08-03:/a-hidden-markov-model-library.html<p>An open source Python library for HMMs</p><h2>An open source Python library for HMMs</h2>
<p>A few years ago I wrote a <a href="https://sourceforge.net/projects/python-hidden-markov/">Python library for Hidden Markov Models</a> and released it on <a href="https://pypi.python.org/pypi/Markov">PyPI</a>. I've now decided that I want to get a few more people involved in it, so I gave a <a href="http://www.slideshare.net/PeterBleackley/a-hidden-markov-model-library-64653215">lightning talk at the 25th Pydata London meetup</a>.</p>
<p>If you're interested in contributing, please follow the links above for more information and <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Hidden%20Markov%20Model%20Library">get in touch</a>.</p>Investigating the Breast Cancer Proteome on Kaggle2016-06-25T00:00:00+01:002016-06-25T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2016-06-25:/investigating-the-breast-cancer-proteome-on-kaggle.html<p>Finding clusters of proteins activity in a sample of data from breast cancer patients and predicting clinical data</p><h2>Finding clusters of proteins activity in a sample of data from breast cancer patients and predicting clinical data</h2>
<p>At a Pydata London meetup, and somebody asked me if I'd ever done anything on Kaggle. I said I'd had a look at it, but hadn't found any competitions that I cared enough about to enter. He told me about a sample of protein activity from breast cancer patients, and I thought that that would be an interesting and potentially worthwhile thing to work on.</p>
<p>Previous investigations had involved clustering the patients, so I decided to cluster the proteins. Using hierarchical clustering I classified the proteins as belonging to 8 clusters. Then, I projected each patient's protein activity onto the space of these clusters, and attempted to use these to predict the patients' clinical data, mainly using Logistic Regression.</p>
<p><a href="https://www.kaggle.com/petebleackley/d/piotrgrabo/breastcancerproteomes/clustering-proteins">My results can be seen in this Kaggle Kernel</a>. They are as good as I could have hoped for, in that they appear to contain information that might help to treat cancer. In particular, patients with activity in a particular cluster of proteins appear to have a much better chance of survival than other patients. When I originally created the kernel, it took too long to run on Kaggle's servers, but is now possible to run it on GPUs and the results can be seen.</p>