Playful Technology Limitedhttps://PlayfulTechnology.co.uk/2023-11-30T00:00:00+00:00The Viterbi Algorithm2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-30:/the-viterbi-algorithm.html<p>Finding the hidden states that generated a sequence</p><h2>Finding the Hidden States that generated a sequence</h2>
<p>If we have a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span> generated by a <a href="https://PlayfulTechnology.co.uk/hidden-markov-models.html">Hidden Markov Model</a>. One thing we may wish to do is infer the maximum likelihood sequence of hidden states <span class="math">\(S_{0},S_{1}...S_{t}\)</span> that gave rise for this. A useful technique for this is the <em>Viterbi Algorithm</em>.</p>
<p>The Viterbi algorithm represents the possible paths through the sequence of hidden states as a graphical model called a <em>trellis</em>. The possible hidden states at each time step are represented by nodes, with edges representing the transitiions between them.</p>
<p>At each time step <span class="math">\(t\)</span>, we start by caclulating the probabilities of the hidden states given the observation at that time step, <span class="math">\(P(S_{i,t} \mid X_{t})\)</span> and place corresponding nodes on the trellis. We then find the maximum likelihood predecessor for each node</p>
<div class="math">$$\texttt{argmax} \left( P(S_{j,t-1}) P(S_{j,t-1} \mid S_{i,t}) \right)$$</div>
<p>and connect an edge from it to its successor. Any nodes at <span class="math">\(t-1\)</span> that have no outgoing edges are then deleted, along with their incoming edge, and this is repeated at each previous time step until no more nodes can be deleted. Then, working forwards through the trellis from the first step at which nodes were deleted, we recalculate the probabilities at each timeslice as </p>
<div class="math">$$P^{\prime}(S_{i,t}) = \frac{P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}{\sum_{i} P(S_{j,t-1}) P(S_{i,t} \mid S_{j,t-1}) P(X_{i} \mid S_{i})}$$</div>
<p>where <span class="math">\(S_{i,t}\)</span> are the remaining states at time <span class="math">\(t\)</span> and <span class="math">\(S_{j,t-1}\)</span> is the maximum likelihood predecessor of each state. </p>
<p>At the end of the sequence, we may select the maximum likelihood final state <span class="math">\(\texttt{argmax} P(S_{i,t})\)</span>. The path leading to it is then the maximum likelihood sequence of states given the observations. The Viterbi Algorithm is particularly suitable for real-time applications, as any time step where the number of possible states falls to 1 may be output immediately and removed from the trellis, which in turn reduces memory requirements and computation time.</p>
<p>I first encountered the Viterbi algorithm in the context of error-correcting codes for digital television. The sequence of bits to be transmitted in a digital TV signal can be protected against errors by interspersing it with extra bits derived from a <em>convolutional code</em> - this is a binary function of a number of previous bits. This converts the transmitted sequence from an apparently random sequence (due to data compression) to a Markov process. At the receiving side, we treat the received bitstream (which inevitably contains errors) as the observations and the transmitted bitstream as the hidden states, using the Viterbi algorithm to recover it.</p>
<p>I later used the Viterbi Algorithm for <a href="https://PlayfulTechnology.co.uk/true-212.html">Word Sense Disambiguation</a>. In this application, the observations were words and the hidden states were <a href="https://wordnet.princeton.edu/">WordNet</a> word senses. There were a few complications to take into account - function words, out-of-vocabulary words, multi-word expressions, proper names - but it achieved 70% accuracy, which was described to me as "state of the art".</p>
<p>It's this flexibility and applicability to a range of different problems that makes the Viterbi Algorithm one of my favourite algorithms.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hidden Markov Models2023-11-23T00:00:00+00:002023-11-23T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-23:/hidden-markov-models.html<p>Using Bayes' Theorem to analyse sequences</p><h2>Using Bayes'Theorem to analyse sequences</h2>
<p>Suppose we wish to analyse a sequence of events <span class="math">\(X_{0},X_{1}...X_{t}\)</span>. This can be modelled using <a href="https://PlayfulTechnology.co.uk/bayes-theorem.html">Bayes' Theorem</a> as a <em>Markov process</em> <span class="math">\(P(X_{t} \mid X_{t-1})\)</span>, <em>ie</em> the probability of each event depends on the previous event in the sequence.</p>
<p>If there are <span class="math">\(N\)</span> possible values that <span class="math">\(X\)</span> can take, the number of transition probabilities between them is <span class="math">\(N^{2}\)</span>. Such a model would quickly become very large and not very informative. We need a way to make the models more tractable.</p>
<p>To do this, we assume that the probability of each event can be described in terms of a hidden state, <span class="math">\(S\)</span>, as <span class="math">\(P(X_{t} \mid S_{t})\)</span>. The states can then be modelled by a Markov process, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>. This is known as a <em>Hidden Markov Model</em>, since it models a sequence of hidden states with a Markov process. The number of hidden states can be considerable smaller than the number of possible events, and the states can group events into meaningful categories. The model consists of three distributions, the inital state distribution, <span class="math">\(P(S_{0})\)</span>, the transition probability distribution, <span class="math">\(P(S_{t} \mid S_{t-1})\)</span>, and the conditional distribution of the events <span class="math">\(P(X \mid S)\)</span>. </p>
<p>Starting from the initial state distribution <span class="math">\(P(S_{0})\)</span>, we can caclulate the posterior distributions of the hidden states at each step <span class="math">\(t\)</span> of a sequence by the following method.</p>
<ol>
<li>Calculate the posterior distribution of the hidden state given the observed event <span class="math">\(X_{t}\)</span> using Bayes' Theorem
<div class="math">$$P(S_{t} \mid X_{t}) = \frac{P(S_{t}) P(X_{t} \mid S_{t})}{P(X_{t})}$$</div>
</li>
<li>Calculate the prior probability of the next state
<div class="math">$$P(S_{t+1}) = P(S_{t+!} \mid S{t}) P(S_{t} \mid {X_t})$$</div>
</li>
</ol>
<p>A concrete example is <a href="https://PlayfulTechnology.co.uk/video-part-of-speech-tagging.html">Part of Speech Tagging</a>. In this application, the observed events are words and the hidden states are the parts of speech (noun, verb, adjective etc.). This approach is particularly useful when you want the probability of each part of speech for a given word, rather than a single tag. I used this approach in my work at <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, using my own open source <a href="https://PlayfulTechnology.co.uk/a-hidden-markov-model-library.html">Hidden Markov Model library</a>, which I had created as a lerning exercise when I first learnt about HMMs. I was pleased to discover that a colleague on that project had also used the library, but I no longer maintain it, as I've learnt a lot since then and if I did any more work on it I'd prefer to restart it from scratch.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayes' Theorem2023-11-16T00:00:00+00:002023-11-16T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-16:/bayes-theorem.html<p>The probability of a hypothesis given observations.</p><h2>Estimating the probability of a hypothesis given observations.</h2>
<p>This is the beginning of what will hopefully be a regular series of articles explaining Key Algorithms in data science.</p>
<p>If you look at <a href="https://www.linkedin.com/in/peterjbleackley">my LinkedIn profile</a>, you'll see that the banner shows the formula </p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{P(O)}$$</div>
<p>This is a foundational rule for calculating conditional probabilities, known a <em>Bayes' Theorem</em>, after the Reverend Thomas Bayes, who first proposed it. It may be read as <em>the probability of a hypothesis given some observations is equal to the prior probability of the hypothesis multiplies by the probability of the observations given that hypothesis, and divided by the probability of the observations</em>. </p>
<p>To illustrate this, consider a family where the father has rhesus-positive blood and the mother has rhesus-negative blood. Rhesus-positive is a dominant trait - the father might have one or two copies of the Rh+ gene, whereas rhesus-negative is recessive - the mother must have two copies of the Rh- gene.</p>
<p>Let <span class="math">\(H\)</span> be the probability that the father has 2 copies of the RH+ gene. Without further information, 1/2 is the best estimate for this. If the family's first child is rhesus-positive, the probability of this is <span class="math">\(P(O \mid H) = 1\)</span> if the father has two copies of the Rh+ gene and <span class="math">\(P(O \mid ¬H) = \frac{1}{2}\)</span> if he has 1 copy. In general the overall probability of the observations give a set of hypotheses <span class="math">\(H_{i}\)</span> is given by
</p>
<div class="math">$$P(O) = \sum_{i} H_{i} P(O \mid H_{i})$$</div>
<p>, since the posterior probabilities of all hypotheses must sum to 1. Therefore, we can update the probability of the father having two copies of the Rh+ gene as
</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{1}{2} \times 1}{\frac{1}{2} \times 1 + \frac{1}{2} \times \frac{1}{2}} = \frac{2}{3}$$</div>
<p>If the family's second child is also rhesus-positive, we can further update our estimate with the new information</p>
<div class="math">$$P(H \mid O) = \frac{P(H) P(O \mid H)}{(P(H) P(O \mid H) + P(¬H) P(O \mid ¬H)} = \frac{\frac{2}{3} \times 1}{\frac{2}{3} \times 1 + \frac{1}{3} \times \frac{1}{2}} = \frac{4}{5}$$</div>
<p>It is easy to see that if we had known both children's blood groups from the outset, and used <span class="math">\(P(O \mid ¬H) = \frac{1}{4}\)</span> we could have got the same result.</p>
<p>In data science, we often have to estimate the probability of a hypothesis given some evidence, so Bayes' theorem is a useful thing to have in our toolkit. </p>
<p>If we need to take observations of several different variables into account, there are two ways we can do it, the first, the <em>Naive Bayes</em> approach, treats all the variables as statistically independent, as we did in the above example. While this has the advantage of simplicity, it is only really viable when the problem is sufficiently simple.</p>
<p>For more complex problems, we need to model the dependencies between variables. We do this with a graphical method called a <em>Bayesian Belief Net</em>, where each node on a graph represents a variable, and the links represent dependencies between them. Each node then calculates the probability of the variable it represents in terms of the variables it is dependent on. A simple example can be seen in the Data Science Notebook <a href="https://PlayfulTechnology.co.uk/is-it-a-mushroom-or-is-it-a-toadstool.html">Is It a Mushroom or Is It a Toadstool?</a>.</p>
<p>For my first AI project, I was asked to chose the best system to implement an automatic diagnostic system. I chose a Bayesian Belief Network on the grounds that it was important for the system to be explainable. Since each node of the Bayesian Belief Newtork represents a meaningful variable, its results are more explainable that those of a neural network, whose nodes are simply steps in a calculation. More recently I used Bayesian models in a project to predict the optimum settings for machine tools, so Bayes' Theorem has followed me throughout my data science career.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>QARAC: Porting to PyTorch2023-11-08T00:00:00+00:002023-11-08T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-11-08:/qarac-porting-to-pytorch.html<p>PyTorch is more suitable for co-training multiple models and objectives</p><p>Most of my previous work with neural networks has been based on the <a href="https://keras.io">Keras</a> library, so in implementing QARAC, I initially used what I was most familiar with. Some lower-level parts of the algorithm were implemented in <a href="https://tensorflow.org">TensorFlow</a>, which for some years has been the default backend to Keras (however, it is now possible to use Keras with a choice of backends again). </p>
<p>I decided to to some local testing before committing large amounts of compute time to training the models, but when I did so, I got the following warning.</p>
<div class="highlight"><pre><span></span><code><span class="n">WARNING</span><span class="o">:</span><span class="n">tensorflow</span><span class="o">:</span><span class="n">Gradients</span><span class="w"> </span><span class="k">do</span><span class="w"> </span><span class="k">not</span><span class="w"> </span><span class="n">exist</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="k">variables</span><span class="w"> </span><span class="err">[</span><span class="s1">'tf_roberta_model/roberta/pooler/dense/kernel:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model/roberta/pooler/dense/bias:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'qarac_trainer_model/qarac_encoder_model/global_attention_pooling_head/local projection:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'qarac_trainer_model/qarac_encoder_model_1/global_attention_pooling_head_1/local projection:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model_1/roberta/pooler/dense/kernel:0'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tf_roberta_model_1/roberta/pooler/dense/bias:0'</span><span class="err">]</span><span class="w"> </span><span class="k">when</span><span class="w"> </span><span class="n">minimizing</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">loss</span><span class="p">.</span><span class="w"> </span><span class="k">If</span><span class="w"> </span><span class="n">you</span><span class="s1">'re using `model.compile()`, did you forget to provide a `loss` argument?</span>
<span class="s1">WARNING:tensorflow:Gradients do not exist for variables ['</span><span class="n">tf_roberta_model</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">kernel</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">bias</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">qarac_trainer_model</span><span class="o">/</span><span class="n">qarac_encoder_model</span><span class="o">/</span><span class="n">global_attention_pooling_head</span><span class="o">/</span><span class="k">local</span><span class="w"> </span><span class="n">projection</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">qarac_trainer_model</span><span class="o">/</span><span class="n">qarac_encoder_model_1</span><span class="o">/</span><span class="n">global_attention_pooling_head_1</span><span class="o">/</span><span class="k">local</span><span class="w"> </span><span class="n">projection</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model_1</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">kernel</span><span class="o">:</span><span class="mi">0</span><span class="s1">', '</span><span class="n">tf_roberta_model_1</span><span class="o">/</span><span class="n">roberta</span><span class="o">/</span><span class="n">pooler</span><span class="o">/</span><span class="n">dense</span><span class="o">/</span><span class="n">bias</span><span class="o">:</span><span class="mi">0</span><span class="s1">'] when minimizing the loss. If you'</span><span class="n">re</span><span class="w"> </span><span class="k">using</span><span class="w"> </span><span class="n n-Quoted">`model.compile()`</span><span class="p">,</span><span class="w"> </span><span class="n">did</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">forget</span><span class="w"> </span><span class="k">to</span><span class="w"> </span><span class="n">provide</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n n-Quoted">`loss`</span><span class="w"> </span><span class="n">argument</span><span class="nv">?</span><span class="w"></span>
</code></pre></div>
<p>It looks like the <a href="https://PlayfulTechnology.co.uk/qarac-models-and-corpora.html">Training Model</a> wasn't able to propogate gradients between its constituent models. This seems to be a feature of the architecture of Keras, in which <code>Model</code>s are made up of <code>Layer</code>s. Layers are designed to be components of a larger model, and so propogate gradients across their inputs, whereas models, which are intended to be complete systems, do not. From Keras's point of view, I was trying to use models as layers, and it didn't like it.</p>
<p>Since <a href="https://huggingface.co">HuggingFace</a> models are available in both TensorFlow and <a href="https://pytorch.org">PyTorch</a>, so I looked to see if PyTorch would be more suitable for what I wanted to do. I found that PyTorch doesn't make the same distinction that Keras does between layers and models - both are <code>Module</code>s, so there would be not problem with propogating gradients between them. The learning curve going from Keras to PyTorch wasn't too steep. The main differences were that the method of a Keras layer that's called <code>call</code> is called <code>forward</code> in PyTorch, and there's no direct equivalent of a Keras model's <code>compile</code> and <code>fit</code> methods, so you have to write a training loop. Also, HuggingFace's PyTorch and TensorFlow models aren't exact drop-in replacements for each other, so on occasion adjustments were needed where one wanted a parameter that the other didn't. </p>
<p>You should learn something new on every project, and that has been one of my key personal goals for QARAC. I didn't envisage that I'd end up learning PyTorch for this project, but the fact that I have done is welcome and will come in useful for future projects. </p>
<p>There's only one more thing I need before I can train the models, and that's a budget for compute time, or a <a href="https://huggingface.co/docs/hub/spaces-gpus#community-gpu-grants">community hardware grant</a> from HuggingFace.</p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>QARAC: Models and Corpora2023-09-14T00:00:00+01:002023-09-14T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-09-14:/qarac-models-and-corpora.html<p>Selection of models and training corpora for QARAC</p><p>I've made some early progress on developing QARAC, and I'm not far from being able to make a first attempt at training it. I've chosen base models, coded the model heads and the training model, and found appropriate datasets to train on.</p>
<h2>Models</h2>
<h3>Base models</h3>
<p>I was initally interested in using <a href="https://arxiv.org/abs/2302.10866">Hyena models</a> as my base models and training them with the <a href="http://www.natcorp.ox.ac.uk/">British National Corpus</a>. However, I found it harder to implement Hyena models in <a href="https://keras.io">Keras</a> than I anticipated, and didn't want this to be a roadblock. I've therefore decided to start by using <a href="https://huggingface.co/roberta-base">RoBERTa</a>. However, I may need to consider another model for the decoder.</p>
<h3>Model Heads</h3>
<p>For the encoder models, the head used is a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/layers/GlobalAttentionPoolingHead.py">Global Attention Pooling Head</a>. If <em>attention</em> in a transformer model is the relevance of each word in a document to the meaning of each other word, <em>global attention</em> may be defined as the relevance of each word to the overall meaning of the document. This is calculated as follows</p>
<p>Given the contextual word vectors <span class="math">\(\vec{v_{i}}\)</span> produced by the base encoder model, and two trainable matrices <span class="math">\(\mathbf{L}\)</span> and <span class="math">\(\mathbf{G}\)</span>, define the <em>local projection</em>
</p>
<div class="math">$$\vec{l_{i}} = \vec{v_{i}} \cdot \mathbf{L}$$</div>
<p> and the <em>global projection</em>
</p>
<div class="math">$$\vec{g} = \left( \sum_{i} \vec{v_{i}} \right) \cdot \mathbf{G}$$</div>
<p>. The attention is then calculated as the cosine similarity of the two projections
</p>
<div class="math">$$a_{i} = \hat{l_{i}} \cdot \hat{g}$$</div>
<p>. Finally, the encoded vector is calculated as the sum of the word vectors weighted by the attention
</p>
<div class="math">$$\vec{E} = \sum_{i} a_{i} \vec{v_{i}}$$</div>
<p>.</p>
<p>For the decoder models, the head used is a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/QaracDecoderModel.py#L13">QaracDecoderHead</a>. This prepends a vector representing an encoded document to the vectors generated by the base model, passes this through a <code>TFRobertaLayer</code>, removes the first vector from the output of that layer, then feeds that through another <code>TFRobertaLayer</code> and finally a <code>TFRobertaLMHead</code>, returning the output of that layer.</p>
<h3>The Training Model</h3>
<p>To prevent <a href="https://en.wikipedia.org/wiki/Catastrophic_interference">catastrophic forgetting</a>, the question encoder, answer encoder and decoder must all be trained together, targeting all training objectives simultaneously. To do this, they are combined into a <a href="https://github.com/PeteBleackley/QARAC/blob/main/qarac/models/QaracTrainerModel.py">Trainer Model</a>.
Given a sentence <span class="math">\(\mathbf{S}\)</span>, a question <span class="math">\(\mathbf{Q}\)</span>, an answer <span class="math">\(\mathbf{A}\)</span>, two propositions <span class="math">\(\mathbf{P_{0}}\)</span> and <span class="math">\(\mathbf{P_{1}}\)</span>, and two statements <span class="math">\(\mathbf{s_{0}}\)</span> and <span class="math">\(\mathbf{s_{1}}\)</span>,
the following outputs are calculated</p>
<div class="math">$$\texttt{encode_decode} = \mathcal{D}(\mathcal{AE}(\mathbf{S}))$$</div>
<div class="math">$$\texttt{question_answering} = \mathcal{QE}(\mathbf{Q}) - \mathcal{AE}(\mathbf{S})$$</div>
<div class="math">$$\texttt{reasoning} = \mathcal{D}(\mathcal{A£}(\mathbf{P_{0}} + \mathcal{AE}{P_{1}})$$</div>
<div class="math">$$\texttt{consistency} = \mathit{cossim}(\mathcal{AE}(\mathbf{s_{0}}),\mathcal{AE}(\mathbf{s_{1}}))$$</div>
<p>For the decoding and question answering objectives, the loss to be minimised is the sparse categorical crossentropy of the generated answer against the answer in the training set. For question answering, it is the squared Eudlidean length of the vector produced, and for consistency is the mean squared error from the desired label (1 for consistent statements, -1 for contradictory statements, 0 for unrelated statements).</p>
<p>The output for question answering and its associated loss are chosen to reflect the intended use of the answer encoder, to generate a query vector for a vector database.</p>
<h2>Training Corpora</h2>
<h3>Question Answering</h3>
<p>For Question Answering, the most suitable corpus I have found is the <a href="https://paperswithcode.com/dataset/wikiqa">WikiQA</a> dataset. This contains a sample of questions obtained from Bing queries, along with the first paragraph of a Wikipedia article relevant to each question. The paragraph is split into sentences, one per line, and the sentences are labelled 1 if they are considered a valid answer to the question, and 0 otherwise. The rows labelled 1 will be used to train the question answering objective.</p>
<p>It has been necessary to perform coreference resolution on this dataset, for which <a href="https://docs.allennlp.org/main/">AllenNLP</a> was used. Since it was necessary to combine all the sentences for a given question into a single document to perform coreference resolution and then separate them afterwards, some rather nasty edge cases had to be dealt with.</p>
<h3>Reasoning</h3>
<p>For Reasoning, the <a href="https://github.com/ZeinabAghahadi/Syllogistic-Commonsense-Reasoning">Avicenna: Syllogistic Commonsense Reasoning</a> dataset will be used. This contains pairs of sentences, a label "yes" if they can be used to form a valid syllogism and "no" if not, and a conclusion to the syllogism if it exists. Only the examples where a valid syllgism exists will be used to train the dataset.</p>
<h3>Consistency</h3>
<p>For Consistency, the <a href="https://www.kaggle.com/datasets/stanfordu/stanford-natural-language-inference-corpus">Stanford Natural Language Inference Corpus</a> will be used. This contains pairs of sentences, labelled as "entailment", "contradiction" or "neutral". These values will be mapped to +1, -1 and 0 respectively.</p>
<h3>Encode/Decode</h3>
<p>To train the decoding of encoded sentences, a combined dataset consisting of
+ all the answer sentences from the WikiQA dataset, whether they are labelled as correct or not
+ all the the propositions from the Avicenna dataset, whether there is a valid conclusion or not
+ the conclusions from the Avicenna dataset, where these are available
+ the sentences from the SNLI corpus
will be used. </p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>QARAC: Question Answering, Reasoning and Consistency2023-08-21T00:00:00+01:002023-08-21T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-08-21:/qarac-question-answering-reasoning-and-consistency.html<p>A project to create a factually accurate NLP system</p><p>Following on from my previous article on <a href="https://PlayfulTechnology.co.uk/the-future-of-natural-language-processing.html">The Future of Natural Language Processing</a>, I've decided to start start a personal research project to put some of these ideas into practice and test them out. </p>
<p>I'm calling the proposed system <strong>QARAC</strong>, which stands for <em>Question Answering, Reasoning and Consistency</em>.</p>
<h2>NLP Components and Training Objectives</h2>
<p>The main NLP components of the system will be two <em>encoders</em> and a <em>decoder</em>. The two encoders will share a base model, and each will map a sentence <strong>S</strong> to a vector <em>v</em>. One will be be a <em>question encoder</em>, <span class="math">\(\mathcal{QE}\)</span> and the other an <em>answer encoder</em>, <span class="math">\(\mathcal{AE}\)</span>.</p>
<p>The <em>decoder</em> <span class="math">\(\mathcal{D}\)</span> will be an autoregressive model that, given a vector <em>v</em> generates a sentence <strong>S</strong>. In particular, it will be trained to act as the inverse function to the answer encoder, so that </p>
<div class="math">$$\mathcal{D}(\mathcal{AE}(\mathbf{S})) = \mathbf{S}$$</div>
<p>. Further training objectives give the system its name.</p>
<h3>Question Answering</h3>
<p>Given a question <strong>Q</strong> and a corresponding answer <strong>A</strong>, the <em>Question Answering</em> objective is that </p>
<div class="math">$$\mathcal{QE}(\mathbf{Q}) = \mathcal{AE}(\mathbf{A})$$</div>
<p>. We might naively try to use this to create a simple question answering system as </p>
<div class="math">$$\mathbf{A} = \mathcal{D}(\mathcal{QE}(\mathbf{Q}))$$</div>
<p>, but this of course would be no more likely to produce accurate results than current LLMs.</p>
<h3>Reasoning</h3>
<p>Given two propositions <span class="math">\(\mathbf{P_{0}}\)</span> and <span class="math">\(\mathbf{P_{1}}\)</span>, and a conclusion <strong>C</strong> that follows from them, the <em>Reasoning</em> objective is that </p>
<div class="math">$$\mathcal{D}(\mathcal{AE}(\mathbf{P_{0}}) + \mathcal{AE}(\mathbf{P_{1}})) = \mathbf{C}$$</div>
<p>. </p>
<h3>Consistency</h3>
<p>Given two statements <span class="math">\(\mathbf{S_{0}}\)</span> and <span class="math">\(\mathbf{S_{1}}\)</span>, the <em>consistency objective</em> is </p>
<div class="math">$$\mathit{cossim}(\mathcal{AE}(\mathbf{S_{0}}),\mathcal{AE}(\mathbf{S_{1}})) = \left\{ \begin{array}{c 1}
+1 & \quad \textrm{if statements are consistent} \\
0 & \quad \textrm{if statements are unrelated} \\
-1 & \quad \textrm{if statements contradict}
\end{array}
\right. $$</div>
<h2>Knowledge base components.</h2>
<p>As previously stated, the system will need a knowledge base in order to produce accurate answers. This will be stored in a vector database and harvested by a crawler.</p>
<p>The crawler will start from a site considered likely to be a reliable source of factual information, extract statements from each document it crawls, end encode them with the answer encoder. It will then test them for consistency with the existing knowledge base, deciding on that basis which to add to the knowledge base and which to reject. It will also calculate an overall reliability score for each document. Links originating from documents with high reliability scores will be prioritised by the crawler for further investigation, and the crawler will terminate when there are no links left to be explored that come primarily from reliable sources.</p>
<h2>Querying</h2>
<p>Presented with a question, QARAC will first use the question encoder to obtain a query vector. It will then find the top few matching vectors from the knowledge base, and use the cosine similarity of the answer vectors to the query vector used as a measure of confidence. If two vectors can be added to produce one with a higher confidence score, this will be added to the results set as an inferred answer. The answer vectors will then be converted to text by the decoder, and the results present to the user, showing the sources of the original vectors and the chain of reasoning to the inferred ones.</p>
<h2>Assessment</h2>
<p>Well, that's the theory. This is a research project, however, and the point is to see how well this system performs in practice, and whether it provides insights into how NLP models could be further improved. As such, a demonstration system will be made accessible, and feedback solicited from users about its performance.</p>
<p>Code for the projects will be published on <a href="https://github.com/PeteBleackley/QARAC">GitHub</a> and trained models on <a href="https://huggingface.co/PlayfulTechnology">HuggingFace</a>. Project updates will be published here under the tag <a href="https://PlayfulTechnology.co.uk/tag/qarac.html">QARAC</a>.</p>
<p>If you are interested in this project, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=QARAC">contact Playful Technology Limited</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Future of Natural Language Processing2023-02-01T00:00:00+00:002023-02-01T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2023-02-01:/the-future-of-natural-language-processing.html<p>NLP systems need knowledge and logic</p><h1>The Future of Natural Language Processing</h1>
<p><a href="https://openai.com/blog/chatgpt/">ChatGPT</a> and similar generative language models have been attracting a lot of attention recently. The trouble is, that while they're good at producing fluent text, they don't necessarily produce accurate or useful text. With ChatGPT, the fact that it admits that it doesn't know the answer some of the time produces a false expectation that it knows what it's talking about the rest of the time, but if you ask it questions about a subject you know about, you'll find it makes mistakes ranging from the subtle to the absurd. <a href="https://www.engadget.com/cnet-reviewing-ai-written-articles-serious-errors-113041405.html">CNET</a> found the hard way that generative models are not a reliable source of content. The reason for this is that the text they generate is based on statistical patterns inferred from their training datasets. At no stage in the process does the model actually understand either the text it's been trained on or what it is being asked to do. It is surmised that in a sufficiently complex model, such understanding may arise as an emergent property of the network, but even if it does, large language models are generally trained on text harvested from the internet, thus leading to a garbage-in garbage-out problem.</p>
<p>This means that the most likely use of generative language models in the near term is as an efficient source of clickbait and fake news. This makes <a href="https://dev.to/fannieailiverse/open-sourced-gptzero-3kik">GPTZero</a> and <a href="https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text/">OpenAI's own AI-written text classifier</a> important. Search engines will need to incorporate tools like these to ensure that results are more likely to come from reliable sources.</p>
<p>However, it clearly isn't enough to trust the neural network. Future generations of NLP models will need to incorporate knowledge and a concept of logical consistency, so that they can discriminate truth from falsehood. My own work with <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a> used <a href="https://www.wikidata.org/">WikiData</a> as a knowledge base for Named Entity Recognition with good effect, so I know how powerful the incorporation of a good knowledge base can be. However, if we want the system to be able to learn and grow its own knowlege base, it needs to understand whether or not data is logically consistent. We can envisage a model that vectorizes statements in such a way that for two statements that are logically consistent, the cosine similarity of the vectors is close to 1, for two statements that are inconsistent, the cosine similarity is close to -1, and for two statements that are unrelated, the cosine similarity is close to zero. The <a href="https://www.kaggle.com/datasets/stanfordu/stanford-natural-language-inference-corpus">Stanford Natural Language Inference Corpus</a>, available from Kaggle, would be a suitable dataset to train this on. Once we could predict logical consistency in this way, we should be able to boostrap a knowledge base from a corpus of trusted facts by adding only statements that are consistent with what is already known.</p>
<p>These vectors have the property that arithmetical negation corresponds to logical negation. It's possible, therefore, that we could perform logical inference by means of arithmetical operations of the vectors. The sum of two vectors may correspond to a logical syllogism, allowing the system to deduce new facts from its knowledge base.</p>
<p>A system that could model consistency would have a lot of powerful applications. <a href="https://PlayfulTechnology.co.uk/the-grammar-of-truth-and-lies-nb.html">Fake News Detection</a> is one possibility - if a document repeatedly contradicted trusted sources, it could be classified as unreliable. Conversely, a document would also be suspicious if it made similar claims to sources known to be unreliable - the QAnon conspiracy theory made similar claims to <a href="https://sourcebooks.fordham.edu/basis/procop-anec.asp">The Secret History</a> - smear campaigns and scare stories haven't changed much since Roman times. Used alongside anomaly detection, it could also detect when an author had concealed dubious claims in an otherwise factual document. However, it could also be a proof-reading tool, allowing authors and editors to check their work for errors more efficiently.</p>
<p>It would also be able to detect opinion and partisanship. Suppose two sources both make claims A and B. However, one source also makes claim C and the other makes claim D. While neither of C or D is inconsistent with A or B, they are inconsistent with each other. We can therefore deduce that A and B are more likely to be accepted by consensus as fact, whereas C and D are opinions. Clustering sources by which opinions they were likely to share would identify partisan groups of sources.</p>
<p>These are just a few possible applications - the ones that occur to me off the top of my head - but they clearly show that knowledge, consistency and reasoning are the missing ingredients needed to make NLP technology truly useful.</p>
<p>If you are interested in these ideas, please <a href="mailto:peter.bleackley@playfultechnolgy.co.uk?subject=The%20Future%20of%20NLP">contact Playful Technology Limited</a></p>How many components?2022-01-17T00:00:00+00:002022-01-17T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2022-01-17:/how-many-components.html<p>A simple method for chosing the number of components to use in principal component analysis</p><h2>A simple method for chosing the number of components to use in principal component analysis</h2>
<p>A common problem in data science is <em>the curse of dimensionality</em>. Essentially, the more different variables a dataset encompasses, the more mathematically intractible it is to make measurements based on them all. The usual method for dealing with this problem is <em>Principal Component Analysis</em>, which seeks to reduct the data to a lower number of dimension while retaining as much information as possible. The most common method of doing this is as follows.</p>
<ol>
<li>Obtain either the covariance matrix of the variables of a similarity matrix of the observations, using a metric such as cosine similarity</li>
<li>Calculate the eigenvalues and eigenvectors of this matrix</li>
<li>Use the eigenvectors corresponding to the N largest eigenvalues to form an orthonormal basis</li>
</ol>
<p>This however, raises the question of how to select an appropriate value of N. Since our aim is to explain the maximum amount of variance with the minimum number of components, a simple approach is to find a maximum in the sum of the proportion of components discarded with the proportion of variance retained, as measured by the eigenvalues. <code>numpy</code> helps us with this by returning eigenvalues and eigenvectors in increasing order of eigenvector. The following code illustrates the method</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span>
<span class="kn">import</span> <span class="nn">numpy.linalg</span>
<span class="k">def</span> <span class="nf">reduce_data</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">metric</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="sd">"""Reduces X to the number of dimensions that retains the maximum amount of infomation for the minimum number of components</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> X : numpy.ndarray</span>
<span class="sd"> (n * m) array containing n rows of m-dimensional observations</span>
<span class="sd"> metric : function (optional, default = None)</span>
<span class="sd"> Similarity metric. Takes an (n * m) array and returns an (n * n) array of similarities</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> numpy.ndarray</span>
<span class="sd"> The data reduced to the optimum number of dimensions</span>
<span class="sd"> """</span>
<span class="n">similarity</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="kp">cov</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">rowvar</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">if</span> <span class="n">metric</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">metric</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="p">(</span><span class="n">eigenvalues</span><span class="p">,</span> <span class="n">eigenvectors</span><span class="p">)</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">eigh</span><span class="p">(</span><span class="n">similarity</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">eigenvalues</span><span class="o">.</span><span class="kp">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">excluded</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="kp">arange</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">/</span><span class="n">n</span>
<span class="n">explained</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="p">(</span><span class="n">eigenvalues</span><span class="o">.</span><span class="kp">cumsum</span><span class="p">()</span><span class="o">/</span><span class="n">eigenvalues</span><span class="o">.</span><span class="kp">sum</span><span class="p">())</span>
<span class="n">cutoff</span> <span class="o">=</span> <span class="p">(</span><span class="n">excluded</span> <span class="o">+</span> <span class="n">explained</span><span class="p">)</span><span class="o">.</span><span class="kp">argmax</span><span class="p">()</span>
<span class="n">basis</span> <span class="o">=</span> <span class="n">eigenvectors</span><span class="p">[:,</span><span class="n">cutoff</span><span class="p">:]</span>
<span class="k">return</span> <span class="n">X</span><span class="o">.</span><span class="kp">dot</span><span class="p">(</span><span class="n">basis</span><span class="p">)</span> <span class="k">if</span> <span class="n">metric</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">basis</span>
</code></pre></div>GHFP Research Institute2021-06-21T00:00:00+01:002021-06-21T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-06-21:/ghfp-research-institute.html<p>Interactive Mapping of the Better Place Index</p><h2>Interactive Mapping of the Better Place Index</h2>
<h3>The Client</h3>
<p><a href="https://ghfp.org/">GHFP Research Institute</a></p>
<h3>The Problem</h3>
<p>In collaboration with <a href="https://pureportal.coventry.ac.uk/en/organisations/centre-for-trust-peace-and-social-relations-2">The Centre for Trust, Peace and International Relations</a> at the University of Coventry, the GHFP Research Institute had developed <em>the Better Place Index</em> a metric of quality of life in different countries. They wished to create an interactive map which would allow users to explore how this metric and its key contributing factors varied from country to country.</p>
<h3>The Approach</h3>
<p>Geopandas was used to combine the <a href="https://www.naturalearthdata.com/downloads/50m-cultural-vectors/50m-admin-0-countries-2/">Natural Earth Countries Shapefile</a> with a spreadsheet of the Better Place Index and its contributing factors. The resulting GeoDataFrame was then used in CartoFrames to produce an <a href="https://www.thebetterplaceindex.report/map">interactive map of the Better Place Index</a> on which</p>
<ul>
<li>Countries are coloured according to the Better Place Index</li>
<li>Hovering the mouse over a country displays the Better Place Index for that country, and its best and worst contributing factors</li>
<li>Countries may be selected by ranges of the Better Place Index, or by the best or worst contributing factor.</li>
</ul>
<h3>Technology Used</h3>
<ul>
<li><a href="https://geopandas.org/">Geopandas</a></li>
<li><a href="https://carto.com/">CartoFrames</a></li>
</ul>Is It A Mushroom or Is It A Toadstool?2021-05-19T00:00:00+01:002021-05-19T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-19:/is-it-a-mushroom-or-is-it-a-toadstool.html<p>Using Bayesian Belief Networks to classify fungus edibility</p><h2>Using Bayesian Belief Networks to classify fungus edibility</h2>
<p>The <a href="https://www.kaggle.com/uciml/mushroom-classification">UCI Machine Learning Mushroom Classification Dataset</a> on Kaggle tabulates discrete features around 8000 specimens of fungi. There are 23 species represented, and the challenge is to classify which are edible and which are poisonous. Since the data are all categorical, I decided that a Bayesian Belief Network would be a suitable, and used an ad-hoc clustering algorithm to infer a hidden variable.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/bayesian-belief-network-for-fungus-edibility?kernelSessionId=1503132" title="Bayesian Belief Network for fungus edibility" width="100%"></iframe>
<p>These results seem promising, but I wanted to see if I could do even better. This time I used Mutual Information to infer two hidden variables.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/bayesian-belief-network-for-fungi-2?kernelSessionId=6991228" title="Bayesian Belief Network for Fungi 2" width="100%"></iframe>
<p><strong>WARNING</strong> This is intended solely as a technology demonstration. Playful Technology Limited cannot accept any liability if you pick wild mushrooms on the basis of these notebooks. If you want to forage for wild mushrooms, find an experienced guide.</p>
<p>If you are interested in classification problems, <a href="mailto:peter.bleackley@playfultechnology.co.uk">contact me</a>.</p>Clustering Proteins in Breast Cancer Patients2021-05-10T00:00:00+01:002021-05-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-10:/clustering-proteins-in-breast-cancer-patients.html<p>Using clustering techniques finds groups of proteins that may be of clinical significance</p><h2>Using clustering techniques finds groups of proteins that may be of clinical significance</h2>
<p>Breast cancer is the most common form of cancer in women, and most of us probably know somebody who's been affected by it, so when another data scientist suggested I look at the breast cancer proteome on Kaggle, I thought it was a worthwhile thing to do. I'm not a biologist, but I know that cell behaviour involves complex networks of interacting proteins, so I thought that clustering would be a good way of uncovering these networks. I was pleased to discover that the protein clusters discovered seemed to be predictive of clinical outcomes.</p>
<iframe src="https://www.kaggle.com/embed/petebleackley/clustering-proteins?kernelSessionId=5010029" height="800" width="100%" frameborder="0" scrolling="auto" title="Clustering proteins"></iframe>
<p>This is something I hope might be useful to clinical researchers. If you are interested in this work, please <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Breast%20cancer%20proteome">contact me</a>.</p>The Grammar of Truth and Lies2021-05-10T00:00:00+01:002021-05-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2021-05-10:/the-grammar-of-truth-and-lies-nb.html<p>Using Natural Language Processing to detect fake news</p><h2>Using Natural Language Processing to detect fake news</h2>
<p>The issue of trust in the media is very important to me, and so when a dataset of fake news items was posted on Kaggle, I decided to see if NLP could be used to distinguish between real and fake news.</p>
<iframe src="https://www.kaggle.com/embed/petebleackley/the-grammar-of-truth-and-lies?kernelSessionId=62289416" height="800" scrolling="auto" title="The Grammar of Truth and Lies" width="100%"></iframe>
<p>I later presented this at two data science meetups and on my video channel.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OyA59kIQcAU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Later, another corpus of real and fake news stories was published on Kaggle, giving me the chance to see if the results were replicable. Fortunately, it appears that they hold up well.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/the-grammar-of-truth-and-lies-part-2?kernelSessionId=54101611" title="The Grammar of Truth and Lies part 2" width="100%"></iframe>
<p>If you are interested in fake news detection, <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Fake%20news%20detection">contact me</a>.</p>Lobbying With Data - How Can Data Help Businesses Influence Policy?2020-07-03T00:00:00+01:002020-07-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-07-03:/lobbying-with-data-how-can-data-help-businesses-influence-policy.html<p>Webinar on how to influence with data science</p><h2>Webinar on how to influence with data science</h2>
<p>I was invited by <a href="https://drivaartsdriva.com/">DRIVA Arts DRIVA</a> to take part in a webinar. Along with <a href="https://www.linkedin.com/in/bonamywaddell/">Bonami Waddell</a> and <a href="https://www.linkedin.com/in/sagihaider/">Haider Raza</a> I discussed what the best strategies were for data scientists to get their message across to decision makers. See below for the video.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZC8ddOhyZ00" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>Video: NLP vs Filter Bubbles2020-06-15T00:00:00+01:002020-06-15T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-15:/video-nlp-vs-filter-bubbles.html<p>Using Topic Modelling and Sentiment Analysis to find common ground between people of differing opinions</p><h2>Using Topic Modelling and Sentiment Analysis to find common ground between people of differing opinions</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/1VKVFJ3pdJw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My latest video is about the Common Ground Algorithm, an idea I've had to try to address the problem of filter buo bbles on line. Given two people with differing opinions, can we use NLP to find common ground between them, and thus encourage civil discussion between people who might otherwise distrust each other? As usual, you can <a href="https://www.kaggle.com/petebleackley/the-common-ground-algorithm">explore the code in this Kaggle kernel</a>.</p>Video: Part of Speech Tagging2020-06-08T00:00:00+01:002020-06-08T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-08:/video-part-of-speech-tagging.html<p>Three approaches to Part of Speech Tagging</p><h2>Three approaches to Part of Speech Tagging</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/UDa7YIPqpiA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My latest video discusses three approaches to a simple NLP task, Part of Speech tagging. Here's a link to <a href="https://gesis.mybinder.org/binder/v2/gh/PeteBleackley/ask-a-data-scientist/780aa74550de278b2ec31f8fbb8dd81af3227fb5">the code</a>.</p>All Street Research2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/all-street-research.html<p>Finding the most relevant paragraphs from corporate documents for given themes</p><h2>Finding the most relevant paragraphs from corporate documents for given themes</h2>
<h3>The Client</h3>
<p><a href="https://www.allstreet.org/">All Street Research</a></p>
<h3>The Problem</h3>
<p>All Street Research wanted to be able to find the most relevant paragraphs of corporate documents related to given themes.</p>
<h3>The Approach</h3>
<p>A set of key words and phrases was obtained for each of the topics of interest. Then, from a corpus of corporate documents, words which correlated with the key words on a paragraph level were identified. These correlations were used to derive a scoring function for each theme that was used to identify the most relevant paragraphs.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://www.nltk.org/">NLTK</a></li>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://numpy.org/">Numpy</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebooks</a></li>
</ul>Amey Strategic Consulting2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/amey-strategic-consulting.html<p>Automatic Diagnostics for the Strategic Road Network</p><h2>Automatic Diagnostics for the Strategic Road Network</h2>
<h3>The Client</h3>
<p><a href="https://www.amey.co.uk/amey-consulting/services/strategic-consulting/">Amey Strategic Consulting</a></p>
<h3>The Problem</h3>
<p>As part of a major data science project on behalf of <a href="https://highwaysengland.co.uk/">Highways England</a>, Amey wished to create an automatic diagnostic system that would detect faults in traffic flow sensors on the strategic road network. As well as enabling timely and efficient maintenance, this would prevent delays to journeys caused by incorrectly set signals, which are estimated to cost the economy £7.5 million per year.</p>
<h3>The Approach</h3>
<p>From a shapefile containing the geometry of the Strategic Road Network, the topology of the network was calculated and groups of sensors assigned to links, which are sections of carriageway between two junctions. Over a link, traffic flow readings should be approcimately consistent at a given time. Anomaly detection can then be used to find the sensor whose readings are most different from the rest. This should vary randomly, but if the same sensor is inconsistent with the rest for a few minutes at a time, it can be assumed to be faulty.</p>
<p>After testing this approach on one link, a simple dashboard was created to demonstrate the results and work began on scaling to the full network.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://geopandas.org/">Geopandas</a></li>
<li><a href="https://scikit-learn.org/stable/">Scikit-learn</a> (Isolation Forests)</li>
<li><a href="https://jupyter.org/">Jupyter Lab</a></li>
<li><a href="https://spark.apache.org/docs/latest/api/python/">PySpark</a></li>
</ul>Formisimo2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/formisimo.html<p>Real time prediction of web form conversion</p><h2>Real time prediction of web form conversion</h2>
<h3>The Client</h3>
<p><a href="https://www.zuko.io/formisimo">Formisimo</a></p>
<h3>The Problem</h3>
<p>Formisimo wanted to predict in real time whether users would complete or abandon web forms, in order to generate nudges that would encourage frustrated users to complete the form. Their early models were able to predict from the full history of user interactions whether a given user had completed or abandoned the form, but could not reproduce this under a simulation of real time operation.</p>
<h3>The Approach</h3>
<p>After some initial experiments models based on Support Vector Machines and Hidden Markov models, a deep investigation of the data was made. It was found that a useful prediction of whether a user would complete the form or not could be made only within the last 100 interactions. It was therefore decided to change the prediction from whether or not the user would complete the form to whether the user was within 100 events of abandoning the form. Models based on this insight showed improved performance, and further improvements were made by using a LSTM model.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://scikit-learn.org/stable/">Scikit-Learn</a> (Support Vector Machines)</li>
<li><a href="https://pypi.python.org/pypi/Markov">Hidden Markov Models</a></li>
<li><a href="https://keras.io/">Keras</a> (LSTM networks)</li>
</ul>Pentland Brands2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/pentland-brands.html<p>Could 3D face scans be used to recommend swimming goggles?</p><h2>Could 3D face scans be used to recommend swimming goggles?</h2>
<h3>The Client</h3>
<p><a href="http://https//pentlandbrands.com/">Pentland Brands</a></p>
<h3>The Problem</h3>
<p>Pentland Brands wanted to create an app to recommend swimming goggles to potential customers. During a trial, they had collected point cloud models of test subjects' faces using the 3D scanner on an IPhone, along with metadata about the test subjects, and whether they liked or disliked various styles of goggles. They wished to know if it would be possible to predict whether a given person would like a particular style of goggles.</p>
<h3>The Approach</h3>
<p>A test framework was created which allowed the performance of various models to be compared. A number of data reduction techniques and classifier algorithms were applied to the data and their performance in predicting the test subjects' preferences were assessed. Unfortunately, it was discovered that there was no significant correlation between facial shapes and preferences, so Playful Technology Limited recommended that the project be discontinued.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://github.com/strawlab/python-pcl">python-pcl</a></li>
<li><a href="https://scikit-learn.org/stable/">scikit-learn</a></li>
<li><a href="https://pypi.org/project/Theano">Theano</a> (graph convolutional network)</li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebook</a></li>
</ul>Rolls Royce AI Hub2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/rolls-royce-ai-hub.html<p>Extracting structured data from technical documents</p><h2>Extracting structured data from technical documents</h2>
<h3>The Client</h3>
<p><a href="https://www.rolls-royce.com/products-and-services/r2datalabs.aspx">R<sup>2</sup> Data Labs</a></p>
<h3>The Problem</h3>
<p>Rolls Royce had a large quantity of technical documents which they wanted to be able to search. They wished to develop their own search system in house, partly for security reasons and partly to ensure that it was optimal for their needs.</p>
<h3>The Approach</h3>
<p>A testbed was developed to compare the performance of various topic modelling algorithms for searching the documents. During this work, a bug was found in the <a href="https://radimrehurek.com/gensim/models/tfidfmodel.html">Gensim implementation of TF-IDF</a> and corrected. It was then necessary to develop a parser library that could extract structured data from various document formats. Many of the documents were scanned PDFs for regulatory reasons, and this led to two problems. Firstly, the OCR program used could infer the physical structure of the document (pages, layouts), but it was necessary to develop heuristics to infer logical structure (chapters, sections, paragraphs). Secondly, it was found that tables confused OCR. A method to handle this was developed in collaboration with another contractor, whereby tables would be separated into individual cells, OCR run on each cell, and the results assembled into a Pandas DataFrame. Methods were developed to account for row and column headers, as well as multirow and multicolumn spans.</p>
<p>During this project I also sat on a tender panel to advise on technical aspects of the bid and gave advice on a proposed collaborative project.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://poppler.freedesktop.org/">Poppler</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://opencv.org/">OpenCV</a></li>
</ul>Social Finance2020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/social-finance.html<p>ETL and Data Cleansing for social services datasets</p><h2>ETL and Data Cleansing for social services datasets</h2>
<h3>The Client</h3>
<p><a href="https://www.socialfinance.org.uk/">Social Finance</a></p>
<h3>The Problem</h3>
<p>Social Finance wished to create an analytics system to help understand the case histories of vulnerable young people. The data was supplied to central government by local authorities in a complex XML format and data was often missing or inconsistent. This data was highly sensitive so strict data security protocols were necessary.</p>
<h3>The Approach</h3>
<p>The XML files were parsed and transformed into a set of relational tables. Heuristics were devised to correct missing and inconsistent values. Fields that carried a high risk of deanonymisation were removed.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://jupyter.org/">Jupyter Notebooks</a></li>
<li><a href="https://www.postgresql.org/">PostgreSQL</a></li>
</ul>True 2122020-06-03T00:00:00+01:002020-06-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-03:/true-212.html<p>Natural Language Processing for semantically enhanced content matching</p><h2>Natural Language Processing for semantically enhanced content matching</h2>
<h3>The Client</h3>
<p><a href="https://www.true212.com/">True 212</a></p>
<h3>The Problem</h3>
<p>True 212 wanted to identify relevant content to link to from their news and culture blogs. They believed that a simple bag-of-words approach would lead to naive matches, and wished to extract semantics from the documents to enable richer matches.</p>
<h3>The Approach</h3>
<p>A NLP pipeline was created with the following stages.</p>
<p>A Named Entity Recognition system that identified candidate named entities in a document and found corresponding <a href="https://www.wikidata.org/">WikiData</a> entities. Known relationships between WikData entities were used to disambiguate candidate matches.</p>
<p>A Part of Speech Tagger that used Hidden Markov Models to return a the probability distribution over the part of speech categories used in WordNet for each word in a sentence.</p>
<p>A Word Sense Disambiguation component that used the Viterbi algorithm to find the maximum likelihood sequence of <a href="https://wordnet.princeton.edu/">WordNet</a> IDs corresponding to the words in a given sentence, allowing for stopwords, multiword expressions, named entities and out-of-vocabulary words. This achieved state-of-the-art accuracy (70%).</p>
<p>A Latent Semantic Indexing model which was trained on the semantically enhanced documents to perform rich matching.</p>
<h3>Technology Used</h3>
<ul>
<li><a href="https://numpy.org/">Numpy</a></li>
<li><a href="https://www.scipy.org/">Scipy</a></li>
<li><a href="https://scikit-learn.org/stable/">Scikit-Learn</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://radimrehurek.com/gensim/index.html">Gensim</a></li>
<li><a href="https://www.mongodb.com/">MongoDB</a></li>
</ul>Video: The Grammar of Truth and Lies2020-06-01T00:00:00+01:002020-06-01T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-06-01:/video-the-grammar-of-truth-and-lies.html<p>Using NLP to detect fake news</p><h2>Using NLP to detect fake news</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OyA59kIQcAU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My talk on Fake News detection, "The Grammar of Truth and Lies", has gone down well at a couple of Meetups and a lunchtime talk for <a href="https://PlayfulTechnology.co.uk/amey-strategic-consulting.html">a client</a>, so I decided to make a version for my <a href="https://www.youtube.com/channel/UCx20P1dncSSFqwusJ6uNUbg">YouTube channel</a>.</p>Video: The Entropy of "Alice In Wonderland"2020-05-26T00:00:00+01:002020-05-26T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-05-26:/video-the-entropy-of-alice-in-wonderland.html<p>Video explaining an entropy-based keyword extraction technique, using "Alice In Wonderland"</p><h2>Video explaining an entropy-based keyword extraction technique, using "Alice In Wonderland"</h2>
<p>Here is the first of a new video series discussing NLP and Data Science.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zC4ZXvAxnHA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>It's discussing an <a href="https://arxiv.org/abs/0907.1558">entropy-based keyword extraction algorithm</a> devised by Marcello Montemurro and Damian Zanette, for which I created the <a href="https://github.com/PeteBleackley/gensim/blob/release-3.8.3/gensim/summarization/mz_entropy.py">Gensim implementation</a>. To illustrate it's use, I've analysed the text of <em>Alice in Wonderland</em>, in this <a href="https://www.kaggle.com/petebleackley/entropy-based-keyword-extraction">Kaggle kernel</a>.</p>
<p>I have more of these videos planned for the near future, and I am also planning a webinar series entitled <a href="https://PlayfulTechnology.co.uk/pages/ask-a-data-scientist.html">Ask a Data Scientist</a>. People will be able to send in data science questions, which I will answer with live coding examples. Subscribers will be able to take part in the live event, and the recording will be available on the channel afterwards. If you're interested in this, <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Ask%20%A%20%Data%20Scientist">please get in touch</a>.</p>The Entropy of "Alice in Wonderland"2020-05-13T00:00:00+01:002020-05-13T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2020-05-13:/the-entropy-of-alice-in-wonderland.html<p>Demonstration of Montemurro and Zanette's information theory based keyword agorithm</p><h2>Demonstration of Montemurro and Zanette's information theory based keyword agorithm</h2>
<p>Several years ago. I read in <a href="https://www.newscientist.com/">New Scientist</a> about an information theory based technique for identifying the most significant words in a document, according to the role they play in its structure. After looking up the paper, <a href="https://arxiv.org/abs/0907.1558">Towards the quantification of semantic information in written language</a> by Marcello Montemurro and Damian Zanette, I implemented the algorithm and contributed it to <a href="https://radimrehurek.com/gensim/">Gensim</a>. Unfortunately, it's no longer in the latest release, but I have created a <a href="https://github.com/PeteBleackley/gensim">fork of Gensim</a> to allow further development of features that have been dropped from the latest release.</p>
<p>When I found the text of <em>Alice's Adventures in Wonderland</em> as a Kaggle Dataset, it provided the opportunity to create a demonstration for the algorithm.</p>
<iframe frameborder="0" height="800" scrolling="auto" src="https://www.kaggle.com/embed/petebleackley/entropy-based-keyword-extraction?kernelSessionId=34819997" title="Entropy Based Keyword Extraction" width="100%"></iframe>
<p>I also created a video explaining it.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zC4ZXvAxnHA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>If you are interested in document analysis, please <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Document%20analysis">contact me</a>.</p>Apple, Bias, Credit2019-11-12T00:00:00+00:002019-11-12T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2019-11-12:/apple-bias-credit.html<p>The importance of understanding your data before using it to train a model</p><h2>The importance of understanding your data before using it to train a model</h2>
<p>News has recently broken that <a href="https://www.zdnet.com/article/apple-card-issuer-investigated-over-gender-bias-in-credit-algorithm/">Apple's credit card gives women lower credit limits that men, even when they have identical credit histories to their husbands</a>. While we don't know precisely what causes this, anybody who knows the basics of machine learning could tell you that if you train on biassed data, you get a biassed model.</p>
<p>Apple and Goldman Sachs appear to have made one of the most basic data science errors in the book. They've thrown a lot of data at a model (probably a black-box model), without making sure they'd understood it first. If they had done a proper exploratory analysis beforehand, they could have identified potential sources of bias in their data and corrected for them.</p>
<p>An example from one of my previous projects illustrates the importance of understanding your data. <a href="https://PlayfulTechnology.co.uk/formisimo.html">Formisimo</a> wanted to predict in real-time whether users would complete or abandon web-forms. Their existing models were capable of predicting to a certain degree of accuracy whether customer had completed or abandoned the form given a complete history of their interactions, but didn't work in a simulation or real-time behaviour. My investigations showed that it was only in the last hundred interactions that a real signal of whether the user would complete or not was present. Taking this into account enabled me to create much better models for them.</p>
<p>Apple now need to go over their training data, work out where the source of bias is, and fix it. If they need a fresh pair of eyes on it, they can <a href="mailto:peter.bleackley@playfultechnolgy.co.uk">contact Playful Technology Limited</a>.</p>The Grammar of Truth and Lies2019-05-08T00:00:00+01:002019-05-08T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2019-05-08:/the-grammar-of-truth-and-lies.html<p>Using Computational Linguistics to Detect Fake News</p><h2>Using Computational Linguistics to Detect Fake News</h2>
<p>There's an old saying that I first read in a Terry Pratchett book that "A lie can run around the world before the truth has got its boots on." This seems to be a particular problem on the Internet a the moment, as propaganda, conspiracy theories and outright dishonesty are big business. As a computational linguist, I started to wonder if there was any way that natural language processing could be used to distinguish between real news and fake news. So when a corpus of fake news articles was published on Kaggle, I decided to investigate.</p>
<p><a href="https://www.kaggle.com/petebleackley/the-grammar-of-truth-and-lies">The Grammar of Truth and Lies</a></p>
<p>The first thing I needed was a sample of news articles from a reliable source to compare the fake news corpus with. For this, I used the Reuters Corpus, from which is available as part of <a href="https://www.nltk.org/">NLTK</a>. Fortunately, it contained a similar number of articles to the fake news corpus, thus avoiding balance issues.</p>
<p>The next challenge was what features to use. I decided not to use vocabulary, since the news stories covered in the Reuters corpus and those in the fake news corpus were from different time periods, and so this would introduce bias - it would be possible to train a model that thought that any mention of "Hillary Clinton" was automatically fake news, for example. Therefore, I used features based on the grammatical structure of sentences. Using <a href="https://https//textblob.readthedocs.io/en/dev/">TextBlob</a>, I performed Part of Speech tagging on the document and concatenated the tags to form a feature for each sentence. These were of course, ridiculously sparse, so I used <a href="https://radimrehurek.com/gensim/">Gensim</a> to perform Latent Semantic Indexing, before classifying with Logistic Regression and Random Forest models from <a href="">scikit-learn</a>.</p>
<p>The results were OK, but I thought I could do better. At first I tried adding sentiment analysis to the model, which brought a moderate improvement, but then I remembered that stopword frequencies are often used for stylometric analysis, such as author identification. Since they're independent from the content and largely governed by subconscious factors, I thought that they might possible contain signals of dishonest intent, so I added them to the feature extraction.</p>
<p>This gave me a classifier that was 90% accurate in distinguishing between fake news and reliable sources. It can't say definitively whether an article is true or not, but it's good at picking up whether an article looks similar to reliable news or fake news. The best thing is that the model is quite simple, so that finding signals of dishonest intent in online content is clearly a tractable problem.</p>
<p>Watch me presenting this at PyData London.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/4m1e--6yQWI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>Neural Isn't Always Better2018-10-10T00:00:00+01:002018-10-10T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2018-10-10:/neural-isnt-always-better.html<p>Neural networks don't match the performance of the Viterbi algorithm for Word Sense Disambiguation</p><h2>Neural networks don't match the performance of the Viterbi algorithm for Word Sense Disambiguation</h2>
<p>For two previous clients, <a href="https://www.metafused.com/">Metafused</a> and <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, I have created <a href="http://www.scholarpedia.org/article/Word_sense_disambiguation">Word Sense Disambiguation systems</a>. I've used a Bayesian Viterbi algorithm, and in each case, achieved an accuracy of 70%, which according to the Scholarpedia article I've just linked, is good. However, I've always wondered if I could do better. Since neural networks (or "deep learning") are fashionable in AI research at the moment, I though that while I was between projects, I'd have a go at seeing what they could do. After all, they are currently popular for machine translation, which is an analogous problem.</p>
<p>I tried two different neural architectures - LSTM and Convolutional networks using <a href="http://keras.io/">Keras</a>. In each case, I trained two <a href="https://radimrehurek.com/gensim/models/lsimodel.html">LSI</a> models on the <a href="https://www.gabormelli.com/RKB/SemCor_Corpus">Semcor</a> corpus, one representing words, the other <a href="http://https//wordnet.princeton.edu/">WordNet</a> senses. The neural network was trained to map from one embedding to the other, and the WordNet embedding searched for the word sense closest to each vector produced by the neural network. And the results were...</p>
<p>Absolute gibberish. The output bore no resemblance to the input whatsoever.</p>
<p>Why was that? Well, with the Viterbi algorithm, I only had to search for word senses that were relevant to the input word. The neural network had to search the entire space of WordNet senses, and this was a pretty dense embedding. That meant that the slightest error in the mapping would lead to the wrong sense being identified.</p>
<p>Secondly, I think that the limit of 70% accuracy in Word Sense Disambiguation comes from the training data. There's only really Semcor available, and while it's a good corpus, I believe that a much bigger corpus of WordNet tagged sentences would be necessary to make a significant improvement in Word Sense Disambiguation performance. Modern machine translation systems use huge corpora harvested from the web, and even then they are often ropy and fragile.</p>
<p>The ScholarPedia article linked above suggests that using some global information about a document may improve the performance. At a later date I may experiment with integrating this into the Viterbi algorithm.</p>Ontologies for Named Entity Recognition2018-01-04T00:00:00+00:002018-01-04T00:00:00+00:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2018-01-04:/ontologies-for-named-entity-recognition.html<p>Semantic relationships make an ontology useful for Named Entity Recognition</p><h2>Semantic relationships make an ontology useful for Named Entity Recognition</h2>
<p>I once had two projects in succession where I was trying to identify named entities in free text. One was successful, the other wasn't, and the reasons why are interesting.</p>
<p>The first project was for <a href="https://PlayfulTechnology.co.uk/true-212.html">True 212</a>, and I used the <a href="https://www.wikidata.org/">WikiData</a> ontology.The second was for a pharmaceutical company, and used the <a href="https://www.nlm.nih.gov/mesh/">MeSH</a> ontology. In each case, a search of the ontology database would return several false positives - for example, searching for "Africa" might return <a href="https://www.wikidata.org/wiki/Q15">the continent</a> or <a href="https://www.wikidata.org/wiki/Q181238">the Roman Province</a>, whereas "Lagos" could be <a href="https://www.wikidata.org/wiki/Q8673">the capital of Nigeria</a> or <a href="https://www.wikidata.org/wiki/Q8780001">a railway station in Portugal</a>. However, WikiData doesn't just store entities, it makes claims about them - that is, it encodes semantic relationships between them. Therefore, if a document mentions both "Lagos" and "Africa", a Named Entity Recognition system based on WikiData can use the fact that Lagos is a city in Africa to determine which Lagos you mean and which Africa you mean.</p>
<p>That unfortunately wasn't the case with MeSH. It didn't encode relationships between the medical terms it documents in any useful way, so it wasn't possible to perform the same sort of disambiguation as with WikiData. The key insights from this are that relationships are meaning, and that before working with an ontology, it's vital to know not just what entities it contains, but what relationships between them it encodes. An ontology of entities can be used for manual tagging, but for analysis, you need an ontology of relationships.</p>A Hidden Markov Model Library2016-08-03T00:00:00+01:002016-08-03T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2016-08-03:/a-hidden-markov-model-library.html<p>An open source Python library for HMMs</p><h2>An open source Python library for HMMs</h2>
<p>A few years ago I wrote a <a href="https://sourceforge.net/projects/python-hidden-markov/">Python library for Hidden Markov Models</a> and released it on <a href="https://pypi.python.org/pypi/Markov">PyPI</a>. I've now decided that I want to get a few more people involved in it, so I gave a <a href="http://www.slideshare.net/PeterBleackley/a-hidden-markov-model-library-64653215">lightning talk at the 25th Pydata London meetup</a>.</p>
<p>If you're interested in contributing, please follow the links above for more information and <a href="mailto:peter.bleackley@playfultechnology.co.uk?subject=Hidden%20Markov%20Model%20Library">get in touch</a>.</p>Investigating the Breast Cancer Proteome on Kaggle2016-06-25T00:00:00+01:002016-06-25T00:00:00+01:00Dr Peter J Bleackleytag:playfultechnology.co.uk,2016-06-25:/investigating-the-breast-cancer-proteome-on-kaggle.html<p>Finding clusters of proteins activity in a sample of data from breast cancer patients and predicting clinical data</p><h2>Finding clusters of proteins activity in a sample of data from breast cancer patients and predicting clinical data</h2>
<p>At a Pydata London meetup, and somebody asked me if I'd ever done anything on Kaggle. I said I'd had a look at it, but hadn't found any competitions that I cared enough about to enter. He told me about a sample of protein activity from breast cancer patients, and I thought that that would be an interesting and potentially worthwhile thing to work on.</p>
<p>Previous investigations had involved clustering the patients, so I decided to cluster the proteins. Using hierarchical clustering I classified the proteins as belonging to 8 clusters. Then, I projected each patient's protein activity onto the space of these clusters, and attempted to use these to predict the patients' clinical data, mainly using Logistic Regression.</p>
<p><a href="https://www.kaggle.com/petebleackley/d/piotrgrabo/breastcancerproteomes/clustering-proteins">My results can be seen in this Kaggle Kernel</a>. They are as good as I could have hoped for, in that they appear to contain information that might help to treat cancer. In particular, patients with activity in a particular cluster of proteins appear to have a much better chance of survival than other patients. When I originally created the kernel, it took too long to run on Kaggle's servers, but is now possible to run it on GPUs and the results can be seen.</p>