Neural networks don't match the performance of the Viterbi algorithm for Word Sense Disambiguation
For two previous clients, Metafused and True 212, I have created Word Sense Disambiguation systems. I've used a Bayesian Viterbi algorithm, and in each case, achieved an accuracy of 70%, which according to the Scholarpedia article I've just linked, is good. However, I've always wondered if I could do better. Since neural networks (or "deep learning") are fashionable in AI research at the moment, I though that while I was between projects, I'd have a go at seeing what they could do. After all, they are currently popular for machine translation, which is an analogous problem.
I tried two different neural architectures - LSTM and Convolutional networks using Keras. In each case, I trained two LSI models on the Semcor corpus, one representing words, the other WordNet senses. The neural network was trained to map from one embedding to the other, and the WordNet embedding searched for the word sense closest to each vector produced by the neural network. And the results were...
Absolute gibberish. The output bore no resemblance to the input whatsoever.
Why was that? Well, with the Viterbi algorithm, I only had to search for word senses that were relevant to the input word. The neural network had to search the entire space of WordNet senses, and this was a pretty dense embedding. That meant that the slightest error in the mapping would lead to the wrong sense being identified.
Secondly, I think that the limit of 70% accuracy in Word Sense Disambiguation comes from the training data. There's only really Semcor available, and while it's a good corpus, I believe that a much bigger corpus of WordNet tagged sentences would be necessary to make a significant improvement in Word Sense Disambiguation performance. Modern machine translation systems use huge corpora harvested from the web, and even then they are often ropy and fragile.
The ScholarPedia article linked above suggests that using some global information about a document may improve the performance. At a later date I may experiment with integrating this into the Viterbi algorithm.