Natural Language Processing for semantically enhanced content matching
True 212 wanted to identify relevant content to link to from their news and culture blogs. They believed that a simple bag-of-words approach would lead to naive matches, and wished to extract semantics from the documents to enable richer matches.
A NLP pipeline was created with the following stages.
A Named Entity Recognition system that identified candidate named entities in a document and found corresponding WikiData entities. Known relationships between WikData entities were used to disambiguate candidate matches.
A Part of Speech Tagger that used Hidden Markov Models to return a the probability distribution over the part of speech categories used in WordNet for each word in a sentence.
A Word Sense Disambiguation component that used the Viterbi algorithm to find the maximum likelihood sequence of WordNet IDs corresponding to the words in a given sentence, allowing for stopwords, multiword expressions, named entities and out-of-vocabulary words. This achieved state-of-the-art accuracy (70%).
A Latent Semantic Indexing model which was trained on the semantically enhanced documents to perform rich matching.