You are here: Home / Portfolio / True 212

True 212

Natural Language Processing for semantically enhanced content matching

The Client

True 212

The Problem

True 212 wanted to identify relevant content to link to from their news and culture blogs. They believed that a simple bag-of-words approach would lead to naive matches, and wished to extract semantics from the documents to enable richer matches.

The Approach

A NLP pipeline was created with the following stages.

A Named Entity Recognition system that identified candidate named entities in a document and found corresponding WikiData entities. Known relationships between WikData entities were used to disambiguate candidate matches.

A Part of Speech Tagger that used Hidden Markov Models to return a the probability distribution over the part of speech categories used in WordNet for each word in a sentence.

A Word Sense Disambiguation component that used the Viterbi algorithm to find the maximum likelihood sequence of WordNet IDs corresponding to the words in a given sentence, allowing for stopwords, multiword expressions, named entities and out-of-vocabulary words. This achieved state-of-the-art accuracy (70%).

A Latent Semantic Indexing model which was trained on the semantically enhanced documents to perform rich matching.

Technology Used

  • Numpy
  • Scipy
  • Scikit-Learn
  • Pandas
  • Gensim
  • MongoDB