The Grammar of Truth and Lies

Using Computational Linguistics to Detect Fake News

There's an old saying that I first read in a Terry Pratchett book that "A lie can run around the world before the truth has got its boots on." This seems to be a particular problem on the Internet a the moment, as propaganda, conspiracy theories and outright dishonesty are big business. As a computational linguist, I started to wonder if there was any way that natural language processing could be used to distinguish between real news and fake news. So when a corpus of fake news articles was published on Kaggle, I decided to investigate.

The first thing I needed was a sample of news articles from a reliable source to compare the fake news corpus with. For this, I used the Reuters Corpus, from which is available as part of NLTK. Fortunately, it contained a similar number of articles to the fake news corpus, thus avoiding balance issues.

The next challenge was what features to use. I decided not to use vocabulary, since the news stories covered in the Reuters corpus and those in the fake news corpus were from different time periods, and so this would introduce bias - it would be possible to train a model that thought that any mention of "Hillary Clinton" was automatically fake news, for example. Therefore, I used features based on the grammatical structure of sentences. Using TextBlob, I performed Part of Speech tagging on the document and concatenated the tags to form a feature for each sentence. These were of course, ridiculously sparse, so I used Gensim to perform Latent Semantic Indexing, before classifying with Logistic Regression and Random Forest models from scikit-learn.

The results were OK, but I thought I could do better. At first I tried adding sentiment analysis to the model, which brought a moderate improvement, but then I remembered that stopword frequencies are often used for stylometric analysis, such as author identification. Since they're independent from the content and largely governed by subconscious factors, I thought that they might possible contain signals of dishonest intent, so I added them to the feature extraction.

This gave me a classifier that was 90% accurate in distinguishing between fake news and reliable sources. It can't say definitively whether an article is true or not, but it's good at picking up whether an article looks similar to reliable news or fake news. The best thing is that the model is quite simple, so that finding signals of dishonest intent in online content is clearly a tractable problem.

Watch me presenting this at PyData London.