Using Machine Learning to Detect Fake News
For my final year University project, I was tasked with using Artificial Intelligence to solve a real-world problem.
The misinformation I was seeing every day sprawled over social media about such things as the American election and, more recently, the Covid-19 pandemic, sprang to mind. This phenomenon, known as fake news, came to the forefront of international politics recently with Donald Trump ascension to the American presidency. Fake news articles can be very dangerous, as they often plausibly pose as real articles based on facts, with the purpose of deceiving the reader. Furthermore, it is easy for them to reach a large audience via social media in the modern world of instant and viral communication. The consequences of fake news can be drastic, influencing a person’s vote or uptake of the Covid-19 vaccination, for example.
So, I had found my problem. The challenge now was how to use machine learning to detect fake news articles.
When assessing whether an article is fake or not, the reader may consider a variety of factors such as: the title, sources, sentiment, spelling, grammar etc. However, I decided to err on the side of simplicity and only consider the main feature of the article; the words themselves. Somewhat surprisingly, the commonality of words used has a much greater weighting than the word order or any other features of the text. One study by Cornell University found that 75%-90% of correct classifications remained constant after the input words were randomly shuffled. https://arxiv.org/abs/2012.15180
Collecting data
Since I was building a supervised machine learning model, I needed a dataset consisting of a list of articles, with the binary classification of fake or not. So I scraped news articles from several different sources, for example Reuters for real articles and the NY Mag for fake articles.
Natural Language Processing
I used several NLP techniques to pre-process the text before I could fit it to a machine learning model.
Stopword removal – This is the process of removing common words such as ‘the’, ‘and’, ‘it’ which add little meaning to a sentence.
Lemmatisation – This is a technique used to reduce each word to its basic form, e.g dogs -> dog, running/ran -> run.
Vectorisation – This is the process of mapping text to a vector of real numbers. Instead of using a simple count vectoriser, I used a Term Frequency-Inverse Document Frequency (TF-IDF) vectoriser. This reduces the weighting for words which are very common and increases the weighting for words which are less common in the overall corpus.
Classification
Once the data had been pre-processed, I then used Python’s brilliant machine learning library, Scikit-learn, to fit my data to several machine learning models. These included logistic regression, decision tree classifier and K-nearest neighbour models. The models were all relatively accurate, with accuracy levels generally falling in the 60-90% range. After tweaking some of the parameters I was able to get my logistic regression model to run at an 87% accuracy level, meaning that after training the model on a partition of the data (the training data set), it correctly classified 87% of the unseen test data set.
Limitations
The dataset used was a major limitation, as some news outlets might tend to use similar words across articles, resulting in a classification bias. This could be mitigated by using a much larger sample of sources to collate the articles.
Further improvements
My model was a first attempt, a simple reduction of what a complete model would look like. To further enhance my model, I would consider adding features such as sentiment analysis and a score for spelling or grammatical errors.
Conclusion
From this project I discovered the complexity of the task at hand. Pairing the vast and growing field of natural language processing with the relatively new explosion of fake news is a daunting concept. However, although I barely scratched the surface, there is considerable progress being made in the field. This article by Bernard Marr explains how Facebook, Twitter and the like are tackling the proliferation of fake news on their sites. https://www.forbes.com/sites/bernardmarr/2020/03/27/finding-the-truth-about-covid-19-how-facebook-twitter-and-instagram-are-tackling-fake-news/?sh=14a9a9419771
Although detecting fake news with machine learning was challenging, I had a lot of fun. It also made me wonder - when considering the number of factors that the human brain must instantly scan through to determine whether or not an article is fake news, it’s no wonder why so often we seem to get it wrong…