In my last article I have hinted that one can predict an election outcome using social media data. In this article, I will share a bit more how we have adopted social data with other data sources to predict an election outcome in a small constituency in Malaysia with 97% accuracy.
Unlocking the Value of Predictive Analytics
Predictive analytics is a combination of art and science. It uses a combination of human intervention using anecdotal-based assumptions, ability to dissect multiple data sources with different data formats, deep understanding of statistical modelling (and the numbers behind it) and other data science techniques like machine learning, factor analysis, random forest and many other techniques one could think of. And let’s not forget the hours of refinement and reflection. At times, it involves simple mathematics.
The poll-plus model used by Nate Silver to predict the 2008 US Presidential Election is a living proof that predicting an election outcome or voter’s behaviour is a mixture of art and science. In the upcoming 2016 US Presidential Election, Nate expressed his latest views:
Polls shift rapidly and often prove to be fairly inaccurate, even on the eve of the election. Non-polling factors, particularly endorsements, can provide some additional guidance, but none of them is a magic bullet
To unlock the real value of predictive analytics, one needs to move away from believing that an investment in a technology (with a click-of-a-button user expectation) will give you a crystal-ball answer to solve your business problems. In a commercial world filled with buzz words and marketing jargons such as “big data”, “data science”, “world-class product”, “backed by the largest tech VCs”, it is easy to get distracted from seeing what value means when you look at from a single perspective (eg: product / tool). Value means combining people, process and technologies.
We predicted the election results with 97% accuracy. How did we do it ?
With differing opinions around the world on predictive approaches, we embarked on an opportunity to predict a by-election results for small constituency of 42,000 voters in 2016. When the official results were out at 2100 hours, the difference between actual vs. predicted results was 97%.
Hours of statistical modelling techniques were deployed to test the assumptions and predictors; from analysing the significance of Cubes Law, Multiple Linear Regressions (MLR) on factors such as ethnicity and age, statistical analysis on voter’s sentiment from online polling and social media data, census data, historical voting performance, effect of age and internet penetration, review of citizen’s emotions (public mood) at the locality level, national events and other data sets.
We create a crystal ball that gives you a set of realistic answers
Since there is no magic bullet in predicting election results, the best representation of a predictive outcome (i.e. probability of it happening) will fall under a best case, worst case and base case scenario. For the by-election predictive modelling, we predicted the incumbent will win 37% (worst case), 52% (base case) and 57% (best case). Each scenarios were carefully reviewed and tested using a set of weighting filters derived from multiple sources of data to represent the real-life situations. Therefore the actual result will potentially fall within the set of realistic probabilities (i.e. scenarios).
We also performed Monte Carlo simulation, which was used in the past to predict the US Presidential Election, as a final sanity check to validate our predicted results. To sum it up, the overall modelling framework and approach which we have undertaken is shown below.
While the framework shown may appear unexciting to some modellers or data scientists, our real competitive edge lies on the data preparation & cleaning, extrapolation and testing of various predictors that may potentially influence the voters outcome, bias estimation, logistic regression and the layers of assumptions applied in order to achieve near perfect accurate prediction.
Our cutting edge approach includes a development of a customized sentiment algorithm engine using Naïve Bayes classifier to detect local dialects in both languages (English & Malay) to identify patterns on favourability or likelihood of voting on either parties using a large sample size from social media data (i.e. both location based and keyword based). Emotions analytics was also used to measure public moods at localized locations within the constituency.
Last words on Predictive Analytics
More often than not, due to rapid evolution of computing technology and internet, users inadvertently forgot that technology (or tools) is a form of automated enablers that can deceive you to believe that those fancy charts or dashboards you view on the computer screen is the gospel truth. Sadly, many have not question the underlying assumptions and the accuracy of the data that they view every day.
The hard truth is, in the world of data science, data analytics is derived from a classic recipe (e.g.: mathematics & statistics) cooked in a brand new electric oven (i.e.: technology) by an amazing first-class science graduate (i.e. people) who is not afraid to explore new approaches or boundaries (i.e. process).
For more information how we can apply predictive analytics to assist your organization, drop us a message at firstname.lastname@example.org
Special thanks to my team members Salma, Adilah and Adan for the relentless hours.