Bayesian Inference – A naive perspective

Many people have been asking me on the unusual name I have given for this Blog – “Bayesian Quest”. Well, the name is inspired from one of the important theorems in statistics ‘The Bayes Theorem’. There is also a branch in statistics called Bayesian Inference whose foundation is  the Bayes Theorem. Bayesian Inference has shot into prominence in this age of ‘Big Data’ and is therefore widely used in machine learning. This week, I will give a perspective on Bayes Theorem.

The essence of Statistics is to draw inference on an unknown population, from samples. Let me elaborate this with an example.  Suppose you are part of an agency specializing in predicting poll outcomes of general elections. To publish the most accurate predictions, the ideal method would be to ask  all the eligible voters within your country  which party they are going to vote. Obviously we all know that this is not possible as the cost and time required to conduct such a survey will be prohibitively expensive. So what do you, as a Psephologist do ? That’s where statistics and statistical inference methods comes in handy. What you would do in such a scenario is to select representative samples of people  from across the country and ask them questions on their voting preferences. In statistical parlance this is called sampling. The idea behind sampling is that, the sample sizes so selected( if selected carefully) will reflect the mood and voting preferences of the general population.  This act of inferring the unknown parameters of the population from the known parameters of the sample is the essence of statistics.There are predominantly two philosophical approaches for doing statistical inference. The first one, which is the more classical of the two is called the Frequentist approach and the second the Bayesian approach.

Let us first see how a frequentist will approach the problem of predictions. For the sake of simplicity let us assume that there are only two political parties, party A and party B.Any party which gets more than 50% of popular votes wins in the election. A frequentist will start their inference by first defining a set of hypothesis. The first hypothesis, which is called the null hypothesis, will ascertain that party A will get more than 50% vote. The other hypothesis, called the alternate hypothesis, will state the contrary i.e. party A will not get more than 50% vote. Given these hypothesis, the next task is to test the validity of these hypothesis from the sample data. Please note here, that the two hypothesis are defined with respect to population(all the eligible voters in the country) and not the sample.

Let  our sample size consist of 100 people who were interviewed. Out of this sample 46 people said they will vote for party A, 38 people said that they will vote for party B and the balance 16 people were undecided. The task at hand is to predict whether party A will get more than 50% in the general election given the numbers we have observed in the sample. To do the inference the frequentist will calculate a probability statistic called the ‘P’ statistic. The ‘P’ statistic in this case can be defined as follows – It is the probability of observing 46 people from a sample of  100 people who would vote for party A, assuming 50% or more of the population will vote for party A. Confused ????? ………….. Let me simplify this a bit more. Suppose there is a definite mood among the public in favor of party A, then there is a high chance of seeing a sample where  40 people or 50 or even 60 people out of the 100 saying that they will vote for party A. However there is very low chance to see a sample with only 10 people out of 100 saying that they will vote for party A. Please remember that these chances are with respect to our hypothesis that party A is very popular. On the contrary if party A were very unpopular, then the chance of seeing  10 people out of 100 saying they will vote for party A, is very plausible. The chance or probability of seeing the number we saw in our sample under the condition that our hypothesis is true is the ‘P’ statistic. Once the ‘P’ statistic is calculated , it is then compared to a threshold value usually 5%. If the ‘P’ value is less than the threshold value we will junk our null hypothesis that 50% or more people will vote for party A and will go with the alternate hypothesis. On the contrary if the P value is more than 5% we will stick with our null hypothesis. This in short is how a frequentist will approach the problem.

A Bayesian will approach this problem in a different way. A Bayesian will take into account historical data of past elections and then assume the probability of party A getting more than 50% of popular vote. This assumption is called the Prior probability.Looking at the historical data of the past 10 elections,  we find that only in 4 of them party A has got more than 50% of votes. In that scenario we will assume the prior probability of party A getting more than 50% of votes as .4( 4 out of 10). Once we have assumed a prior probability, we then look at our observed sample data ( 46 out of 100 saying they will vote for party A) and determine the possibility of seeing such data under the assumed prior. This possibility is called the Likelihood. The likelihood and the prior is multiplied together to get the final probability called the posterior probability. The posterior probability is our updated belief based on the data we observed and also the historical prior we assumed. So if party A has higher posterior probability than party B, we will assume that Party A has higher chance of getting more than 50% of votes than party B. This is rather a very naive explanation to the Bayesian approach.

Now that you have seen both Bayesian and Frequentist approaches you might be tempted to ask which is the better among the two. Well this debate has been going on for many years and there is no right answer. It all depends on the context and the problem which is at hand. However, in the recent past Bayesian inference has gained a definite edge over the Frequentist methods due to its ability to update prior beliefs through observation of more data. In addition, computing power is also getting cheaper and faster making Bayesian inference much more fulfilling than Frequentist methods. I will get into more examples of Bayesian inference in a future post.

 

Machine Learning in Action – Word Prediction

In my previous blog on machine learning, I explained the science behind how a machine learns from its parameters. In this week, I will delve on a very common application which we use in our day to day life – Next Word Prediction.

When we text with our smartphones all  of us would have appreciated how our phones make our typing so easy by predicting or suggesting the word which we have in mind. And many would also have noticed the fact that, our phones predict words which we tend to use regularly in our personal lexicon. Our phones have learned from our pattern of usage and is giving us a personalized offering. This genre of machine learning falls under a very potent field called the Natural Language Processing ( NLP).

Natural Language Processing, deals with ways in which machines derives its learning from human languages. The basic input within the NLP world is something called a Corpora, which essentially is a collection of words or groups of words, within the language. Some of the most prominent corpora for English are Brown Corpus, American National Corpus etc. Even Google has its own linguistic corpora with which it achieves many of the amazing features in many of its products. Deriving learning out of the corpora is the essence of NLP. In the context which we are discussing, i.e. word prediction, its about learning from the corpora to do prediction. Let us now see, how we do it.

The way we do learning from the corpora is through the use of some simple rules in probabilities. It all starts with calculating the frequencies of words or group of words within the corpora. For finding the frequencies, what we use is something called a n-gram model, where the “n” stands for the number of words which are grouped together. The most common n-gram models are the trigram and the bigram models. For example the sentence “the quick red fox jumps over the lazy brown dog” has the following word level trigrams:(Source : Wikipedia)

the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog

Similarly a bi-gram model will split a given sentence into combinations of two word groups. These groups of trigrams or bigrams forms the basic building blocks for calculating the frequencies of word combinations. The idea behind calculation of frequencies of word groups goes like this. Suppose we want to calculate the frequency of the trigram “the quick red”. What we look for in this calculation is how often we find the combination of the words “the” and “quick” followed by “red” within the whole corpora. Suppose in our corpora there were other 5 instances where the words “the” and “quick” was followed by the word “red”, then the frequency of this trigram is 5.

Once the frequencies of the words are found, the next step is to calculate the probabilities of the trigram. The probability is just the frequency divided by the total number of trigrams within the corpora.Suppose there are around 500,000 trigrams in our corpora, then the probability of our trigram “the quick red” will be 5/500,000.The probabilities so calculated comes under a subjective probability model called the Hidden Markov Model(HMM).By the term subjective probability what we mean is the probability of an event happening subject to something else happening. In our trigram model context it means,the probability of seeing the word “red” subject to having preceded with words “the” and “quick”. Extending the same concept to bigrams, it would mean probability of seeing the second word subject to have seen the first word. So if “My God” is a bigram, then the subjective probability would be the probability of seeing the word “God” followed by the word “My”

The trigrams and bigrams along with the calculated probabilities arranged in a huge table forms the basis of the word prediction algorithm.The mechanism of prediction works like this. Suppose you were planning to type “Oh my God” and you typed the first word “Oh”. The algorithm will quickly go through the n-gram table and identify those n-grams starting with word “Oh” in the order of its probabilities. So if the top words in the n-gram table starting with “Oh” are “Oh come on”,”Oh my God” and “Oh Dear Lord” in decreasing order of probabilities, the algorithm will predict the words “Come” ,”my” and “Dear” as your three choices as soon as you type the first word “Oh”.After you type “Oh” you also type “my” the algorithm reworks the prediction and looks at the highest probabilities of n-gram combinations preceded with words “Oh” and “my”. In this case the word “God” might be the most probable choice which is predicted. The algorithm will keep on giving prediction as you keep on typing more and more words. At every instance of your texting process the algorithm will look at the penultimate two words you have already typed to do the prediction of the running word and the process continues.

The algorithm which I have explained here is a very simple algorithm involving n-grams and HMM models. Needless to say there are more complex models which involves more complex models like Neural Networks. I will explain about Neural Networks and its applications in a future post.
images