Bayesian Inference – A naive perspective

Many people have been asking me on the unusual name I have given for this Blog – “Bayesian Quest”. Well, the name is inspired from one of the important theorems in statistics ‘The Bayes Theorem’. There is also a branch in statistics called Bayesian Inference whose foundation is  the Bayes Theorem. Bayesian Inference has shot into prominence in this age of ‘Big Data’ and is therefore widely used in machine learning. This week, I will give a perspective on Bayes Theorem.

The essence of Statistics is to draw inference on an unknown population, from samples. Let me elaborate this with an example.  Suppose you are part of an agency specializing in predicting poll outcomes of general elections. To publish the most accurate predictions, the ideal method would be to ask  all the eligible voters within your country  which party they are going to vote. Obviously we all know that this is not possible as the cost and time required to conduct such a survey will be prohibitively expensive. So what do you, as a Psephologist do ? That’s where statistics and statistical inference methods comes in handy. What you would do in such a scenario is to select representative samples of people  from across the country and ask them questions on their voting preferences. In statistical parlance this is called sampling. The idea behind sampling is that, the sample sizes so selected( if selected carefully) will reflect the mood and voting preferences of the general population.  This act of inferring the unknown parameters of the population from the known parameters of the sample is the essence of statistics.There are predominantly two philosophical approaches for doing statistical inference. The first one, which is the more classical of the two is called the Frequentist approach and the second the Bayesian approach.

Let us first see how a frequentist will approach the problem of predictions. For the sake of simplicity let us assume that there are only two political parties, party A and party B.Any party which gets more than 50% of popular votes wins in the election. A frequentist will start their inference by first defining a set of hypothesis. The first hypothesis, which is called the null hypothesis, will ascertain that party A will get more than 50% vote. The other hypothesis, called the alternate hypothesis, will state the contrary i.e. party A will not get more than 50% vote. Given these hypothesis, the next task is to test the validity of these hypothesis from the sample data. Please note here, that the two hypothesis are defined with respect to population(all the eligible voters in the country) and not the sample.

Let  our sample size consist of 100 people who were interviewed. Out of this sample 46 people said they will vote for party A, 38 people said that they will vote for party B and the balance 16 people were undecided. The task at hand is to predict whether party A will get more than 50% in the general election given the numbers we have observed in the sample. To do the inference the frequentist will calculate a probability statistic called the ‘P’ statistic. The ‘P’ statistic in this case can be defined as follows – It is the probability of observing 46 people from a sample of  100 people who would vote for party A, assuming 50% or more of the population will vote for party A. Confused ????? ………….. Let me simplify this a bit more. Suppose there is a definite mood among the public in favor of party A, then there is a high chance of seeing a sample where  40 people or 50 or even 60 people out of the 100 saying that they will vote for party A. However there is very low chance to see a sample with only 10 people out of 100 saying that they will vote for party A. Please remember that these chances are with respect to our hypothesis that party A is very popular. On the contrary if party A were very unpopular, then the chance of seeing  10 people out of 100 saying they will vote for party A, is very plausible. The chance or probability of seeing the number we saw in our sample under the condition that our hypothesis is true is the ‘P’ statistic. Once the ‘P’ statistic is calculated , it is then compared to a threshold value usually 5%. If the ‘P’ value is less than the threshold value we will junk our null hypothesis that 50% or more people will vote for party A and will go with the alternate hypothesis. On the contrary if the P value is more than 5% we will stick with our null hypothesis. This in short is how a frequentist will approach the problem.

A Bayesian will approach this problem in a different way. A Bayesian will take into account historical data of past elections and then assume the probability of party A getting more than 50% of popular vote. This assumption is called the Prior probability.Looking at the historical data of the past 10 elections,  we find that only in 4 of them party A has got more than 50% of votes. In that scenario we will assume the prior probability of party A getting more than 50% of votes as .4( 4 out of 10). Once we have assumed a prior probability, we then look at our observed sample data ( 46 out of 100 saying they will vote for party A) and determine the possibility of seeing such data under the assumed prior. This possibility is called the Likelihood. The likelihood and the prior is multiplied together to get the final probability called the posterior probability. The posterior probability is our updated belief based on the data we observed and also the historical prior we assumed. So if party A has higher posterior probability than party B, we will assume that Party A has higher chance of getting more than 50% of votes than party B. This is rather a very naive explanation to the Bayesian approach.

Now that you have seen both Bayesian and Frequentist approaches you might be tempted to ask which is the better among the two. Well this debate has been going on for many years and there is no right answer. It all depends on the context and the problem which is at hand. However, in the recent past Bayesian inference has gained a definite edge over the Frequentist methods due to its ability to update prior beliefs through observation of more data. In addition, computing power is also getting cheaper and faster making Bayesian inference much more fulfilling than Frequentist methods. I will get into more examples of Bayesian inference in a future post.

 

The Recommendation Engine

I was recently browsing through Amazon and guess what ? All that was displayed to me were a bunch of books, books which probably I might never buy at all. I wasn’t quite surprised about the choices Amazon laid out to me. One reason for this is, I am a very dormant online buyer. So the choices Amazon laid out to me is a reflection of the fact that,it doesn’t know me well at all. But wait a minute, did I just say, that Amazon doesn’t know me ?? How can a website know me ? Knowing , understanding , taking care are all traits supposed to be associated with living entities, and not with a static webpages. If you are also thinking the same way, then you are in for a huge surprise. Static webpages are part of old dispensation, the new mantra is making everything,  from webpages to billboards and every facets which touch customers, teeming with life. All these are made possible through advances made in field of machine learning. Yes, machines are equipped with sufficient intelligence to learn based on their interaction with customers . So that they also start taking care of you and me. This is the new dispensation. In this post, I would like to unravel one such application in the field of machine learning, which lies at the heart of online stores like Amazon , E bay etc. : The Recommendation Engine :

You as an avid online buyer would have noticed that before logging in to any of these online stores, if you just browse these sites, you will be shown a bunch of items scrolling before you. Now these could be items which are totally unrelated to your tastes. However Amazon or any online store decided to recommend it to you because these are their top selling or trending items. Bereft of any intelligence about you as a buyer this is the best, the website could lay out to you. This kind of recommendation is called the Non Personalized recommendations. Such recommendations are made based on the top items which are being bought or searched on the site.

Now once you log in, it would be a totally different world. Based on  your level of activity on the site, you would realize that many of the products which are recommended to you are more aligned to your tastes. The more your level of activity, more aligned to your tastes the products recommended to you. This is the part which I referred to you in the beginning about the site understanding you. The more it understands you, the better it would take care of you. Interesting isn’t it ? These type of recommendations falls under the genre called the personalized recommendations.

Personalized recommendations predominantly works on an algorithm called the collaborative filtering. A very simple analogy of the collaborative filtering algorithm is a huge table, where the rows of the table will be users like you and me and the columns of the table will be the items which you or me has bought or has shown interest in. So this table is one huge table with millions and millions of items and as many customers in it. Each time you buy something or even browse something, against your name against the corresponding item column,some value will be updated . However one interesting point to note is that, you as an individual customer at the most would not have bought more than hundreds of types of items. This is quite minuscule compared to the millions of items which adorn the columns of the huge table. This is the case for most of the other users too. The number of items which any user would have shown interest  would be quite minuscule  in comparison to the total number of items in the table. This kind of representation is called the sparse representation.

So naturally you would think, if you as a customer buys or shows interests in only a small percentage of items, how come Amazon recommends new things to me. That’s where the intricacies of the collaborative filtering algorithm kicks in. As I said earlier, the table is a large table with millions of users. So considering the millions of users and the varied tastes each user will have, there would be some transactions which would have happened against all the items in the table. The essence of the collaborative filtering algorithm is to find similarities from this huge table. Similarities between users who would be have bought similar kinds of items, similarities between items which are usually bought together etc. It is these similarities extracted from that huge table, which forms the basis of the recommendations. So the idea is like this, if you and me like casual dressing, we would be more inclined to browse for such brands. So based on our transactions, the algorithm will combine both of us as people having similar tastes. Now next time you go ahead and buy a new Polo shirt, the algorithm will assume that I might also like such a shirt and will recommend the same kind of shirt to me too. This is how the collaborative filtering algorithm works. In addition to the similarities between users, the algorithm also finds similarities between items too, to further enhance the ability to recommend products.

In addition to the above, there is another type of recommendation.  Say you want to buy an ice bucket and you start browsing for various models of ice buckets. Once you zero in on the model you like and decide to add it to the cart, you might get a recommendation for an Ice Scoop saying – “Items usually bought together”.  This is an example of similarities between items and is called Market Basket Analysis. The idea behind this algorithm is also similar to the one mentioned above. In this type of algorithm, the huge table is again analysed and transactions where two or more similar items are bought together are identified and is often recommended when one of them is being bought.

Now the base of all these data products is the transactions you do on the virtual world. All the websites you browse, things you rate, items you buy, something which you comment on , all these generates data which would be channelized to make you buy more. And all these happens without you realizing whats going on.  So next time, you browse the net and suddenly you find an ad for a new Polo shirt,do not be surprised. “Somebody is Watching”

Watchin you