The Logic of Logistic Regression

At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.

classification

In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions  we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall  from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

Classification Problems Unplugged – Setting the context

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

  • Age of the customer
  • Income of the customer and
  • A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

Customer                Age                 Income                Propensity
Cust-1                                   22                      1000                           1
Cust-2                                   36                     5000                           6
Cust-3                                   62                     4500                            8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ‘W’ suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

Customer           Age                 Income             Propensity           Total Score
Cust-1                  22 x 8         +     1000 x 3     +    1 x 10                  3186
Cust-2                 36 x 8         +    5000 x 3     +     6 x 10                  15,348
Cust-3                  62 x 8          +   4500 x 3     +    8 x 10                 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of  more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

  1. The derivation of the parameters for each variable – In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.
  2. The derivation of the total score – In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.
  3. The decision boundary – The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction  is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.

Bayesian Inference – A naive perspective

Many people have been asking me on the unusual name I have given for this Blog – “Bayesian Quest”. Well, the name is inspired from one of the important theorems in statistics ‘The Bayes Theorem’. There is also a branch in statistics called Bayesian Inference whose foundation is  the Bayes Theorem. Bayesian Inference has shot into prominence in this age of ‘Big Data’ and is therefore widely used in machine learning. This week, I will give a perspective on Bayes Theorem.

The essence of Statistics is to draw inference on an unknown population, from samples. Let me elaborate this with an example.  Suppose you are part of an agency specializing in predicting poll outcomes of general elections. To publish the most accurate predictions, the ideal method would be to ask  all the eligible voters within your country  which party they are going to vote. Obviously we all know that this is not possible as the cost and time required to conduct such a survey will be prohibitively expensive. So what do you, as a Psephologist do ? That’s where statistics and statistical inference methods comes in handy. What you would do in such a scenario is to select representative samples of people  from across the country and ask them questions on their voting preferences. In statistical parlance this is called sampling. The idea behind sampling is that, the sample sizes so selected( if selected carefully) will reflect the mood and voting preferences of the general population.  This act of inferring the unknown parameters of the population from the known parameters of the sample is the essence of statistics.There are predominantly two philosophical approaches for doing statistical inference. The first one, which is the more classical of the two is called the Frequentist approach and the second the Bayesian approach.

Let us first see how a frequentist will approach the problem of predictions. For the sake of simplicity let us assume that there are only two political parties, party A and party B.Any party which gets more than 50% of popular votes wins in the election. A frequentist will start their inference by first defining a set of hypothesis. The first hypothesis, which is called the null hypothesis, will ascertain that party A will get more than 50% vote. The other hypothesis, called the alternate hypothesis, will state the contrary i.e. party A will not get more than 50% vote. Given these hypothesis, the next task is to test the validity of these hypothesis from the sample data. Please note here, that the two hypothesis are defined with respect to population(all the eligible voters in the country) and not the sample.

Let  our sample size consist of 100 people who were interviewed. Out of this sample 46 people said they will vote for party A, 38 people said that they will vote for party B and the balance 16 people were undecided. The task at hand is to predict whether party A will get more than 50% in the general election given the numbers we have observed in the sample. To do the inference the frequentist will calculate a probability statistic called the ‘P’ statistic. The ‘P’ statistic in this case can be defined as follows – It is the probability of observing 46 people from a sample of  100 people who would vote for party A, assuming 50% or more of the population will vote for party A. Confused ????? ………….. Let me simplify this a bit more. Suppose there is a definite mood among the public in favor of party A, then there is a high chance of seeing a sample where  40 people or 50 or even 60 people out of the 100 saying that they will vote for party A. However there is very low chance to see a sample with only 10 people out of 100 saying that they will vote for party A. Please remember that these chances are with respect to our hypothesis that party A is very popular. On the contrary if party A were very unpopular, then the chance of seeing  10 people out of 100 saying they will vote for party A, is very plausible. The chance or probability of seeing the number we saw in our sample under the condition that our hypothesis is true is the ‘P’ statistic. Once the ‘P’ statistic is calculated , it is then compared to a threshold value usually 5%. If the ‘P’ value is less than the threshold value we will junk our null hypothesis that 50% or more people will vote for party A and will go with the alternate hypothesis. On the contrary if the P value is more than 5% we will stick with our null hypothesis. This in short is how a frequentist will approach the problem.

A Bayesian will approach this problem in a different way. A Bayesian will take into account historical data of past elections and then assume the probability of party A getting more than 50% of popular vote. This assumption is called the Prior probability.Looking at the historical data of the past 10 elections,  we find that only in 4 of them party A has got more than 50% of votes. In that scenario we will assume the prior probability of party A getting more than 50% of votes as .4( 4 out of 10). Once we have assumed a prior probability, we then look at our observed sample data ( 46 out of 100 saying they will vote for party A) and determine the possibility of seeing such data under the assumed prior. This possibility is called the Likelihood. The likelihood and the prior is multiplied together to get the final probability called the posterior probability. The posterior probability is our updated belief based on the data we observed and also the historical prior we assumed. So if party A has higher posterior probability than party B, we will assume that Party A has higher chance of getting more than 50% of votes than party B. This is rather a very naive explanation to the Bayesian approach.

Now that you have seen both Bayesian and Frequentist approaches you might be tempted to ask which is the better among the two. Well this debate has been going on for many years and there is no right answer. It all depends on the context and the problem which is at hand. However, in the recent past Bayesian inference has gained a definite edge over the Frequentist methods due to its ability to update prior beliefs through observation of more data. In addition, computing power is also getting cheaper and faster making Bayesian inference much more fulfilling than Frequentist methods. I will get into more examples of Bayesian inference in a future post.