Causal Estimation methods for Machine learning and Data Science Part II – Propensity Score Matching

Image Source : Google images

1.0 Introduction

In the world of data science, uncovering cause-and-effect relationships from observational data can be quite challenging. In our earlier discussion, we dived into using linear regression for causal estimation, a fundamental statistical method. In this post, we introduce propensity score matching. This technique builds on linear regression foundations, aiming to enhance our grasp of cause-and-effect dynamics. Propensity score matching is crafted to deal with confounding variables, offering a refined perspective on causal relationships. Throughout our exploration, we’ll showcase the potency of propensity scores across various industries, emphasizing their pivotal role in the art of causal inference.

2.0 Structure

We will be traversing the following topics as part of this blog.

  • Introduction to Propensity Score Matching ( PSM )
  • Process for implementing PSM
    • Propensity Score Estimation
    • Matching Individuals
    • Stratification
    • Comparison and Causal Effect Estimation:
  • PSM implementation from scratch using python
    • Generating synthetic data and defining variables
    • Training the propensity model using logistic regression
    • Predicting the propensity scores for the data
    • Separating treated and control variables
    • Stratify the data using nearest neighbour algorithm to identify the neighbours for both treated and control groups
    • Calculate Average Treatment Effect on Treated ( ATT ) and Average Treatment Effect on Control ( ATC ) .
    • Calculation of ATE
  • Practical examples of PSM in Retail, Finance, Telecom and Manufacturing

3.0 Introduction to Propensity Score Matching

Propensity score matching is a statistical method used to estimate the causal effect of a treatment, policy, or intervention by balancing the distribution of observed covariates between treated and untreated units. In simpler terms, it aims to create a balanced comparison group for the treated group, making the treatment and control groups more comparable.

Consider a retail company evaluating the impact of a customer loyalty program on purchase behavior. The company wants to understand if customers enrolled in the loyalty program (treatment group) show a significant difference in their average purchase amounts compared to non-enrolled customers (control group). Covariates for these groups include customer demographics (age, income), historical purchase patterns, and frequency of interactions with the company.

To measure the the difference in purchase amounts between these two groups, we need to factor in the covariates in the group age, income, historical purchase patterns, and interaction frequency. This is because there can be substantial differences in the buying propensities between subgroups based on demographics like age or income. So to get a fair difference between these groups its imperative to make these groups comparable so that you can accurately measure the impact of the loyalty program.

The propensity score serves as a balancing factor. It’s like a ticket that each customer gets, representing their likelihood or “propensity” to join the loyalty program based on their observed characteristics (age, income, etc.). Propensity score matching then pairs customers from the treatment group (enrolled in the program) with similar scores to those from the control group (not enrolled). By doing this, you create comparable sets of customers, making it as if they were randomly assigned to either the treatment or control group.

4.0 Process to implement causal estimation using Propensity Score Matching ( PSM )

PSM is a meticulous process designed to balance the covariates between treated (enrolled in the loyalty program) and untreated (not enrolled) groups, ensuring a fair comparison that unveils the causal effect of the program on purchase behavior. Let’s delve into the steps of this approach, each crafted to enhance comparability and precision in our causal estimation journey.

4.1. Propensity Score Estimation: Unveiling Likelihoods

The journey commences with Propensity Score Estimation, employing a classification model like logistic regression to predict the probability of a customer enrolling in the loyalty program based on various covariates. The idea of the classification model is to capture the conditional probability of enrollment given the observed customer characteristics, such as demographics, historical purchase patterns, and interaction frequency.

The resulting propensity score (PS) for each customer represents the likelihood of joining the program, forming a key indicator for subsequent matching. Classification models like Logistic regression’s ability to discern these probabilities provides a nuanced understanding of enrollment likelihood, emphasizing the significance of each covariate in the decision-making process.

The PS serves as a foundational element in the Propensity Score Matching journey, guiding the way to a balanced and unbiased comparison between treated and untreated individuals.

4.2. Matching Individuals: Crafting Equivalence

Once we have the propensity scores, the next step is Matching Individuals. Think of it like finding “loyalty twins” for each customer – someone from the treatment group (enrolled in the loyalty program) matched with a counterpart from the control group (not enrolled), all based on similar propensity scores. The methods for this matching process vary, from simple nearest neighbor matching to more sophisticated optimal matching. Nearest neighbor matching pairs customers with the closest propensity scores, like pairing a loyalty member with a non-member whose characteristics align closely. On the other hand, optimal matching goes a step further, finding the best pairs that minimize differences in propensity scores. These methods aim for a quasi-random pairing, making sure our comparisons are as unbiased as possible. It’s like assembling pairs of customers who, based on their propensity scores, could have been randomly assigned to either group in the loyalty program evaluation.

4.3. Stratification: Ensuring Comparable Strata

Stratification is like sorting our matched customers into groups based on their propensity scores. This helps us organize the data and makes our analysis more accurate. It’s a bit like putting customers with similar likelihoods of joining the loyalty program into different buckets. Why? Well, it minimizes the chance of mixing up customers with different characteristics. For example, if we didn’t do this, we might end up comparing loyal customers who joined the program with non-loyal customers who have very different behaviors. Stratification ensures that, within each group, customers are pretty similar in terms of their chances of joining. It’s a bit like creating smaller, more manageable sets of customers, making our analysis more precise, especially when it comes to evaluating the impact of the loyalty program.

4.4. Comparison and Causal Effect Estimation: Unveiling Impact

Once we’ve organized our customers into strata based on propensity scores, we dive into comparing the mean purchase amounts of those who joined the loyalty program (treated, Y=1) and those who didn’t (untreated, Y=0) within each stratum.

Once the above is done we calculate the Average Causal Effect ( ACE ) for each stratum.

ACEs=Ys1​−Ys0​

This equation calculates the Average Causal Effect (ACE) for a specific stratum (s). It’s the difference between the mean purchase amount for treated ( Ys1​ ) and untreated ( Ys0​ ), individuals within that stratum.

Next we take the Weighted average of stratum specific ACE’s.

ACE = ∑sACEs * Ws

Here, we take the sum of all stratum-specific ACEs, each multiplied by its proportion (ws) of individuals in that stratum. It gives us an overall Average Causal Effect (ACE) that considers the contribution of each stratum based on its size.

ACE measures how much the loyalty program influenced purchase amounts. If ACEs is positive, it suggests the program had a positive impact on purchases in that stratum. The weighted average considers stratum sizes, providing a comprehensive assessment of the program’s overall impact.

5.0 Implementing PSM from scratch

Let us now work out an example to demonstrate the concept of propensity scoring. We will generate some synthetic data and then develop the code to implement the estimation method using propensity score matching.

We will start by importing the necessary library files

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error

Next let us synthetically generate data for this exercise.

# Generate synthetic data
np.random.seed(42)
# Covariates
n_samples = 1000
age = np.random.normal(35, 5, n_samples)
income = np.random.normal(50000, 10000, n_samples)
historical_purchase = np.random.normal(100, 20, n_samples)
# Convert age as an integer
age = age.astype(int)
# Treatment variable (loyalty program)
treatment = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
# Outcome variable (purchase amount)
purchase_amount = 50 + 10 * treatment + 5 * age + 0.5 * income + 2 * historical_purchase + np.random.normal(0, 10, n_samples)

# Create a DataFrame
data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'HistoricalPurchase': historical_purchase,
    'Treatment': treatment,
    'PurchaseAmount': purchase_amount
})
data.head()

Let us look in detail on the above code

  1. Setting Seed: np.random.seed(42) ensures reproducibility by fixing the random seed.
  2. Generating Covariates:
    • age: Simulates ages from a normal distribution with a mean of 35 and a standard deviation of 5.
    • income: Simulates income from a normal distribution with a mean of 50000 and a standard deviation of 10000.
    • historical_purchase: Simulates historical purchase data from a normal distribution with a mean of 100 and a standard deviation of 20.
  3. Converting Age to Integer: age = age.astype(int) converts the ages to integers.
  4. Generating Treatment and Outcome:
    • treatment: Simulates a binary treatment variable (0 or 1) representing enrollment in a loyalty program. The p=[0.7, 0.3] argument sets the probability distribution for the treatment variable.
    • purchase_amount: Generates an outcome variable based on a linear combination of covariates and a random normal error term.
  5. Creating DataFrame:
    • data: Creates a pandas DataFrame with columns ‘Age,’ ‘Income,’ ‘HistoricalPurchase,’ ‘Treatment,’ and ‘PurchaseAmount’ using the generated data.

This synthetic dataset serves as a starting point for implementing propensity score matching, allowing you to compare the outcomes of treated and untreated groups while controlling for covariates.

Next let us define the variables for the propensity modelling step

# Step 1: Propensity Score Estimation (Logistic Regression)
X_covariates = data[['Age', 'Income', 'HistoricalPurchase']]
y_treatment = data['Treatment']
# Standardize covariates
scaler = StandardScaler()
X_covariates_standardized = scaler.fit_transform(X_covariates)

First we start by selecting the covariates (features) that will be used to estimate the propensity score. In this case, it includes ‘Age,’ ‘Income,’ and ‘HistoricalPurchase.’

After that we select the treatment variable. This variable represents whether an individual is enrolled in the loyalty program (1) or not (0).

After this we standardize the covariates using the fit_transform method. Standardization involves transforming the data to have a mean of 0 and a standard deviation of 1.

The standardized covariates will then used in logistic regression to predict the probability of treatment. This propensity score is a crucial component in propensity score matching, allowing for the creation of comparable groups with similar propensities for treatment. Let us see that part next

# Fit logistic regression model
propensity_model = LogisticRegression()
propensity_model.fit(X_covariates_standardized, y_treatment)

The logistic regression model is trained to understand the relationship between the standardized covariates and the binary treatment variable.This model learns the patterns in the covariates that are indicative of treatment assignment, providing a probability score that becomes a key element in creating balanced treatment and control groups.

In the next step, this model will be used in calculating probabilities, and in this case, predicting the probability of an individual being in the treatment group based on their covariates.

# Predict propensity scores
propensity_scores = propensity_model.predict_proba(X_covariates_standardized)[:, 1]
# Addd the propensity scores to the data frame
data['propensity'] = propensity_scores
data.head()

Here we utilize the trained logistic regression model (propensity_model) to predict propensity scores for each observation in the dataset. The predict_proba method returns the predicted probabilities for both classes (0 and 1), and [:, 1] selects the probabilities for class 1 (treatment group). After predicting the probabilities we create a new column in the DataFrame (data) named ‘propensity’ and populates it with the predicted propensity scores.

The propensity scores represent the estimated probability of an individual being in the treatment group based on their covariates. In the context of the problem, the “treatment” refers to whether an individual has enrolled in the loyalty program of the company. A higher propensity score for an individual indicates a higher estimated probability or likelihood that this person would enroll in the loyalty program, given their observed characteristics.

As seen earlier propensity score matching aims to create comparable treatment and control groups. Adding the propensity scores to the dataset allows for subsequent steps in which individuals with similar propensity scores can be paired or matched.

# Find the treated and control groups
treated = data.loc[data['Treatment']==1]
control = data.loc[data['Treatment']==0]

# Fit the nearest neighbour on the control group ( Have not signed up) propensity score
control_neighbors = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit(control["propensity"].values.reshape(-1, 1))

This section of the code is aimed at finding the nearest neighbors in the control group for each individual in the treatment group based on their propensity scores. Let’s break down the steps and provide intuition in the context of the problem statement:

Line 45 creates a subset of the data where individuals have received the treatment (enrolled in the loyalty program) and line 46 creates a subset for individuals who have not received the treatment.

Line 49 uses the Nearest Neighbors algorithm to fit the propensity scores of the control group. It’s essentially creating a model that can find the closest neighbor (individual) in the control group for any given propensity score. The choice of n_neighbors=1 means that for each individual in the treatment group, the algorithm will find the single nearest neighbor in the control group based on their propensity scores.

In the context of the loyalty program, this step is pivotal for creating a balanced comparison. Each treated individual is paired with the most similar individual from the control group in terms of their likelihood of enrolling in the program (propensity score). This pairing helps to mitigate the effects of confounding variables and allows for a more accurate estimation of the causal effect of the loyalty program on purchase behavior.

We will see the step of matching people from treated and control group next

# Find the distance of the control group to each member of the treated group ( Individuals who signed up)
distances, indices = control_neighbors.kneighbors(treated["propensity"].values.reshape(-1, 1))

In the above code, the goal is to find the distances and corresponding indices of the nearest neighbors in the control group for each individual in the treated group.

Line 51, uses the fitted Nearest Neighbors model (control_neighbors) to find the distances and indices of the nearest neighbors in the control group for each individual in the treated group.

  • distances: This variable will contain the distances between each treated individual and its nearest neighbor in the control group. The distance is a measure of how similar or close the propensity scores are.
  • indices: This variable will contain the indices of the nearest neighbors in the control group corresponding to each treated individual.

In propensity score matching, the objective is to create pairs of treated and control individuals with similar or close propensity scores. The distances and indices obtained here are crucial for assessing the quality of matching. Smaller distances and appropriate indices indicate successful pairing, contributing to a more reliable causal effect estimation.

Let us now see the process of causal estimation using the distance and indices we just calculated

# Calculation of the Average Treatment effect on the Treated ( ATT)
att = 0
numtreatedunits = treated.shape[0]
print('Number of treated units',numtreatedunits)
for i in range(numtreatedunits):
  treated_outcome = treated.iloc[i]["PurchaseAmount"].item()
  control_outcome = control.iloc[indices[i][0]]["PurchaseAmount"].item()
  att += treated_outcome - control_outcome
att /= numtreatedunits
print("Value of ATT",att)

The above code calculates the Average Treatment Effect on the Treated (ATT). Let us try to understand the intuition behind the above step.

Line 53, att : The ATT is a measure of the average causal effect of the treatment (loyalty program) on the treated group. It specifically looks at how the purchase amounts of treated individuals differ from those of their matched controls.

Line 54 : numtreatedunits: The total number of treated individuals.

Line 56, initiates a for loop to iterate through each of the treated individuals. For each treated individual, the difference between their observed purchase amount and the purchase amount of their matched control is calculated. The loop iterates through each treated individual, accumulating these differences in att.

A positive ATT indicates that, on average, the treated individuals had higher purchase amounts compared to their matched controls. This would suggest a positive causal effect of the loyalty program on purchase amounts.A negative ATT would imply the opposite, indicating that, on average, treated individuals had lower purchase amounts than their matched controls.

In our case the ATT is negative, indicating that on average treated individuals have lower purchase amount that their matched controls.

What these processes achieves is that by matching each treated individual with their closest counterpart in the control group (based on propensity scores), we create a balanced comparison, reducing the impact of confounding variables. The ATT represents the average difference in outcomes between the treated and their matched controls, providing an estimate of the causal effect of the loyalty program on purchase behavior for those who enrolled.

Next, similar to the ATT we calculate the Average treatment of Control ( ATC ). Let us look into those steps now.

# Computing ATC
treated_neighbors = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit(treated["propensity"].values.reshape(-1, 1))
distances, indices = treated_neighbors.kneighbors(control["propensity"].values.reshape(-1, 1))
print(distances.shape)
# Calculating ATC from the neighbours of the control group
atc = 0
numcontrolunits = control.shape[0]
print('Number of control units',numcontrolunits)
for i in range(numcontrolunits):
  control_outcome = control.iloc[i]["PurchaseAmount"].item()
  treated_outcome = treated.iloc[indices[i][0]]["PurchaseAmount"].item()
  atc += treated_outcome - control_outcome
atc /= numcontrolunits
print("Value of ATC",atc)

Similar to the earlier step, we calculate the average treatment effect on the control group ( ATC ). The ATC measures the average causal effect of the treatment (loyalty program) on the control group. It evaluates how the purchase amounts of control individuals would have been affected had they received the treatment.The loop iterates through each control individual, accumulating the differences between the purchase amounts of the treated individuals’ matched controls and the purchase amounts of the controls themselves.

While ATT focuses on the treated group, comparing the outcomes of treated individuals with their matched controls, ATC focuses on the control group. It assesses how the treatment would have affected the control group by comparing the outcomes of treated individuals’ matched controls with the outcomes of the controls.

Having calculated ATT and ATC its time to calculate Average treatment effect ( ATE ). This can be calculated as follows

# Calculation of Average Treatment Effect
ate = (att * numtreatedunits + atc * numcontrolunits) / (numtreatedunits + numcontrolunits)
print("Value of ATE",ate)

ATE represents the weighted average of the Average Treatment Effect on the Treated (ATT) and the Average Treatment Effect on the Control (ATC). It combines the effects on the treated and control groups based on their respective sample sizes. ATE provides an overall measure of the average causal effect of the treatment (loyalty program) on the entire population, considering both the treated and control groups.

The calculated ate value i.e ( -140.38 ) indicates that, on average, the treatment has a negative impact on the purchase amounts in the entire population when considering both treated and control groups. Negative ATE suggests that, on average, individuals who received the treatment have lower purchase amounts compared to what would have been observed if they had not received the treatment, considering the control group. This also suggests that the loyalty program has not had the desired effect and should rather be discontinued.

Conclusion

Propensity score matching is a powerful technique in causal inference, particularly when dealing with observational data. By estimating the likelihood of receiving treatment (propensity score), we can create balanced groups, ensuring that treated and untreated individuals are comparable in terms of covariates. The step-by-step process involves estimating propensity scores, forming matched pairs, and calculating treatment effects. Through this method, we mitigate the impact of confounding variables, enabling more accurate assessments of causal relationships.

This method has good uses within multiple domains. Let us look at some of the important use cases.

Retail: Promotional Campaign Effectiveness

  • Treated Group: Customers exposed to a promotional campaign.
  • Control Group: Customers not exposed to the campaign.
  • Effects: Evaluate the impact of the promotional campaign on sales and customer engagement.
  • Covariates: Demographics, purchase history, and online engagement metrics.

In retail, understanding the effectiveness of promotional campaigns is crucial for allocating marketing resources wisely. The treated group in this scenario consists of customers who have been exposed to a specific promotional campaign, while the control group comprises customers who have not encountered the campaign. The goal is to assess the influence of the promotional efforts on sales and customer engagement. To ensure a fair comparison, covariates such as demographics, purchase history, and online engagement metrics are considered. Propensity score matching helps mitigate the impact of confounding variables, allowing for a more precise evaluation of the promotional campaign’s actual impact on retail performance.

Finance: Loan Default Prediction

  • Treated Group: Individuals who have received a specific type of loan.
  • Control Group: Individuals who have not received that particular loan.
  • Effects: Assess the impact of the loan on default rates and repayment behavior.
  • Covariates: Credit score, income, employment status.

In the financial sector, accurately predicting loan default rates is crucial for risk management. The treated group here consists of individuals who have received a specific type of loan, while the control group comprises individuals who have not been granted that particular loan. The objective is to evaluate the influence of this specific loan on default rates and overall repayment behavior. To account for potential confounding factors, covariates such as credit score, income, and employment status are considered. Propensity score matching aids in creating a balanced comparison between the treated and control groups, enabling a more accurate estimation of the causal effect of the loan on default rates within the financial domain.

Telecom: Customer Retention Strategies

  • Treated Group: Customers who have been exposed to a specific customer retention strategy.
  • Control Group: Customers who have not been exposed to any such strategy.
  • Effects: Evaluate the effectiveness of customer retention strategies on reducing churn.
  • Covariates: Contract length, usage patterns, customer complaints.

In the telecommunications industry, where customer churn is a significant concern, assessing the impact of retention strategies is vital. The treated group comprises customers who have experienced a particular retention strategy, while the control group consists of customers who have not been exposed to any such strategy. The aim is to examine the effectiveness of these retention strategies in reducing churn rates. Covariates such as contract length, usage patterns, and customer complaints are considered to control for potential confounding variables. Propensity score matching enables a more rigorous comparison between the treated and control groups, providing valuable insights into the causal effects of different retention strategies in the telecom sector.

Manufacturing: Production Process Optimization

  • Treated Group: Factories that have implemented a new and optimized production process.
  • Control Group: Factories continuing to use the old production process.
  • Effects: Evaluate the impact of the new production process on efficiency and product quality.
  • Covariates: Factory size, workforce skill levels, historical production performance.

In the manufacturing domain, continuous improvement is crucial for staying competitive. To assess the impact of a newly implemented production process, factories using the updated method constitute the treated group, while those adhering to the old process make up the control group. The effects under scrutiny involve measuring improvements in efficiency and product quality resulting from the new production process. Covariates such as factory size, workforce skill levels, and historical production performance are considered during the analysis to account for potential confounding factors. Propensity score matching aids in providing a robust causal estimation, enabling manufacturers to make data-driven decisions about process optimization.

Propensity score matching stands as a powerful tool allowing researchers and decision-makers to glean insights into the true impact of interventions. By carefully balancing treated and control groups based on covariates, we unlock the potential to discern causal relationships in observational data, overcoming confounding variables and providing a more accurate understanding of cause and effect. This method has widespread applications across diverse industries, from retail and finance to telecom, manufacturing, and marketing.

“Dive into the World of Data Science Magic: Subscribe Today!”

Ready to uncover the wizardry behind data science? Our blog is your spellbook, revealing the secrets of causal inference. Whether you’re a data enthusiast or a seasoned analyst, our content simplifies the complexities, making data science a breeze. Subscribe now for a front-row seat to unraveling data mysteries.

But wait, there’s more! Our YouTube channel adds a visual twist to your learning journey.

Don’t miss out—subscribe to our blog and channel to become a data wizard.

Let the magic begin!

Unlocking Causal Insights: A Comprehensive Guide to Causal Estimation Methods for Machine learning and Data Science

Let us embark on an insightful exploration into causal effects within machine learning and data science. In this blog, we will delve into causal estimation methods, emphasizing their importance in addressing complex business problems and decision-making processes.This comprehensive guide aims to shed light on crucial methodologies that unlock deeper insights in Causal AI. This blog is in continuation to our earlier blogs on causal analysis dealing with the following topics

  1. Estimating the effect on loyalty program
  2. Back door, front door and instrument analysis in causal AI
  3. Causal graphs
  4. Confounders , Treatment and Control groups

We will take the discussions forward dealing with different methods to estimate the causal effect in this blog. Let us look at the contents of this article.

Structure

  1. Business context ( Introduction to potential outcomes, individual treatment effect and average treatment effect
  2. Causal estimation from scratch using Linear regression
    • Synthetic data generation for the problem statement
    • Fitting a linear model to map the causal relationship
    • Intervention and Average Treatment Effect Estimation
    • Quantifying Impact: Calculating ATE through difference in expected outcomes
    • Interpretation of results
  3. Causal estimation using Dowhy library
  4. Applications of causal estimation techniques in different domains
    • Marketing
    • Finance
    • Healthcare
    • Retail
  5. Conclusion

1.0 Business Context

We will start our discussions with a practical problem. Imagine a tailored tutoring program for students, designed to assess its influence on each student’s academic performance.This real-world business problem sets the stage for understanding the concepts of treatment and outcomes which we have learned earlier in the previous article. Here, the tutoring program acts as the treatment, and the resulting change in academic performance becomes the outcome.

When discussing causal estimation its important to explore the concept of potential outcomes (PE). Potential outcomes refer to the different results a student could achieve under two conditions: one with the tutoring program (treatment) and the other without it (no treatment). In the context of our business problem, each student has the potential to demonstrate varied academic performances depending on whether they undergo the tutoring program or not. This introduces the concept of Individual Treatment Effect (ITE), representing the specific causal impact of the treatment for an individual student. However, a significant challenge arises as we can only observe one outcome per student, making it impossible to directly measure both potential outcomes. For instance, if a student improves their academic performance after participating in the tutoring program, we can witness this positive outcome. However, we cannot simultaneously observe the outcome that would have occurred if the same student did not receive the tutoring.

This challenge necessitates the need for the concept of Average Treatment Effect (ATE), providing a broader perspective by considering the average impact across the entire group of students. In the previous paragraph we saw the challenge with ITE. The way around this challenge is by ATE. Here, the ATE steps in, offering a comprehensive understanding of the overall effectiveness of the tutoring program across the entire student population. ATE allows us to aggregate these individual treatment effects, providing a holistic view of how the program influences academic performance across the diverse student population.

As we navigate the intricacies of causal estimation methods, the interplay between ITE and ATE emerges as a key focal point. This exploration not only addresses the complexities of our business problem but also sets the stage for robust analyses into treatment effects within various contexts. Throughout this comprehensive guide, we’ll uncover various methods, including causal estimation using linear regression, propensity score matching, and instrumental variable analysis, to estimate the potential outcomes of treatments.

In this article we will dive deep into the estimation method using linear regression. We will be dealing with other methods in subsequent articles.

2.0 Causal estimation from scratch using Linear regression

Regression analysis is a data analysis method that aims to comprehend the connections between variables and unveil cause-and-effect patterns. For example, when investigating the impact of a new study technique on exam scores, regression analysis allows us to account for factors such as study time, sleep, and past grades. This systematic approach facilitates a quantitative assessment of the study technique’s impact while considering the influence of these additional variables.

In practical terms, you can employ regression analysis to explore how hours spent exercising affect weight loss, factoring in elements like diet and metabolism. This method proves valuable for complex questions with multiple variables. However, it is most effective when certain assumptions hold true, making it suitable for straightforward relationships without hidden factors affecting both the treatment and the outcome.

Let us start by generating synthetic data for our exercise with 100 students. The students have random study hours, sleep hours, and past grades.

2.1 Synthetic data generation for the analysis

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

n_students = 100
study_hours = np.random.randint(1, 10, n_students)
sleep_hours = np.random.randint(5, 9, n_students)
past_grades = np.random.randint(60, 100, n_students)

Next a dataFrame is created with columns for study hours, sleep hours, past grades, and a binary column representing whether or not a student received the new study technique (0 for No, 1 for Yes).

data = pd.DataFrame({
    'StudyHours': study_hours,
    'SleepHours': sleep_hours,
    'PastGrades': past_grades,
    'NewStudyTechnique': np.random.choice([0, 1], n_students)
})

Next we generate a ‘DoExamScores’ column using a formula incorporating study hours, sleep hours, past grades, and the influence of the innovative study technique. To emulate real-world variations, we introduce random noise into the equation. This new variable serves as a key indicator, amalgamating crucial factors that contribute to exam performance.

data['DoExamScores'] = (
    70 + 2 * data['StudyHours'] + 3 * data['SleepHours'] + 5 * data['PastGrades'] +
    10 * data['NewStudyTechnique'] + np.random.normal(0, 5, n_students)
)

2.2 Fitting a linear model to map the causal relationship

We will now perform causal analysis using regression analysis on the data. The data will be split into features (X_do) and the target variable (y_do). A linear regression model is then fitted to estimate the relationship between the features and the target variable.

X_do = data[['StudyHours', 'SleepHours', 'PastGrades', 'NewStudyTechnique']]
y_do = data['DoExamScores']

model = LinearRegression()
model.fit(X_do, y_do)

2.3 Intervention and Average Treatment Effect Estimation

In the pursuit of estimating the Average Treatment Effect (ATE), we employ the concept of intervention, a pivotal step in causal analysis. Intervention, represented by the “do” operator, allows us to explore hypothetical scenarios by directly manipulating the treatment variable. This step is crucial for uncovering the causal impact of a specific treatment, irrespective of the actual treatment status of individuals.

Different from conditioning, where we analyze outcomes within specific groups based on the actual treatment received, intervention delves into the counterfactual—what would happen if everyone were exposed to the treatment, and conversely, if no one were. It provides a deeper understanding of the causal relationship, untangling the impact of the treatment from the complexities of observed data. Let us observe the code on how we achieve that

# Getting the intervention data sets
X_do_1 = pd.DataFrame.copy(X_do)
X_do_1['NewStudyTechnique'] = 1
X_do_0 = pd.DataFrame.copy(X_do)
X_do_0['NewStudyTechnique'] = 0

In our provided code, the concept of intervention comes to life with the creation of two datasets, X_do_1 and X_do_0, representing scenarios where everyone either received or did not receive the new study technique. By simulating these intervention scenarios and observing the resultant changes in predicted exam scores, we can effectively estimate the Average Treatment Effect, a critical measure in causal analysis. This process allows us to grasp the counterfactual world, answering the essential question of how the treatment influences the outcome when universally applied.

2.4 Quantifying Impact: Calculating ATE through difference in expected outcomes

Now that we have dealt with the counterfactual question, let us look at the answer we were looking at. Whether the new study technique was indeed effective. This can be unravelled by calculating the average treatment effect, which can get as follows.

ate_est = np.mean(model.predict(X_do_1) - model.predict(X_do_0))

The code for Average Treatment Effect (ATE) estimation, represented by ate_est = np.mean(model.predict(X_do_1) - model.predict(X_do_0)), calculates the mean difference in predicted outcomes between the scenarios where the new study technique is applied (X_do_1) and where it is not applied (X_do_0).

2.5 Interpretation of the results

  • The model.predict(X_do_1) represents the predicted outcomes when all individuals are exposed to the new study technique.
  • The model.predict(X_do_0) represents the predicted outcomes in a hypothetical scenario where none of the individuals are exposed to the new study technique.

The subtraction of these predicted outcomes gives the difference in outcomes between the two scenarios for each individual. Taking the mean of these differences provides the ATE, which represents the average impact of the new study technique across the entire group.

In this specific example:

  • The calculated ATE value of 9.53 suggests that, on average, the new study technique is associated with a positive change in the predicted outcomes.
  • This positive ATE indicates that, overall, the new study technique has a favorable impact on the exam scores of the students.

This interpretation aligns with the initial question of whether the new study technique is effective. The positive ATE value supports the hypothesis that the implementation of the new study technique contributes positively to the predicted exam scores, providing evidence of its effectiveness based on the model’s predictions.

Alternatively in a linear regression model, the coefficient of the treatment variable (‘NewStudyTechnique’) represents the change in the outcome variable for a one-unit change in the treatment variable. In this context, it gives us the estimated causal effect of the new study technique on exam scores, making it equivalent to the ATE.

ate_est = model.coef_[3]
print('ATE estimate:', ate_est)

We can see that both these values match.

3.0 Implementation of the example using DoWhy library

In our earlier implementation, we manually crafted a linear regression model and calculated the Average Treatment Effect (ATE) by manipulating treatment variables. The idea of the earlier implementation was to get an intuitive idea of what is happening behind the hood. Alternatively we now leverage the DoWhy library, a powerful tool designed for causal inference and analysis. DoWhy simplifies the process of creating causal models and estimating treatment effects by providing a unified interface for causal analysis.

import pandas as pd
import numpy as np
import dowhy
from dowhy import CausalModel

# Generate the data
n_students = 100
study_hours = np.random.randint(1, 10, n_students)
sleep_hours = np.random.randint(5, 9, n_students)
past_grades = np.random.randint(60, 100, n_students)

data = pd.DataFrame({
    'StudyHours': study_hours,
    'SleepHours': sleep_hours,
    'PastGrades': past_grades,
    'NewStudyTechnique': np.random.choice([0, 1], n_students)
})

data['DoExamScores'] = (
    70 + 2 * data['StudyHours'] + 3 * data['SleepHours'] + 5 * data['PastGrades'] +
    10 * data['NewStudyTechnique'] + np.random.normal(0, 5, n_students)
)

# Define the causal model
model = CausalModel(
    data=data,
    treatment='NewStudyTechnique',
    outcome='DoExamScores',
    common_causes=['StudyHours', 'SleepHours', 'PastGrades']
)

# Identify the causal effect using linear regression
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")

# Print the causal effect estimate
print(estimate.value)

In the implementation the creation of data set is the same as we have seen earlier. After creating the data frame we define a causal model using the CausalModel class from DoWhy. We specify the treatment variable (‘NewStudyTechnique’), the outcome variable (‘DoExamScores’), and the potential common causes or confounders (‘StudyHours’, ‘SleepHours’, ‘PastGrades’).

The identify_effect method is called to identify the causal effect based on the specified causal model. This step involves determining the causal estimand, considering potential backdoor and front-door paths in the causal graph. To know more about back door and front door paths you can refer our previous blogs.

With the identified estimand, we proceed to estimate the causal effect using linear regression as the chosen method. The backdoor.linear_regression method is employed to handle the backdoor paths and calculate the effect of the treatment variable on the outcome. We can see from the output, that both methods give very similar results.

The DoWhy implementation provides a more automated and standardized approach to causal analysis by handling various aspects of the causal graph, identification, and estimation, streamlining the process compared to the manual implementation from scratch.

4.0 Applications of causal estimation techniques in different domains

Causal estimation is crucial for moving beyond correlation and understanding the true cause-and-effect relationships that drive outcomes. Linear regression, as a versatile tool, empowers professionals in marketing, finance, healthcare, retail, to name a few, to make data-driven decisions, optimize strategies, and improve overall performance. The significance of causal estimation lies in its ability to guide interventions, enhance predictive modeling, and provide a deeper understanding of the factors influencing complex systems. Let us look at potential use cases in different domains

4.1 Marketing:

Causal estimation using linear regression is instrumental in the field of marketing, where businesses aim to comprehend the effectiveness of their advertising strategies. In this context, the treatment could be the implementation of a specific advertising campaign, the outcome might be changes in consumer behavior or purchasing patterns, and potential confounders could include factors like seasonality, economic conditions, or competitor actions. By employing linear regression, marketers can isolate the impact of their advertising efforts, discerning which strategies contribute most significantly to desired outcomes. This approach enhances marketing ROI and aids in strategic decision-making.

4.2 Finance:

In finance, causal estimation with linear regression can be applied to unravel the complex relationships between economic factors and investment returns. Here, the treatment might represent changes in interest rates or economic policies, the outcome could be fluctuations in stock prices, and potential confounders may encompass market trends, geopolitical events, or industry-specific factors. By employing linear regression, financial analysts can discern the true impact of economic variables on investment performance, facilitating more informed portfolio management and risk assessment.

4.3 Healthcare:

Causal estimation using linear regression is paramount in healthcare for evaluating the effectiveness of different treatments and interventions. The treatment variable might represent a specific medical intervention, the outcome could be patient health outcomes or recovery rates, and potential confounders may include patient demographics, pre-existing health conditions, and lifestyle factors. Linear regression allows healthcare professionals to disentangle the impact of various treatments from confounding variables, aiding in evidence-based medical decision-making and personalized patient care.

4.4 Retail:

In the retail sector, understanding customer behavior is essential, and causal estimation with linear regression provides valuable insights. The treatment in this context might be changes in pricing strategies or promotional activities, the outcome could be changes in customer purchasing behavior, and potential confounders may involve external factors like economic conditions or competitor actions. By employing linear regression, retailers can identify the causal factors influencing customer decisions, allowing for more targeted and effective retail strategies.

5.0 Conclusion

The exploration of causal estimation methods unravels a powerful toolkit for understanding and deciphering complex cause-and-effect relationships in various domains. Whether applied to optimize marketing strategies, inform financial decisions, enhance healthcare interventions, or refine retail operations, causal estimation methods provide a nuanced lens through which to interpret data and guide decision-making. From the foundational concepts of potential outcomes to the intricacies of linear regression and sophisticated frameworks like DoWhy, this journey through causal estimation has underscored the importance of moving beyond mere correlation to discern true causal links.

As we embrace the significance of causal estimation in our data-driven era, the ongoing quest for refined methodologies and deeper insights continues. Whether in marketing, finance, healthcare, retail, or beyond, the application of causal estimation methods stands as a testament to the ever-evolving nature of data science and its profound impact on shaping a more informed and effective future.

Causal AI in Business: Estimating the Effect of a Member Rewards Program – Part I

Structure

  1. Introduction to causal analysis
  2. Implementation of causal analysis using DoWhy
  3. Business context – Member loyalty program
  4. Causal analysis process in a nutshell
  5. Process 1 : Defining the problem statement
  6. Process 2 : Data preparation through simulation
  7. Defining treatment and control for membership program
  8. Creation of simulated data for causal analysis
  9. Process 3 : Defining the causal graph structure
  10. Process 4 : Causal identification
    • Identification of back door, front door and instrumental variables
  11. Conclusion – Processes to be implemented in Part II of the series

Introduction

In this series, we’re introducing a paradigm that’s very important in the world of artificial intelligence, Causal AI. While traditional AI focuses on predicting outcomes based on data patterns, Causal AI delves much deeper. It’s all about understanding the ‘why’ behind those patterns, uncovering the intricate cause-and-effect relationships that govern our world. So, if you’ve ever wondered about the true reasons behind data trends, or if you’re curious about how different variables interact to create real-world impacts, then you’re in for an exciting journey.

Imagine you’re running an online streaming platform, and you’ve noticed that whenever you recommend certain songs to users, they not only listen to those songs but also stay on your platform longer, leading to increased revenue.

Traditional AI would say, “When we recommend these songs, users stay longer, and we earn more.” But Causal AI would take it a step further digging deeper to reveal the underlying cause. It might find that the songs you recommend have a specific beat or rhythm that makes users more engaged. It’s like discovering that the secret ingredient in a recipe is a unique spice and explain, “When we recommend these songs, it causes users to stay longer and explore more music, which directly leads to increased revenue.”

This understanding is incredibly valuable. It means you can not only predict but also influence user behavior. You can fine-tune your recommendations to include songs with similar engaging beats, further enhancing user engagement and revenue.

Causal AI doesn’t just stop at stating an effect; it uncovers the causal chain and reveals the hidden connections between variables. It’s like being a detective in the world of data, solving mysteries and using those insights to drive success in various domains. Causal AI has immense implications for fields like healthcare, economics, social sciences, and beyond.

Causal analysis Implementation – Introducing Do Why

When it comes to learning causal analysis, there are different approaches: diving into the theories and building methods from the ground up, or taking advantage of existing packages and tools. The former option is rich in educational value but can be time-consuming. On the other hand, the latter, using specialized packages, can get the job done quickly but might leave you in the dark about what’s happening beneath the surface. In this series, we’re offering a hybrid approach. We’ll leverage a powerful package, DoWhy, while also taking a deep dive into the underlying theory.

Speaking of packages, there are several exceptional ones available, such as DoWhy, CausalML, CausalImpact, EconML, and more. Our journey into the world of causal analysis begins with DoWhy.

But why DoWhy? The answer is simplicity. DoWhy allows us to conduct complex causal analyses with just a few lines of code, making it a practical choice. However, don’t let this simplicity fool you. Behind each line of code lies a profound theoretical foundation, shaping every aspect of the analysis process.

In this series, we’re pulling back the curtain. We won’t just show you how to execute essential causal analysis steps; we’ll also explore the theories that underpin them. For example, consider the crucial task of identifying causal effects. DoWhy makes it look easy with a single line of code. But beneath that simplicity lies a sequence of steps, including:

  1. Building a causal graph
  2. Identifying causal effects
  3. Ensuring identifiability
  4. Estimating causal effects

Understanding each of these steps is critical to get an intuitive understanding of what we are attempting to do . In this series we will break down each of the implementation steps into its component processes and then we’ll dissect these process to its core, understanding why we perform each step and how it contributes to our intuitive comprehension of causal relationships. So, join us as we bridge the gap between theory and practice, empowering you to wield the true power of Causal AI.

Problem Statement

As this is an introductory series, we will delve with a simple problem statement. This is taken from the example available in Microsoft research on estimating the effect of a Member Rewards program from the following link. Let us first understand the business context for this problem statement

Business Context – Members loyalty program

Image source : https://emarsys.com/learn/blog/best-retail-customer-loyalty-programs/

In today’s competitive business landscape, companies strive to attract and retain customers by offering various loyalty programs and subscription services. One common question that arises in these scenarios is whether these programs truly have a positive impact on business outcomes. To answer this question, we need to go beyond simple correlation and apply causal analysis techniques. In this blog post, we will explore how causal analysis can help us estimate the effect of a member rewards program on total sales, providing valuable insights for businesses.

Importance of the Problem:
The impact of a member rewards program on total sales is a critical aspect for businesses to evaluate. By understanding the causal relationship between offering such a program and customer spending, companies can make informed decisions about resource allocation, marketing strategies, and program optimization. Answering the question of whether the program is effective allows businesses to measure its return on investment, identify areas for improvement, and enhance customer loyalty.
The problem of estimating the effect of a member rewards program is just one example of many business problems that can be addressed using causal analysis. Consider the following scenarios:

  1. Pricing Strategy: A company wants to determine the impact of a price change on sales volume. Causal analysis can help isolate the effect of price from other factors and provide insights into optimal pricing strategies.
  2. Marketing Campaigns: Businesses invest significant resources in marketing campaigns. Causal analysis can help measure the effectiveness of different marketing channels, messages, or promotions, enabling companies to allocate their marketing budget wisely.
  3. Product Development: When introducing a new product or feature, companies need to assess its impact on customer satisfaction and sales. Causal analysis can uncover the causal relationship between the product and desired outcomes, guiding product development decisions.

Causal Analysis in a Nutshell:
Causal analysis is an analytical approach that aims to understand cause-and-effect relationships between variables. Unlike correlational analysis, which examines associations between variables, causal analysis goes a step further to determine the true causal impact of an intervention or treatment on an outcome of interest. Let us look at all the steps involved in causal analysis. Many of the terms defined in these steps might sound alien at this point of time. However by end of this series you would pretty well understand the nuances of all these terms.

Steps in causal analysis
  1. Problem Statement Definition:
    To conduct a meaningful analysis, we define the problem statement and research question clearly. For instance, we pose the question: “What is the causal effect of the membership loyalty program on buying patterns?” Here, the focus is on understanding how the loyalty program influences customers’ buying propensities.
  2. Data Collection and Cleaning:
    In order to analyze the impact of a membership loyalty program on customer retention, we begin by collecting and cleaning relevant data. This may include customer profiles, enrollment information, and purchase history. For example, we gather data on the number of months a customer has been a member, their total spendings, and the frequency of their purchases.
  3. Causal Graph Specification:
    Constructing a causal graph allows us to visually represent the hypothesized causal relationships among variables in the membership loyalty program. Consider variables such as enrollment in the program, customer demographics, purchase behavior, and retention rates. For example, the graph may show that program enrollment affects customer satisfaction, which in turn influences their likelihood of remaining loyal to the brand. By specifying the causal graph, we aim to capture the potential causal pathways and identify confounding variables.
  4. Identification of Causal Effect:
    Identifying the causal effect involves determining the variables that need to be controlled for to estimate the impact of the loyalty program accurately. For instance, we identify potential confounders that could bias the estimation, such as age, income, or previous purchase frequency, which may affect both enrollment in the program and customer retention. The objective is to identify and account for the variables that could influence the causal relationship between the loyalty program and customer retention.
  5. Selection of Estimation Method:
    Based on the data characteristics and assumptions made in the causal graph, we select an appropriate estimation method. For example, we may choose propensity score matching, logistic regression with covariate adjustment, or survival analysis. The selected method should address confounding variables and provide reliable estimates of the causal effect. The objective is to choose a statistical technique that best suits the data and assumptions to estimate the impact accurately.
  6. Estimation of Causal Effect:
    Using the chosen estimation method, we estimate the causal effect of the membership loyalty program on customer retention. By running the statistical analysis and adjusting for confounding variables, we quantify the impact of the program. For example, we may find that customers who participate in the loyalty program have a 20% higher likelihood of remaining loyal compared to non-participants. The objective is to obtain a quantitative measure of the program’s impact on customer retention and assess its statistical significance.
  7. Sensitivity Analysis:
    To test the robustness of the estimated causal effect, we conduct sensitivity analyses. This involves exploring the sensitivity of the results to different model specifications, inclusion/exclusion of variables, or alternative estimation techniques. For example, we may vary the set of confounding variables or use different statistical models to assess the stability of the estimated effect. The objective is to evaluate the robustness of the findings and ensure that the estimated causal effect is not overly sensitive to modeling choices.

By applying causal analysis techniques, businesses can move beyond mere observations and correlations, gaining deeper insights into the drivers of business outcomes. This empowers decision-makers to make evidence-based choices, optimize strategies, and maximize their competitive advantage. So, let’s roll up our sleeves, delve into the exciting world of causal analysis . Together, we’ll uncover the story behind customer behaviors, the impact of our membership rewards program, and the strategies that will drive our business success.

Process 1 : Defining the problem statement

In causal analysis, it is crucial to define the problem statement and research question clearly to conduct a meaningful analysis. In the context of the membership loyalty program, we can define the problem statement as follows:

“What is the causal effect of the membership loyalty program on buying patterns?”

This problem statement aims to investigate the impact of the loyalty program on customers’ purchasing behaviors and understand whether the program has a positive influence on their buying propensities.

Alongside the problem statement, we can also define the counterfactual question:

“If customers had not enrolled in the membership loyalty program, how would their buying patterns differ?”

This counterfactual question explores the hypothetical scenario where customers did not participate in the program and seeks to determine the difference in their purchasing behaviors compared to the observed scenario.

Causal analysis processes, such as data preparation, causal graph specification, identification, estimation, and refutation, are aligned with providing answers to these questions. The data preparation stage involves cleaning and transforming the data to facilitate the analysis of the loyalty program’s effect on buying patterns. The specification of a causal graph helps identify the variables and their relationships relevant to the research question. The identification step determines whether the causal effect can be estimated given the available data and assumptions. Estimation techniques are then applied to estimate the actual causal effect. Finally, refutation methods are used to test the robustness and sensitivity of the causal model to different assumptions and specifications.

Next, let us now prepare the data required for our analysis.

Process 2 : Data preparation through simulation

Let’s embark on our journey of causal analysis by creating a simulated dataset for our problem statement. Imagine we have a bustling online platform with 10,000 customers, each with their unique preferences and behaviors. Some of these customers have eagerly signed up for our membership rewards program, while others have chosen not to participate. To add some excitement, let’s assign each signing-up customer a specific month when they joined the program, ranging from January to December.

import numpy as np
# Defining the users and the number of months
num_users = 10000
num_months = 12
# Random generation of sign up months
signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0,2, size=num_users)

The code snippet provided generates a simulated dataset for the problem statement. Let’s break it down step by step:

  1. num_users = 10000: This line sets the number of users in our dataset to 10,000. Each user represents a customer on our online platform.
  2. num_months = 12: This line sets the number of months we want to simulate in our dataset. In this case, we have 12 months, representing a year.
  3. signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0, 2, size=num_users): This line generates random signup months for the users in our dataset. Let’s break it down further:
  • np.arange(1, num_months): This generates an array containing numbers from 1 to num_months - 1, representing the available signup months from January to December
  • np.random.choice(...): This function randomly selects signup months for each user from the array generated in the previous step. The num_users parameter specifies the number of choices to make.
  • np.random.randint(0, 2, size=num_users): This generates an array of random 0s and 1s, representing whether a user has signed up (1) or not (0). The size parameter specifies the size of the array.
  • This performs element-wise multiplication between the two arrays. As a result, users who have not signed up (0) will have a signup month of 0, and users who have signed up (1) will have a randomly assigned signup month.

Creating this dataset allows us to explore the impact of our membership rewards program on total sales and delve into the realm of causal analysis. By simulating customer behaviors, we can examine how different factors, such as program participation and signup month, influence customer spending patterns. Through this process, we can unravel the causal relationships that drive our business outcomes and gain valuable insights for decision-making.

To conduct our analysis on the membership program, we create a dataframe that contains the necessary data. The dataframe, named ‘df’, includes the following information:

  1. User ID: Each unique user is assigned a unique ID ranging from 1 to 10,000. This allows us to identify and track individual users throughout the analysis.
  2. Signup Month: This column indicates the month in which a specific user signed up for the membership program. The signup month is represented by a number between 1 and 12. A value of 0 indicates that the user has not signed up for the program. For example, a user with a signup month of 3 indicates they joined the program in March.
  3. Month: This column represents the months of the year, ranging from 1 to 12. It indicates the time period for which we have data on customer purchases and program participation.
  4. Purchase Presence: This feature indicates whether a customer made a purchase in a particular month. It is designed in a way that ensures every customer has a presence in each month of the year. For instance, a value of 1 indicates that the customer made a purchase during that month, while a value of 0 indicates no purchase.
  5. Spend: This column represents the aggregated purchase amount for each customer in a given month. The values are randomly generated based on a Poisson distribution with a mean of 500. This simulates the varying spending patterns of customers throughout the year.

The code snippet used to create the dataframe is as follows:

df = pd.DataFrame({
    'user_id': np.repeat(np.arange(num_users), num_months),
    'signup_month': np.repeat(signup_months, num_months),
    'month': np.tile(np.arange(1, num_months+1), num_users),
    'spend': np.random.poisson(500, num_users*num_months)
})

In this code, we utilize the np.repeat() function to repeat the values of user_id and signup_month for each month, ensuring that each user has an entry for every month in the dataframe. Similarly, the np.tile() function is used to replicate the sequence of months for each user. Lastly, the np.random.poisson() function generates random purchase amounts based on a Poisson distribution with a mean of 500.

The resulting dataframe df provides the necessary data to analyze the impact of the membership program on customer purchasing behavior. It allows us to examine the relationship between signup months, customer purchases, and spending patterns over time. Now that we have simulated the basic data set, let us introduce some important concepts of causal analysis i.e treatment group and control groups.

Defining treatment and control for membership program

In causal analysis, two essential concepts that play a vital role in evaluating the impact of a membership program are the treatment group and the control group. These groups are integral to understanding the causal effects of the program on customer behavior and outcomes. To know more about treatment group and control group you can refer to this link.

The treatment group in the context of our membership program refers to customers who have actively signed up and participated in the program. These individuals have voluntarily chosen to engage with the program and are exposed to the additional benefits, rewards, and experiences offered. By being part of the treatment group, these customers potentially experience a different set of circumstances compared to the control group.

On the other hand, the control group consists of customers who have not signed up for the membership program. They represent a baseline or reference group that reflects what would happen in the absence of the program. These customers do not receive the program’s benefits, rewards, or any specific treatment related to membership. The control group provides a comparison point to evaluate the impact of the program by serving as a counterfactual scenario.

By distinguishing between the treatment and control groups, we can assess the causal effect of the membership program on various outcomes, such as customer spending, engagement, loyalty, or any other relevant metrics. Understanding the distinct characteristics and behaviors of both the treatment and control groups enables us to explore questions such as:

  • How does customer spending differ between those who signed up for the program and those who did not?
  • What is the incremental impact of the program on customer retention or lifetime value?

Answering these questions requires careful analysis of the treatment and control groups, allowing us to draw valid causal inferences and make informed decisions regarding the membership program’s efficacy and potential optimization.

Having understood about treatment group and control group let us define them in our data set.

The above code manipulates the data frame to simulate the treatment effect and generate data for the treatment and control groups.

# A customer is in the treatment group if and only if they signed up
df["treatment"] = df["signup_month"]>0
# Simulating an effect of month (monotonically decreasing--customers buy less later in the year)
df["spend"] = df["spend"] - df["month"]*10
# Simulating a simple treatment effect of 100
after_signup = (df["signup_month"] < df["month"]) & (df["treatment"])
df.loc[after_signup,"spend"] = df[after_signup]["spend"] + 100
df

Let us understand the processes which we have implemented in the code block above to simulate the data for our analysis.

  1. Defining the Treatment Group: The line df["treatment"] = df["signup_month"]>0 creates a new column named “treatment” in the data frame. A customer is assigned to the treatment group if their “signup_month” is greater than zero, indicating that they have signed up for the membership program. This step distinguishes between the customers who have actively engaged with the program (treatment group) and those who have not (control group).
  2. Simulating the Effect of Time: The line df["spend"] = df["spend"] - df["month"]*10 simulates a time-related effect on customer spending. It reduces the “spend” values based on the corresponding “month” values in a monotonically decreasing manner. This simulation implies that customers tend to spend less as the year progresses.
  3. Simulating a Treatment Effect: The line after_signup = (df["signup_month"] < df["month"]) & (df["treatment"]) creates a Boolean mask to identify the customers who signed up before the current month (indicating they are eligible for the treatment) and are still in the treatment group. This condition represents the period after customers have signed up but are still active members of the program.
  4. Applying the Treatment Effect: The line df.loc[after_signup,"spend"] = df[after_signup]["spend"] + 100 modifies the “spend” values for the customers who meet the condition defined by the “after_signup” mask. It adds a treatment effect of 100 to their spending, representing an increase in spending as a result of being part of the membership program.

By applying these simulations, the code generates a data frame that reflects the characteristics of the treatment and control groups. It incorporates the impact of time on customer spending and introduces a specific treatment effect for the customers who signed up and are still active members. This data frame allows for further analysis and estimation of the causal effect of the membership program on customer spending.

Process 3 : Defining the causal graph structure

In the context of causal analysis, constructing a causal graph is a fundamental step as it allows us to visually and systematically represent the relationships between variables within a causal system. A causal graph serves as a graphical depiction of the causal connections between variables, with directed edges indicating the direction of causation. When creating a causal graph, it is essential to have a deep understanding of the domain under study, as the structure of the graph should accurately reflect the true causal relationships that exist within that domain. By incorporating domain knowledge into the construction of the causal graph, we can ensure that the variables and their relationships depicted align with our understanding of how they influence one another in the real-world context. Ultimately, a well-designed causal graph provides a clear and organized representation of the causal structure, facilitating subsequent steps in the causal analysis process. To understand different types of causal graphs and its effect on causal analysis you can refer the following link.

In our problem statement, the outcome which we are trying to understand is whether there is any change in the spending pattern after the sign up month. To do this we need to get the average monthly spending before the signup month and after the signup month. The signup month is an important variable, which in our case is the treatment. For the purpose of our analysis, we take any random month as the signup month, lets say month = 3. The average spending from month 1 to month 2 will be treated as a new variable ‘pre-spends’ and the average spending from month 4 – 11 will be post spends. Often its the buying experiences in the pre-spending period which influences the decision to join the membership program. Once a customer has signed up for the membership program, that decision will influence the post-spending also. So from a causal analysis, the relationship can be represented as follows.

Another important causal relationship to consider is the connection between the signup months and the treatment group assignment. The signup month variable directly influences whether a customer is categorized as part of the treatment group or not. For example, if a customer signs up in month 3, they are included in the treatment group. Additionally, the signup month has a direct relationship with post-spends, as it is associated with seasonal factors that can impact customer spending behavior. For instance, customers who sign up during holiday months might exhibit different post-spending patterns compared to those who sign up during non-holiday months. Understanding and exploring these relationships within a causal analysis framework can provide valuable insights into the effects of signup months on the treatment assignment and subsequent spending behaviour.

In the context of the above causal graph, we need to introduce a new concept called confounding. In causal analysis, a confounder refers to a variable that is associated with both the treatment variable and the outcome variable. It confounds the relationship between the treatment and the outcome because it may independently influence the outcome, creating a spurious association. Confounding variables can lead to biased estimates of the causal effect if not properly accounted for in the analysis.

In this case, the signup month can be considered a confounder. It is associated with both the treatment variable (membership program signup) and the outcome variable (post-spends). The signup month may influence the outcome through its relationship with seasonality and other factors that affect customer spending patterns.

Apart from the signup month, there may be other confounding variables that could impact both the treatment and the outcome. These confounders might include factors such as:

  1. Customer demographics: Variables like age, gender, income level, or occupation can influence both the decision to sign up for the membership program and post-spending behavior. For example, customers in higher income brackets may be more likely to join the program and spend more.
  2. Purchase history: The customers’ past purchasing behavior, including the frequency and value of purchases, can be a confounder. Customers who are frequent purchasers or who have a history of high-value purchases may exhibit different spending patterns after signing up for the membership program.
  3. Geographic location: The geographical location of customers may affect both their likelihood of signing up and their spending patterns. Customers from different regions may have different preferences, access to stores or products, or be influenced by local economic factors.

It is important to note that there may also be unobserved or unidentified confounders that could impact the relationship between the treatment and the outcome. These are confounding variables that are not included in the available dataset or are unknown. Unidentified confounders can introduce bias into the causal analysis and affect the accuracy of the estimated causal effects. Researchers must carefully consider and address both identified and unidentified confounders to obtain reliable causal estimates in their analysis. Please refer to the attached link to know more about confounders

Although we have identified some variables that have a direct influence on the treatment and outcome variables ( confounders ) , it is important to acknowledge that there may be other variables that impact the treatment variable but do not directly affect the outcome variable. Examples of such variables could include the type of products purchased, the length of a user’s account, or the geographic location of the customer. Since we may not have access to all these variables or they may not be directly measurable, we introduce a new variable called an instrumental variable, denoted as ‘z’, in causal analysis. An instrumental variable is a variable that is causally related to the independent variable of interest (treatment) but is not causally related to the outcome variable (post-spends). It helps address potential omitted variable bias and improves the validity of causal inference by isolating the causal effect of the treatment on the outcome.

Let us now take into account all these relationships and then represent the final causal graph for this problem statement.

In this causal graph “Z” is the instrumental variable that is causally related to Treatment (“Program Signup in month i ) but not causally related to “post_spends” (the outcome variable). To know more about instrument variables, you can refer the attached link.

To carry on with our causal analysis, which is about identifying whether the membership program had any effect or not, we need to transform the dataset and extract the relevant data. Let’s consider the example of a customer who signed up in the 3rd month. To do this, we modify the data frame by filtering out the customers who either did not sign up (signup_month = 0) or signed up in the 3rd month (signup_month = 3).

# Defining the signup month
i = 3
# Creation of the data frame for customers who have signed up for 4th month

df_i_signupmonth = (
    df[df.signup_month.isin([0, i])]
    .groupby(["user_id", "signup_month", "treatment"])
    .apply(
        lambda x: pd.Series(
            {
                "pre_spends": x.loc[x.month < i, "spend"].mean(),
                "post_spends": x.loc[x.month > i, "spend"].mean(),
            }
        )
    )
    .reset_index()
)

The code snippet performs the following steps:

  1. We define the signup month of interest as ‘i’, which is set to 3 (representing the 3rd month).
  2. We create a new data frame, ‘df_i_signupmonth’, by selecting the rows from the original data frame where the signup month is either 0 (indicating no signup) or ‘i’ (the month of interest).
  3. Next, we group the data frame by ‘user_id’, ‘signup_month’, and ‘treatment’.
  4. Using the ‘apply’ function, we calculate the average pre-spending and post-spending for each customer who falls into the specified signup month category. The pre_spends are computed by taking the mean of ‘spend’ for the months before the signup month (i.e., where ‘month’ < i), and the post_spends are computed by taking the mean of ‘spend’ for the months after the signup month (i.e., where ‘month’ > i).
  5. Finally, we reset the index of the data frame to have a clean and organized structure.

The resulting ‘df_i_signupmonth’ data frame contains the pre_spends and post_spends for each customer who signed up in the 3rd month (or did not sign up at all). This transformed dataset allows us to analyze the average treatment effect for customers in the specified signup month, providing insights into how the membership program influenced their spending patterns.

The data frame after these transformation will look as below.

The created data frame, ‘df_i_signupmonth’, is structured in a way that enables further causal analysis. It contains the average spends before and after the signup month for each customer, allowing us to examine the causal effects of the membership program.

Implementation the above processes in Do-why

Now that we have seen some of the initial processes, let us try implementing them using the do-why package. This package can be pip installed as follows.

!pip install dowhy
!pip install EconML

Once the package is installed, we import the package and initialise the causal graph as follows

import dowhy
# Define the causal graph object 
causal_graph = """digraph {
treatment[label="Program Signup in month i"];
pre_spends;
post_spends;
Z->treatment;
pre_spends -> treatment;
treatment->post_spends;
signup_month->post_spends;
signup_month->treatment;
}"""

This code defines a causal graph using the DOT language. It models the relationship between treatment (program signup in month i), pre_spends, post_spends, Z (unobserved confounder), and signup_month. Arrows indicate causal connections. This graph suggests that Z affects both treatment and pre_spends, treatment affects post_spends, and signup_month influences treatment and post_spends. This graph is used in DoWhy to identify and estimate causal effects in the presence of unobserved confounding (Z).

Now that we have defined the causal graph and implemented that in do-why, let us analyse the causal path and identify all the components required for causal analysis

Process 4 : Causal Identification

Causal identification refers to the process of determining whether a causal relationship can be inferred between a treatment or intervention and an outcome variable of interest. It involves establishing a causal link by identifying the effect of a specific cause (treatment) on a specific effect (outcome) while considering and controlling for potential confounding factors.

Identifying the causal effect is a critical step in conducting a causal analysis of a membership program. The objective is to determine the impact of the program on customer behavior, specifically their spending patterns. To identify the causal effect, we need to carefully consider the relationship between the treatment variable (program signup in a particular month) and the outcome variable of interest (average post-spends after signup), while accounting for potential confounding factors.

For example, let’s consider a scenario where the membership program offers exclusive discounts and promotions to customers who sign up during holiday months. If we only compare the average post-spends of customers who signed up during holiday months with those who did not sign up, we may mistakenly attribute any differences solely to the program. However, the observed differences could be influenced by other factors such as increased holiday spending trends.

To address this, we need to identify and control for potential confounding variables that could affect both the treatment and the outcome variable. In this case, factors like customers’ purchasing habits prior to signing up for the program, their demographics, or seasonal effects on spending patterns could act as confounders. By including these variables in our analysis and accounting for their influence, we can more accurately isolate the causal effect of the membership program on post-spends.

Causal identification involves the following.

  • Identification of backdoor variables: Backdoor variables are variables that are common causes of both the treatment and the outcome. They create confounding bias and can distort the estimated treatment effect if not properly accounted for. Identifying and adjusting for backdoor variables is crucial in establishing a causal relationship between the treatment and outcome. To know more about back door variables and its effects, please refer to this link.

In our context we have only one back door path as shown below.

VariableRelationship Characteristics
Signup-MonthBack Door Variable
Fork type Connection
Conditioning required on the variable to adjust for the effect on outcome

The only back door path we have is the path from treatment > Sign up month > post spends. This causal path involves a fork type of association and therefore to get an unbiased estimate of the effect of the treatment on the outcome we need to condition on the variable sign up month. We will address these in a little while from now.

For our problem statement of membership program, we don’t have any front door path as seen from the figure above.

  • Identification of instrumental variables: Instrumental variables are variables that are causally related to the treatment but not directly related to the outcome. They are used in situations where there are unobserved confounders or measurement errors. Instrumental variables help to identify the causal effect of the treatment by exploiting their relationship with the treatment variable. You can learn more about instrument variables from this post

We have two instrument variables for our problem statement as seen in the figure below.

By carefully considering backdoor variables, instrumental variables and front door variables causal path analysis allows for a more comprehensive understanding of the causal relationships between the treatment and outcome variables. It helps to mitigate confounding biases, identify indirect effects, and provide insights into the underlying causal mechanisms at play.

Next we will implement these in do-why

# Initialising the causal model
model = dowhy.CausalModel(data=df_i_signupmonth,
                     graph=causal_graph.replace("\n", " "),
                     treatment="treatment",
                     outcome="post_spends")
# Visualising the model
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

In this code snippet using DoWhy:

  1. Data Specification: data=df_i_signupmonth specifies the dataset (df_i_signupmonth) containing relevant variables.
  2. Graph Specification: graph=causal_graph.replace("\n", " ") provides the causal graph in DOT format, specifying the relationships between variables.
  3. CausalModel Initialization: dowhy.CausalModel(...) initializes a causal model.
  • treatment="treatment" indicates the treatment variable.
  • outcome="post_spends" specifies the outcome variable.

This sets up the structure for causal analysis in DoWhy. The causal graph is crucial for identifying and estimating causal effects, and the treatment and outcome variables are essential components for conducting causal inference.

Having identified the causal relationship, the next task is to get into the important area of causal estimation. We will discuss about the causal estimation and the rest of the process in the second part of the series. Before we sign off on this first part, let us summarise our learning so far.

Wrapping up

In the initial segment of our exploration into causal analysis, we embarked on a journey unraveling the intricacies of causality. Beginning with a comprehensive introduction to the essence of causation, we delved into the significance of the problem statement and the crucial role of the DoWhy package in this domain. The process kicked off with the generation of synthetic data, followed by the critical task of specifying a causal graph. The focal point then shifted to the identification of causal effects, laying the foundation for the second part of our series.

In the second part of the series, we will navigate through the nuanced terrain of selecting estimation methods, a pivotal step in the causal analysis workflow. The journey continues with the estimation of causal effects, exploring the intricacies of propensity score estimation and culminating in the calculation of the Average Treatment Effect (ATE). The DoWhy package will serve as our guiding compass, enabling a practical implementation of these concepts. As we unravel these layers, a holistic understanding of causal analysis will emerge, bridging the theoretical with the practical to empower a deeper grasp of this fascinating field.

Causal Analysis Foundation Series : Back door, Front door and Instrumentals in Causal Analysis

Structure

  1. Introduction
  2. Back door variables
  3. Examples of effects on blocking different nodes in back door paths
  4. Importance of identifying all back door paths
  5. Back door adjustments
  6. Code example for back door adjustments
  7. Front door adjustments
  8. Front door variable adjustment
  9. Code example for front door adjustment
  10. Instrumental variables
  11. Code examples for conditioning on instrumental variables
  12. Conclusion

Introduction

Building upon our previous exploration of causal graphs, where we delved into the intricate dance of associations through chains, forks, and colliders, as well as the powerful concept of d-separation, we now venture into even deeper territory in the realm of causal analysis. In this part of our journey, we introduce you to three pivotal players in the world of causal inference: back door variables, front door variables, and instrumental variables. These concepts emerge as essential tools when we encounter scenarios where straightforward causal understanding becomes elusive due to the complexities of confounding, hidden variables, and unobserved factors. As we navigate through this blog, we’ll uncover how back door, front door, and instrumental variables act as guiding beacons, illuminating otherwise obscure causal pathways and enabling us to tease out the true effects of our interventions. So, let’s embark on this enlightening expedition to unveil the intricacies of these variables and how they illuminate the path in our quest for causal understanding.

Back door variables:

The back door path is a key concept in causal analysis that provides a systematic way to address the challenge of confounding variables and uncover true causal relationships. Intuitively, it’s like finding a controlled path that helps us navigate around confounders to understand the direct causal impact of one variable on another.

One effective way to obtain the true causal effect without confounders is through interventions, where we actively manipulate variables to observe their effects.

However, interventions are not always feasible due to ethical, practical, or logistical constraints. For instance, studying the effects of smoking on lung health by exposing individuals to smoking is unethical and harmful or investigating the impact of a natural disaster on a community’s mental health might require an intervention that triggers such an event, which is clearly not feasible.

This is where the significance of observational data comes into play. Observational data consists of information collected without direct intervention, reflecting real-world scenarios.As we discussed in an earlier blog on causal graphs, blocking paths is crucial to ensure that the effect we observe is indeed a direct causal relationship and not distorted by confounding. Back door paths are essentially the paths that lead to confounding, and identifying and blocking them becomes essential when working with observational data. Identifying all the back door paths and appropriately conditioning on them is a powerful strategy to achieve this. By conditioning on these paths, we effectively neutralize the impact of confounding variables and obtain causal estimates akin to those obtained through interventions. This technique empowers us to make informed decisions and draw meaningful conclusions from observational data.

In the upcoming sections, we will delve deeper into how we can identify back door paths using illustrative examples. This understanding will equip us with valuable tools to distinguish true causal relationships from spurious associations in complex datasets.

Let us look at the causal graph given above. From the graph we can see two back door paths from the treatment variable to the outcome. These back door paths has the effect of bringing spurious correlations between the treatment and the outcome because of confounding. As we saw in the blog on causal graph and confounding , to eliminate these spurious correlations and to get the real causal effect between the treatment and outcome variable, we need to block the flow of association through these paths. One way to achieve this is through conditioning on the different variables in the back door paths. We saw this phenomenon on the blog on causal graphs. Let us consider the graph represented below which shows the conditioning on M2 and SEC.

We will get the same effect i.e blocking the back door paths if we condition on on M1 & SEC and/or M3 & SEC.

So by blocking the back door paths, we are eliminating all spurious correlations between the treatment and outcome giving us the real causal effects.

To illustrate the concept further, consider the following examples:

Example 1: Student Performance and Study Hours

Suppose we’re exploring the causal link between sleep duration ((X)) and student performance ((Y)) while considering study hours ((Z)) as a potential confounder. It’s conceivable that students who sleep longer might have more time and energy to dedicate to studying, and this extra study time might enhance their performance.

Direct Path: If we disregard confounding, we might hastily conclude that more sleep directly leads to better academic performance. However, the confounding factor of study hours could be influencing both sleep duration and performance.

Back Door Path: In this context, the back door path would involve controlling for study hours ((Z)). By comparing students who put in similar study hours, we can better uncover the genuine causal relationship between sleep duration and performance. This approach helps us mitigate the influence of the potential confounder, study hours, thus providing a clearer picture of the true impact of sleep duration on academic performance.

Example 2: Yoga, Sleep quality and Mental well-being

Let’s consider a scenario where we’re investigating the causal relationship between practicing yoga ((X)) and mental well-being ((Y)), while considering sleep quality ((Z)) as a potential confounder. People who regularly practice yoga might experience improved sleep quality, and sleep quality could also influence their mental well-being.

Direct Path: Without accounting for sleep quality, we might hastily conclude that practicing yoga directly enhances mental well-being. Yet, the presence of sleep quality as a confounder might impact both the decision to practice yoga and mental well-being.

Back Door Path: In this context, the back door path could involve controlling for sleep quality ((Z)). By comparing individuals with similar sleep quality, we can effectively isolate the genuine causal connection between practicing yoga and mental well-being. This approach helps us mitigate the confounding effect of sleep quality, allowing us to assess the true impact of practicing yoga on mental well-being without undue influence.

Importance of Identifying All Back Door Paths:

Identifying and adjusting for all relevant back door paths is crucial in causal analysis because it helps us avoid drawing incorrect conclusions due to confounding. By considering these paths, we can isolate the true causal relationship between variables and gain accurate insights into how changes in one variable affect another. Failure to consider back door paths can lead to biased estimates and incorrect causal interpretations.

Back door path identification is a strategic approach to untangle the complexity of causal relationships in the presence of confounding variables. By adjusting for these paths, we ensure that our conclusions about causality are accurate and reliable, even in the face of hidden influences. Let us now look at what adjusting for back door paths means

Back door adjustment

In the world of understanding cause and effect, things can get tricky when hidden factors, known as confounding variables, cloud our view. This is where the back door criterion comes in. It’s like a guide that helps us untangle the mess and find out what truly causes what. In our earlier example, lets say we want to know if practicing yoga (X) really improves mental well-being (Y). But there’s a sneaky factor: sleep quality (Z). People who practice yoga tends to sleep well. Now, instead of doing interventions, which can be tricky, we can adjust by considering sleep quality. This is back door adjustment. It’s like simulating interventions without actually intervening. By doing this, we uncover the real connection between yoga and health while avoiding the confusion caused by that sneaky factor. So, the back door criterion is like a secret tool that helps us figure out real cause-and-effect relationships, even when things are a bit tangled up.

Mathematically, the back door adjustment brings in “conditioning” to counteract the impact of confounders. For three variables – let’s call them teaching method (X), student performance (Y), and socioeconomic background (Z) – the equation reads:

P(Ydo(X))=∑ZP(YX,Z)⋅P(Z)

Now, let’s put this into a relatable classroom scenario. We’re looking at two teaching methods (Method A and B) used by students of different socioeconomic statuses (Low, Medium, and High SES) to see how they perform. Our dataset captures all this complexity, noting down the teaching method, socioeconomic status, and performance for each student.

Now, let’s use some simple code to apply this equation. In this example, we’ll focus on the back door path when we zero in on students with Low SES. By doing this, we make sure we account for the influence of socioeconomic status, and this helps us uncover the genuine impact of the teaching method on student performance.

# Given data for the student example
data = np.array([
    [0, 0, 1],  # Method A, Low SES, Pass
    [1, 1, 0],  # Method B, High SES, Fail
    [0, 0, 1],  # Method A, Low SES, Pass
    [1, 2, 1],  # Method B, Medium SES, Pass
    [1, 2, 0],  # Method B, Medium SES, Fail
    [0, 1, 0],  # Method A, High SES, Fail
    [1, 1, 1],  # Method B, High SES, Pass
    [0, 2, 1]   # Method A, Medium SES, Pass
])

# Extract columns for easier calculations
X = data[:, 0]  # Teaching method
Y = data[:, 2]  # Student performance
Z = data[:, 1]  # Socioeconomic status

# Calculate the back door path with conditioning for all values of X
p_Y_do_X_values = []
for x_val in np.unique(X):
    print('value of X',x_val)
    p_Y_do_X_proof = 0

    # Step 1: Start with the definition
    for w in np.unique(Z):
        print('value of Z',w)
        # Step 3: Break down P(y | do(t), w)
        
        p_Y_given_t_w = np.nan_to_num(np.sum(Y[(X == x_val) & (Z == w)]) / np.sum((X == x_val) & (Z == w)))
        # Step 4: Substitute into the equation
        p_w_given_do_t = np.nan_to_num(np.mean(Z[X == x_val]) / np.sum(X == x_val))
        # Step 5: Back door criterion
        p_Y_do_X_proof += p_Y_given_t_w * p_w_given_do_t
    p_Y_do_X_values.append(p_Y_do_X_proof)
    
# Print the results for all values of X
for i, x_val in enumerate(np.unique(X)):
    print(f"P(Y | do(X={x_val})) = {p_Y_do_X_values[i]:.4f}")

The purpose of this code is to get an intuitive understanding on what is going on in the equation. Here we have considered a toy data set.Let’s break down the code and its high-level explanation in the context of the back door equation:

Intuitive Explanation:
This code calculates the adjusted probability (P(Y | do(X))) for various values of (X) using the back door path. It iterates over different values of (X) and considers the influence of confounder (Z) to account for potential bias. The back door path approach allows us to adjust for confounding variables and isolate the causal effect of (X) on (Y).
Imagine we’re in a classroom where we’re trying to understand how different teaching methods ((X)) affect student performance ((Y)), while taking into account their socioeconomic status ((Z)) which might act as a confounder. The goal is to adjust for the confounding effect of (Z) on the relationship between (X) and (Y).

  1. Looping Over Values of (X): The code iterates over each unique value of (X) (different teaching methods). For each (X) value, it calculates the adjusted probability (P(Y | do(X))) using the back door path equation.
  2. Conditioning on (Z) (Confounder): Inside the loop, the code further iterates over each unique value of (Z) (different socioeconomic statuses). It considers each combination of (X) and (Z) to calculate the adjusted probability. This conditioning on (Z) helps us control for its influence.
  3. Adjustment using Back Door Path: For each (X) value and (Z) value combination, the code calculates the probability (P(Y | X = x_{\text{val}}, Z = w)), which represents the likelihood of a student passing given a certain teaching method and socioeconomic status.
  4. Weighted Sum for Adjustment: The code then calculates (P(w | do(t))) using the mean of (Z) values where (X = x_{\text{val}}). This corresponds to the proportion of students with a specific socioeconomic status in the context of the chosen teaching method.
  5. Back Door Adjustment: It multiplies the calculated probabilities of passing ((P(Y | X = x_{\text{val}}, Z = w))) by the corresponding weight ((P(w | do(t)))) and sums up these adjusted probabilities across all (Z) values. This adjustment accounts for the confounding effect of socioeconomic status and isolates the causal effect of the teaching method ((X)) on student performance ((Y)).
  6. Output and Interpretation: The adjusted probabilities for each value of (X) are printed. These adjusted probabilities represent the true causal effects of teaching methods on student performance while considering the influence of socioeconomic status.

In this example we have demonstrated how the back door path is operationalized through iteration, conditioning, and weighted adjustment. This process helps us uncover causal relationships by accounting for confounding variables and yielding insights free from biases introduced by hidden factors.

Having seen back door paths and back door adjustments, let us look at what front door adjustment is.

Front Door Adjustments

In the previous section we discussed how back door paths and adjustments can help us unveil causal relationships. However, what if we’re faced with a scenario where we don’t have access to these back door variables? Imagine we have data generated based on a certain causal graph, but we can’t observe one of the crucial back door variables. This means our usual back door adjustments won’t work. In such cases, the front door criterion offers a solution. It’s a bit like taking a different path to get the same result.

Imagine a scenario where we want to understand how a cause, X (let’s say a new teaching method), affects an effect, Y (student performance). It’s plausible that other variables we may not observe directly, like a student’s motivation or background, could influence both the teaching method chosen (X) and student performance (Y). These unobserved variables are essentially the confounders that can muddy our attempts to establish a causal link between X and Y.

Now, suppose we don’t have access to data on these unobserved confounders. In such cases, establishing causality between X and Y becomes challenging because we can’t directly block the confounding paths through these unobserved variables. However, we observe another variable, M (let’s say the student’s engagement level or effort), that is influenced by X and influences Y.

The variable M is called as a mediator and it plays a significant role in getting the causal effect. It acts as an intermediary step that helps us understand the causal effect of X on Y, even when the direct path from X to Y is blocked by unobserved or unobservable variables (the confounders).

Front Door Variable Adjustment:

Front door variable adjustment is a method used to uncover the causal effect of X on Y through the mediator M, even when the direct path between X and Y is blocked by unobserved or unobservable variables. Let us look at the equation of the front door adjustment

P(y|do(X=x))=∑zP(z|x)∑x′P(y|x′,z)P(x′)

Now, let’s dissect this equation step by step:

  1. Intervention on X: (X = x) represents the value of X when we intervene or change it to a specific value, denoted as (x). For example, if we want to find the probability that Y equals 1 when we intervene and set X to 1: [ P(Y | do(x)) { when Y = 1 and X = 1} ]
  2. Values of X before Intervention: (X = x’) represents the values of X before any intervention. This is found within the inner summation. For instance:
  • (P(z = 0 | x = 1) * [∑x′ P(Y = 1 | x’, z = 0) P(x’)] ) This part considers the probability of Z being 0 given X is 1, and then it sums over all possible values of X’ before the intervention. It calculates the contribution to the overall probability of Y = 1 when X is intervened upon.
  • Similarly, (P(z = 1 | x = 1) * [∑x P(Y = 1 | x’, z = 1) P(x’)]) accounts for the probability of Z being 1 given X is 1, and it sums over all possible values of X’ before the intervention.

Now that we have seen the equations, let us break down the equation and provide an interpretation of the process using code. We will give a step by step explanation and relate it to the front door adjustment criteria. To illustrate the front door adjustment criteria, we’ll use a causal scenario involving three variables: X (the treatment), Z (the mediator), and Y (the outcome).

Step 1: Simulating Data

# Simulate the distribution of X before intervention
p_x_before = np.array([0.4, 0.6])  # P(X=x')
x_before = np.random.choice([0, 1], size=n_samples, p=p_x_before)

Here, we start by simulating the distribution of X before any intervention. In this example, we have two possible values for X: 0 and 1. This represents different treatment options.

Step 2: Simulate the distribution of Z given X

# Simulate the distribution of Z given X
p_z_given_x = np.array([[0.7, 0.3], [0.2, 0.8]])  # P(Z=z | X=x')
z_given_x = np.zeros(n_samples)
for i in range(n_samples):
    z_given_x[i] = np.random.choice([0, 1], size=1, p=p_z_given_x[x_before[i]])

Now, we simulate the distribution of Z given X. This represents how the mediator (Z) behaves depending on the treatment (X). We have conditional probabilities here. For example, when X=0, the probabilities for Z=0 and Z=1 are 0.7 and 0.3, respectively.

Step 3: Simulate the distribution of Y given X, Z

# Simulate the distribution of Y given X, Z
p_y_given_xz = np.array([[[0.9, 0.1], [0.2, 0.8]], [[0.6, 0.4], [0.1, 0.9]]])

Here, we simulate the distribution of Y given X and Z. This represents how the outcome (Y) is influenced by both the treatment (X) and the mediator (Z). The nested array structure corresponds to conditional probabilities. For example, when X=0 and Z=0, the probability of Y=1 is 0.9.

Step 4: Normalize the Conditional Probability Distributions

# Normalize the conditional probability distributions
p_y_given_xz_norm = p_y_given_xz / np.sum(p_y_given_xz, axis=2, keepdims=True)

To ensure that our conditional probability distributions sum to 1, we normalize them. This normalization is crucial for accurate calculations.

Step 5: Calculate P(y=1 | do(X=x))

# Calculate P(y=1 | do(X=x))
x_intervention = 1
p_y_do_x = 0

for z_val in [0, 1]:
    p_z_given_x_intervention = p_z_given_x[x_intervention, z_val]
    p_y_given_xz_intervention = p_y_given_xz_norm[x_intervention, z_val, 1]  # P(Y=1 | X=x, Z=z)
    p_x_before_intervention = p_x_before[x_intervention]

    p_y_do_x += p_z_given_x_intervention * p_y_given_xz_intervention * p_x_before_intervention

This is where we apply the front door adjustment criteria. We want to find the probability of Y=1 when we intervene and set X to a specific value (x_intervention). We do this by considering different values of Z (the mediator) and calculating the contribution of each value to the final probability. We use the normalized conditional probability distributions to ensure accurate calculations.

Practical Use Case: Imagine X represents a medical treatment, Z is a biological mediator, and Y is a patient’s health outcome. The code simulates how we can estimate the effect of the treatment (X) on health (Y) while accounting for the mediator (Z). By adjusting for Z, we isolate the direct effect of the treatment, which is essential in medical research to understand treatment efficacy.

In summary, this code demonstrates the front door adjustment criteria by simulating a causal scenario and calculating the causal effect of a treatment variable while considering a mediator variable. It ensures that we account for the mediator’s influence on the outcome when estimating causal effects.

Instrumental Variables :

Imagine you want to know if eating more vegetables makes people healthier. But here’s the issue: people who eat more veggies might also do more exercise, and that exercise could be the real reason they’re healthier. So, there’s a sneaky variable, exercise, that you can’t really measure, but it’s affecting both eating veggies and being healthy.

Instrumental variables are like detective tools in such situations. They’re special variables that you can see and control. You use them to help figure out the real cause.

Here’s how they work: You find an instrumental variable that’s linked to eating veggies (X) but doesn’t directly affect health (Y). It’s like a stand-in for the exercise you can’t measure. Then, you use this instrumental variable to uncover if eating more veggies really makes people healthier.

So, instrumental variables help cut through the confusion caused by hidden factors, letting you discover the true cause-and-effect relationship between things, even when you can’t see all the pieces of the puzzle.

Selection of instrumental variables : Instrumental variables are carefully chosen based on specific criteria. They should exhibit a correlation with the Outcome variable (Y) however, they must not have any direct causal effect on the outcome (Y) apart from their influence on the treatment. This criterion ensures that the IV only affects the outcome through its impact on the treatment, making it a valid instrument.

Causal Effect on the Treatment Variable:One critical characteristic of an instrument variable is that it has a causal effect on the treatment variable (X). This means that changes in the IV lead to changes in X. This causal link is essential because it enables the IV to affect the treatment variable in a way that can be exploited to estimate causal effects.

Random Assignment:An ideal instrumental variable is one that is randomly assigned or naturally occurring in a way that is akin to randomization. This random assignment ensures that the IV is not influenced by any other confounding variables. In other words, the IV should not be correlated with any unobserved or omitted variables that might affect the outcome. Random assignment strengthens the validity of the IV approach and helps in isolating the causal relationship between X and Y.

Let us now generate code to understand the role of instrument variables in causal analysis. We will take an example on education level and higher wages to explain the concept of instrument variables. This example is heavily inspired by the example demonstrated in the attached link.

Let us first define the problem first

In many real-world scenarios, there is a common observation: individuals with higher levels of education tend to earn higher wages. This positive correlation between education and wages suggests that, on average, people with more years of schooling tend to have higher incomes. Let us generate some synthetic data to elucidate this fact and ask some questions on this phenomenon.

import numpy as np
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(0)

# Generate synthetic data
n_samples = 1000
education_years = np.random.randint(8, 20, n_samples)  # Simulate years of education (e.g., between 8 and 20)
# Simulate hourly wages as a linear function of education with added noise
hourly_wages = 15 * education_years + 10 + np.random.normal(0, 10, n_samples)

# Create a scatter plot to visualize the correlation
plt.figure(figsize=(8, 6))
plt.scatter(education_years, hourly_wages, alpha=0.5)
plt.title('Synthetic Data: Education vs. Hourly Wages')
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wages')
plt.grid(True)

# Add a trendline to visualize the positive correlation
z = np.polyfit(education_years, hourly_wages, 1)
p = np.poly1d(z)
plt.plot(education_years, p(education_years), "r--")

plt.show()

Now, the question arises: Does this positive correlation imply a causal relationship? In other words, does getting more education directly cause individuals to earn higher wages? However, keep in mind that correlation alone does not prove causation. There would be limitations of inferring causality from the correlations between years of education and hourly wages. Here’s an analysis of this specific scenario and our choice of actions

  1. Causality Requires Controlled Experiments: To establish a causal relationship between an independent variable (in this case, years of education) and a dependent variable (hourly wages), it’s ideal to conduct controlled experiments. In a controlled experiment, researchers can randomly assign individuals to different levels of education and then observe the impact on wages. Random assignment helps ensure that any observed differences in wages are likely due to the manipulation of the independent variable (education) rather than other confounding factors.
  2. Non-Random Assignment in the Real World: However in the real world, we often cannot conduct controlled experiments, especially when it comes to factors like education. Education is typically a choice made by individuals based on various factors such as personal interests, socioeconomic background, career goals, and more. It’s not randomly assigned like it would be in an experiment.
  3. Endogeneity and Confounding: When individuals choose their level of education based on their personal circumstances and goals, this introduces the possibility of endogeneity. Endogeneity means that there may be unobserved or omitted variables that influence both education choices and wages. These unobserved variables are the confounders which we have dealt with in depth in other examples. For example, personal motivation, intelligence, or family background could be confounding factors that affect both education choices and earning potential.
  4. Correlation vs. Causation: The positive correlation between education and wages simply means that, on average, individuals with more education tend to earn higher wages. However, this correlation does not, by itself, prove causation. It could be that individuals who are naturally more motivated or come from wealthier families are both more likely to pursue higher education and to earn higher wages later in life.

Let us take that the unobserved confounder in our case is the motivation level of induvial. Let us again generate some data to study its relationship with both the treatment ( education ) and outcome ( higher wages )

import numpy as np

import matplotlib.pyplot as plt

# Set a random seed for reproducibility

np.random.seed(0)

# Generate synthetic data for education years and hourly wages (same as before)

n_samples = 1000

education_years = np.random.randint(8, 20, n_samples)  # Simulate years of education (e.g., between 8 and 20)

hourly_wages = 15 * education_years + 10 + np.random.normal(0, 10, n_samples)

# Generate synthetic data for motivation levels

motivation_levels = 0.5 * education_years + 0.3 * hourly_wages + np.random.normal(0, 5, n_samples)

# Create scatter plots to visualize correlations

plt.figure(figsize=(12, 5))

# Education vs. Motivation Levels

plt.subplot(1, 2, 1)

plt.scatter(education_years, motivation_levels, alpha=0.5)

plt.title('Education vs. Motivation Levels')

plt.xlabel('Years of Education')

plt.ylabel('Motivation Levels')

plt.grid(True)

# Hourly Wages vs. Motivation Levels

plt.subplot(1, 2, 2)

plt.scatter(hourly_wages, motivation_levels, alpha=0.5)

plt.title('Hourly Wages vs. Motivation Levels')

plt.xlabel('Hourly Wages')

plt.ylabel('Motivation Levels')

plt.grid(True)

plt.tight_layout()

plt.show()

In this synthetic data, we’ve introduced motivation levels as a function of both education years and hourly wages, with some added random noise. As seen in the scatter plots, there is a positive correlation between motivation levels and both education years and hourly wages. This example is only to illustrate how an unobserved confounder (motivation levels) can influence both the treatment variable (education) and the outcome (hourly wages), potentially leading to a spurious correlation between education and wages.

However we have a problem here. We know that motivation level is a good confounder which will influence both the treatment and outcome. However we cant observe or collect data for a variable like motivation levels. So without getting data on the confounder how to we adjust the confounder to get this true causal effect between the treatment and outcome ? This is where instrumental variables (IVs) come into play.

This is how we will go about this task

  1. Identify an Instrument: An instrumental variable (IV) is a variable that has causal relationship with the treatment variable (education) but does not have a direct causal effect on the outcome (hourly wages). The IV should only affect the outcome through its impact on the treatment.
  2. Random Assignment: The IV should be assigned randomly or at least exogenously (not influenced by other confounding variables). This ensures that the IV is not subject to the same confounding factors as education.

In practice, finding a suitable instrumental variable can be challenging. It needs to meet the criteria of relevance (correlated with education) and exogeneity (not directly affecting wages). Once an appropriate IV is identified, it can help us estimate the causal effect of education on wages while addressing the issue of unobserved confounders like motivation

A good instrumental variable (IV) in the case of education and higher wages would be a variable that affects a person’s level of education but does not directly affect their wages. This IV should be correlated with education and exogenous, meaning it’s not influenced by the same factors that affect wages. One potential IV in this context is “compulsory schooling laws.”

Here’s how it works:

  1. Compulsory Schooling Laws: In some regions or countries, there are laws that require children to stay in school until a certain age or grade level. These laws effectively determine the minimum level of education a person must receive.
  2. Relevance: Compulsory schooling laws are relevant because they impact the educational attainment of individuals. People are compelled to stay in school up to a certain level, which influences their years of education.
  3. Exogeneity: The decision to implement compulsory schooling laws is typically made by governments and is not influenced by individual motivations, abilities, or job opportunities. Therefore, it can be considered exogenous in the context of individual wage determinants.
  4. Causal Effect: Compulsory schooling laws have a causal effect on education because they mandate a minimum level of schooling. However, they do not directly affect wages. The relationship between these laws and wages is mediated through the educational attainment they enforce.

Let us now generate more synthetic data to show the desired effect between instrument variables and other variables.

import numpy as np

import matplotlib.pyplot as plt

# Number of samples

n_samples = 1000

# Generate compulsory schooling levels (IV)

compulsory_schooling = np.random.randint(10, 16, size=n_samples)  # Random values between 10 and 15

# Generate education levels influenced by compulsory schooling

education = compulsory_schooling + np.random.normal(0, 2, size=n_samples)  # Adding some noise

# Generate wages influenced by education

wages = 1000 + 500 * education + np.random.normal(0, 1000, size=n_samples)  # Adding noise

# Generate random motivation levels

motivation = np.random.normal(50, 20, size=n_samples)  # Random normal distribution

# Create subplots

fig, axs = plt.subplots(1, 3, figsize=(18, 5))

# Plot education vs. compulsory schooling

axs[0].plot(compulsory_schooling, education, 'bo', alpha=0.5)

axs[0].set_xlabel("Compulsory Schooling Years")

axs[0].set_ylabel("Education Years")

axs[0].set_title("Positive Correlation: Education vs. Compulsory Schooling")

# Plot wages vs. compulsory schooling

axs[1].plot(compulsory_schooling, wages, 'go', alpha=0.5)

axs[1].set_xlabel("Compulsory Schooling Years")

axs[1].set_ylabel("Hourly Wages")

axs[1].set_title("Positive Correlation: Wages vs. Compulsory Schooling")

# Plot motivation vs. compulsory schooling

axs[2].plot(compulsory_schooling, motivation, 'ro', alpha=0.5)

axs[2].set_xlabel("Compulsory Schooling Years")

axs[2].set_ylabel("Motivation Level")

axs[2].set_title("No Correlation: Motivation vs. Compulsory Schooling")

plt.tight_layout()

plt.show()

From the plots, we can clearly observe the positive correlation between the instrument variable (compulsory schooling) and both the treatment (education) and the outcome (hourly wages). . Furthermore, we know that the instrument (compulsory schooling) has a causal effect on the treatment (education) because it determines the minimum number of years an individual must attend school. This satisfies one of the key conditions for a valid instrument.

Another critical condition for a valid instrument is that it should not be correlated with confounding variables that might distort the causal relationship. In our case, we can see from the plots that there is no apparent correlation between compulsory schooling and motivation levels, which is a good sign. This lack of correlation suggests that our instrument is not influenced by unobserved factors that could confound our analysis.

In summary, these plots demonstrate the significance of using an instrument variable in causal analysis. They illustrate that a well-chosen instrument can exhibit a clear causal effect on the treatment while also having a positive correlation with both the treatment and the outcome. Moreover, the absence of correlation between the instrument and potential confounders enhances the credibility of our causal inferences. Instrumental variables provide a powerful way to disentangle causation from correlation in situations where direct causality is challenging to establish, making them a valuable tool in causal analysis.

Wrapping up

In this exploration of causal inference, we’ve delved into three essential concepts: the back door criterion, the front door criterion, and instrumental variables. The back door criterion serves as our guide in identifying the necessary variables for adjusting causal relationships and helps us uncover the true causal effects in observational data. Meanwhile, the front door criterion reveals an intriguing pathway where we can understand the impact of an exposure on an outcome, even when the direct path is blocked by unobserved variables. Finally, instrumental variables emerge as powerful tools to address endogeneity and establish causality by leveraging variables with known causal effects and correlations with both the treatment and the outcome.

These methods are not merely theoretical constructs; they play a pivotal role in extracting causal insights from real-world data. By diligently following the criteria of these causal inference techniques, researchers can navigate the complex landscape of observational data, disentangle confounding relationships, and ultimately gain a deeper understanding of cause and effect. These tools empower us to make more informed decisions, whether in public health, economics, or any field where causal relationships matter. As we continue to refine these methods and develop new ones, the horizon of causal inference expands, offering us ever more precise insights into the complex web of causality that shapes our world.

Causal Analysis Foundation Series : Causal Graphs – A powerful tool for Causal Analysis

Image Source : Google images, medium.com

Structure

  1. Introduction to causal graphs
  2. Representation of causality in causal graphs
  3. Components of Causal Graphs
  4. Different modes for flow of association between nodes
  5. Code example for conditioning in association through chains
  6. Association through forks
  7. Intuitive examples of association through forks
  8. Association through colliders
  9. Intuitive examples of association through colliders
  10. D-separation
  11. Examples of D-separation with different types of conditioning
  12. Conclusion

Introduction

Causal graphs, also known as causal diagrams or directed acyclic graphs (DAGs), are powerful tools used in causal analysis to represent and understand the causal relationships between variables. They provide a visual representation of how different variables interact and influence each other, shedding light on the complex web of cause and effect. Causal graphs play a fundamental role in causal analysis by helping us identify and estimate the causal effects of interventions or treatments on specific outcomes of interest. By visualizing the causal relationships, we can better understand the underlying mechanisms and make more informed decisions based on solid evidence.

For example, consider a study investigating the impact of a new marketing campaign on product sales. A causal graph can depict the relationships between variables such as the marketing campaign (treatment), sales volume (outcome), and other variables like product quality, customer awareness, demand, customer satisfaction and loyalty. By analyzing the causal graph, we can identify the key factors influencing sales and estimate the true causal effect of the marketing campaign on sales, while accounting for other variables that may also impact sales.

Causal graph for marketing campaign

In another example, let’s consider a healthcare study examining the effectiveness of a new drug in treating a specific disease. A causal graph can represent the relationships between variables such as the drug treatment (treatment), patient health outcomes (outcome), and potential confounding variables such as age, comorbidities, and lifestyle factors.

Causal graph for healthcare study

In both of these examples, causal graphs provide a clear visual representation of the causal relationships, helping us identify potential confounding factors and understand the pathways through which the treatment influences the outcome. This understanding is crucial for accurate estimation of causal effects and drawing meaningful conclusions from the data.

In the following sections, we will delve deeper into the construction and interpretation of causal graphs, exploring their significance in causal analysis and understanding how they guide the estimation of causal effects.

So, let’s embark on this journey into the world of causal graphs and unravel the secrets of cause and effect.

Representation of causality in causal graphs

In a causal graph, variables are represented as nodes, and the causal relationships between them are depicted using directed edges or arrows. The directionality of the arrows signifies the direction of causation. For example, if variable A influences variable B, there will be an arrow pointing from A to B in the graph.

The purpose of causal graphs is to provide a clear and structured representation of the causal relationships among variables. They help to identify the direct and indirect paths through which variables influence each other, facilitating the estimation of causal effects. Causal graphs also aid in identifying potential confounding variables, which are factors associated with both the treatment and the outcome, and guide the selection of appropriate statistical techniques to address confounding bias.

In our marketing campaign graph, we can understand the causal relationship between marketing campaign and sales. From the graph we can see that marketing campaigns help in creating customer awareness, which in turn creates demand ultimately leading to higher sales. There are also other causal paths which leads to higher sales from the node ‘product quality’. All these representations of the interrelationships between variables have to be derived based on our domain understanding. Causal graph in short can be considered as a visual representation of domain understanding through a web of causes and effects.

Causal variables influencing customer demand

Examining causal graphs helps in identifying key variables which influence the final outcome. We can also understand the role of different confounding variables enabling estimation of the true causal effect of any variable with the desired outcome.

Components of Causal Graphs

Nodes are the fundamental components of a causal graph. They represent the variables involved in the analysis, such as the treatment, outcome, confounders, and mediators. Each node corresponds to a specific variable and plays a distinct role in the causal relationships depicted by the graph.

Causal graph for study influencing impact of new teaching methods

For example, in a study evaluating the impact of a new teaching method on student performance, the nodes could include the teaching method (treatment), student performance (outcome), student’s socioeconomic status (confounder), and student’s motivation (mediator). Each of these nodes represents a distinct variable that is of interest in the analysis. Let us now understand each of these components as shown in the above graph.

The different roles that variables can play in a causal graph are:

  • Treatment: A treatment is a variable that is manipulated in the causal graph. For example, in a study of the effect of teaching methods on student performance, the treatment would be the different methods adopted by teachers in the school.
  • Outcome: The outcome is the variable that we are interested in measuring. In our example the performance of the student is the outcome which manifests the effect of the treatment.
  • Confounder: A confounder is a variable that is associated with both the treatment and the outcome. Confounders can bias the results of a study if they are not taken into account. For example, Socio Economic Status ( SES ) can also be a confounder, meaning that it is associated with both the new teaching method and student performance. For example, schools with higher SES are more likely to implement new teaching methods, and they are also more likely to have students who perform well. This means that it is difficult to say whether the new teaching method is actually causing the improvement in student performance, or whether it is simply because the students are from higher SES backgrounds.
  • Mediator: A mediator is a variable that is affected by the treatment and then affects the outcome. Mediators can help to explain how the treatment works. In our case student motivation can be a mediator. The new teaching methods might motivate the students which in turn will influence their performance.
  • Shared effects : In addition to the above mentioned variables, there can also be another type like the share effects. The variable that is caused by both the treatment and the outcome is called a common effect or a shared effect. In our example, the reputation of the school is influenced by both the teaching method (treatment) and student performance (outcome). The reputation of the school is a common effect of both the treatment and the outcome variables in this scenario.

Causal graphs depict the logical flow of causation, showing the connections between variables and how they interact. By following the arrows in the graph, we can trace the pathways of influence and identify the direct and indirect effects between variables. Having explored the roles of variables and the flow of causation, the next section will delve into the various ways in which the association flows between nodes in a causal graph. Understanding these different patterns of association is crucial for uncovering the underlying causal mechanisms and estimating the causal effects accurately.

Different modes for flow of association between nodes

In a causal graph, the association between nodes can flow in different ways, indicating various types of relationships and causal pathways. Here are some common ways in which association flows between nodes:

  • Chain: A chain refers to a sequence of nodes connected by directed edges, where the association flows from one node to the next. In our example of the study on new teaching methods and student performance , we can see that there is a chain association between teaching methods, student motivation and student performance.
Association through chains

You can identify other chain associations within the causal network.

In a causal graph, a chain refers to a sequence of variables where each variable is influenced by the preceding variable in the chain. In a causal chain A -> B -> C, where A is the direct cause of B and B is the direct cause of C, changes in A can propagate through the chain and affect the subsequent variables. When A undergoes a change, it leads to a corresponding change in B. This change in B, in turn, influences C. The magnitude and direction of the changes in B and C are determined by the strength and nature of the causal relationships between the variables.

For example, in our example by introducing better teaching methods (change in A), it is likely to improve the motivation level of the students (change in B), which can subsequently lead to better exam performance (change in C).

To understand the unique and direct causal effect of each variable on the outcome variable, we need to introduce independence between variables which are not directly connected in the causal chain. Independence can be brought about by conditioning on the mediator variable. For example by conditioning on the motivation levels of students we can make teaching methods and student performance independent. Let us explore the intuition behind this with a detailed example

In the study comparing teaching methods A and B and their impact on student performance, the causal graph indicates that teaching methods influence student motivation, which subsequently affects their performance. To understand the effect of conditioning on motivation levels, let’s consider two groups: students with high motivation and those with low motivation.

Conditioning on motivation level means that we selectively analyze the performance of students within each motivation group. By doing so, we focus on the relationship between teaching methods and performance within each group, while considering motivation as a mediator variable.

In this context, conditioning on motivation level means that if we know a student is highly motivated, their performance is likely to be higher, regardless of the teaching method they receive. By conditioning on motivation, the information about the teaching method becomes redundant in predicting performance. Thus, conditioning blocks the influence of the teaching method on performance, as the impact is already explained by the student’s motivation level.

In simpler terms, conditioning on motivation level makes teaching methods and student performance independent. By taking into account the motivation level of students, we remove the direct influence of teaching methods on performance. Instead, performance is explained by the level of motivation, making the teaching method redundant in predicting performance within each motivation group.

Let us implement a simple exercise, to demonstrate the effect of conditioning based on the above use case.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
n_students = 100
teaching_methods = np.random.choice(['Method A', 'Method B'], size=n_students)
motivation_levels = np.random.choice(['Low', 'High'], size=n_students)
performance = np.random.normal(loc=60, scale=10, size=n_students)

# Create a DataFrame
data = pd.DataFrame({'Teaching Method': teaching_methods, 'Motivation Level': motivation_levels, 'Performance': performance})

# Summary measures before conditioning
mean_performance_before = data.groupby('Teaching Method')['Performance'].mean()
std_dev_performance_before = data.groupby('Teaching Method')['Performance'].std()

# Summary measures after conditioning on high motivation
high_motivation_data = data[data['Motivation Level'] == 'High']
mean_performance_after = high_motivation_data.groupby('Teaching Method')['Performance'].mean()
std_dev_performance_after = high_motivation_data.groupby('Teaching Method')['Performance'].std()

# Display summary measures
print("Summary Measures Before Conditioning:")
print(f"Mean Performance:\n{mean_performance_before}\n")
print(f"Standard Deviation of Performance:\n{std_dev_performance_before}\n")

print("Summary Measures After Conditioning on High Motivation:")
print(f"Mean Performance:\n{mean_performance_after}\n")
print(f"Standard Deviation of Performance:\n{std_dev_performance_after}\n")

Let’s break down the exercise and connect it with student performance:

Intuition Behind the Exercise:

  1. Teaching Methods Influence Motivation: In the causal graph, teaching methods are shown to influence student motivation, which then affects student performance. This suggests that the choice of teaching method may indirectly impact performance through its influence on motivation.
  2. Conditioning on Motivation: Conditioning on a variable means that we are controlling for its effects. In this case, conditioning on motivation involves considering student performance while holding motivation levels constant. It allows us to isolate the impact of teaching methods on performance within specific motivation groups.

Connection with Student Performance:

  1. Before Conditioning:
  • Mean Performance: Calculating the mean performance before conditioning gives us an overall average performance for each teaching method, considering all students irrespective of their motivation levels.
  • Standard Deviation: The standard deviation provides a measure of the variability or spread of performance scores.
  1. After Conditioning on High Motivation:
  • Mean Performance: Calculating the mean performance after conditioning on high motivation provides an average performance for each teaching method but only among highly motivated students. This isolates the effect of teaching methods on performance within the high motivation group.
  • Standard Deviation: The standard deviation after conditioning gives us a measure of how performance varies within the high motivation group for each teaching method.

Connection with Student Performance (Intuitive Perspective):

  • Before Conditioning: If we only look at mean performance without considering motivation levels, it might be influenced by both the direct impact of teaching methods and the indirect impact through motivation. The standard deviation gives us a sense of the overall variability in performance.
  • After Conditioning on High Motivation: By focusing on highly motivated students, we aim to remove the potential influence of motivation on performance. If, after conditioning, the mean performance still differs between teaching methods, it suggests a direct impact of teaching methods on performance within the highly motivated subgroup.

In simpler terms, the exercise helps us understand how teaching methods might independently affect the performance of highly motivated students when we control for the variable of motivation. It provides insights into whether the observed performance differences are primarily due to teaching methods or if motivation plays a significant role.

Having learned about the flow of association through chains, let us understand the type of influence through forks.

  • Fork: A fork occurs when two or more nodes have a common cause, and their association is explained by this shared cause. It represents an indirect effect, where the association flows through a common parent node.
Association through forks

In a causal graph, a fork structure represents a scenario where a common cause influences two variables independently, leading to a causal association between these variables. Let’s consider the example of student performance we dealt earlier, where socio-economic status (SES) acts as the common cause for both teaching methods (Method A and Method B) and student performance.

In this context, socio-economic status can affect a student’s access to educational resources, parental support, and overall learning environment. Now, let’s examine the flow of association from socio-economic status to teaching methods and student performance:

  1. Socio-economic status (SES) → Teaching Methods:
    Students from different socio-economic backgrounds may have varying opportunities to access certain teaching methods. For instance, schools in low-income neighborhoods might have limited resources, leading to a preference for a particular teaching method (Method A or Method B). Thus, SES indirectly influences the selection of teaching methods.
  2. Socio-economic status (SES) → Student Performance:
    Students with higher socio-economic status often have access to better educational resources and a supportive learning environment. This can positively impact their academic performance. Conversely, students from lower socio-economic backgrounds may face challenges due to limited resources and support, potentially affecting their performance.

In the fork structure, socio-economic status acts as the common cause, influencing both teaching methods and student performance . The presence of socio-economic status in the causal graph can create an association between teaching methods and student performance, but it’s crucial to remember that this association is not causal. Instead, it’s the result of both variables being influenced by a common underlying factor, SES.

In the fork structure of a causal graph, there is a flow of influence from Teaching Methods (TM) to Student Performance (SP) through the common cause, Socio-Economic Status (SES). This means that the choice of teaching methods can indirectly impact student performance via its influence on SES. Let’s explore some intuitive examples to understand this flow of influence:

  1. Example 1 – High-Performing Schools:
    Consider a high-performing school in an affluent neighborhood where students have access to well-funded schools, highly qualified teachers, and various educational resources. In this scenario, the school might prefer a specific teaching method (Method A) that has shown promising results in previous years. As a result, students from this school, influenced by the selected teaching method and other resources, tend to perform well in exams and achieve higher academic outcomes. In this case, the causal influence from Method A to Student Performance passes through the higher SES associated with the school’s location.
  2. Example 2 – Low-Performing Schools:
    Now, let’s consider a low-performing school situated in a socioeconomically disadvantaged area. Due to budget constraints and limited resources, the school may opt for a different teaching method (Method B) that fits within their financial constraints. Students in this school may face challenges related to their SES, such as lack of access to tutoring or extracurricular activities. Consequently, the academic performance of these students might be adversely affected by these socio-economic factors. Here, the causal influence from Method B to Student Performance is mediated through the lower SES associated with the school’s location.

In both examples, we see how the choice of teaching methods (Method A and Method B) indirectly impacts student performance by acting through the common cause, Socio-Economic Status (SES). The flow of influence through SES highlights the importance of considering and accounting for confounding factors like SES in causal analysis to accurately estimate the true causal effect of teaching methods on student performance.

Conditioning on Socio-Economic Status (SES) can make Teaching Methods (TM) and Student Performance (SP) independent in the fork structure of the causal graph. By controlling for the common cause (SES), we can eliminate the indirect influence that Teaching Methods might have on Student Performance through SES, thereby making them independent. Let’s explore some intuitive examples to understand this concept:

Example 1 – SES as an Equalizer:
Consider a scenario where two schools, School X and School Y, have different teaching methods (Method A and Method B) but similar SES levels. School X follows Method A, while School Y adopts Method B. However, both schools are located in the same socio-economic neighborhood with comparable resources and facilities. When we compare the academic performance of students from both schools, we find that there is no significant difference between the groups. In this case, conditioning on SES equalizes the influence of the teaching methods on student performance, and the variables become independent.

Example 2 – SES as a Confounder:
Now, let’s consider another situation where two schools, School P and School Q, implement different teaching methods (Method C and Method D). However, School P is located in a higher SES area with ample educational resources, while School Q is situated in a lower SES neighborhood with limited access to educational facilities. When we compare the student performance between the schools, we may observe that School P’s students perform better on average than School Q’s students. In this case, SES acts as a confounding factor, as it influences both the choice of teaching methods and student performance. By conditioning on SES, we can control for this confounding effect and analyze the true causal impact of teaching methods on student performance independently.

By conditioning on Socio-Economic Status can help disentangle the causal relationship between Teaching Methods and Student Performance by removing the influence of the common cause. This process ensures that any observed associations between the variables are not biased by the influence of SES, leading to a more accurate understanding of the causal effects of different teaching methods on student performance.

Having understood the chain and fork structure let us understand the next important causal association structure which is the collider.

  1. Collider: A collider is a node in a causal graph where two or more causal paths converge. It is a point of intersection that can induce a spurious association between its parents when conditioning on the collider.

Colliders are quite different from chains and forks . Chains represent a direct causal relationship between variables, where one variable causally influences the other in a sequential manner. In contrast, forks indicate two variables that are causally influenced by a common cause, but they are not directly related to each other. On the other hand, colliders represent a scenario where two variables have a common effect, and conditioning on the collider variable can lead to spurious associations between its parent variables. Unlike chains and forks, colliders are unique in that they block the path between its parent variables and make them independent when it is not conditioned on. Let us understand the collider in detail with our example

In the context of our example, a collider, also known as an immoral node, plays a significant role in the causal graph. The teaching methods variable acts as a collider because it has incoming edges from both Teacher Quality and Socio-Economic Status (SES). Colliders represent a specific scenario where two variables have a common effect, and conditioning on the collider can lead to spurious associations between its parent variables.

In this case, Teacher Quality and SES are likely to influence the choice of teaching methods. For instance, in schools with high-quality teachers and resources (higher SES), there might be a preference for adopting innovative teaching methods. Conversely, schools with limited resources and lower teacher quality might stick to traditional teaching methods. However, the teaching methods themselves do not directly influence either Teacher Quality or SES.

When we condition on the teaching methods variable, it opens up a pathway of influence from Teacher Quality to SES. For instance, if we compare schools with different teaching methods, we might observe that schools with innovative teaching methods have higher teacher quality and higher SES. However, this association is spurious, as the true relationship between Teacher Quality and SES is not influenced by teaching methods but rather by their shared influence on teaching methods.

Colliders or immorality in a causal graph can create spurious associations when conditioning on the collider variable. It highlights the importance of careful analysis and understanding of causal graphs to avoid making erroneous inferences based on observed associations that are not truly causal.

Conditioning on any descendant of a collider can also lead to the flow of association between the parents of the colliders. In our education example, where Teaching methods is the collider variable, conditioning on its descendant, Student motivation, can open a path between the parents of the collider, which are Teacher quality and Socioeconomic status (SES).

Let’s illustrate this scenario further:

  1. Teacher Quality (Parent of Teaching methods): This variable influences the choice of Teaching methods in a school. Schools with high-quality teachers may adopt different teaching methods compared to schools with lower-quality teachers.
  2. Socioeconomic Status (Parent of Teaching methods): SES of students’ families can also influence the selection of Teaching methods in schools. Schools in areas with higher SES might have access to more resources and funding, allowing them to implement different teaching methods.
  3. Teaching Methods (Collider): As previously discussed, this variable is influenced by both Teacher quality and SES. Conditioning on its descendant, Student motivation, may happen unintentionally when we study the relationship between Teaching methods and Student performance.
  4. Student Motivation (Descendant of Teaching methods): Teaching methods can affect students’ motivation, as certain teaching approaches may inspire and engage students more effectively.

Now, if we condition on Student motivation (by comparing the performance of students who are highly motivated and those with low motivation), we may inadvertently open a path between Teacher quality and Socioeconomic status. This means that any observed association between Teacher quality and Student performance may not be solely due to the direct causal relationship but may also be influenced by the path through Teaching methods and Student motivation.

This illustrates the importance of understanding the causal structure in a causal graph and the potential for bias when conditioning on descendants of colliders. Proper handling of colliders and their descendants is essential to draw accurate causal inferences and avoid spurious associations in causal analysis.

The different type of associations, flow of associations, blocking flow of association are all important in the context of understanding an important concept in causal graphs called D-seperation. Let us see that now

D-Separation

D-separation, short for “dependency separation,” is a concept used in causal inference and graphical models to determine whether two sets of variables in a causal graph are independent or conditionally independent given a third set of variables. It helps to identify which paths between variables are blocked or unblocked based on the observed data and the causal relationships represented in the graph.

In D-separation, we use the concept of “blocking” to determine whether paths between variables are open (unblocked) or closed (blocked) for information flow. A path between two variables is considered open if there is no blocking variable in the path, and it allows information flow between those variables. On the other hand, a path is considered blocked if there is at least one blocking variable in the path, which prevents information flow between the variables.

D-separation involves three types of relationships between sets of variables in a causal graph:

  1. Active Paths: An active path is an open path that allows information flow between variables. It is not blocked by any of the conditioning variables.
  2. Inactive Paths: An inactive path is a closed path that does not allow information flow between variables. It is blocked by at least one of the conditioning variables.
  3. Potentially Active Paths: A potentially active path can be either active or inactive depending on the specific values of the conditioning variables.

By identifying the active and inactive paths between sets of variables, we can determine whether those variables are independent or conditionally independent given the conditioning variables. D-separation is a powerful tool in causal inference as it allows us to identify which associations between variables are likely due to causal relationships and which are spurious.

Let us now look at various graphs and find D-separation between the variables.

Examples of D-separation

To understand concept of blocked paths, active paths and d-separation, let us go back to the example we have been discussing so far as depicted below

Let us now consider the flow of association between Teaching Quality ( TQ ) and Student Performance ( SP ). As seen from the below figure there are three paths to for the flow of association from TQ to SP

Path1 : TQ >> TM >> SR >> SP

Path2 : TQ >> TM >> SM >> SP

Path3 : TQ >> TM >> SEC >> SP

Let us now look at how these paths can be blocked. As we saw in the previous examples on chains, forks and colliders, paths can be blocked by conditioning on these variables. Let us start with conditioning on an empty set i.e no conditioning at all.

Case 1 : Conditioning on empty set ( No conditioning )

In this case as seen in the figure above, the only active path is through the chain type association in Path 2.

Path 1 and Path 3 would be blocked because SR and TM are colliders for those paths and you have seen that in the case of collider when there is no conditioning on the collider, then the path is blocked.

Having seen the example let us look at the definition of d-separation. D-separation is a concept used in causal analysis and graphical models to determine the independence or conditional independence between two sets of nodes, X and Y. It involves examining all the possible paths connecting any node in set X to any node in set Y and evaluating whether these paths are blocked or influenced by another set of nodes, Z. When all paths between X and Y are blocked by Z, we say that X and Y are d-separated by Z, indicating that they are independent or conditionally independent given Z. D-separation plays a crucial role in analyzing causal relationships and making causal inferences from observational data, helping researchers identify valid causal effects and avoid spurious associations in complex causal structures.

In our case TQ ( set X ) and SP ( set Y )are not d-separated when conditioning on the empty set ( set Z )as there is one active path between these nodes.

Case 2 : Conditioning on SM

Let us look at the case when we condition on SM. In this case, like the earlier case path 1 is blocked at the collider SR. Path 2 becomes blocked in this case as there is a conditioning at SM which will block the influence flowing from TQ. Patch 3 opens at the collider TM because of the conditioning on SM. This is because SM is a descendant of TM and any conditioning on the collider or any descendants of the collider opens the path for the influence to flow from TQ >> TM >> SEC >> SP.

As we have an open path between TQ and SP, these variables are not D-separated.

Case 2 : Conditioning on SM and SEC

When we condition on both SM and SEC, we block Path 3 also. This is because, SEC is a fork and conditioning on the fork blocks the path.

Now we have a condition where all the paths between TQ and SP are blocked. This entails that TQ and SP are d-separated which entails these variables are independent when conditioning on SM and SEC.

Case 3 : Conditioning on SM , SEC and SR

When we condition on SR, Path 1 opens up as SR is a collider. The other paths remain blocked as seen from the previous case. This means that TQ and SP are not d-separated.

Intuition for d-separation

The concept of D-separation is fundamental to causal analysis and provides a framework to understand how the presence of certain nodes can block or open causal paths between other nodes in the graph.

When we have a set of nodes Z, D-separation helps us identify whether Z acts as a “shield” that blocks the flow of influence between two other sets of nodes, X and Y. If all the paths between X and Y are blocked by Z, it implies that X and Y are independent or conditionally independent given Z. In other words, conditioning on the nodes in set Z eliminates the confounding effects that may exist between X and Y due to their shared relationships with Z.

This concept is highly relevant in causal analysis because it enables to discern true causal relationships from spurious associations. By identifying the nodes that act as “confounders” we can appropriately adjust for these variables when estimating causal effects and making valid causal inferences from observational data.

Conclusion

In conclusion, causal graphs play a pivotal role in advancing causal analysis and enhancing our understanding of cause-and-effect relationships in complex systems. By visually representing the causal relationships between variables, these graphs provide valuable insights into the underlying mechanisms that drive outcomes in various fields. They help us identify and control for confounding factors, mediating variables, and colliders, allowing for more accurate and reliable causal inference from observational data.

Embracing causal graphs in our research and decision-making processes can lead to better policy recommendations, improved healthcare interventions, and enhanced understanding of social dynamics. As we continue to delve into the world of causal analysis, we are encouraged to explore further the diverse applications of causal graphs in tackling real-world problems.

In our future blog posts, we will delve into more advanced topics in causal analysis, including counterfactual reasoning, sensitivity analysis, and techniques for causal identification and estimation. By expanding our knowledge and utilization of causal graphs, we can make more informed decisions, contribute to cutting-edge research, and shape a better understanding of the cause-and-effect relationships that govern the world around us.

Causal AI Foundation Series : Confounders in Causal Analysis

Structure

  1. What are confounding variables ?
  2. Importance of identifying the confounders
  3. Confounding bias and ways of removing it
  4. Unraveling Simpson’s Paradox: The Impact of Confounders on Correlation and Causation
  5. Mitigating the Influence of Confounders through Adjustment Formulae
  6. Conclusion

Introduction

Today, we’ll embark on a journey to unravel the mysteries of confounding and understand its significant role in causal inference. To start, let’s dive into a simple and relatable example: a beachside shop that sells both ice creams and summer hats. During the scorching summer days, the shop owner notices an intriguing trend – a strong positive correlation between the sales of ice creams and summer hats. It seems that on days when more ice creams are sold, more summer hats fly off the shelves as well.

Hat sales correlated with ice cream sales

At first glance, one might assume a direct causal relationship between these two items – selling more ice creams causes an increase in summer hat sales, or vice versa. However, as we delve deeper into the underlying factors influencing this pattern, we come across a lurking variable that plays a pivotal role – the temperature. On hot summer days, people are naturally drawn to buy both ice creams and summer hats to beat the heat and enjoy the sunshine. In this scenario, the temperature acts as a confounding variable, affecting both ice cream and summer hat sales independently.

Summer heat as confounder for hat sales and ice cream sales

As we explore the concept of confounding further, we will uncover its potential to lead us astray when estimating causal effects in various scenarios. Confounding variables can introduce biases in our analysis, leading to incorrect conclusions about cause-and-effect relationships. Join us on this enlightening journey as we learn about the intricacies of confounding, its implications in causal analysis, and how to tackle it to obtain accurate and meaningful insights. So, let’s unravel the mysteries of confounding and enhance our understanding of causal inference together!

Importance of identifying the confounders

“Correlation is not causation” is a well-known phrase that serves as a poignant reminder of the potential pitfalls in making causal inferences solely based on observed associations. Let’s explore this idea through an example. Consider a study that examines the relationship between reading ability and shoe size in a group of students. Surprisingly, the study finds a strong positive correlation between these two variables – students with larger shoe sizes tend to have better reading skills. However, it would be misguided to conclude that bigger feet somehow enhance reading abilities or vice versa.

Ageing as the confounder for shoe size and reading skills

In this example, the concept of confounding comes into play. The lurking variable here is age. As students grow older, both their shoe sizes and reading abilities tend to increase. Age acts as a confounder, influencing both shoe size and reading ability independently. Therefore, the observed correlation between shoe size and reading ability is not indicative of a causal relationship between the two. Instead, it underscores the importance of carefully considering potential confounding factors before drawing any causal conclusions.

The phrase “correlation is not causation” serves as a cautionary reminder to remain vigilant in distinguishing between mere associations and genuine causal relationships. Without accounting for confounding variables, we risk falling prey to erroneous conclusions and misinterpretations of data. As we delve deeper into the realms of causal analysis, it becomes evident that understanding and addressing confounding is essential to unraveling the true cause-and-effect dynamics hidden within our observations.

Confounding bias and ways of removing it

Confounding bias poses a significant challenge in causal analysis, as it can lead to misleading conclusions and inaccurate assessments of the true causal relationship between a treatment and an outcome. A classic example of confounding bias arises when evaluating the efficacy of a new drug. Suppose the drug is primarily prescribed to older individuals, who are at higher risk for the targeted disease. Here, age acts as the confounder, influencing both the prescription of the drug and the likelihood of experiencing the disease outcome. Consequently, any adverse outcome in health among patients taking the drug could be mistakenly attributed to the drug’s effects, when in reality, it might be due to the age-related risk factors.

To disentangle the genuine impact of the new drug from the confounding effect of age, randomized controlled trials (RCTs) offer a powerful solution. In RCTs, patients are randomly assigned to either the treatment group, receiving the new drug, or the control group, receiving a placebo or standard treatment. Randomization ensures that potential confounders, like age, are evenly distributed among the groups, eliminating the bias that might have skewed the results in observational studies. By comparing the outcomes of both groups, researchers can confidently identify the true causal effect of the drug, making informed decisions about its efficacy and safety. Randomization helps to overcome confounding bias and strengthens the validity of causal analysis, providing more reliable evidence for medical practitioners and researchers.

Unraveling Simpson’s Paradox: The Impact of Confounders on Correlation and Causation

The well-known adage “correlation does not imply causation” reminds us that just because two variables are correlated, it doesn’t necessarily mean that one causes the other. This is especially true when confounding variables come into play, causing a spurious correlation or even a reversal of the direction of the association.

Simpson’s paradox is a classic example that exemplifies this phenomenon. It occurs when the relationship between two variables changes when a third variable, a confounder, is not taken into account. The confounding variable influences both the correlated variables, leading to a misleading overall conclusion.

Consider the pass percentage of female and male students in a semester.

Female studentsMale Students
Semester Pass percentage78% ( 78/100)81% ( 81/100)
Table 1 : Overall pass percentage

At first glance, we observe that male students have a slightly higher pass percentage, seemingly indicating better academic performance. However, upon closer examination, we discover that the elective choices of students play a significant role.

Female studentsMale Students
Elective A ( Easy elective)92% ( 23/25)87% ( 67/77)
Elective B ( Tough elective)73% ( 55/75)61% ( 14/23)
Table 2 : Breakup of pass percentage

When we analyze the pass percentages for each elective separately, we find that female students outperform male students in both difficult and easy electives. This means that, in reality, female students perform better in each category. However, the composition of elective choices creates an illusion of male students having an advantage in the overall results. Female students tend to opt for more challenging electives with lower pass rates, while male students prefer easier electives with higher pass rates. This disparity in the choice of electives is what creates the illusion of better performance of male students than that of the female students.

This example illustrates the importance of considering confounders in causal analysis. Failing to account for confounding variables can lead to erroneous conclusions and the misinterpretation of correlations. In this case the confounding variable is the choice of electives.The causal graph in this case is as represented below

Elective as confounder influencing gender and result

A more equitable comparison in this case were to first condition on the electives and then compare the results of both male and female students. This entails considering the students who take the same elective as a group and then comparing the pass percentage of the male and female students within these groups. For example take students who take elective A, and then compare the pass percentage of male and female students ( 87% v/s 92% ). This is the essence of one of the methodologies to address Simpson’s paradox called the adjustment formulae. Let us look at the intuition behind the adjustment formulae next.

Mitigating the Influence of Confounders through Adjustment Formulae

In our earlier example, we delved into the intricacies of Simpson’s Paradox, where a seemingly straightforward comparison between the overall pass percentages of female and male students (78% vs. 81%) revealed a nuanced distortion. This distortion arose from the fact that female students disproportionately opted for more challenging courses, thus affecting their overall pass rate. The pivotal question that emerges is how to render this comparison more equitable, while factoring in the complexities introduced by confounding variables.

To address this quandary, let’s embark on a hypothetical scenario: envisage a student population entirely comprised of females. Within this context, the dynamics of the study population undergo a transformation. In the original scenario, the proportion of female students taking the demanding elective course was 77%, while the proportion opting for the less challenging course stood at 24%. We can now recalculate the overall pass percentage for female students using an alternative approach:

Overall Pass Percentage of Females = Pass Percentage in Demanding Course × Proportion of Females in Demanding Course + Pass Percentage in Less Challenging Course × Proportion of Females in Less Challenging Course

With calculations in mind, this results in:

Overall Pass Percentage of Females = 73% × 77% + 92% × 24% = 78%

Now, let’s transport ourselves to a hypothetical scenario where the entire student population consists solely of females. Although the individual pass percentages for the distinct courses remain unchanged (73% and 92%), what does undergo modification is the relative frequency or proportion of students opting for each course. With all students being females, the proportion of students enrolling in the demanding course would become (75 + 23)/200 = 49%. Similarly, the frequency of female students enrolling in the less challenging course would be (77 + 25)/200 = 51%.

In light of this adjustment, the recalculated overall pass percentage for the hypothetical all-female student population becomes:

Overall Pass Percentage (All-Female Scenario) = 73% × 49% + 92% × 51% ≈ 83%

Similarly the overall pass percentage if we consider all students be males would be

Overall Pass Percentage (All-male Scenario) = 62% × 49% + 87% × 51% ≈ 75%

Now these are comparable numbers which clearly indicate that female pass percentage is far better than the male pass percentage. The revised formulae which we used above is called the adjustment formulae.

This exercise showcases the influence of confounders on causal inferences and illustrates how adjustments, akin to this hypothetical scenario, can be applied to mitigate their impact. By understanding and effectively addressing confounding variables, we can attain a clearer and more accurate representation of the true relationships underlying the data.

Conclusion

In conclusion, our journey through the intricate landscape of confounding has unveiled a fundamental truth: causation is not correlation. We’ve delved into the nuances of confounding variables, those sneaky influencers that can distort the apparent relationship between our variables of interest. The adjustment formulae have emerged as our trusty tools, helping us navigate the labyrinth of observational data by accounting for these confounders.

Understanding the distinction between causation and correlation is paramount in unraveling the true nature of relationships within data. Confounding variables, often hidden in the shadows, can lead us astray if not properly acknowledged and addressed. The adjustment formulae act as our guiding light, allowing us to discern the genuine causal effects from mere associations.

As we part ways with confounding, let’s carry forth this wisdom: in the pursuit of understanding causation, we must tread carefully, armed with the knowledge that correlation does not imply causation. By diligently adjusting for confounders, we can unearth the threads of true causation, weaving a clearer narrative in the intricate tapestry of observational analysis.

Causal AI Foundation Series : Treatment group and Control group

Structure

  1. Treatment and Control group in causality
  2. Assignment of individuals to treatment and control groups
  3. Role of randomization in selection of treatment and control groups
  4. Covariates and balancing covariates for treatment and control group
  5. Python code exercises on covariate balancing
  6. Conclusion

Introduction

Causal analysis is a powerful tool that finds its applications in diverse fields like healthcare, social sciences, and business. It’s the key to unlocking the secrets of cause-and-effect relationships, empowering us to make well-informed decisions. In this exciting journey through causal analysis, our first stop is to dive deep into the world of treatment and control groups. These groups are the building blocks of estimating causal effects, and we’re about to uncover their vital role.

Treatment and Control Group

In causal analysis, a treatment group consists of individuals who receive a particular intervention. In contrast, a control group consists of individuals or subjects who do not receive the intervention being studied.Let us look at an example.

Consider a marketing campaign aimed at increasing customer engagement on a company’s website. The treatment group would be the customers who are exposed to the campaign, while the control group would be customers who are not exposed to any promotional activities. By comparing the website engagement metrics between the treatment group and the control group, marketers can assess the effectiveness of the campaign in driving customer engagement.

Let us consider another example in the healthcare domain. Imagine a clinical trial studying the effectiveness of a new drug for treating a certain medical condition. The treatment group consists of participants who receive the new drug, while the control group comprises participants who receive a placebo or standard treatment. By comparing the outcomes between the treatment group and the control group, researchers can determine whether the new drug has a significant impact on improving the condition.

By carefully designing and analyzing studies using these groups, researchers can draw reliable conclusions about the impact of the treatment on the outcome variable. Treatment groups and control groups are essential components of causal analysis for several reasons:

  1. Establishing Causality: By comparing the outcomes of the treatment group and control group, helps establish a cause-and-effect relationship
  2. Minimizing Bias: The use of control groups helps minimize bias and confounding in the estimation of the causal effect. By ensuring that the only difference between the treatment and control groups is the presence or absence of the treatment, researchers can attribute any observed differences in outcomes to the treatment itself, rather than other factors that may influence the outcome.
  3. Counterfactual Analysis: Control groups enables creation of counterfactual scenario, representing what would have happened to the treatment group had they not received the treatment. By comparing the actual outcomes of the treatment group with the hypothetical outcomes of the control group, researchers can estimate the causal effect of the treatment.

Assignment of individuals to treatment or control groups

Individuals are assigned to the treatment group or control group through a process called randomization. Randomization involves randomly assigning individuals to either group, ensuring that each individual has an equal chance of being assigned to either group. This helps minimize bias and ensures that the groups are comparable.

Here are some examples illustrating treatment group and control group in different scenarios:

  1. Clinical Trials: In a clinical trial for testing a new drug, participants are randomly assigned to either the treatment group, where they receive the experimental drug, or the control group, where they receive a placebo or standard treatment. The treatment group is used to assess the effectiveness of the new drug compared to the control group.
  2. Education Interventions: In an educational study investigating the impact of a teaching method on student performance, students in the treatment group might receive the new teaching method, while those in the control group receive the traditional teaching method. The treatment group is used to evaluate the effectiveness of the new teaching method.
  3. Marketing Campaigns: In a marketing study evaluating the effectiveness of a promotional campaign, customers might be divided into a treatment group, exposed to the campaign, and a control group, not exposed to the campaign. The treatment group helps assess the impact of the campaign on customer behavior compared to the control group.

Having see the mechanism of assigning individuals to either groups let us explore the role of randomization in the whole exercise.

Role of randomization in selection of treatment and control groups

Randomization plays a crucial role in assigning individuals to treatment and control groups in causal analysis. It is a fundamental principle that helps ensure the validity and reliability of the findings. By randomly assigning individuals to groups, we introduce an element of chance that reduces the potential for biases that may creep up.

Consider a clinical trial investigating the efficacy of a new drug. Randomization ensures that participants have an equal chance of being assigned to either the treatment group receiving the drug or the control group receiving a placebo ( standard treatment ). Randomization will help avoid selection bias like for instance, individuals with higher risk levels are assigned to the treatment group. As you know individuals with higher risk also can have higher adverse outcomes. This will make these two groups less comparable as one group contains individuals with higher risk factor than the other. So any of the outcomes because of this will be more attributed to the risk factor than the efficacy of the treatment. This is where randomization comes into play. By the process of randomization an individual with a certain risk factor has the chance of being assigned to either of these groups making the groups more comparable.

Covariates and balancing covariates for treatment and control group

Another important factor to consider when assigning individuals into treatment and control group is the influence of covariates. Covariates refer to additional variables or factors that may influence both the assignment of individuals to the treatment or control group and the outcome of interest.

For example in a clinical trial, covariates like age, gender, ethnicity, and medical history can influence both the assignment to treatment groups and the treatment outcomes. For example, older age or the presence of specific medical conditions may affect how individuals respond to the drug or experience side effects. To ensure a fair comparison between treatment and control groups, it is important to consider these covariates. By comparing individuals with similar covariate values across the treatment and control groups, researchers can create balanced comparisons. For instance, comparing the outcomes of individuals aged 50-60 in both groups allows for a more balanced comparison, as it takes into account the potential influence of age on the outcome. On the other hand, comparing individuals aged 50-60 in the treatment group with individuals aged 20-30 in the control group would not provide a balanced comparison, as age is not adequately controlled for. By balancing the covariates, researchers can make more accurate inferences about the treatment effect and minimize the potential confounding effects of covariates on the outcome.

Let us look at a sample code, which attempts to explain the concept of covariate balancing through randomization

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)

# Sample data for Age and Medical History before randomization
age_before = np.random.normal(45, 10, 200)
med_history_before = np.random.choice([0, 1], size=200, p=[0.7, 0.3])

# Sample data for Age and Medical History after randomization
age_after = np.random.normal(50, 8, 200)
med_history_after = np.random.choice([0, 1], size=200, p=[0.5, 0.5])


# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot Age before and after randomization
axes[0, 0].hist(age_before, bins=20, alpha=0.6, color='blue', label='Before Randomization')
axes[0, 0].set_title('Age Distribution Before Randomization')
axes[0, 0].legend()

axes[0, 1].hist(age_after, bins=20, alpha=0.6, color='red', label='After Randomization')
axes[0, 1].set_title('Age Distribution After Randomization')
axes[0, 1].legend()

# Plot Medical History before and after randomization
axes[1, 0].hist(med_history_before, bins=[-0.5, 0.5, 1.5], rwidth=0.8, alpha=0.6, color='blue', label='Before Randomization')
axes[1, 0].set_title('Medical History Distribution Before Randomization')
axes[1, 0].set_xlabel('Medical History (0: No, 1: Yes)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_xticks([0, 1])
axes[1, 0].legend()

axes[1, 1].hist(med_history_after, bins=[-0.5, 0.5, 1.5], rwidth=0.8, alpha=0.6, color='red', label='After Randomization')
axes[1, 1].set_title('Medical History Distribution After Randomization')
axes[1, 1].set_xlabel('Medical History (0: No, 1: Yes)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_xticks([0, 1])
axes[1, 1].legend()

plt.tight_layout()
plt.show()

Let us explain this exercise and provide an intuitive understanding. In the context of randomized experiments, balancing covariates means ensuring that the distribution of certain variables (covariates) is similar or “balanced” between the treatment and control groups. This balance is crucial because it helps eliminate or reduce the potential influence of these covariates on the outcome, allowing researchers to make more accurate inferences about the treatment effect.

In the provided exercise with histograms, we’re examining two hypothetical covariates: Age and Medical History. Here’s how it connects to the concept of balancing covariates:

  1. Visualizing Covariate Distributions Before Randomization:
  • In the first histogram (Age Distribution), we see the distribution of ages in both the treatment and control groups before randomization.
  • Before randomization, the age distributions in the treatment and control groups may be quite different. This lack of balance means that the average age in one group could be significantly different from the other.
  1. Visualizing Covariate Distributions After Randomization:
  • In the second histogram (Medical History Distribution), we see the distribution of medical history (a binary covariate) in both groups before and after randomization.
  • After randomization, the distributions of medical history in the treatment and control groups should ideally be more balanced. This balance suggests that roughly equal proportions of individuals with and without a medical history condition are present in both groups.

The exercise relates to the concept of balancing covariates in the following ways:

  • Before Randomization: If covariates like age or medical history are imbalanced between treatment and control groups before randomization, it could lead to confounding effects. For example, if the treatment group has significantly older participants, any observed differences in the outcome may be due to age rather than the treatment itself.
  • After Randomization: The goal of randomization is to create groups that are comparable in terms of observed and unobserved covariates. After randomization, we hope to achieve balance in covariate distributions. This balance allows us to attribute any observed differences in outcomes between groups more confidently to the treatment, rather than to covariates.

This exercise illustrates the importance of ensuring covariate balance between treatment and control groups. Achieving this balance through randomization is a fundamental principle of experimental design, as it helps us isolate the true effect of the treatment variable while minimizing the impact of other covariates.

Another way of balancing based on covariates is stratification. Stratification involves dividing the population into strata based on specific covariate values and then randomly assigning individuals from each stratum to the treatment and control groups. This ensures that the groups are balanced within each stratum, even if there are differences across strata. For example, in a study examining the impact of a fitness program on weight loss, participants can be stratified by initial body mass index (BMI) categories, and within each category, individuals are randomly assigned to the treatment or control group. This helps control for the potential effect of the covariate BMI on the outcome. Let us look at an exercise where we achive stratification

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Simulated data
np.random.seed(0)
n_samples = 200

# Generate age data
age = np.random.choice(['18-30', '31-45', '46-60', '61+'], size=n_samples)

# Create a function to simulate blood pressure based on age and treatment
def simulate_blood_pressure(age, treatment):
    base_pressure = {'18-30': 120, '31-45': 130, '46-60': 140, '61+': 150}
    treatment_effect = {'Treatment': -10, 'Control': 0}
    return base_pressure[age] + treatment_effect[treatment] + np.random.normal(0, 5)

# Create a DataFrame to store the data
import pandas as pd
data = pd.DataFrame({'Age': age})

# Assign treatment based on age strata
data.loc[data['Age'] == '18-30', 'Treatment'] = np.random.choice(['Treatment', 'Control'], size=len(data.loc[data['Age'] == '18-30'] ), p=[0.5, 0.5])
data.loc[data['Age'] == '31-45', 'Treatment'] = np.random.choice(['Treatment', 'Control'], size=len(data.loc[data['Age'] == '31-45'] ), p=[0.3, 0.7])
data.loc[data['Age'] == '46-60', 'Treatment'] = np.random.choice(['Treatment', 'Control'], size=len(data.loc[data['Age'] == '46-60'] ), p=[0.2, 0.8])
data.loc[data['Age'] == '61+', 'Treatment'] = np.random.choice(['Treatment', 'Control'], size=len(data.loc[data['Age'] == '61+'] ), p=[0.1, 0.9])

# Simulate blood pressure
data['BloodPressure'] = data.apply(lambda row: simulate_blood_pressure(row['Age'], row['Treatment']), axis=1)

# Create boxplots to visualize blood pressure by treatment within each age category
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='Age', y='BloodPressure', hue='Treatment', palette='Set2')
plt.title('Blood Pressure by Age and Treatment')
plt.xlabel('Age')
plt.ylabel('Blood Pressure')
plt.show()

In the provided code, we are stratifying the population based on different age groups and then assigning treatment or control groups within each stratum.

  1. ’18-30′ Age Group: For individuals aged between 18 and 30, we want to balance the treatment and control groups, so we use a 50-50 split. This ensures that within this age group, half of the individuals receive treatment, and the other half serve as the control group.
  2. ’31-45′ Age Group: For individuals aged between 31 and 45, we assign a different split. In this case, we have a 30-70 split. This means that a smaller proportion (30%) of individuals in this age group receive treatment, while a larger proportion (70%) serve as the control group.
  3. ’46-60′ Age Group: For individuals aged between 46 and 60, we assign an even more skewed split towards the control group. We use a 20-80 split, meaning that only 20% of individuals in this age group receive treatment, while 80% serve as the control group.
  4. ’61+’ Age Group: Finally, for individuals aged 61 and above, we have the most skewed split towards the control group. We use a 10-90 split, where only 10% of individuals in this age group receive treatment, while 90% serve as the control group.

The rationale behind these varying splits is to ensure that within each age stratum, the treatment and control groups are balanced according to the chosen probabilities. This accounts for potential differences in treatment effects across different age categories and allows for more precise analysis by controlling for the influence of age as a covariate. Stratification helps prevent bias and ensures that comparisons between treatment and control groups are meaningful within each age stratum.

Balancing covariates through techniques mentioned above helps reduce bias by ensuring that treatment and control groups are comparable with respect to the relevant covariates. This improves the validity of the causal inference drawn from the analysis.

Conclusion.

In this blog post, we explored the significance of treatment group and control group in causal analysis. The treatment group represents individuals who are exposed to a specific treatment or intervention, while the control group consists of individuals who do not receive the treatment and serve as a baseline for comparison.

We discussed the importance of randomization in assigning individuals to treatment and control groups. Randomization ensures that participants have an equal chance of being assigned to either group, minimizing selection bias and allowing for unbiased estimation of the treatment effect. Randomization helps create comparable groups, making it more likely that any observed differences in outcomes between the groups are due to the treatment rather than pre-existing characteristics.

Additionally, we delved into the concept of covariates, which are factors that may influence both the treatment assignment and the outcome. Covariates can include demographic information, medical history, or other relevant factors. Balancing covariates between treatment and control groups is crucial to ensure a fair comparison and reduce confounding. Techniques such as matching, stratification, and regression adjustment can be employed to achieve balance and improve the accuracy of causal inference.

By understanding the role of treatment group and control group, the assignment of individuals, the role of randomization, and the importance of balancing covariates, we gain valuable insights into the process of conducting causal analysis. Considering these factors enhances the validity of causal inferences and strengthens the conclusions drawn from the analysis.

In our future posts, we will dive deeper into the counterfactual question and explore sensitivity analysis, which further expands our understanding of the treatment and control groups and their impact on causal analysis. Stay tuned for more insights on this fascinating topic!