Causal Estimation methods for Machine learning and Data Science Part II – Propensity Score Matching

Image Source : Google images

1.0 Introduction

In the world of data science, uncovering cause-and-effect relationships from observational data can be quite challenging. In our earlier discussion, we dived into using linear regression for causal estimation, a fundamental statistical method. In this post, we introduce propensity score matching. This technique builds on linear regression foundations, aiming to enhance our grasp of cause-and-effect dynamics. Propensity score matching is crafted to deal with confounding variables, offering a refined perspective on causal relationships. Throughout our exploration, we’ll showcase the potency of propensity scores across various industries, emphasizing their pivotal role in the art of causal inference.

2.0 Structure

We will be traversing the following topics as part of this blog.

Introduction to Propensity Score Matching ( PSM )
Process for implementing PSM
- Propensity Score Estimation
- Matching Individuals
- Stratification
- Comparison and Causal Effect Estimation:
PSM implementation from scratch using python
- Generating synthetic data and defining variables
- Training the propensity model using logistic regression
- Predicting the propensity scores for the data
- Separating treated and control variables
- Stratify the data using nearest neighbour algorithm to identify the neighbours for both treated and control groups
- Calculate Average Treatment Effect on Treated ( ATT ) and Average Treatment Effect on Control ( ATC ) .
- Calculation of ATE
Practical examples of PSM in Retail, Finance, Telecom and Manufacturing

3.0 Introduction to Propensity Score Matching

Propensity score matching is a statistical method used to estimate the causal effect of a treatment, policy, or intervention by balancing the distribution of observed covariates between treated and untreated units. In simpler terms, it aims to create a balanced comparison group for the treated group, making the treatment and control groups more comparable.

Consider a retail company evaluating the impact of a customer loyalty program on purchase behavior. The company wants to understand if customers enrolled in the loyalty program (treatment group) show a significant difference in their average purchase amounts compared to non-enrolled customers (control group). Covariates for these groups include customer demographics (age, income), historical purchase patterns, and frequency of interactions with the company.

To measure the the difference in purchase amounts between these two groups, we need to factor in the covariates in the group age, income, historical purchase patterns, and interaction frequency. This is because there can be substantial differences in the buying propensities between subgroups based on demographics like age or income. So to get a fair difference between these groups its imperative to make these groups comparable so that you can accurately measure the impact of the loyalty program.

The propensity score serves as a balancing factor. It’s like a ticket that each customer gets, representing their likelihood or “propensity” to join the loyalty program based on their observed characteristics (age, income, etc.). Propensity score matching then pairs customers from the treatment group (enrolled in the program) with similar scores to those from the control group (not enrolled). By doing this, you create comparable sets of customers, making it as if they were randomly assigned to either the treatment or control group.

4.0 Process to implement causal estimation using Propensity Score Matching ( PSM )

PSM is a meticulous process designed to balance the covariates between treated (enrolled in the loyalty program) and untreated (not enrolled) groups, ensuring a fair comparison that unveils the causal effect of the program on purchase behavior. Let’s delve into the steps of this approach, each crafted to enhance comparability and precision in our causal estimation journey.

4.1. Propensity Score Estimation: Unveiling Likelihoods

The journey commences with Propensity Score Estimation, employing a classification model like logistic regression to predict the probability of a customer enrolling in the loyalty program based on various covariates. The idea of the classification model is to capture the conditional probability of enrollment given the observed customer characteristics, such as demographics, historical purchase patterns, and interaction frequency.

The resulting propensity score (PS) for each customer represents the likelihood of joining the program, forming a key indicator for subsequent matching. Classification models like Logistic regression’s ability to discern these probabilities provides a nuanced understanding of enrollment likelihood, emphasizing the significance of each covariate in the decision-making process.

The PS serves as a foundational element in the Propensity Score Matching journey, guiding the way to a balanced and unbiased comparison between treated and untreated individuals.

4.2. Matching Individuals: Crafting Equivalence

Once we have the propensity scores, the next step is Matching Individuals. Think of it like finding “loyalty twins” for each customer – someone from the treatment group (enrolled in the loyalty program) matched with a counterpart from the control group (not enrolled), all based on similar propensity scores. The methods for this matching process vary, from simple nearest neighbor matching to more sophisticated optimal matching. Nearest neighbor matching pairs customers with the closest propensity scores, like pairing a loyalty member with a non-member whose characteristics align closely. On the other hand, optimal matching goes a step further, finding the best pairs that minimize differences in propensity scores. These methods aim for a quasi-random pairing, making sure our comparisons are as unbiased as possible. It’s like assembling pairs of customers who, based on their propensity scores, could have been randomly assigned to either group in the loyalty program evaluation.

4.3. Stratification: Ensuring Comparable Strata

Stratification is like sorting our matched customers into groups based on their propensity scores. This helps us organize the data and makes our analysis more accurate. It’s a bit like putting customers with similar likelihoods of joining the loyalty program into different buckets. Why? Well, it minimizes the chance of mixing up customers with different characteristics. For example, if we didn’t do this, we might end up comparing loyal customers who joined the program with non-loyal customers who have very different behaviors. Stratification ensures that, within each group, customers are pretty similar in terms of their chances of joining. It’s a bit like creating smaller, more manageable sets of customers, making our analysis more precise, especially when it comes to evaluating the impact of the loyalty program.

4.4. Comparison and Causal Effect Estimation: Unveiling Impact

Once we’ve organized our customers into strata based on propensity scores, we dive into comparing the mean purchase amounts of those who joined the loyalty program (treated, Y=1) and those who didn’t (untreated, Y=0) within each stratum.

Once the above is done we calculate the Average Causal Effect ( ACE ) for each stratum.

ACE_s=Y_s¹−Y_s⁰

This equation calculates the Average Causal Effect (ACE) for a specific stratum (s). It’s the difference between the mean purchase amount for treated ( Y_s¹ ) and untreated ( Y_s⁰ ), individuals within that stratum.

Next we take the Weighted average of stratum specific ACE’s.

ACE = ∑_s ACE_s * W_s

Here, we take the sum of all stratum-specific ACEs, each multiplied by its proportion (w_s) of individuals in that stratum. It gives us an overall Average Causal Effect (ACE) that considers the contribution of each stratum based on its size.

ACE measures how much the loyalty program influenced purchase amounts. If ACEs is positive, it suggests the program had a positive impact on purchases in that stratum. The weighted average considers stratum sizes, providing a comprehensive assessment of the program’s overall impact.

5.0 Implementing PSM from scratch

Let us now work out an example to demonstrate the concept of propensity scoring. We will generate some synthetic data and then develop the code to implement the estimation method using propensity score matching.

We will start by importing the necessary library files

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error

Next let us synthetically generate data for this exercise.

# Generate synthetic data
np.random.seed(42)
# Covariates
n_samples = 1000
age = np.random.normal(35, 5, n_samples)
income = np.random.normal(50000, 10000, n_samples)
historical_purchase = np.random.normal(100, 20, n_samples)
# Convert age as an integer
age = age.astype(int)
# Treatment variable (loyalty program)
treatment = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
# Outcome variable (purchase amount)
purchase_amount = 50 + 10 * treatment + 5 * age + 0.5 * income + 2 * historical_purchase + np.random.normal(0, 10, n_samples)

# Create a DataFrame
data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'HistoricalPurchase': historical_purchase,
    'Treatment': treatment,
    'PurchaseAmount': purchase_amount
})
data.head()

Let us look in detail on the above code

Setting Seed: np.random.seed(42) ensures reproducibility by fixing the random seed.
Generating Covariates:
- age: Simulates ages from a normal distribution with a mean of 35 and a standard deviation of 5.
- income: Simulates income from a normal distribution with a mean of 50000 and a standard deviation of 10000.
- historical_purchase: Simulates historical purchase data from a normal distribution with a mean of 100 and a standard deviation of 20.
Converting Age to Integer: age = age.astype(int) converts the ages to integers.
Generating Treatment and Outcome:
- treatment: Simulates a binary treatment variable (0 or 1) representing enrollment in a loyalty program. The p=[0.7, 0.3] argument sets the probability distribution for the treatment variable.
- purchase_amount: Generates an outcome variable based on a linear combination of covariates and a random normal error term.
Creating DataFrame:
- data: Creates a pandas DataFrame with columns ‘Age,’ ‘Income,’ ‘HistoricalPurchase,’ ‘Treatment,’ and ‘PurchaseAmount’ using the generated data.

This synthetic dataset serves as a starting point for implementing propensity score matching, allowing you to compare the outcomes of treated and untreated groups while controlling for covariates.

Next let us define the variables for the propensity modelling step

# Step 1: Propensity Score Estimation (Logistic Regression)
X_covariates = data[['Age', 'Income', 'HistoricalPurchase']]
y_treatment = data['Treatment']
# Standardize covariates
scaler = StandardScaler()
X_covariates_standardized = scaler.fit_transform(X_covariates)

First we start by selecting the covariates (features) that will be used to estimate the propensity score. In this case, it includes ‘Age,’ ‘Income,’ and ‘HistoricalPurchase.’

After that we select the treatment variable. This variable represents whether an individual is enrolled in the loyalty program (1) or not (0).

After this we standardize the covariates using the fit_transform method. Standardization involves transforming the data to have a mean of 0 and a standard deviation of 1.

The standardized covariates will then used in logistic regression to predict the probability of treatment. This propensity score is a crucial component in propensity score matching, allowing for the creation of comparable groups with similar propensities for treatment. Let us see that part next

# Fit logistic regression model
propensity_model = LogisticRegression()
propensity_model.fit(X_covariates_standardized, y_treatment)

The logistic regression model is trained to understand the relationship between the standardized covariates and the binary treatment variable.This model learns the patterns in the covariates that are indicative of treatment assignment, providing a probability score that becomes a key element in creating balanced treatment and control groups.

In the next step, this model will be used in calculating probabilities, and in this case, predicting the probability of an individual being in the treatment group based on their covariates.

# Predict propensity scores
propensity_scores = propensity_model.predict_proba(X_covariates_standardized)[:, 1]
# Addd the propensity scores to the data frame
data['propensity'] = propensity_scores
data.head()

Here we utilize the trained logistic regression model (propensity_model) to predict propensity scores for each observation in the dataset. The predict_proba method returns the predicted probabilities for both classes (0 and 1), and [:, 1] selects the probabilities for class 1 (treatment group). After predicting the probabilities we create a new column in the DataFrame (data) named ‘propensity’ and populates it with the predicted propensity scores.

The propensity scores represent the estimated probability of an individual being in the treatment group based on their covariates. In the context of the problem, the “treatment” refers to whether an individual has enrolled in the loyalty program of the company. A higher propensity score for an individual indicates a higher estimated probability or likelihood that this person would enroll in the loyalty program, given their observed characteristics.

As seen earlier propensity score matching aims to create comparable treatment and control groups. Adding the propensity scores to the dataset allows for subsequent steps in which individuals with similar propensity scores can be paired or matched.

# Find the treated and control groups
treated = data.loc[data['Treatment']==1]
control = data.loc[data['Treatment']==0]

# Fit the nearest neighbour on the control group ( Have not signed up) propensity score
control_neighbors = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit(control["propensity"].values.reshape(-1, 1))

This section of the code is aimed at finding the nearest neighbors in the control group for each individual in the treatment group based on their propensity scores. Let’s break down the steps and provide intuition in the context of the problem statement:

Line 45 creates a subset of the data where individuals have received the treatment (enrolled in the loyalty program) and line 46 creates a subset for individuals who have not received the treatment.

Line 49 uses the Nearest Neighbors algorithm to fit the propensity scores of the control group. It’s essentially creating a model that can find the closest neighbor (individual) in the control group for any given propensity score. The choice of n_neighbors=1 means that for each individual in the treatment group, the algorithm will find the single nearest neighbor in the control group based on their propensity scores.

In the context of the loyalty program, this step is pivotal for creating a balanced comparison. Each treated individual is paired with the most similar individual from the control group in terms of their likelihood of enrolling in the program (propensity score). This pairing helps to mitigate the effects of confounding variables and allows for a more accurate estimation of the causal effect of the loyalty program on purchase behavior.

We will see the step of matching people from treated and control group next

# Find the distance of the control group to each member of the treated group ( Individuals who signed up)
distances, indices = control_neighbors.kneighbors(treated["propensity"].values.reshape(-1, 1))

In the above code, the goal is to find the distances and corresponding indices of the nearest neighbors in the control group for each individual in the treated group.

Line 51, uses the fitted Nearest Neighbors model (control_neighbors) to find the distances and indices of the nearest neighbors in the control group for each individual in the treated group.

distances: This variable will contain the distances between each treated individual and its nearest neighbor in the control group. The distance is a measure of how similar or close the propensity scores are.
indices: This variable will contain the indices of the nearest neighbors in the control group corresponding to each treated individual.

In propensity score matching, the objective is to create pairs of treated and control individuals with similar or close propensity scores. The distances and indices obtained here are crucial for assessing the quality of matching. Smaller distances and appropriate indices indicate successful pairing, contributing to a more reliable causal effect estimation.

Let us now see the process of causal estimation using the distance and indices we just calculated

# Calculation of the Average Treatment effect on the Treated ( ATT)
att = 0
numtreatedunits = treated.shape[0]
print('Number of treated units',numtreatedunits)
for i in range(numtreatedunits):
  treated_outcome = treated.iloc[i]["PurchaseAmount"].item()
  control_outcome = control.iloc[indices[i][0]]["PurchaseAmount"].item()
  att += treated_outcome - control_outcome
att /= numtreatedunits
print("Value of ATT",att)

The above code calculates the Average Treatment Effect on the Treated (ATT). Let us try to understand the intuition behind the above step.

Line 53, att : The ATT is a measure of the average causal effect of the treatment (loyalty program) on the treated group. It specifically looks at how the purchase amounts of treated individuals differ from those of their matched controls.

Line 54 : numtreatedunits: The total number of treated individuals.

Line 56, initiates a for loop to iterate through each of the treated individuals. For each treated individual, the difference between their observed purchase amount and the purchase amount of their matched control is calculated. The loop iterates through each treated individual, accumulating these differences in att.

A positive ATT indicates that, on average, the treated individuals had higher purchase amounts compared to their matched controls. This would suggest a positive causal effect of the loyalty program on purchase amounts.A negative ATT would imply the opposite, indicating that, on average, treated individuals had lower purchase amounts than their matched controls.

In our case the ATT is negative, indicating that on average treated individuals have lower purchase amount that their matched controls.

What these processes achieves is that by matching each treated individual with their closest counterpart in the control group (based on propensity scores), we create a balanced comparison, reducing the impact of confounding variables. The ATT represents the average difference in outcomes between the treated and their matched controls, providing an estimate of the causal effect of the loyalty program on purchase behavior for those who enrolled.

Next, similar to the ATT we calculate the Average treatment of Control ( ATC ). Let us look into those steps now.

# Computing ATC
treated_neighbors = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit(treated["propensity"].values.reshape(-1, 1))
distances, indices = treated_neighbors.kneighbors(control["propensity"].values.reshape(-1, 1))
print(distances.shape)
# Calculating ATC from the neighbours of the control group
atc = 0
numcontrolunits = control.shape[0]
print('Number of control units',numcontrolunits)
for i in range(numcontrolunits):
  control_outcome = control.iloc[i]["PurchaseAmount"].item()
  treated_outcome = treated.iloc[indices[i][0]]["PurchaseAmount"].item()
  atc += treated_outcome - control_outcome
atc /= numcontrolunits
print("Value of ATC",atc)

Similar to the earlier step, we calculate the average treatment effect on the control group ( ATC ). The ATC measures the average causal effect of the treatment (loyalty program) on the control group. It evaluates how the purchase amounts of control individuals would have been affected had they received the treatment.The loop iterates through each control individual, accumulating the differences between the purchase amounts of the treated individuals’ matched controls and the purchase amounts of the controls themselves.

While ATT focuses on the treated group, comparing the outcomes of treated individuals with their matched controls, ATC focuses on the control group. It assesses how the treatment would have affected the control group by comparing the outcomes of treated individuals’ matched controls with the outcomes of the controls.

Having calculated ATT and ATC its time to calculate Average treatment effect ( ATE ). This can be calculated as follows

# Calculation of Average Treatment Effect
ate = (att * numtreatedunits + atc * numcontrolunits) / (numtreatedunits + numcontrolunits)
print("Value of ATE",ate)

ATE represents the weighted average of the Average Treatment Effect on the Treated (ATT) and the Average Treatment Effect on the Control (ATC). It combines the effects on the treated and control groups based on their respective sample sizes. ATE provides an overall measure of the average causal effect of the treatment (loyalty program) on the entire population, considering both the treated and control groups.

The calculated ate value i.e ( -140.38 ) indicates that, on average, the treatment has a negative impact on the purchase amounts in the entire population when considering both treated and control groups. Negative ATE suggests that, on average, individuals who received the treatment have lower purchase amounts compared to what would have been observed if they had not received the treatment, considering the control group. This also suggests that the loyalty program has not had the desired effect and should rather be discontinued.

Conclusion

Propensity score matching is a powerful technique in causal inference, particularly when dealing with observational data. By estimating the likelihood of receiving treatment (propensity score), we can create balanced groups, ensuring that treated and untreated individuals are comparable in terms of covariates. The step-by-step process involves estimating propensity scores, forming matched pairs, and calculating treatment effects. Through this method, we mitigate the impact of confounding variables, enabling more accurate assessments of causal relationships.

This method has good uses within multiple domains. Let us look at some of the important use cases.

Retail: Promotional Campaign Effectiveness

Treated Group: Customers exposed to a promotional campaign.
Control Group: Customers not exposed to the campaign.
Effects: Evaluate the impact of the promotional campaign on sales and customer engagement.
Covariates: Demographics, purchase history, and online engagement metrics.

In retail, understanding the effectiveness of promotional campaigns is crucial for allocating marketing resources wisely. The treated group in this scenario consists of customers who have been exposed to a specific promotional campaign, while the control group comprises customers who have not encountered the campaign. The goal is to assess the influence of the promotional efforts on sales and customer engagement. To ensure a fair comparison, covariates such as demographics, purchase history, and online engagement metrics are considered. Propensity score matching helps mitigate the impact of confounding variables, allowing for a more precise evaluation of the promotional campaign’s actual impact on retail performance.

Finance: Loan Default Prediction

Treated Group: Individuals who have received a specific type of loan.
Control Group: Individuals who have not received that particular loan.
Effects: Assess the impact of the loan on default rates and repayment behavior.
Covariates: Credit score, income, employment status.

In the financial sector, accurately predicting loan default rates is crucial for risk management. The treated group here consists of individuals who have received a specific type of loan, while the control group comprises individuals who have not been granted that particular loan. The objective is to evaluate the influence of this specific loan on default rates and overall repayment behavior. To account for potential confounding factors, covariates such as credit score, income, and employment status are considered. Propensity score matching aids in creating a balanced comparison between the treated and control groups, enabling a more accurate estimation of the causal effect of the loan on default rates within the financial domain.

Telecom: Customer Retention Strategies

Treated Group: Customers who have been exposed to a specific customer retention strategy.
Control Group: Customers who have not been exposed to any such strategy.
Effects: Evaluate the effectiveness of customer retention strategies on reducing churn.
Covariates: Contract length, usage patterns, customer complaints.

In the telecommunications industry, where customer churn is a significant concern, assessing the impact of retention strategies is vital. The treated group comprises customers who have experienced a particular retention strategy, while the control group consists of customers who have not been exposed to any such strategy. The aim is to examine the effectiveness of these retention strategies in reducing churn rates. Covariates such as contract length, usage patterns, and customer complaints are considered to control for potential confounding variables. Propensity score matching enables a more rigorous comparison between the treated and control groups, providing valuable insights into the causal effects of different retention strategies in the telecom sector.

Manufacturing: Production Process Optimization

Treated Group: Factories that have implemented a new and optimized production process.
Control Group: Factories continuing to use the old production process.
Effects: Evaluate the impact of the new production process on efficiency and product quality.
Covariates: Factory size, workforce skill levels, historical production performance.

In the manufacturing domain, continuous improvement is crucial for staying competitive. To assess the impact of a newly implemented production process, factories using the updated method constitute the treated group, while those adhering to the old process make up the control group. The effects under scrutiny involve measuring improvements in efficiency and product quality resulting from the new production process. Covariates such as factory size, workforce skill levels, and historical production performance are considered during the analysis to account for potential confounding factors. Propensity score matching aids in providing a robust causal estimation, enabling manufacturers to make data-driven decisions about process optimization.

Propensity score matching stands as a powerful tool allowing researchers and decision-makers to glean insights into the true impact of interventions. By carefully balancing treated and control groups based on covariates, we unlock the potential to discern causal relationships in observational data, overcoming confounding variables and providing a more accurate understanding of cause and effect. This method has widespread applications across diverse industries, from retail and finance to telecom, manufacturing, and marketing.

“Dive into the World of Data Science Magic: Subscribe Today!”

Ready to uncover the wizardry behind data science? Our blog is your spellbook, revealing the secrets of causal inference. Whether you’re a data enthusiast or a seasoned analyst, our content simplifies the complexities, making data science a breeze. Subscribe now for a front-row seat to unraveling data mysteries.

But wait, there’s more! Our YouTube channel adds a visual twist to your learning journey.

Don’t miss out—subscribe to our blog and channel to become a data wizard.

Let the magic begin!

Build you Computer Vision Application – Part VI: Road pothole detector using Tensorflow Object Detection API

This is the sixth post of the series were we build a road sign and pothole detection application. We will be using multiple methods through out this series which includes computer vision techniques using opencv, annotating images using labelImg, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 8 posts.

1. Introduction to object detection

2. Data set preparation and annotation Using LabelImg

3. Building your object detection model from scratch using Image pyramids and sliding window

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API ( This Post)

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In this post we will discuss in detail the process for training an object detector using the Tensorflow Object Detection API(TFODAPI).

Introduction

Over the past few posts of this series we explored many frameworks through which we created object detection models to detect potholes on road. All the frameworks which we explored till post 5 were about some specific type of model. However in this post we are going to do something different. In this post we will learn about a great utility to do object detection and its called Tensorflow Object Detection API ( TFODAPI ). This is a great API with which we would be able to train custom object detection models using different types of networks. In this post we will use TFODAPI to build our pothole detector. Let us dive in.

Installation of Tensorflow Object Detection API

The pre-requisite for Tensorflow Object Detection is the installation of Tensorflow. To install Tensorflow on your machine you can follow the following link.

Once Tensorflow is installed, we can proceed with the installation of TFODAPI . This installation has 4 major steps.

Downloading Tensorflow model garden
Protobut installation / compilation
COCO API installation
Install object detection API.

You can do these step wise installation using the following link.

If the installation steps are correct, on testing your installation you should get the following screen

Once all the installations are correct you will have the following folder structure.

Please note that in the installation link provided above, the root folder would be named as 'Tensorflow', however in the installation followed here the root folder is named as 'TFODAPI'. Other than that, the important folder which you need to verify is the /models folder and the other folders created under it. Once this structure is in place, we can get into the next step which is to start the training process using the Custom object detector.

Training a Custom Object detector

Having installed the Tensorflow object detection API, its now the time to get to the training process. In the training process we will be covering the following processes

Create the workspace for training
Generate tf records from the annotated dataset
Configure the training pipeline and monitor progress
Export the resulting model and use it to detect porholes

Let us start with the first process

Workspace for training

We start off, creating the following sub-folders within our existing folder structure.

We first create a folder called workspace, under the TFODAPI folder. The workspace folder is where we keep all the training configurations. Let us look at the subfolders of the workspace folder.

training_pothole : This folder is where the training process gets implemented. Each time we do a training, it is advisable to create a new training_pothole subfolder. This folder has different subfolders under it as follows.

annotations : This folder will contain the train and test data in a format called tf.records. We will see how to create the tf.records in short while.

exported-models :After the training is complete we export the model object to do inference using the train model. This folder will contain the model we will use for inference.

images : This folder contains the raw train and test images which we want to train.

models : This folder will contain a subfolder for each training job we implement. For example, I have created the current training using a ssd_resnet50 model. So you will find a folder related to that as shown in the image below

Once the training is initiated you will have all the training related checkpoints and also the *.config file which contains all the parameters within this subfolder.

pre-trained-models : This folder contains the pre-trained models which we use to initiate our training process. So every type of pretrained model we use will be in a separate subfolder as shown in the image below.

These are the different folders which you will have to create to initiate the training process.

Having seen all the constituent folders within the workspace, let us now get into the training process. As a first step in the training process, let us create the train and test records.

Creating train and test records

Before creating the train and test records, we will have to split the total data into train and test sets using the train_test_split function in scikit learn. After creating the train and test sets, we will move those files inside the train and test folders which are within the images folder. We will do all these processes in the Jupyter notebook.

We will start by importing the necessary library files

import glob
import pandas as pd
import os
import random
from sklearn.model_selection import train_test_split
import shutil

Next let us change our current directory in the Jupyter notebook to the TFODAPI directory. Please note that you will have to give the correct path where your root folder lies instead of the path which is represented here below

!cd /BayesianQuest/Pothole/TFODAPI

Let us also list down all the images we annotated in post 2. We will be using the same set of images in this post.

# List down all the annotated images
random.seed(123)
# Initialize the folder where the annotated images are placed
datafolder = '/BayesianQuest/Pothole/data/annotatedImages'
# List down all the images in the data folder
images = glob.glob(datafolder + '/*.jpeg')
print(len(images))
images

As seen in the output, I have taken around 18 images for this process. The number of images you want to use, is your prerogative, more the better.

Let us now sort the images and the split the data into train and test sets.

# Let us sort the images and the split it into train and test sets
images.sort()

# Split the dataset into train-valid-test splits 
train_images, test_images = train_test_split(images,test_size = 0.1, random_state = 123)

print('Total train images :',len(train_images))
print('Total test images:',len(test_images))

After having split the data into train and test sets, we need to move the files into the images folder . We need to create two folders under the images folder and name them train and test.

# Creating the train and test folders inside the workspace images folder
!mkdir workspace/training_pothole/images/train workspace/training_pothole/images/test

Now that we have the train and test folders created let us move the files to the destination folders . We will move the file using the below function.

#Utility function to move images 
def move_files_to_folder(list_of_files, destination_folder):
    for f in list_of_files:
        try:
            shutil.move(f, destination_folder)
        except:
            print(f)
            assert False

Let us move the files using the above function

# Move the splits into their folders
move_files_to_folder(train_images, 'workspace/training_pothole/images/train')
move_files_to_folder(test_images, 'workspace/training_pothole/images/test/')

Next we will explore the creation of tf records, a format which is required to read data into TFODAPI.

Creation of tf.records file from the images

In this section we will switch gears and then go about executing the next process in python scripts.

When initiating training, we will be using many pre-defined methods and classes which comes with the API. Most of them are within the models/research/object_detection folder in our root folder,TFODAPI, as shown below

To utilise them in our training and inference scripts, we need to add those paths in the environment. In linux this can be easily be enabled by running those paths in a shell script ( .sh files). Let us first create a shell script to access all these paths.

Open a text editor,create a file called setup.sh and add the following lines in the file

#!/bin/sh
export  PYTHONPATH=$PYTHONPATH:/BayesianQuest/Pothole/TFODAPI/models/research:/BayesianQuest/Pothole/TFODAPI/models/research/slim

This file basically contains the path to the TFODAPI/models/research and TFODAPI/models/research/slim path. The path to the TFODAPI must be changed according to your specific paths. Also please note that you need to have the script export and the paths in the same line.

For Windows system, you can add these paths to the environment variables.

Once the file is created, save that in the folder TFODAPI as shown below

To execute the shell script, open a terminal and the execute the following commands

There will not be any message or output after executing this script. You will be returned to your terminal prompt after execution.

This will ensure that all the paths are entered as environment variables.

Creation of label maps

TFODAPI requires a label map, which maps each of the labels to an integer value. This label map is used both by the training and detection processes. The mapping is based on the number of classes we have in the pothole_df.csv file we created in post2 of this series.

# Reading the csv file
pothole_df = pd.read_csv('../pothole_df.csv')
pothole_df.head()

pothole_df['class'].unique()

To create a label map open a text editor, name it label_map.pbtxt and include the below mapping in that file.

item {
    id: 1
    name: 'pothole'
}
item {
    id: 2
    name: 'vegetation'
}
item {
    id: 3
    name: 'sign'
}
item {
    id: 4
    name: 'vehicle'
}

This has to be placed in the folder ‘annotation’ in our workspace.

Creation of tf.records

Now we have all the required files to create our tf.records. Let us open the text editor, name it generate_tfrecord.py and insert the following code.

import os
import glob
import pandas as pd
import io
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'    # Suppress TensorFlow logging (1)
import tensorflow.compat.v1 as tf
import argparse
from PIL import Image
from object_detection.utils import dataset_util, label_map_util


# Define the argument parser
arg = argparse.ArgumentParser()
arg.add_argument("-l","--labels-path",help="Path to the labels .pbxtext file",type=str)
arg.add_argument("-o","--output-path",help="Path to the output .tfrecord file",type=str)
arg.add_argument("-i","--image_dir",help="Path to the folder where the input image files are stored. ", type=str, default=None)
arg.add_argument("-a","--anot_file",help="Path to the folder where the annotation file is stored. ", type=str, default=None)

args = arg.parse_args()

# Load the labels files
label_map = label_map_util.load_labelmap(args.labels_path)
label_map_dict = label_map_util.get_label_map_dict(label_map)

# Function to extract information from the images
def create_tf_example(path,annotRecords):
    with tf.gfile.GFile(path, 'rb') as fid:
        encoded_jpg = fid.read()
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image = Image.open(encoded_jpg_io)
    width, height = image.size
    # Get the filename from the path
    filename = path.split("/")[-1].encode('utf8')
    image_format = b'jpeg'
    # Get all the lists to store the records
    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    classes_text = []
    classes = []
    # Iterate through the annotation records and collect all the records
    for index, row in annotRecords.iterrows():
        xmins.append(row['xmin'] / width)
        xmaxs.append(row['xmax'] / width)
        ymins.append(row['ymin'] / height)
        ymaxs.append(row['ymax'] / height)
        classes_text.append(row['class'].encode('utf8'))
        classes.append(label_map_dict[row['class']])
    # Store all the examples in the format we want
    tf_example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(filename),
        'image/source_id': dataset_util.bytes_feature(filename),
        'image/encoded': dataset_util.bytes_feature(encoded_jpg),
        'image/format': dataset_util.bytes_feature(image_format),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))

    return tf_example


def main(_):

    # Create the writer object
    writer = tf.python_io.TFRecordWriter(args.output_path)
    # Get the annotation file from the arguments
    annotFile = pd.read_csv(args.anot_file)
    # Get the path to the image directory
    path = os.path.join(args.image_dir)
    # Get the list of all files in the image directory
    imgFiles = glob.glob(path + "/*.jpeg")
    # Read each of the file and then extract the details
    for imgFile in imgFiles:
        # Get the file name from the path
        fname = imgFile.split("/")[-1]        
        # Get all the records for the filename from the annotation file
        annotRecords = annotFile.loc[annotFile.filename==fname,:]
        tf_example =  create_tf_example(imgFile,annotRecords)
        # Write the records to the required format
        writer.write(tf_example.SerializeToString())
    writer.close()
    print('Successfully created the TFRecord file: {}'.format(args.output_path))
if __name__ == '__main__':
    tf.app.run()

Lines 1-9, we import the necessary library files and in lines 13-19 we define the arguments.

In line 14, we define the path to the label map file ( .pbtxt ) file we created earlier

We define the path where we will be writing the .tfrecord file in line 15. In our case this is the path to the annotations folder.

The next argument we provide in line 16, is the path to the images folder. Here we give either the train folder or test folder.

The final argument, in line 17 is the path to the annotation file i.e pothole_df.csv file.

Next task is to process the label mapping file we created. For processing this file we use two utility functions which are part of the Tensorflow Object detection API, which we imported in line 9. After the processing in line 23, we get a label map dictionary, which is further used in creation of the tf.records files.

In lines 26-67, is a function used for extracting features from the images and the label maps to create the tf.record. Let us look at the function

The parameters to the function are the following

path : This is the path to the image we are going to process

annotRecords : This is the row of the pothole_df.csv file which contains information of the image and the bounding boxes in that image.

Moving on inside the function lines 26-29 implements a module tf.io.gfile for reading the input image file. This module provides an API that is close to Python’s file I/O object. TensorFlow exports these objects as tf.io.gfile, so that you can use these implementations for saving and loading checkpoints, writing to TensorBoard logs, and accessing training data.

In lines 30-31, the image is opened and its dimensions are read.

The filename is extracted from the path in line 33 and in line 34 the file format is defined.

Lines 36-49, extracts the bounding box information in the respective lists and also stores the class name in the string format and also the numerical format from the label map.

Finally in lines 51-63, all these information extracted from the images and its class names are stored in a format called tf.train.Example. Once these information are packed in the tf.train.Example object it gets written to the tf. record format. That takes us to the end of the function and now we will see the complete process , where this function will be called to extract information from the images.

Lines 72-89, is where the process gets executed. Let us see them line by line.

In line 72, the writer is defined using the TfRecordWriter() method and is written to the output folder to the .record format ( for eg. train.record / test.record)

We read the annotation csv file in line 74 and then extracts the path to the image directory in line 76 and lists down all the the image paths in line 78.

We then iterate through each of the image path in line 80 for further feature extraction within the iterative loop.

We extract the file name from the path in line 82 and the get all the annotation information for the file from the annotation csv file in line 84

We extract all the information of the file in line 85 using the create_tf_example() function we saw earlier and get the tf_example object. This object is finally written as a string in the .record in line 87

The writer object is closed after all the image files are processed.

We will save the generate_tfrecord.py in the scripts/preprocessing folder as shown below

To run the file, we will open a terminal and then execute the command in the following format.

$ python generate_tfrecord.py -i [path to images folder] -a [path to annotation csv file] -l [path to label map .pbtxt file] -o [path to the output folder where .record files are written]

For example

Need to run this command for both the train images and test images seperately. Need to change the path of the files folder and also .record name based on whether it is train or test. Once these scripts are executed you will find the train.record and test.record files in the annotation folder as shown below.

That takes us to the end of train and test record processing steps. Next we will start the training process.

Training the Pothole Detection model using pre-trained model

We will not be training the model from scratch, rather we would be fine tuning a pre-trained model for our purpose. The pre-trained model we will be using would be SSD ResNet50 V1 FPN 640×640. These pre-trained models are available in TensorFlow 2 Detection Model Zoo. Later on I would encourage you to implement the same detector using a Faster RCNN model from this repository.

We start our training process by downloading the model we want to implement from the TensorFlow 2 Detection Model Zoo.

Once we click on the link, a .tar.gz file gets downloaded to your local drive. Extract the contents of the tar file and then move the complete folder into the folder pre-trained-models. Since we extracted the model SSD ResNet50 V1 FPN 640×640, our folder, pre-trained-models will have the following structure.

The more models you want to download, you need to maintain seperte folder structure for each of the model you want to use. I have downloded the Faster RCNN model also, and now the structure looks like the following.

Creating the training pipeline

After unloading the contents of the model to the pre-trained models folder, we will now create a new folder under the folder workspace/training_pothole/models and name it my_ssd_resnet50_v1_fpn and then copy the pipeline.config file from the folder pre-trained-models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8 and place it in the new folder my_ssd_resnet50_v1_fpn you created. Now the structure will look like the below.

Please note that I also have faster_rcnn model here. So for each model you download the structure will look like the above.

Now that we have copied the pipeline.config file, we will have to make changes to the file to cater to our specific purpose.

Change 1 : The first change we have to make is in line 3 for the number of classes. We need to change the number of classes to 4

Change 2 : The next change is in line 131 for the batch size. Depending on the number of examples, you need to change the batch size.

Change 3 : The next optional change is for the number of training steps as in line 152 and 154. Depending on the configuration of your machine you can change it to the number of steps you want to train the model.

Change 4 : Path to the check point of the pre-trained model in line 161

Change 5 : Change the fine tune checkpoint type to “detection” from the default “classification'” in line 167

Change 6 : label_map_path and train record paths , line 172 and 174

Change 7: label_map_path and test record paths, line 182 and 186

Now that the config file is customised, its time to start our training process.

Training the model

We have a script which is part of the API to do the training. This can be copied from the folder TFODAPI/models/research/object_detection/model_main_tf2.py. This needs to be placed in the training_pothole folder as shown below.

We are all set to start the training of our model. To start the training, you can change directory to the training_pothole folder and enter the following command on the terminal.

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

Training is a time consuming process. Depending on the speed of your computer it might take hours to complete. The process might seem stuck as not output would be printed for a long time. However you need to be patient and wait for it to complete. The metrics will be printed every 100 steps, as shown in the output above.

You will be able to monitor the training process using Tensorboard. You need to open a terminal, change directory to training_pothole and then enter the following command in the terminal

You will get the following output and tensorboard will be active on port 6006

Once you click on the link for the port 6006, you will see metrics like the below on tensorboard.

Once training is complete you will find a sessions folder called train and the checkpoints created inside my_ssd folder.

We now need to export the trained models for the inference process. This means that the model object is exported from the latest checkpoint to a new folder from which we will do our predictions.

To get this done, we first need to copy the file, TFODAPI/models/research/object_detection/exporter_main_v2.py and then paste it inside the training_pothole folder.

Now open a terminal change directory into training_pothole, directory and then enter the following command.

 python exporter_main_v2.py --input_type image_tensor --pipeline_config_path models/my_ssd_resnet50_v1_fpn/pipeline.config --trained_checkpoint_dir models/my_ssd_resnet50_v1_fpn/ --output_directory exported-models/my_model

You will now see the model object and the checkpoint information in the exported-models/my_model folder.

We can now initiate the inference process after this.

Inference Process

Inference process is where we test the model on new images. We will implement the inference process using a new script. The code for the inference step is heavily inspired from the following link

Open your text editor, create an new file and name it inference_load_model.py and add the following code into it.

import time
from object_detection.utils import label_map_util
from object_detection.utils import config_util
from object_detection.utils import visualization_utils as viz_utils
from object_detection.builders import model_builder
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'    # Suppress TensorFlow logging (1)
import tensorflow as tf
import numpy as np
from PIL import Image
import warnings
warnings.filterwarnings('ignore')   # Suppress Matplotlib warnings
import glob

First we import all the necessary packages. Packages from lines 2-5, are downloaded from the API code we downloaded. These will be available in the object detection folder.

Next we will define some of the paths to the exported model folder.

# Define the path to the model directory
PATH_TO_MODEL_DIR = "exported-models/my_model"
PATH_TO_CFG = PATH_TO_MODEL_DIR + "/pipeline.config"
PATH_TO_CKPT = PATH_TO_MODEL_DIR + "/checkpoint"

Lines 16-18, we define the paths to the model we exported, the config file and the model checkpoint. These information will be used to load the model for predictions.

We will now load the model using the check point information.

print('Loading model... ', end='')
start_time = time.time()

# Load pipeline config and build a detection model
configs = config_util.get_configs_from_pipeline_file(PATH_TO_CFG)
model_config = configs['model']
detection_model = model_builder.build(model_config=model_config, is_training=False)

# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(PATH_TO_CKPT, 'ckpt-0')).expect_partial()

end_time = time.time()
elapsed_time = end_time - start_time
print('Done! Took {} seconds'.format(elapsed_time))

We load the model in line 26 and restore the checkpoint information in lines 29-30.

Next we will see two utility functions which will be used in the inference cycle.

@tf.function
def detect_fn(image):
    """Detect objects in image."""
    image, shapes = detection_model.preprocess(image)
    prediction_dict = detection_model.predict(image, shapes)
    detections = detection_model.postprocess(prediction_dict, shapes)

    return detections

def load_image_into_numpy_array(path):
    """Load an image from file into a numpy array. """
    return np.array(Image.open(path))

The first function is to generate the detections from the image. Line 39, the image is preprocessed and we do the prediction in line 40 to get the prediction dictionary. The prediction dictionary consists of different elements which are required to create the bounding boxes for the objects. In line 41, the prediction dictionary is preprocessed to get the final detection dictionary which again consists of the elements required for bounding box creation.

The second function in lines 45-47 is a simple one to convert the image into an np.array.

Next we will initialise the labels and also get the path of the test images in lines 49-53

# Get the annotations
PATH_TO_LABELS = "annotations/label_map.pbtxt"
category_index = label_map_util.create_category_index_from_labelmap(PATH_TO_LABELS,use_display_name=True)
# Get the paths of the images
IMAGE_PATHS = glob.glob("BayesianQuest/Pothole/data/test" + '/*.jpeg')

We now have all the components to start the inference process. We will iterate through each of the test images and then create the bounding boxes. Let us see the complete process for that now.

for image_path in IMAGE_PATHS:
    print('Running inference for {}... '.format(image_path), end='')
    # Convert image into a np array
    image_np = load_image_into_numpy_array(image_path)
    # Convert the image array to a tensor after expanding the dimension to include batch size also    
    input_tensor = tf.convert_to_tensor(np.expand_dims(image_np, 0), dtype=tf.float32)
    # Get the detection
    detections = detect_fn(input_tensor)    
    # Get all the objects which were detected
    num_detections = int(detections.pop('num_detections'))
    detections = {key: value[0, :num_detections].numpy()
                  for key, value in detections.items()}
    detections['num_detections'] = num_detections
    # detection_classes should be ints.
    detections['detection_classes'] = detections['detection_classes'].astype(np.int64)
    # Create offset for labels for visualisation
    label_id_offset = 1
    image_np_with_detections = image_np.copy()
    # Visualise the images along with the bounding boxes and labels
    viz_utils.visualize_boxes_and_labels_on_image_array(
            image_np_with_detections,
            detections['detection_boxes'],
            detections['detection_classes']+label_id_offset,
            detections['detection_scores'],
            category_index,
            use_normalized_coordinates=True,
            max_boxes_to_draw=200,
            min_score_thresh=.45,
            agnostic_mode=False)
    # Show the images with bounding boxes
    img = Image.fromarray(image_np_with_detections, 'RGB')
    img.show()

We iterate through each of the test images in line 55 and then get the detections in line 62 after all the necessary pre-processing in the previous lines.

In the pipeline.config file we defined that the maximum total objects to be 100 ( line 104 of pipeline.config file). Therefore all the elements in the detection dictionary will cater to 100 objects. However the total objects we detected could be far less that what was initialised. So for the next processes, we only need to take those objects which were detected by the model. Lines 64-69, implements the steps for selecting only those objects which were detected.

Once we get only the objects which were detected, its time to visualise the objects along with the bounding boxes and the labels. These steps are implemented in lines 71-86. In line 82, we are specifying a threshold for accepting any objects. Only those objects whose score is greater than the threshold will be visualised.

To implement the script, open the terminal and enter the following command

You should see outputs similar to the below after this script is run.

We can see that the there are some good localisations for the potholes. All these were achieved with very limited images. With more images and better pre-processing techniques, we will be able to get much better results from what we have got now.

What Next ?

So far in this series we have seen different frameworks for object detection. We started with legacy methods like image pyramids and then explored more robust methods like RCNN and YOLO. Finally in this post, we learned to implement object detection using a great utility, Tensorflow Object Detection API. Now we will move ahead from what we have learned so far. The next step is to apply the techniques we learned in some real world scenarios like using it to analyze video files. That will be our endeavor in the next post. To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in d eep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Build you Computer Vision Application – Part V: Road pothole detector using YOLO-V5

This is the fifth post of the series were we build a pothole detection application. We will be using multiple methods through out this series which includes computer vision techniques using Opencv, annotating images using LabelImg, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 8 posts.

1. Introduction to object detection

2. Data set preparation and annotation Using LabelImg

3. Building your object detection model from scratch using Image pyramids and sliding window

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLOV5 ( This Post )

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In this post we will build our pothole detector using YOLO-V5. Let us start our process.

Introduction to YOLO

YOLO which stands for “You only look once” is one of the most popular object detector in use. The algorithm is designed in a way that by a single pass of forward propagation the network would be able to generate predictions. YOLO achieves very high accuracy and works really well in real time detection.

YOLO take a batch of images of shape (m, 224,224,3) and then outputs a list of bounding boxes along with its confidence scores and class labels, (p_c,b_x,b_y,b_w,b_h,c).

The output generated will be a grid of dimensions S x S ( eg. 19 x 19 ) with each grid having a set of B anchor boxes. Each box will contain 5 basic dimensions which include a confidence score and 4 bounding box information. Along with these 5 basic information, each box will also have the probabilities of the classes. So if there are 10 classes, there will be in total 15 ( 5 + 10) cells in each box. Let us look at the process in detail

The start of the process in YOLO is to divide the image into a S x S grids. Here S can be any integer value. For our example let us take S to be 4.

Each cell would predict B boxes with a confidence score. Again B can be decided based on the number of objects that can be contained in a cell. An important condition that needs to be met is that the center of the box should be within the cell. These B boxes are called the anchor boxes.

In our case, let us consider that B = 2. So each cell will predict 2 boxes where there is some probability of an object. Let us take the grid as shown in the above picture, where two boxes are predicted. That cell was able to detect a pothole and the car, and we can also see that the center of the boxes are also in the same cell. This process of predicting boxes happens for every cell within the image. In the course of this step multiple overlapping boxes will be predicted across all the grids of the image.

Along with the boxes and confidence scores a class probability map is also predicted. A class probability map gives the likelihood of the presence of a class in each of the cell. For example, vehicle in cell 2,3,4 …. and pothole in cell 9,10,11,…. etc.

The class probability maps enables the network to assign a class map to each of the bounding boxes. Finally non maxima suppression is applied to reduce the number of overlapping boxes and get the bounding boxes of only the objects we want to classify.

Having seen an overview of the end to end process, let us look at the output or predictions from each cell. Let us look specifically at a cell shown in the image below.

Each of the cells predicts a confidence score, which indicates if there is an object in the cell. Along with the confidence score, the bounding boxes of the object and the class of the object is also predicted. The class label can be an integer like 2 or 1 or it could be a one hot encoding representation of the predicted class ( eg. [0,0,1] ).

Having got an overview of YOLO , let us get into the implementation details.

Implementation of YOLO-V5

We will be managing the process through a Jupyter notebook. As this is a pre-trained model, we will not have too many activities to control in the process. The total process of implementation would have the following steps

Downloading the YOLO V5 model files
Preparing the annotated files
Preparing the train, validation and test sets
Implementing the training process
Executing the inference process using the trained model

We will be training our custom Yolo model using Pytorch. Let us start by importing all the packages we require.

import pandas as pd
import os
import glob
from PIL import Image, ImageDraw
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
import shutil
import torch
from IPython.display import Image  # for displaying images
import os 
import random
import shutil
import PIL

In the first step we clone the official repository of YOLOV5. We do it from the terminal or we can execute the same from Jupyter notebook too. Let us clone the repository from the Jupyter notebook.

! git clone https://github.com/ultralytics/yolov5

After the clone we will find a folder of YOLOV5 created in the folder where the Jupyter notebook resides.

The Yolov5 folder will have many more default folders under it. The folder structure will look like the below.

Please note that the folder ‘potholeData‘ will not be part of the default yolov5 folder. This folder will be created by us in a moment from now.

We will now change directory to the yolov5 folder we created now. All the processes we will execute will be from that folder.

Next we will prepare the annotated file

Prepare annotation file

To prepare the annotated file we will use the annotation csv file which we created in post2. Let us first read the file

# Reading the csv file
pothole_df = pd.read_csv('BayesianQuest/Pothole/pothole_df.csv')
pothole_df.head()

Now we will create a class map, which is a dictionary which maps each of our classes to an integer value.

# First get the list of all classes
classes = pothole_df['class'].unique().tolist()
# Create a dictionary for storing class to ID mapping
classMap = {}

for i,cls in enumerate(classes):
    # Map a class name to an integet ID
    classMap[cls] = i
    
classMap

Next we will extract the bounding box information of the images from excel sheet in a specific format which is required for YoloV5. We also need to store the images and the annotation files ( labels ) in specific folders. Let us create the folders before we extract the bounding box information.

# Create the main data folder
!mkdir potholeData
# Create images and labels data folders
!mkdir potholeData/images
!mkdir potholeData/labels
# Create train,val and test data folders for both images and labels
!mkdir potholeData/images/train potholeData/images/val potholeData/images/test  potholeData/labels/train potholeData/labels/val potholeData/labels/test

After creation of these folders, our folder structure will look like the following

Now that we have created the data folders, let us start extracting the bounding box information. To do that we need to iterate through all the images we have and then get the bounding information in a .txt format, as required by YoloV5. Let us look at the code to do that.

# Creating the list of images from the excel sheet
imgs = pothole_df['filename'].unique().tolist()
# Loop through each of the image
for img in imgs:
    boundingDetails = []
    # First get the bounding box information for a particular image from the excel sheet
    boundingInfo = pothole_df.loc[pothole_df.filename == img,:]
    # Loop through each row of the details
    for idx, row in boundingInfo.iterrows():
        # Get the class Id for the row
        class_id = classMap[row["class"]]
        # Convert the bounding box info into the format for YOLOV5
        # Get the width
        bb_width = row['xmax'] - row['xmin']
        # Get the height
        bb_height = row['ymax'] - row['ymin']
        # Get the centre coordinates
        bb_xcentre = (row['xmin'] + row['xmax'])/2
        bb_ycentre = (row['ymin'] + row['ymax'])/2
        # Normalise the coordinates by diving by width and height
        bb_xcentre /= row['width'] 
        bb_ycentre /= row['height'] 
        bb_width    /= row['width'] 
        bb_height   /= row['height']  
        # Append details in the list 
        boundingDetails.append("{} {:.3f} {:.3f} {:.3f} {:.3f}".format(class_id, bb_xcentre, bb_ycentre, bb_width, bb_height))
    # Create the file name to save this info     
    file_name = os.path.join("potholeData/labels", img.split(".")[0] + ".txt")
    # Save the annotation to disk
    print("\n".join(boundingDetails), file= open(file_name, "w"))

In line 2, we list down all the image ids from the csv file and then iterate through each of the image ids in line 4

We initialize a list in line 5 to capture the bounding box information and the get the bounding box information for the iterated image in line 7.

The bounding box information for each image is iterated through in line 9 and then we extract the class id in line 11 using the classMap dictionary we created.

From lines 14 -19, the bounding box information is extracted. When we created the annotations in post 2, we extracted the co-ordinates of the top left corner and the bottom right corner. However Yolo requires the width, height and the co-ordinates of the center of the image. In these lines we convert the coordinates to what is required by Yolo.

Lines 21-24 , co-ordinates are normalized by diving it by the width and height of the image and these coordinates are written to a text format in line 28.

After executing this step you will be able to see the annotations as txt files in the labels folder.

Having completed the annotation of the data, let us prepare the train, test and validation sets.

Preparing the train, test and validation sets

To train the Yolo model, we need all the train, test & validation images and annotation text files in the respective folders which we created ( eg : ‘/images/train’, ‘labels/train’ etc). In this section we will list down the paths of the images and annotation texts, split the paths to train, test and validation sets and then copy the images and annotation files to the right folders. Let us see how we do that.

First let us get the paths of the annotation text files and images

# Get the list of all annotations
annotations = glob.glob('potholeData/labels' + '/*.txt')
annotations

# Get the list of images from its folder
imagePath = '/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/BayesianQuest/Pothole/data/annotatedImages'
images = glob.glob(imagePath + '/*.jpeg')
images

Please note to change the path of the images to the correct path where your images are placed in your system.

Next we sort the images and annotation files and the split the data into train/test/val sets

# Sort the annotations and images and the prepare the train ,test and validation sets
images.sort()
annotations.sort()

# Split the dataset into train-valid-test splits 
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 123)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 123)

Now we will create a utility function to copy the actual files from the source files to the destination folders.

#Utility function to copy images to destination folder
def move_files_to_folder(list_of_files, destination_folder):
    for f in list_of_files:
        try:
            shutil.copy(f, destination_folder)
        except:
            print(f)
            assert False

Let us now copy the files using the above utility function

# Copy the splits into the respective folders
move_files_to_folder(train_images, 'potholeData/images/train')
move_files_to_folder(val_images, 'potholeData/images/val/')
move_files_to_folder(test_images, 'potholeData/images/test/')
move_files_to_folder(train_annotations, 'potholeData/labels/train/')
move_files_to_folder(val_annotations, 'potholeData/labels/val/')
move_files_to_folder(test_annotations, 'potholeData/labels/test/')

Now you will be able to see the images and annotation text files in the respective folders

Now we are ready to start the training.

Training the model

Before initiating the training process we have to create a special file called .yaml file which contains information about the paths to the train, test and val folders and also the class labels. Let us create the yaml file first. Open your text editor and name it 'potholeData.yaml' and copy the following code in it.

train: /BayesianQuest/Pothole/yolov5/potholeData/images/train/
val:  /BayesianQuest/Pothole/yolov5/potholeData/images/val/
test: /BayesianQuest/Pothole/yolov5/potholeData/images/test/

# number of classes
nc: 4

# class names
names: ["pothole","vegetation", "sign","vehicle"]

Please note that for the first three lines, you need to give the full path to your images/train, images/val and images/test folder. The number of class names should be in the exact order in which we have defined the classMap dictionary earlier. You need to save this .yaml file in the data folder

Now its time to start the training. To start the training you need to enter the following command on the Jupyter notebook. Alternatively you can also run the same command on the terminal

!python train.py --img 640 --cfg yolov5m.yaml --hyp data/hyps/hyp.scratch-med.yaml --batch 4 --epochs 500 --data potholeData.yaml --weights yolov5m.pt --workers 4 --name yolo_pothole_det_m

Let us understand each of these parameters we give to initiate training

train.py : This is the training file which comes with the code when we clone the folder. This file contains all the methods to run the training.

img : This is the dimension of the image

cfg : This is the configuration file which defines the model architecture. This file would be available in the folder yolov5/models as shown below.

hyp : These are the hyperparameters for the model which are available in the data/hyp folder

batch : This is the batch size, which you define based on the number of images you have

epochs : Number of training epochs

data : This is the yaml file which we created which has the path to the train/test/val files and also class information.

weights : These are the pre-trained weights of the model which will be automatically downloaded as part of the script. There are three types of models, large, medium and small. These are denoted by the abbreviations 'm' in yolov5m.pt. Here we have selected the medium model. When you run the training process for the first time, this weights file gets downloaded into the yolov5 folder.

Weights file downloading during training execution

workers : This indicate the number of cores/threads which needs to be used for training.

name : This is the name of the folder where the trained model and its checkpoints are stored. When you run the training command line, you will notice that a folder will be created with the same name as shown below. This will be inside a folder called ‘runs‘, which will be created inside the yolov5 folder.

Once the training command is executed, you will see output similar to below on the screen

The training is a time consuming activity and can be visualized on Tensorboard by entering the following command on a terminal. Please note that the terminal should be pointing to the yolov5 folder. The log details required to run Tensorboard will be available in runs/train folder

Once this command is executed, you will find the following output and will be able to visualize the training run on the browser in the following url http://localhost:6006/

Once you open the browser you will find a similar output

Once the training is complete, the trained model weights will be stored in the — name folder you defined during the training process ( runs/train/yolo_pothole_det_m/weights/best.pt ). This weights would be used for your inference cycle.

Inference with the trained model

The inference will also be using a pre-defined script which comes with the Yolov5 package. Inference can be initiated using the following command on the Jupyter notebook.

!python detect.py --source potholeData/images/val/ --weights runs/train/yolo_pothole_det_m/weights/best.pt --max-det 3  --conf-thres 0.005 --classes 0 --name yolo_pothole_det_test_m1

Alternately you can also run the same on the terminal as below

Let us go through each of the parameters

detect.py : This is the file used for inference which is available in the yolov5 folder

source : This is the path where the validation images are kept for inference. You can point this to any folder where you have your images which needs to be predicted on.

weights : This is the path to the weight of the checkpointed model we trained. These weights will be used for inference.

max-det : This is a parameter to define how many objects you want to be detected in an image.

conf-thres : This is a confidence threshold above which you want the predictions to be visualized.

classes : This is a parameter to filter the classes we want to be displayed. In the example we have defined only the pothole class ( 0 ). If we want objects of other classes to be defined, those class ids need to be represented with this parameter. ( eg. –classes 0 3 )

name : This is the path where the detected objects will exist. You will find a folder with the name you defined in the following folder.

Let us look at some of the images we have predicted

We can see that the bounding boxes have localized well. We should note that the number of images we used were very less and still we got some good results. With more images, we will be able to get superior results.

With this we have come to the end of object detection using YOLOV5. Let us quickly recap what we have achieved in this post.

Downloaded the YOLOV5 scripts into our local folder
Learned how to pre-process the data for custom training using YOLOV5.
Trained the model and verified the best model
Used the best model to do inference on our test images.

We have come a long way and are now adept at training and doing inference using an advanced model like YOLOV5. I am sure this will be another great tool with which you could do your object detection project.

What Next ?

Having seen an advanced method like YOLOV5, we will now proceed to learn to use a great tool from Tensorflow called the Tensorflow Object Detection API ( TFODAPI ). Using this API we would be able to build different types of object detection models. We will cover pothole detection using TFODAPI in the next post . Watch this space for more.

To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhancement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowered by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialized in d eep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Build you Computer Vision Application – Part III: Pothole detector from scratch using legacy methods (Image Pyramids and sliding window)

This is the third post of the series were we build a road sign and pothole detection application. We will be using multiple methods through out this series which includes computer vision techniques using opencv, annotating images using labelImg, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 9 posts.

1. Introduction to object detection

2. Data set preperation and annotation Using labelImg

3. Building your object detection model from scratch using Image pyramids and Sliding window ( This post )

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In this post we build a custom object detector from scratch progressively using different methods like pyramid segmentation, sliding window and non maxima suppression. These methods are legacy methods which lays the foundation to many of the modern object detection methods. Let us look at the processes which will be covered in building an object detector from scratch.

Prepare the train and test sets from the annotated images ( Covered in the last post)
Build a classifier for detecting potholes
Build the inference pipeline using image pyramids and sliding window techniques to predict bounding boxes for potholes
Optimise the bounding boxes using Non Maxima suppression.

We will be covering all the topics from step 2 in this post. These posts are heavily inspired by the following posts.

Let us dive in.

Training a classifier on the data

In the last post we prepared our training data from positive and negative examples and then saved the data in h5py format. In this post we will use that data to build our pothole classifier. The classifier we will be building is a binary classifier which has a positive class and a negative class. We will be training this classifier using a SVM model. The choice of SVM model is based on some earlier work which is done in this space, however I would urge you to experiment with other classification models as well.

We will start off from where we stopped in the last section. We will read the database from disk and extract the labels and data

# Read the data base from disk
db = h5py.File(outputPath, "r")
# Extract the labels and data
(labels, data) = (db["pothole_features_all"][:, 0], db["pothole_features_all"][:, 1:])
# Close the data base
db.close()

print(labels.shape)
print(data.shape)

We will now use the data and labels to build the classifier

# Build the SVM model
model = SVC(kernel="linear", C=0.01, probability=True, random_state=123)
model.fit(data, labels)

Once the model is fit we will save the model as a pickle file in the output folder.

# Save the model in the output folder
modelPath = 'data/models/model.cpickle'
f = open(modelPath, "wb")
f.write(pickle.dumps(model))
f.close()

Please remember to create the 'models' folder in your local drive in the 'data' folder before saving the model. Once the model is saved you will be able to see the model pickle file within the path you specified.

Now that we have build the classifier, we will use this classifier for object detection in the next section. We will be covering two important concepts in the next section which is important for object detection, Image pyramids and Sliding windows. Let us get familiar with those concepts first.

Image Pyramids and Sliding window techniques

Let us try to understand the concept of image pyramids with an example. Let us assume that we have a window of fixed size and potholes are detected only if they fit perfectly inside the window. Let us look at how well the potholes are detected when using a fixed size window. Take the case of layer1 of the image below. We can see that the fixed sized window was able to detect one of the potholes which was further down the road as it fit well within the window size, however the bigger pothole which is at the near end the image is not detected because the window was obviously smaller than size of the pothole.

As a way to solve this, let us progressively reduce the size of the image, and try to fit the potholes to the fixed window size, as shown in the figure below. With the reduction in size of the image, the object we want to detect also reduces in size. Since our detection window remains the same, we are able to detect more potholes including the biggest one, when the image sizes are reduced. Thereby we will be able to detect most of the potholes which otherwise would not have been possible with a fixed size window and a constant size image. This is the concept behind image pyramids.

The name image pyramids signifies the fact that, if the scaled images are stacked vertically, then it will fit inside a pyramid as shown in the below figure.

The implementation of image pyramids can be done easily using Sklearn. There are many different types of image pyramid implementation. Some of the prominent ones are Gaussian pyramids and Laplacian pyramids. You can read about these pyramids in the link give here. Let us quickly look at the implementation of of pyramids.

from skimage.transform import pyramid_gaussian
for imgPath in allFiles[-2:-1]:
    # Read the image
    image = cv2.imread(imgPath)
    # loop over the layers of the image pyramid and display them
    for (i, layer) in enumerate(pyramid_gaussian(image, downscale=1.2)):
        # Break the loop if the image size is less than our window size
        if layer.shape[1] < 80 or layer.shape[0] < 40:
            break
        print(layer.shape)

From the output we can see how the images are scaled down progressively.

Having see the image pyramids, its time to discuss about sliding window. Sliding windows are effective methods to identify objects in an image at various scales and locations. As the name suggests, this method involves a window of standard length and width which slides accross an image to extract features. These features will be used in a classifier to identify object of interest. Let us look at the code block below to understand the dynamics of the sliding window method.

# Read the image
image = cv2.imread(allFiles[-2])
# Define the window size
windowSize = [80,40]
# Define the step size
stepSize = 40
# slide a window across the image
for y in range(0, image.shape[0], stepSize):
    for x in range(0, image.shape[1], stepSize):
        # Clone the image
        clone = image.copy()
        # Draw a rectangle on the image 
        cv2.rectangle(clone, (x, y), (x + windowSize[0], y + windowSize[1]), (0, 255, 0), 2)
        plt.imshow()

To implement the sliding window we need to understand some of the parameters which are used. The first is the window size, which is the dimension of the fixed window we would be sliding accross the image. We earlier calculated the size of this window to be [80,40] which was the average size of a pothole in our distribution. The second parameter is the step size. A step size is the number of pixels we need to step to move the fixed window accross the image. Smaller the step size, we will have to move through more pixels and vice-versa. We dont want to slide through every pixel and definitely dont want to skip important features, and therefore the step size is a necessary parameter. An ideal step size would depend on the image size. For our case let us experiment with the ‘y’ cordinate size of our fixed window which is 40. I would encourage to experiment with different step sizes and observe the results before finalising the step size.

To implement this method, we first iterates through the vertical distance starting from 0 to the height of the image with increments of the stepsize. We have an inner iterative loop which loops through the horizontal direction ranging from 0 to the width of the image with increments of stepsize. For each of these iterations we capture the x and y cordinates and then extract a rectangle with the same shape of the fixed window size. In the above implementation we are only drawing a rectangle on the image to understand the dynamics. However when we implement this along with image pyramids, we will crop an image size with the dimension of the window size as we slide accross the image. Let us see some of the sample outputs of the sliding window.

From the above output we can see how the fixed window slides accross the image both horizontally and vertically with a step size to extract features from the image of the same size as the fixed window.

So far we have seen the pyramid and the sliding window implementations independently. These two methods have to be integrated to use it as an object detector. However for integrating them we need to convert the sliding window method into a function. Let us look at the function to implement sliding windows.

# Function to implement sliding window
def slidingWindow(image, stepSize, windowSize):    
    # slide a window across the image
    for y in range(0, image.shape[0], stepSize):
        for x in range(0, image.shape[1], stepSize):
            # yield the current window
            yield (x, y, image[y:y + windowSize[1], x:x + windowSize[0]])

The function is not very different from what we implemented earlier. The only difference is as the output we yield a tuple of the x,y cordinates and the crop of the image of the same size as the window Size. Next we will see how we integrate this function with the image pyramids to implement our custom object detector.

Building the object detector

Its now time to bring all what we defined into creating our object detector. As a first step let us load the model which we saved during the training phase

# Listing the path were we stored the model
modelPath = 'data/models/model.cpickle'
# Loading the model we trained earlier
model = pickle.loads(open(modelPath, "rb").read())
model

Now let us look at the complete code to implement our object detector

# Initialise lists to store the bounding boxes and probabilities
boxes = []
probs = []
# Define the HOG parameters
orientations=12
pixelsPerCell=(4, 4)
cellsPerBlock=(2, 2)
# Define the fixed window size
windowSize=(80,40)
# Pick a random image from the image path to check our prediction
imgPath = sample(allFiles,1)[0]
# Read the image
image = cv2.imread(imgPath)
# Converting the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# loop over the image pyramid
for (i, layer) in enumerate(pyramid_gaussian(image, downscale=1.2)):
    # Identify the current scale of the image    
    scale = gray.shape[0] / float(layer.shape[0])
    # loop over the sliding window for each layer of the pyramid
    for (x, y, window) in slidingWindow(layer, stepSize=40, windowSize=(80,40)):
        # if the current window does not meet our desired window size, ignore it
        if window.shape[0] != windowSize[1] or window.shape[1] != windowSize[0]:
            continue
        # Let us now extract the hog features of this window within the image
        feat = hogFeatures(window,orientations,pixelsPerCell,cellsPerBlock,normalize=True).reshape(1,-1)
        # Get the prediction probabilities for the positive class ( potholesf)
        prob = model.predict_proba(feat)[0][1] 
        
        # Check if the probability is greater than a threshold probability
        if prob > 0.95:            
            # Extract (x, y)-coordinates of the bounding box using the current scale 
            # Starting coordinates
            (startX, startY) = (int(scale * x), int(scale * y))
            # Ending coordinates
            endX = int(startX + (scale * windowSize[0]))
            endY = int(startY + (scale * windowSize[1]))
            # update the list of bounding boxes and probabilities
            boxes.append((startX, startY, endX, endY))
            probs.append(prob)
            
# loop over the bounding boxes and draw them
for (startX, startY, endX, endY) in boxes:
    cv2.rectangle(image, (startX, startY), (endX, endY), (0, 0, 255), 2)       

plt.imshow(image,aspect='equal')
plt.show()

To start of we initialise two lists in lines 2-3 where we will store the bounding box coordinates and also the probabilities which indicates the confidence about detecting potholes in the image.

We also define some important parameters which are required for HOG feature extraction method in lines 5-7

orientations
pixels per Cell
Cells per block

We also define the size of our fixed window in line 9

To test our process, we randomly sample an image from the list of images we have and then convert the image into gray scale in lines 11-15.

We then start the iterative loop to implement the image pyramids in line 17. For each iteration the input image is scaled down as per the scaling factor we defined.Next we calculate the running scale of the image in line 19. The scale would always be the original shape divided by the scaled down image. We need to find the scale to blow up the x,y coordinates to the orginal size of the image later on.

Next we start the sliding window implementation in line 21. We provide the scaled down version of the image as the input along with the stepSize and the window size. The step size is the parameter which indicates by how much the window has to slide accross the original image. The window size indicates the size of the sliding window. We saw the mechanics of these when we looked at the sliding window function.

In lines 23-24 we ensure that we only take images, which meets our minimum size specification.For any image which passes the minimum size specification, HOG features are extracted in line 26. On the extracted HOG features, we do a prediction in line 28. The prediction gives the probability whether the image is a pothole or not. We extract only probability of the positive class. We then take only those images were the probability is greater than a threshold we have defined in line 31. We give a high threshold because, our distribution of both the positive and negative images are very similar. So to ensure that we get only the potholes, we given a higher threshold. The threshold has been arrived at after fair bit of experimentation. I would encourage you to try out with different thresholds before finalising the threshold you want.

Once we get the predictions, we take those x and y cordinates and then blow it to the original size using the scale we earlier calculated in lines 34-37. We find the starting cordinates and the ending cordinates and then append those coordinates in the lists we defined, in lines 39-40.

In lines 43-47, we loop through each of the coordinates and draw bounding boxes around the image.

Let us look at the output we have got, we can see that there are multiple bounding boxes created around the area were there are potholes. We can be happy that the object detector is doing its job by localising around the area around a pothole in most of the cases. However there are examples where the detector has detected objects other than potholes. We will come to that issue later. Let us first address another important issue.

All the images have multiple overlapping bounding boxes. Having a lot of bounding boxes can sometimes be cumbersome say if we want to calculate the area where the pot hole is present. We need to find a way to reduce the number of overlapping bounding boxes. This is were we use a technique called Non Maxima suppression. The objective of Non maxima suppression is to combine bounding boxes with significant overalp and get a single bounding box. The method which we would be implementing is inspired from this post

Non Maxima Suppression

We would be implementing a customised method of the non maxima suppression implementation. We will be implementing it through a function.

def maxOverlap(boxes):
    '''
    boxes : This is the cordinates of the boxes which have the object
    returns : A list of boxes which do not have much overlap
    '''
    # Convert the bounding boxes into an array
    boxes = np.array(boxes)
    # Initialise a box to pick the ids of the selected boxes and include the largest box
    selected = []
    # Continue the loop till the number of ids remaining in the box is greater than 1
    while len(boxes) > 1:
        # First calculate the area of the bounding boxes 
        x1 = boxes[:, 0]
        y1 = boxes[:, 1]
        x2 = boxes[:, 2]
        y2 = boxes[:, 3]
        area = (x2 - x1) * (y2 - y1)
        # Sort the bounding boxes based on its area    
        ids = np.argsort(area)
        #print('ids',ids)
        # Take the coordinates of the box with the largest area
        lx1 = boxes[ids[-1], 0]
        ly1 = boxes[ids[-1], 1]
        lx2 = boxes[ids[-1], 2]
        ly2 = boxes[ids[-1], 3]
        # Include the largest box into the selected list
        selected.append(boxes[ids[-1]].tolist())
        # Initialise a list for getting those ids that needs to be removed.
        remove = []
        remove.append(ids[-1])
        # We loop through each of the other boxes and find the overlap of the boxes with the largest box
        for id in ids[:-1]:
            #print('id',id)
            # The maximum of the starting x cordinate is where the overlap along width starts
            ox1 = np.maximum(lx1, boxes[id,0])
            # The maximum of the starting y cordinate is where the overlap along height starts
            oy1 = np.maximum(ly1, boxes[id,1])
            # The minimum of the ending x cordinate is where the overlap along width ends
            ox2 = np.minimum(lx2, boxes[id,2])
            # The minimum of the ending y cordinate is where the overlap along height ends
            oy2 = np.minimum(ly2, boxes[id,3])
            # Find area of the overlapping coordinates
            oa = (ox2 - ox1) * (oy2 - oy1)
            # Find the ratio of overlapping area of the smaller box with respect to its original area
            olRatio = oa/area[id]            
            # If the overlap is greater than threshold include the id in the remove list
            if olRatio > 0.50:
                remove.append(id)                
        # Remove those ids from the original boxes
        boxes = np.delete(boxes, remove,axis = 0)
        # Break the while loop if nothing to remove
        if len(remove) == 0:
            break
    # Append the remaining boxes to the selected
    for i in range(len(boxes)):
        selected.append(boxes[i].tolist())
    return np.array(selected)

The input to the function are the bounding boxes we got after our prediction. Let me give a big picture of what this implementation is all about. In this implementation we start with the box with the largest area and progressively eliminate boxes which have considerable overlap with the largest box. We then take the remaining boxes after elimination and the repeat the process of elimination till we get to the minimum number of boxes. Let us now see this implementation in the code above.

In line 7, we convert the bounding boxes into an numpy array and the initialise a list to store the bounding boxes we want to return in line 9.

Next in line 11, we start the continues loop for elimination of the boxes till the number of boxes which remain is less than 2.

In lines 13-17, we calculate the area of all the bounding boxes and then sort them in ascending order in line 19.

We then take the cordinates of the box with the largest area in lines 22-25 and then append the largest box to the selection list in line 27. We initialise a new list for the boxes which needs to be removed and then include the largest box in the removal list in line 30.

We then start another iterative loop to find the overlap of the other bounding boxes with the largest box in line 32. In lines 35-43, we find the coordinates of the overlapping portion of each of the other boxes with the largest box and the take the area of the overlapping portion. In line 45 we find the ratio of the overlapping area to the original area of the bounding box which we iterate through. If the ratio is larger than a threshold value, we include that box to the removal list in lines 47-48 as this has good overlap with the largest box. After iterating through all the boxes in the list, we will get a list of boxes which has good overlap with the largest box. We then remove all those overlapping boxes and the current largest box from the original list of boxes in line 50. We continue this process till there are no more boxes to be removed. Finally we add the last remaining box to the selected list and then return the selection.

Let us implement this function and observe the result

# Get the selected list
selected = maxOverlap(boxes)

Now let us look at different examples after non maxima suppression.

# Get the image again
image = cv2.imread(imgPath)
# Make a copy of the image
clone = image.copy()
for (startX, startY, endX, endY) in selected:
    cv2.rectangle(clone, (startX, startY), (endX, endY), (0, 255, 0), 2)       

plt.imshow(clone,aspect='equal')
plt.show()

We can see that the bounding boxes are considerably reduced using our non maxima suppression implementation.

Improvement Opportunities

Eventhough we have got reasonable detection effectiveness, is the model we built perfect ? Absolutely not. Let us look at some of the major pitfalls

Misclassifications of objects :

From the outputs, we can see that we have misclassified some of the objects.

Most of the misclassifications we have seen are for vegetation. There are also cases were road signs are also misclassified as potholes.

A major reason we have mis classification is because our training data is limited. We used only 19 positive images and 20 negative examples. Which is a very small data set for tasks like this. Considering the fact that the data set is limited the classifier has done a decent job. Also for negative images, we need to include some more variety, like get some road signs, vehicles, vegetation etc labelled as negative images. So with more positive images and more negative images with little more variety of objects that are likely to be found on roads will improve the classification accuracy of the classifier.

Another strategy is to experiment with different types of classifiers. In our example we used a SVM classifier. It would be worthwhile to use other binary classifiers starting from Logistic regression, Naive Bayes, Random forest, XG boost etc. I would encourage you to try out with different classifiers and then verify the results.

Non detection of positive classes

Along with misclassifications, we have also seen non detection of positive classes.

As seen from the examples, we can see that there has been non detection in cases of potholes with water in it. In addition some of the potholes which are further along the road are not detected.

These problems again can be corrected by including more variety in the positive images, by including potholes with water in it. It will also help to include images with potholes further away along the road. The other solution is to preprocess images with different techniques like smoothing and blurring, thresholding, gradient and edge detection, contours, histograms etc. These methods will help in highliging the areas with potholes which will help in better detection. In addition, increasing the number of positive examples will also help in addressing the problems associated with non detection.

What Next ?

The idea behind this post was to give you a perspective in building an object detector from scratch. This was also an attempt to give an experience in working in cases where the data sets are limited and where you have to create the necessary data sets. I believe these exercises will equip you will capabilities to deal with such issues in your projects.

Now that you have seen the basic grounds up approach, it is time to use this experience to learn more state of the art techniques. In the next post we will start with more advanced techniques. We will also be using transfer learning techniques extensively from the next post. In the next post we will cover object detection using RCNN.

To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

I would also recommend two books I have co-authored. The first one is specialised in d eep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – VIII : Evaluating deployment options

This is the eighth and last post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

Recommendation system and reinforcement learning primer
Introduction to multi armed bandit problem
Self learning recommendation system as a K-armed bandit
Build the prototype of the self learning recommendation system : Part I
Build the prototype of the self learning recommendation system : Part II
Productionising the self learning recommendation system : Part I – Customer Segmentation
Productionising self learning recommendation system: Part II : Implementing self learning recommendations
Evaluating deployment options for the self learning recommendation systems. ( This post )

This post ties together all what we discussed in the previous two posts where in we explored all the classes and methods we built for the application. In this post we will implement the driver file which controls all the processes and then explore different options to deploy this application.

Implementing the driver file

Now that we have seen all the classes and methods of the application, let us now see the main driver file which will control the whole process.

Open a new file and name it rlRecoMain.py and copy the following code into the file

import argparse
import pandas as pd
from utils import Conf,helperFunctions
from Data import DataProcessor
from processes import rfmMaker,rlLearn,rlRecomend
import os.path
from pymongo import MongoClient

# Construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument('-c','--conf',required=True,help='Path to the configuration file')
args = vars(ap.parse_args())

# Load the configuration file
conf = Conf(args['conf'])

print("[INFO] loading the raw files")
dl = DataProcessor(conf)

# Check if custDetails already exists. If not create it
if os.path.exists(conf["custDetails"]):
    print("[INFO] Loading customer details from pickle file")
    # Load the data from the pickle file
    custDetails = helperFunctions.load_files(conf["custDetails"])
else:
    print("[INFO] Creating customer details from csv file")
    # Let us load the customer Details
    custDetails = dl.gvCreator()
    # Starting the RFM segmentation process
    rfm = rfmMaker(custDetails,conf)
    custDetails = rfm.segmenter()
    # Save the custDetails file as a pickle file 
    helperFunctions.save_clean_data(custDetails,conf["custDetails"])

# Starting the self learning Recommendation system

# Check if the collections exist in Mongo DB
client = MongoClient(port=27017)
db = client.rlRecomendation

# Get all the collections from MongoDB
countCol = db["rlQuantdic"]
polCol = db["rlValuedic"]
rewCol = db["rlRewarddic"]
recoCountCol = db['rlRecotrack']

print(countCol.estimated_document_count())

# If Collections do not exist then create the collections in MongoDB
if countCol.estimated_document_count() == 0:
    print("[INFO] Main dictionaries empty")
    rll = rlLearn(custDetails, conf)
    # Consolidate all the products
    rll.prodConsolidator()
    print("[INFO] completed the product consolidation phase")
    # Get all the collections from MongoDB
    countCol = db["rlQuantdic"]
    polCol = db["rlValuedic"]
    rewCol = db["rlRewarddic"]

# start the recommendation phase
rlr = rlRecomend(custDetails,conf)
# Sample a state since the state is not available
stateId = rlr.stateSample()
print(stateId)

# Get the respective dictionaries from the collections

countDic = countCol.find_one({stateId: {'$exists': True}})
polDic = polCol.find_one({stateId: {'$exists': True}})
rewDic = rewCol.find_one({stateId: {'$exists': True}})

# The count dictionaries can exist but still recommendation dictionary can not exist. So we need to take this seperately

if recoCountCol.estimated_document_count() == 0:
    print("[INFO] Recommendation tracking dictionary empty")
    recoCountdic = {}
else:
    # Get the dictionary from the collection
    recoCountdic = recoCountCol.find_one({stateId: {'$exists': True}})


print('recommendation count dic', recoCountdic)


# Initialise the Collection checker method
rlr.collfinder(stateId,countDic,polDic,rewDic,recoCountdic)
# Get the list of recommended products
seg_products = rlr.rlRecommender()
print(seg_products)
# Initiate customer actions

click_list,buy_list = rlr.custAction(seg_products)
print('click_list',click_list)
print('buy_list',buy_list)

# Get the reward functions for the customer action
rlr.rewardUpdater(seg_products,buy_list ,click_list)

We import all the necessary libraries and classes in lines 1-7.

Lines 10-12, detail the argument parser process. We provide the path to our configuration file as the argument. We discussed in detail about the configuration file in post 6 of this series. Once the path of the configuration file is passed as the argument, we read the configuration file and the load the value in the variable conf in line 15.

The first of the processes is to initialise the dataProcessor class in line 18. As you know from post 6, this class has the methods for loading and processing data. After this step, lines 21-33 implements the raw data loading and processing steps.

In line 21 we check if the processed data frame custDetails is already present in the output directory. If it is present we load it from the folder in line 24. If we havent created the custDetails data frame before, we initiate that action in line 28 using the gvCreator method we have seen earlier. In lines 30-31, we create the segments for the data using the segmenter method in the rfmMaker class. Finally the custDetails data frame is saved as a pickle file in line 33.

Once the segmentation process is complete the next step is to start the recommendation process. We first establish the connection with our collection in lines 38-39. Then we collect the 4 collections from MongoDB in lines 42-45. If the collections do not exist it will return a ‘None’.

If the collections are none, we need to create the collections. This is done in lines 50-59. We instantiate the rlLearn class in line 52 and the execute the prodConsolidator() method in line 54. Once this method is run the collections would be created. Please refer to the prodConsolidator() method in post 7 for details. Once the collections are created, we get those collections in lines 57-59.

Next we instantiate the rlRecomend class in line 62 and then sample a stateID in line 64. Please note that the sampling of state ID is only a work around to simulate a state in the absence of real customer data. If we were to have a live application, then the state Id would be created each time a customer logs into the sytem to buy products. As you know the state Id is a combination of the customers segment, month and day in which the logging happens. So as there are no live customers we are simulating the stateId for our online recommendation process.

Once we have sampled the stateId, we need to extract the dictionaries corresponding to that stateId from the MongoDb collections. We do that in lines 69-71. We extract the dictionary corresponding to the recommendation as a seperate step in lines 75-80.

Once all the dictionaries are extracted, we do the initialisation of the dictionaries in line 87 using the collfinder method we explored in post 7 . Once the dictionaries are initialised we initiate the recommendation process in line 89 to get the list of recommended products.

Once we get the recommended products we simulate customer actions in line 93, and then finally update the rewards and values using rewardUpdater method in line 98.

This takes us to the end of the complete process to build the online recommendation process. Let us now see how this application can be run on the terminal

Figure 1 : Running the application on terminal

The application can be executed on the terminal with the below command

$ python rlRecoMain.py --conf config/custprof.json

The argument we give is the path to the configuration file. Please note that we need to change directory to the rlreco directory to run this code. The output from the implementation would be as below

The data can be seen in the MongoDB collections also. Let us look at ways to find the data in MongoDB collections.

To initialise Mongo db from terminal, use the following command

Figure 3 : Initialize Mongo

You should get the following output

Now to find all the data bases in Mongo DB you can use the below command

You will be able to see all the databases which you have created. The one marked in red is the database we created. No to use that data base the command used is use rlRecomendation as shown below. We will get the command that the database has been switched to the desired data base.

To see all the collections we have made in this database we can use the below command.

From the output we can see all the collections we have created. Now to see some specific record within the collections, we can use the following command.

db.rlValuedic.find({"Q1_August_1_Monday":{$exists:true} })

In the above command we are trying to find all records in the collection rlValuedic for the stateID "Q1_August_1_Monday". Once we execute this command we get all the records in this collection for this specific stateID. You should get the below output.

The output displays all the proucts for that stateID and its value function.

What we have implemented in code is a simulation of the complete process. To run this continuously for multiple customers, we can create another scrip with a list of desired customers and then execute the code multiple times. I will leave that step as an exercise for you to implement. Now let us look at different options to deploy this application.

Deployment of application

The end product of any data science endeavour should be to build an application and sharing it with the world. There are different options to deploy python applications. Let us look at some of the options available. I would encourage you to explore more methods and share your results.

Flask application with Heroku

A great option to deploy your applications is to package it as a Flask application and then deploy it using Heroku. We have discussed this option in one of our earlier series, where we built a machine translation application. You can refer this link for details. In this section we will discuss the nuances of building the application in Flask and then deploying it on Heroku. I will leave the implementation of the steps for you as an exercise.

When deploying the self learning recommendation system we have built, the first thing which we need to design is what the front end will contain. From the perspective of the processes we have implemented, we need to have the following processes controlled using the front end.

Training process : This is the process which takes the raw data, preprocesses the data and then initialises all the dictionaries. This includes all the processes till line 59 in the driver file rlRecoMain.py. We need to initialise the process of training from the front end of the flask application. In the background all the process till line 59 should run and the dictionaries needs to be updated.
Recommendation simulation : The second process which needs to be controlled is the one where we get the recommendations. The start of this process is the simulation of the state from the front end. To do this we can provide a drop down of all the customer IDs on the flask front end and take the system time details to form the stateID. Once this stateID is generated, we start the recommendation process which includes all the process starting from line 62 till line 90 in the the driver file rlRecoMain.py. Please note that line 64 is the stateID simulating process which will be controlled from the front end. So that line need not be implemented. The final output, which is the list of all recommended products needs to be displayed on the front end. It will be good to add some visual images along with the product for visual impact.
Customer action simulation : Once the recommended products are displayed on the front end, we can send feed back from the front end in terms of the products clicked and the products bought through some widgets created in the front end. These widgets will take the place of line 93, in our implementation. These feed back from the front end needs to be collected as lists, which will take the place of click_list and buy_list given in lines 94-95. Once the customer actions are generated, the back end process in line 98, will have to kick in to update the dictionaries. Once the cycle is completed we can build a refresh button on the screen to simulate the recommendation process again.

Once these processes are implemented using a Flask application, the application can be deployed on Heroku. This post will give you overall guide into deploying the application on Heroku.

These are broad guidelines for building the application and then deploying them. These need not be the most efficient and effective ones. I would challenge each one of you to implement much better processes for deployment. Request you to share your implementations in the comments section below.

Other options for deployment

So far we have seen one of the option to build the application using Flask and then deploy them using Heroku. There are other options too for deployment. Some of the noteable ones are the following

Flask application on Ubuntu server
Flask application on Docker

The attached link is a great resource to learn about such deployment. I would challenge all of you to deploy using any of these implementation steps and share the implementation for the community to benefit.

Wrapping up.

This is the last post of the series and we hope that this series was informative.

We will start a new series in the near future. The next series will be on a specific problem on computer vision specifically on Object detection. In the next series we will be building a ‘Road pothole detector using different object detection algorithms. This series will touch upon different methods in object detection like Image Pyramids, RCNN, Yolo, Tensorflow Object detection API etc. Watch out this space for the next series.

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

I would also recommend two books I have co-authored. The first one is specialised in d eep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – VII : Productionizing the application : II

This is the seventh post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

Recommendation system and reinforcement learning primer
Introduction to multi armed bandit problem
Self learning recommendation system as a K-armed bandit
Build the prototype of the self learning recommendation system : Part I
Build the prototype of the self learning recommendation system : Part II
Productionising the self learning recommendation system : Part I – Customer Segmentation
Productionising the self learning recommendation system: Part II – Implementing self learning recommendation ( This Post )
Evaluating different deployment options for the self learning recommendation systems.

This post builds on the previous post where we started off with productionizing the application using python scripts. In the last post we completed the customer segmentation part. In this post we continue from where we left off and then build the self learning system using python scripts. Let us get going.

Creation of States

Let us take a quick recap of the project structure and what we covered in the last post.

In the last post we were in the early part of our main driver file rlRecoMain.py. We explored rfmMaker class in file rfmProcess.py from the processes directory. We will now explore selfLearnProcess.py file in the same directory.

Open a new file and name it selfLearnProcess.py and insert the following code

import pandas as pd
from numpy.random import normal as GaussianDistribution
from collections import OrderedDict
from collections import Counter
import operator
from random import sample
import numpy as np
from pymongo import MongoClient
client = MongoClient(port=27017)
db = client.rlRecomendation



class rlLearn:
    def __init__(self,custDetails,conf):
        # Get the date  as a seperate column
        custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
        # Converting date to float for easy comparison
        custDetails['Date'] = custDetails['Date'].astype('float64')
        # Get the period of month column
        custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > conf['monthPer']))
        # Aggregate the custDetails to get a distribution of rewards
        rewardFull = custDetails.groupby(['Segment', 'Month', 'monthPeriod', 'Day', conf['product_id']])[conf['prod_qnty']].agg(
            'sum').reset_index()
        # Get these data frames for all methods
        self.custDetails = custDetails
        self.conf = conf
        self.rewardFull = rewardFull
        # Defining some dictionaries for storing the values
        self.countDic = {}  # Dictionary to store the count of products
        self.polDic = {}  # Dictionary to store the value distribution
        self.rewDic = {}  # Dictionary to store the reward distribution
        self.recoCountdic = {}  # Dictionary to store the recommendation counts

    # Method to find unique values of each of the variables
    def uniqeVars(self):
        # Finding unique value for each of the variables
        segments = list(self.rewardFull.Segment.unique())
        months = list(self.rewardFull.Month.unique())
        monthPeriod = list(self.rewardFull.monthPeriod.unique())
        days = list(self.rewardFull.Day.unique())
        return segments,months,monthPeriod,days

    # Method to consolidate all products
    def prodConsolidator(self):
        # Get all the unique values of the variables
        segments, months, monthPeriod, days = self.uniqeVars()
        # Creating the consolidated dictionary
        for seg in segments:
            for mon in months:
                for period in monthPeriod:
                    for day in days:
                        # Get the subset of the data
                        subset1 = self.rewardFull[(self.rewardFull['Segment'] == seg) & (self.rewardFull['Month'] == mon) & (
                                self.rewardFull['monthPeriod'] == period) & (self.rewardFull['Day'] == day)]
                        # INitializing a temporary dictionary to storing in mongodb
                        tempDic = {}
                        # Check if the subset is valid
                        if len(subset1) > 0:
                            # Iterate through each of the subset and get the products and its quantities
                            stateId = str(seg) + '_' + mon + '_' + str(period) + '_' + day
                            # Define a dictionary for the state ID
                            self.countDic[stateId] = {}
                            tempDic[stateId] = {}
                            for i in range(len(subset1.StockCode)):
                                # Store in the Count dictionary
                                self.countDic[stateId][subset1.iloc[i]['StockCode']] = int(subset1.iloc[i]['Quantity'])
                                tempDic[stateId][subset1.iloc[i]['StockCode']] = int(subset1.iloc[i]['Quantity'])
                            # Dumping each record into mongo db
                            db.rlQuantdic.insert(tempDic)

        # Consolidate the rewards and value functions based on the quantities
        for key in self.countDic.keys():
            # Creating two temporary dictionaries for loading in Mongodb
            tempDicpol = {}
            tempDicrew = {}
            # First get the dictionary of products for a state
            prodCounts = self.countDic[key]
            self.polDic[key] = {}
            self.rewDic[key] = {}
            # Initializing temporary dictionaries also
            tempDicpol[key] = {}
            tempDicrew[key] = {}
            # Update the policy values
            for pkey in prodCounts.keys():
                # Creating the value dictionary using a Gaussian process
                self.polDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)
                tempDicpol[key][pkey] = self.polDic[key][pkey]
                # Creating a reward dictionary using a Gaussian process
                self.rewDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)
                tempDicrew[key][pkey] = self.rewDic[key][pkey]
            # Dumping each of these in mongo db
            db.rlRewarddic.insert(tempDicrew)
            db.rlValuedic.insert(tempDicpol)
        print('[INFO] Dumped the quantity dictionary,policy and rewards in MongoDB')

As usual we start with import of the libraries we want from lines 1-7. In this implementation we make a small deviation from the prototype which we developed in the previous post. During the prototyping phase we predominantly relied on dictionaries to store data. However here we would be storing data in Mongo DB. Those of you who are not fully conversant with MongoDB can refer to some good tutorials on MongDB like the one here. I will also be explaining the key features as and when required. In line 8, we import the MongoClient which is required for connections with the data base. We then define the client using the default port number ( 27017 ) in line 9 and then name the data base where we will store the recommendation in line 10. The name of the database we have selected is rlRecomendation . You are free to choose any name of your choice.

Let us now explore the rlLearn class. The constructor of the class which starts from line 15, takes the custDetails data frame and the configuration file as inputs. You would already be familiar with lines 17-23 from our prototyping phase, where we extract information to create states and then consolidate the data frame to get the quantities of each state. In lines 30-33, we create dictionaries where we store the relevant information like count of products, value distribution, reward distribution and the number of times the products are recommended.

The main method within the rlLearn class is the prodConslidator() method in lines 45-95. We have seen the details of this method in the prototyping phase. Just to recap, in this method we iterate through each of the components of our states and then store the quantities of each product under the state in the dictionaries. However there is a subtle difference from what we did during the prototyping phase. Here we are inserting each state and its associated products in Mongodb data base we created, as shown in line 70, 93 and 94. We create a temporary dictionary in line 57 to dump each state into Mongodb. We also store the data in the dictionaries,as we did during the prototyping phase, so that we get the data for other methods in this class. The final outcome from this method, is the creation of the count dictionary, value dictionary and reward dictionary from our data and updation of this data in Mongodb.

This takes us to the end of the rlLearn class.

We now go back to the driver file rlRecoMain.py and the explore the next important class rlRecomend.

The rlRecomend class has the methods which are required for recommending products. This class has many methods and therefore we will go one by one through each of the methods. We have seen all these methods during the prototyping phase and therefore we will not get into detailed explanation of these methods here. For detailed explanation you can refer to the previous post.

Now on the selfLearnProcess.py start adding the code pertaining to the rlRecomend class.

class rlRecomend:
    def __init__(self, custDetails, conf):
        # Get the date  as a seperate column
        custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
        # Converting date to float for easy comparison
        custDetails['Date'] = custDetails['Date'].astype('float64')
        # Get the period of month column
        custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > conf['monthPer']))
        # Aggregate the custDetails to get a distribution of rewards
        rewardFull = custDetails.groupby(['Segment', 'Month', 'monthPeriod', 'Day', conf['product_id']])[
            conf['prod_qnty']].agg(
            'sum').reset_index()
        # Get these data frames for all methods
        self.custDetails = custDetails
        self.conf = conf
        self.rewardFull = rewardFull

The above code is for the constructor of the class ( lines 97 – 112 ), which is similar to the constructor of the rlLearn class. Here we consolidate the custDetails data frame and get the count of each products for the respective state.

Let us now look at the next two methods. Add the following code to the class we earlier created.

# Method to find unique values of each of the variables
    def uniqeVars(self):
        # Finding unique value for each of the variables
        segments = list(self.rewardFull.Segment.unique())
        months = list(self.rewardFull.Month.unique())
        monthPeriod = list(self.rewardFull.monthPeriod.unique())
        days = list(self.rewardFull.Day.unique())
        return segments, months, monthPeriod, days

    # Method to sample a state
    def stateSample(self):
        # Get the unique state elements
        segments, months, monthPeriod, days = self.uniqeVars()
        # Get the context of the customer. For the time being let us randomly select all the states
        seg = sample(segments, 1)[0]  # Sample the segment
        mon = sample(months, 1)[0]  # Sample the month
        monthPer = sample([0, 1], 1)[0]  # sample the month period
        day = sample(days, 1)[0]  # Sample the day
        # Get the state id by combining all these samples
        stateId = str(seg) + '_' + mon + '_' + str(monthPer) + '_' + day
        self.seg = seg
        return stateId

The first method , lines 115 – 121, is to get the unique values of segments, months, month-period and days. This information will be used in some of the methods we will see later on. The second method detailed in lines 124-135, is to sample a state id, through random sampling of the components of a state.

The next methods we will explore are to initialise dictionaries if a state id has not been seen earlier. The first method initialises dictionaries and the second method inserts a recommendation collection record in MongoDB if the state dosent exist. Let us see the code for these methods.

  # Method to initialize a dictionary in case a state Id is not available
    def collfinder(self,stateId,countDic,polDic,rewDic,recoCountdic):
        # Defining some dictionaries for storing the values
        self.countDic = countDic  # Dictionary to store the count of products
        self.polDic = polDic  # Dictionary to store the value distribution
        self.rewDic = rewDic  # Dictionary to store the reward distribution
        self.recoCountdic = recoCountdic  # Dictionary to store the recommendatio
        self.stateId = stateId
        print("[INFO] The current state is :", stateId)
        if self.countDic is None:
            print("[INFO] State ID do not exist")
            self.countDic = {}
            self.countDic[stateId] = {}
            self.polDic = {}
            self.polDic[stateId] = {}
            self.rewDic = {}
            self.rewDic[stateId] = {}
        if self.recoCountdic is None:
            self.recoCountdic = {}
            self.recoCountdic[stateId] = {}
        else:
            self.recoCountdic[stateId] = {}

# Method to update the recommendation dictionary
    def recoCollChecker(self):
        print("[INFO] Inside the recommendation collection")
        recoCol = db.rlRecotrack.find_one({self.stateId: {'$exists': True}})
        if recoCol is None:
            print("[INFO] Inserting the record in the recommendation collection")
            db.rlRecotrack.insert_one({self.stateId: {}})
        return recoCol

The inputs to the first method, as in line 138 are the state Id and all the other 4 dictionaries we extract from Mongo DB, which we will see later on in the main script rlRecoMain.py. If no record exists for a specific state Id, the dictionaries we extract from Mongo DB would be null and therefore we need to initialize these dictionaries for storing all the values of products, its values, rewards and the count of recommendations. The initialisation of these dictionaries are implemented in this method from lines 146-158.

The second initialisation method is to check for the recommendation count dictionary for a specific state Id. We first check for the state Id in the collection in line 163. If the record dosent exist then we insert a blank dictionary for that state in line 166.

Let us now look at the next two methods in the class

    # Create a function to get a list of products for a certain segment
    def segProduct(self,seg, nproducts):
        # Get the list of unique products for each segment
        seg_products = list(self.rewardFull[self.rewardFull['Segment'] == seg]['StockCode'].unique())
        seg_products = sample(seg_products, nproducts)
        return seg_products

    # This is the function to get the top n products based on value
    def sortlist(self,nproducts,seg):
        # Get the top products based on the values and sort them from product with largest value to least
        topProducts = sorted(self.polDic[self.stateId].keys(), key=lambda kv: self.polDic[self.stateId][kv])[-nproducts:][::-1]
        # If the topProducts is less than the required number of products nproducts, sample the delta
        while len(topProducts) < nproducts:
            print("[INFO] top products less than required number of products")
            segProducts = self.segProduct(seg, (nproducts - len(topProducts)))
            newList = topProducts + segProducts
            # Finding unique products
            topProducts = list(OrderedDict.fromkeys(newList))
        return topProducts

The method in lines 171-175 is to sample a list of products for a segment. This method is used incase the number of products in a particular state is less than the total number of products which we want to recommend. In such cases, we randomly sample some products from the list of all products bought by customers in that segment and then add it to the list of products we want to recommend. We will see this in action in sortlist method (lines 178-188).

The sortlist method, sorts the list of products based on the demand for that product and the returns the list of top products. The inputs to this method are the number of products we want to be recommended and the segment ( line 178 ). We then get the top ‘n‘ products by sorting the value dictionary based on the number of times a product is bought as in line 180. If the number of products is less than the required products, sampling of products is done using the segProduct method we saw earlier. The final list of top products is then returned by this method.

The next method which we are going to explore is the one which controls the exploration and exploitation process thereby generating a list of products to be recommended. Let us add the following code to the class.

# This is the function to create the number of products based on exploration and exploitation
    def sampProduct(self,seg, nproducts,epsilon):
        # Initialise an empty list for storing the recommended products
        seg_products = []
        # Get the list of unique products for each segment
        Segment_products = list(self.rewardFull[self.rewardFull['Segment'] == seg]['StockCode'].unique())
        # Get the list of top n products based on value
        topProducts = self.sortlist(nproducts,seg)
        # Start a loop to get the required number of products
        while len(seg_products) < nproducts:
            # First find a probability
            probability = np.random.rand()
            if probability >= epsilon:
                # print(topProducts)
                # The top product would be first product in the list
                prod = topProducts[0]
                # Append the selected product to the list
                seg_products.append(prod)
                # Remove the top product once appended
                topProducts.pop(0)
                # Ensure that seg_products is unique
                seg_products = list(OrderedDict.fromkeys(seg_products))
            else:
                # If the probability is less than epsilon value randomly sample one product
                prod = sample(Segment_products, 1)[0]
                seg_products.append(prod)
                # Ensure that seg_products is unique
                seg_products = list(OrderedDict.fromkeys(seg_products))
        return seg_products

The inputs to the method are the segment, number of products to be recommended and the epsilon value which determines exploration and exploitation as shown in line 191. In line 195, we get the list of the products for the segment. This list is from where products are sampled during the exploration phase. We also get the list of top products which needs to be recommended in line 197, using the sortlist method we defined earlier. In lines 199-218 we implement the exploitation and exploration processes we discussed during the prototyping phase and finally we return the list of top products for recommendation.

The next method which we will explore is the one to update dictionaries after the recommendation process.

# This is the method for updating the dictionaries after recommendation
    def dicUpdater(self,prodList, stateId):        
        for prod in prodList:
            # Check if the product is in the dictionary
            if prod in list(self.countDic[stateId].keys()):
                # Update the count by 1
                self.countDic[stateId][prod] += 1                
            else:
                self.countDic[stateId][prod] = 1                
            if prod in list(self.recoCountdic[stateId].keys()):
                # Update the recommended products with 1
                self.recoCountdic[stateId][prod] += 1                
            else:
                # Initialise the recommended products as 1
                self.recoCountdic[stateId][prod] = 1                
            if prod not in list(self.polDic[stateId].keys()):
                # Initialise the value as 0
                self.polDic[stateId][prod] = 0                
            if prod not in list(self.rewDic[stateId].keys()):
                # Initialise the reward dictionary as 0
                self.rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)                
        print("[INFO] Completed the initial dictionary updates")

The inputs to this method, as in line 221, are the list of products to be recommended and the state Id. From lines 222-234, we iterate through each of the recommended product and increament the count in the dictionary if the product exists in the dictionary or initialize the count to 1 if the product wasnt available. Later on in lines 235-240, we initialise the value dictionary and the reward dictionary if the products are not available in them.

The next method we will see is the one for initializing the dictionaries in case the context dosent exist.

    def dicAdder(self,prodList, stateId):        
        # Loop through the product list
        for prod in prodList:
            # Initialise the count as 1
            self.countDic[stateId][prod] = 1
            # Initialise the value as 0
            self.polDic[stateId][prod] = 0
            # Initialise the recommended products as 1
            self.recoCountdic[stateId][prod] = 1
            # Initialise the reward dictionary as 0
            self.rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)
        print("[INFO] Completed the dictionary initialization")
        # Next update the collections with the respective updates        
        # Updating the quantity collection
        db.rlQuantdic.insert_one({stateId: self.countDic[stateId]})
        # Updating the recommendation tracking collection
        db.rlRecotrack.insert_one({stateId: self.recoCount[stateId]})
        # Updating the value function collection for the products
        db.rlValuedic.insert_one({stateId: self.polDic[stateId]})
        # Updating the rewards collection
        db.rlRewarddic.insert_one({stateId: self.rewDic[stateId]})
        print('[INFO] Completed updating all the collections')

If the state Id dosent exist, the dictionaries are initialised as seen in lines 147-155. Once the dictionaries are initialised, MongoDb data bases are updated in lines 259-265.

The next method which we are going to explore is one of the main methods which integrates all the methods we have seen so far. This methods implements the recomendation process. Let us explore this method.

# Method to sample a stateID and then initialize the dictionaries
    def rlRecommender(self):
        # First sample a stateID
        stateId = self.stateId        
        # Start the recommendation process
        if len(self.polDic[stateId]) > 0:
            print("The context exists")
            # Implement the sampling of products based on exploration and exploitation
            seg_products = self.sampProduct(self.seg, self.conf["nProducts"],self.conf["epsilon"])
            # Check if the recommendation count collection exist
            recoCol = self.recoCollChecker()
            print('Recommendation collection existing :',recoCol)
            # Update the dictionaries of values and rewards
            self.dicUpdater(seg_products, stateId)
        else:
            print("The context dosent exist")
            # Get the list of relavant products
            seg_products = self.segProduct(self.seg, conf["nProducts"])
            # Add products to the value dictionary and rewards dictionary
            self.dicAdder(seg_products, stateId)
        print("[INFO] Completed the recommendation process")

        return seg_products

The first step in the process is to get the state Id ( line 271 ) based on which we have to do all the recommendations. Once we have the state Id, we check if it is an existing state id in line 273. If it is an existing state Id we get the list of ‘n’ products for recommendation using the sampProduct method we saw earlier, where we implement exploration and exploitation. Once we get the products we initialise the recommendation collection in line 278. Finally we update all dictionaries using the dicUpdater method in line 281.

From lines 282-287, we implement a similar process when the state Id dosent exist. The only difference in this case is in the initialisation of the dictionaries in line 287, where we use the dicAdder method.

Once we complete the recommendation process, we get into simulating the customer action.

# Function to initiate customer action
    def custAction(self,segproducts):
        print('[INFO] getting the customer action')
        # Sample a value to get how many products will be clicked
        click_number = np.random.choice(np.arange(0, 10),
                                        p=[0.50, 0.35, 0.10, 0.025, 0.015, 0.0055, 0.002, 0.00125, 0.00124, 0.00001])
        # Sample products which will be clicked based on click number
        click_list = sample(segproducts, click_number)

        # Sample for buy values
        buy_number = np.random.choice(np.arange(0, 10),
                                      p=[0.70, 0.15, 0.10, 0.025, 0.015, 0.0055, 0.002, 0.00125, 0.00124, 0.00001])
        # Sample products which will be bought based on buy number
        buy_list = sample(segproducts, buy_number)

        return click_list, buy_list

Lines 296-305 implements the processes for simulating the list of products which are bought and browsed by the customer based on the recommendation we made. The method returns the list of products which were browsed through and also the one which were bought. For detailed explanations on these methods please refer the previous post

The next methods we will explore are the ones related to the value updation of the recommendation system.

    def getReward(self,loc):
        rew = GaussianDistribution(loc=loc, scale=1, size=1)[0].round(2)
        return rew

    def saPolicy(self,rew, prod):
        # This function gets the relavant algorithm for the policy update
        # Get the current value of the state        
        vcur = self.polDic[self.stateId][prod]        
        # Get the counts of the current product
        n = self.recoCountdic[self.stateId][prod]        
        # Calculate the new value
        Incvcur = (1 / n) * (rew - vcur)       
        return Incvcur

The getReward method on line 309 is to generate a reward from a gaussian distribution centred around the reward value. We will see the use of this method in subsequent methods.

The saPolicy method in lines 313-321 updates the value of the state based on the simple averaging method in line 320. We have already seen these methods in our prototyping phase in the previous post.

Next we will see the method which uses both the above methods.

    def valueUpdater(self,seg_products, loc, custList, remove=True):
        for prod in custList:
            # Get the reward for the bought product. The reward will be centered around the defined reward for each action
            rew = self.getReward(loc)            
            # Update the reward in the reward dictionary
            self.rewDic[self.stateId][prod] += rew            
            # Update the policy based on the reward
            Incvcur = self.saPolicy(rew, prod)            
            self.polDic[self.stateId][prod] += Incvcur           
            # Remove the bought product from the product list
            if remove:
                seg_products.remove(prod)
        return seg_products

The inputs to this method are the recommended list of products, the mean reward ( click, buy or ignore), the corresponding list ( click list or buy list) and a flag to indicate if the product has to be removed from the recommendation list or not.

We interate through all the products in the customer action list in line 324 and then gets the reward in line 326. Once the reward is incremented in the reward dictionary in line 328, we get the incremental value in line 330 and this is updated in the value dictionary in line 331. If the flag is True, we remove the product from the recommended list and the finally returns the remaining recommendation list.

The next method is the last of the methods and ties the above three methods with the customer action.

# Function to update the reward dictionary and the value dictionary based on customer action
    def rewardUpdater(self, seg_products,custBuy=[], custClick=[]):
        # Check if there are any customer purchases
        if len(custBuy) > 0:
            seg_products = self.valueUpdater(seg_products, self.conf['buyReward'], custBuy)
            # Repeat the same process for customer click
        if len(custClick) > 0:
            seg_products = self.valueUpdater(seg_products, self.conf['clickReward'], custClick)
            # For those products not clicked or bought, give a penalty
        if len(seg_products) > 0:
            custList = seg_products.copy()
            seg_products = self.valueUpdater(seg_products, -2, custList,False)
        # Next update the collections with the respective updates
        print('[INFO] Updating all the collections')
        # Updating the quantity collection
        db.rlQuantdic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.countDic[self.stateId]})
        # Updating the recommendation tracking collection
        db.rlRecotrack.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.recoCountdic[self.stateId]})
        # Updating the value function collection for the products
        db.rlValuedic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.polDic[self.stateId]})
        # Updating the rewards collection
        db.rlRewarddic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.rewDic[self.stateId]})
        print('[INFO] Completed updating all the collections')

In lines 340-348, we update the value based on the number of products bought, clicked and ignored. Once the value dictionaries are updated, the respective MongoDb dictionaries are updated in lines 352-358.

With this we have covered all the methods which are required for implementing the self learning recommendation system. Let us summarise our learning so far in this post.

Created the states and updated MongoDB with the states data. We used the historic data for initialisation of values.
Implemented the recommendation process by getting a list of products to be recommended to the customer
Explored customer response simulation wherein the customer response to the recommended products were implemented.
Updated the value functions and reward functions after customer response
Updated Mongo DB collections after the completion of the process for a customer.

What next ?

We are coming to the fag end of our series. The next post is where we tie all these methods together in the main driver file and see how these processes are implmented. We will also run the script on the terminal and observe the results. Once the application implementation is done, we will also explore avenues to deploy the application. Watch this space for the last post of the series.

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

I would also recommend two books I have co-authored. The first one is specialised in d eep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

VIII : Build and deploy data science products: Machine translation application -Build and deploy using Flask

“One measure of success will be the degree to which you build up others“

This is the last post of the series and in this post we finally build and deploy our application we painstakingly developed over the past 7 posts . This series comprises of 8 posts.

Understand the landscape of solutions available for machine translation
Explore sequence to sequence model architecture for machine translation.
Deep dive into the LSTM model with worked out numerical example.
Understand the back propagation algorithm for a LSTM model worked out with a numerical example.
Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
Build the production grade code for the training module using Python scripts.
Building the Machine Translation application -From Prototype to Production : Inference process
Building the Machine Translation application: Build and deploy using Flask : ( This post)

Over the last two posts we covered the factory model and saw how we could build the model during the training phase. We also saw how the model was used for inference. In this section we will take the results of these predictions and build an app using flask. We will progressively work through the different processes of building the application.

Folder Structure

In our journey so far we progressively built many files which were required for the training phase and the inference phase. Now we are getting into the deployment phase were we want to deploy the code we have built into an application. Many of the files which we have built during the earlier phases may not be required anymore in this phase. In addition, we want the application we deploy as light as possible for its performance. For this purpose it is always a good idea to create a seperate folder structure and a new virtual environment for deploying our application. We will only select the necessary files for the deployment purpose. Our final folder structure for this phase will look as follows

Let us progressively build this folder structure and the required files for building our machine translation application.

Setting up and Installing FLASK

When building an application in FLASK , it is always a good practice to create a virtual environment and then complete the application build process within the virtual environment. This way we can ensure that only application specific libraries and packages are deployed into the hosting service. You will see later on that creating a seperate folder and a new virtual environment will be vital for deploying the application in Heroku.

Let us first create a separate folder in our drive and then create a virtual environment within that folder. In a Linux based system, a seperate folder can be created as follows

$ mkdir mtApp

Once the new directory is created let us change directory into the mtApp directory and then create a virtual environment. A virtual environment can be created on Linux with Python3 with the below script

mtApp $ python3 -m venv mtApp

Here the second mtApp is the name of our virtual environment. Do not get confused with the directory we created with the same name. The virtual environment which we created can be activated as below

mtApp $ source mtApp/bin/activate

Once the virtual environment is enabled we will get the following prompt.

(mtApp) ~$

In addition you will notice that a new folder created with the same name as the virtual environment

Our next task is to install all the libraries which are required within the virtual environment we created.

(mtApp) ~$ pip install flask

(mtApp) ~$ pip install tensorflow

(mtApp) ~$ pip install gunicorn

That takes care of all the installations which are required to run our application. Let us now look through the individual folders and the files within it.

There would be three subfolders under the main application folder MTapp. The first subfolder factoryModel is a subset of the corrsponding folder we maintained during the training phase. The second subfolder mtApp is the one created when the virtual environment was created. We dont have to do anything with that folder. The third folder templates is a folder specifically for the flask application. The file app.py is the driver file for the flask application. Let us now looks into each of the folders.

Folder 1 : factoryModel:

The subfolders and files under the factoryModel folder are as shown below. These subfolders and its files are the same as what we have seen during the training phase.

The config folder contains the __init__.py file and the configuration file mt_config.py we used during the training and inference phases.

The output folder contains only a subset of the complete output folder we saw during the inference phase. We need only those files which are required to translate an input German string to English string. The model file we use is the one generated after the training phase.

The utils folder has the same helperFunctions script which we used during the training and inference phase.

Folder 2 : Templates :

The templates folder has two html templates which are required to visualise the outputs from the flask application. We will talk more about the contents of the html file in a short while along with our discussions on the flask app.

Flask Application

Now its time to get to the main part of this article, which is, building the script for the flask application. The code base for the functionalities of the application will be the same as what we have seen during the inference phase. The difference would be in terms of how we use the predictions and visualise them on to the web browser using the flask application.

Let us now open a new file and name is app.py. Let us start building the code in this file

'''
This is the script for flask application
'''

from tensorflow.keras.models import load_model
from factoryModel.config import mt_config as confFile
from factoryModel.utils.helperFunctions import *
from flask import Flask,request,render_template

# Initializing the flask application
app = Flask(__name__)

## Define the file path to the model
modelPath = confFile.MODEL_PATH

# Load the model from the file path
model = load_model(modelPath)

Lines 5-8 imports the required libraries for creating the application

Lines 11 creates the application object ‘app’ as an instance of the class ‘Flask’. The (__name__) variable passed to the Flask class is a predefined variable used in Python to set the name of the module in which it is used.

Line 14 we load the configuration file from the config folder.

Line 17 The model which we created during the training phase is loaded using the load_model() function in Keras.

Next we will load the required pickle files we saved after the training process. In lines 20-22 we intialize the paths to all the files and variables we saved as pickle files during the training phase. These paths are defined in the configuration file. Once the paths are initialized the required files and variables are loaded from the respecive pickle files in lines 24-27. We use the load_files() function we defined in the helper function script for loading the pickle files. You can notice that these steps are same as the ones we used during the inference process.

In the next lines we will explore the visualisation processes for flask application.

@app.route('/')
def home():
	return render_template('home.html')

Lines 29:31 is a feature called the ‘decorator’. A decorator is used to modify the function which comes after it. The function which follows the decorator is a very simple function which returns the html template for our landing page. The landing page of the application is a simple text box where the source language (German) has to be entered. The purpose of the decorator is to build a mapping between the function and the url for the landing page. The URL’s are defined through another important component called ‘routes’ . ‘Routes’ modules are objects which configures the webpages which receives inputs and displays the returned outputs. There are two ‘routes’ which are required for this application, one corresponding to the home page (‘/’) and the second one mapping to another webpage called ‘/translate. The way the decorator, the route and the associated function works together is as follows. The decorator first defines the relationship between the function and the route. The function returns the landing page and route shows the location where the landing page has to be displayed.

Next we will explore the next decorator which return the predictions

@app.route('/translate', methods=['POST', 'GET'])
def get_translation():
    if request.method == 'POST':

        result = request.form
        # Get the German sentence from the Input site
        gerSentence = str(result['input_text'])
        # Converting the text into the required format for prediction
        # Step 1 : Converting to an array
        gerAr = [gerSentence]
        # Clean the input sentence
        cleanText = cleanInput(gerAr)
        # Step 2 : Converting to sequences and padding them
        # Encode the inputsentence as sequence of integers
        seq1 = encode_sequences(Ger_tokenizer, int(Ger_stdlen), cleanText)
        # Step 3 : Get the translation
        translation = generatePredictions(model,Eng_tokenizer,seq1)
        # prediction = model.predict(seq1,verbose=0)[0]

        return render_template('result.html', trans=translation)

Line 33. Our application is designed to accept German sentences as input, translate it to English sentences using the model we built and output the prediction back to the webpage. By default, the routes decorator only receives input i.e ‘GET’ requests. In order to return the predicted words, we have to define a new method in the decorator route called ‘POST’. This is done through the parameters methods=['POST','GET'] in the decorator.

Line 34. is the main function which translates the input German sentences to English sentences and then display the predictions on to the webpage.

Line 35, defines the ‘if’ method to ascertain that there is a ‘POST’ method which is involved in the operation. The next line is where we define the web form which is used for getting the inputs from the application. Web forms are like templates which are used for receiving inputs from the users and also returning the output.

In Line 37 we define the request.form into a new variable called result. All the outputs from the web forms will be accessible through the variable result.There are two web forms which we use in the application ‘home.html’ and ‘result.html’.

By default the webforms have to reside in a folder called Templates. Before we proceed with the rest of the code within the function we have to understand the webforms. Therefore let us build them. Open a new file and name it home.html and copy the following code.

<!DOCTYPE html>

<html>
<title>Machine Translation APP</title>
<body>
<form action = "/translate" method= "POST">

	<h3> German Sentence: </h3>

	<th> <input name='input_text' type="text" value = " " /> </th>

	<p><input type = "submit" value = "submit" /></p>

</form>
</body>
</html>

The prediction process in our application is initiated when we get the input German text from the ‘home.html’ form. In ‘home.html’ we define the variable name ( ‘input_text’ : line 10 in home.html) for getting the German text as input. A default value can also be mentioned using the variable value which will be over written when a new text is given as input. We also specify a submit button for submitting the input German sentence through the form, line 12.

Line 39 : As seen in line 37, the inputs from the web form will be stored in the variable result. Now to access the input text which is stored in a variable called ‘input_text’ within home.html, we have to call it as ‘input_text’ from the result variable ( result['input_text']. This input text is there by stored into a variable ‘gerSentence’ as a string.

Line 42 the string object we received from the earlier line is converted to a list as required during prediction process.

Line 44, we clean the input text using the cleanInput() function we import from the helperfunctions. After cleaning the text we need to convert the input text into a sequence of integers which is done in line 47. Finally in line 49, we generate the predicted English sentences.

For visualizing the translation we use the second html template result.html. Let us quickly review the template

<!DOCTYPE html>
<html>
<title>Machine Translation APP</title>

    <body>
          <h3> English Translation:  </h3>
            <tr>
                <th> {{ trans }} </th>
            </tr>
    </body>
</html>

This template is a very simple one where the only varible of interest is on line 8 which is the variable trans.

The translation generated is relayed to result.html in line 51 by assigning the translation to the parameter trans .

if __name__ == '__main__':
    app.debug = True
    app.run()

Finally to run the app, the app.run() method has to be invoked as in line 56.

Let us now execute the application on the terminal. To execute the application run $ python app.py on the terminal. Always ensure that the terminal is pointing to the virtual environment we initialized earlier.

When the command is executed you should expect to get the following screen

Click the url or copy the url on a browser to see the application you build come live on your browser.

Congratulations you have your application running on the browser. Keep entering the German sentences you want to translate and see how the application performs.

Deploying the application

You have come a long way from where you began. You have now built an application using your deep learning model. Now the next question is where to go from here. The obvious route is to deploy the application on a production server so that your application is accessible to users on the web. We have different deployment options available. Some popular ones are

Heroku
Google APP engine
AWS
Azure
Python Anywhere …… etc.

What ever be the option you choose, deploying an application of this size will be best achieved by subscribing a paid service on any of these options. However just to go through the motions and demonstrate the process let us try to deploy the application on the free option of Heroku.

Deployment Process on Heroku

Heroku offers a free version for deployment however there are restrictions on the size of the application which can be hosted as a free service. Unfortunately our application would be much larger than the one allowed on the free version. However, here I would like to demonstrate the process of deploying the application on Heroku.

Step 1 : Creating the Heroku account.

The first step in the process is to create an account with Heroku. This can be done through the link https://www.heroku.com/. Once an account is created we get access to a dashboard which lists all the applications which we host in the platform.

Step 2 : Configuring git

Configuring ‘git’ is vital for deploying applications to Heroku. Git has to be installed first to our local system to make the deployment work. Git can be installed by following instructions in the link https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.

Once ‘git’ is installed it has to be configured with your user name and email id.

$ git config –global user.name “user.name”

$ git config –global user.email userName@mail.com

Step 3 : Installing Heroku CLI

The next step is to install the Heroku CLI and the logging in to the Heroku CLI. The detailed steps which are involved for installing the Heroku CLI are given in this link

https://devcenter.heroku.com/articles/heroku-cli

If you are using Ubantu system you can install Heroku CLI using the script below

$ sudo snap install heroku --classic

Once Heroku is installed we need to log into the CLI once. This is done in the terminal with the following command

$ heroku login

Step 4 : Creating the Procfile and requirements.txt

There is a file called ‘Procfile’ in the root folder of the application which gives instructions on starting the application.

Procfile and requirements.txt in the application folder

The file can be created using any text editor and should be saved in the name ‘Procfile’. No extension should be specified for the file. The contents of the file should be as follows

web: gunicorn app:app --log-file

Another important pre-requisite for the Heroku application is a file called ‘requirements.txt’. This is a file which lists down all the dependencies which needs to be installed for running the application. The requirements.txt file can be created using the below command.

$ pip freeze > requirements.txt

Step 5 : Initializing git and copying the required dependent files to Heroku

The above steps creates the basic files which are required for running the application. The next task is to initialize git on the folder. To initialize git we need to go into the root folder where the app.py file exists and then initialize it with the below command

$ git init

Step 6 : Create application instance in Heroku

In order for git to push the application file to the remote Heroku server, an instance of the application needs to be created in Heroku. The command for creating the application instance is as shown below.

$ heroku create {application name}

Please replace the braces with the application name of your choice. For example if the application name you choose is 'gerengtran', it has to be enabled as follows

$ heroku create gerengtran

Step 7 : Pushing the application files to remote server

Once git is initialized and an instance of the application is created in Heroku, the application files can be set up in remote Heroku server by the following commands.

$ heroku git:remote -a {application name}

Please note that ‘application_name’ is the name of the application which you have chosen earlier. What ever name you choose will be the name of the application in Heroku. The external link to your application will be in the name which you choose here.

Step 8 : Deploying the application and making it available as a web app

The final step of the process is to complete the deployment on Heroku and making the application available as a web app. This process starts with the command to add all the changes which you made to git.

$ git add .

Please note that there is a full stop( ‘.’ ) as part of the script after ‘add’ with a space in between .

After adding all the changes, we need to commit all the changes before finally deploying the application.

$ git commit -am "First submission"

The deployment will be completed with the below script after which the application will be up and running as a web app.

$ git push heroku master

When the files are pushed, if the deployment is successful you will get a url which is the link to the application. Alternatively, you can also go to Heroku console and activate your application. Below is the view of your console with all the applications listed. The application with the red box is the application which has been deployed

If you click on the link of the application ( red box) you get the link where the application can be open.

When the open app button is clicked the application is opened in a browser.

Wrapping up the series

With this we have achieved a good milestone of building an application and deploying it on the web for others to consume. I am a strong believer that learning data science should be to enrich products and services. And the best way to learn how to enrich products and services is to build it yourselves at a smaller scale. I hope you would have gained a lot of confidence by building your application and then deploying them on the web. Before we bid adeau, to this series let us summarise what we have achieved in this series and list of the next steps

In this series we first understood the solution landscape of machine translation applications and then understood different architecture choices. In the third and fourth posts we dived into the mathematics of a LSTM model where we worked out a toy example for deriving the forward pass and backpropagation. In the subsequent posts we got down to the tasks of building our application. First we built a prototype and then converted it into production grade code. Finally we wrapped the functionalities we developed in a Flask application and understood the process of deploying it on Heroku.

You have definitely come a long way.

However looking back are there avenues for improvement ? Absolutely !!!

First of all the model we built is a simple one. Machine translation is a complex process which requires lot more sophisticated models for better results. Some of the model choices you can try out are the following

Change the model architecture. Experiment with different number of units and number of layers. Try variations like bidirectional LSTM
Use attention mechanisms on the LSTM layers. Attention mechanism is see to have given good performance on machine translation tasks
Move away from sequence to sequence models and use state of the art models like Transformers.

The second set of optimizations you can try out are on the vizualisations of the flask application. The templates which are used here are very basic templates. You can further experiment with different templates and make the application visually attractive.

The final improvement areas are in the choices of deployment platforms. I would urge you to try out other deployment choices and let me know the results.

I hope all of you enjoyed this series. I definitely enjoyed writing this post. Hope it benefits you and enable you to improve upon the methods used here.

I will be back again with more practical application building series like this. Watch this space for more

You can download the code for the deployment process from the following link

https://github.com/BayesianQuest/MachineTranslation/tree/master/Deployment/MTapp

Do you want to Climb the Machine Learning Knowledge Pyramid ?

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

VI : Build and deploy data science products: Machine translation application – From prototype to production. Introduction to the factory model

This is the sixth part of the series where we continue on our pursuit to build a machine translation application. In this post we embark on a transformation process where in we transform our prototype into a production grade code.

This series comprises of 8 posts.

In this section we will see how we can take the prototype which we built in the last article into a production ready code. In the prototype building phase we were developing our code on a Jupyter/Colab notebook. However if we have to build an application and deploy it, notebooks would not be very effective. We have to convert the code we built on the notebook into production grade code using python scripts. We will be progressively building the scripts using a process, I call, as the factory model. Let us see what a factory model is.

Factory Model

A Factory model is a modularized process of generating business outcomes using machine learning models. There are some distinct phases in the process which includes

Ingestion/Extraction process : Process of getting data from source systems/locations
Transformation process : Transformation process entails transforming raw data ingested from multiple sources into a form fit for the desired business outcome
Preprocessing process: This process involves basic level of cleaning of the transformed data.
Feature engineering process : Feature engineering is the process of converting the preprocessed data into features which are required for model training.
Training process : This is the phase where the models are built from the featurized data.
Inference process : The models which were built during the training phase is then utilized to generate the desired business outcomes during the inference process.
Deployment process : The results of the inference process will have to be consumed by some process. The consumer of the inferences could be a BI report or a web service or an ERP application or any downstream applications. There is a whole set of process which is involved in enabling the down stream systems to consume the results of the inference process. All these steps are called the deployment process.

Needless to say all these processes are supported by an infrastructure layer which is also called the data engineering layer. This layer looks at the most efficient and effective way of running all these processes through modularization and parallelization.

All these processes have to be designed seamlessly to get the business outcomes in the most effective and efficient way. To take an analogy its like running a factory where raw materials gets converted into a finished product and thereby gets consumed by the end customers. In our case, the raw material is the data, the product is the model generated from the training phase and the consumers are any business process which uses the outcomes generated from the model.

Let us now see how we can execute the factory model to generate the business outcomes.

Project Structure

Before we dive deep into the scripts, let us look at our project structure.

Our root folder is the Machine Translation folder which contains two sub folders Data and factoryModel. The Data subfolder contains the raw data. The factoryModel folder contains different subfolders containing scripts for our processes. We will be looking at each of these scripts in detail in the subsequent sections. Finally we have two driver files mt_driver_train.py which is the driver file for the training process and mt_Inference.py which is the driver file for the inference process.

Let us first dive into the training phase scripts.

Training Phase

The first part of the factory model is the training phase which comprises of all the processes till the creation of the model. We will start off by building the supporting files and folders before we get into the driver file. We will first start with the configuration file.

Configuration file

When we were working with the notebook files, we were at a liberty to change the pararmeters we wanted to vary, say for example the path to the input file or some hyperparameters like the number of dimensions of the embedding vector, on the notebook itself. However when an application is in production we would not have the luxury to change the parameters and hyperparameters directly in the code base. To get over this problem we use the configuration files. We consolidate all the parameters and hyperparameters of the model on to the configuration file. All processes will pick the parameters from the configuration file for further processing.

The configuration file will be inside the config folder. Let us now build the configuration file.

Open a word editor like notepad++ or any other editor of your choice and open a new file and name it mt_config.py. Let us start adding the below code in this file.

'''
This is the configuration file for storing all the application parameters
'''

import os
from os import path


# This is the base path to the Machine Translation folder
BASE_PATH = '/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/BayesianQuest/MT/MachineTranslation'
# Define the path where data is stored
DATA_PATH = path.sep.join([BASE_PATH,'Data/deu.txt'])

Lines 5 and 6, we import the necessary library packages.

Line 10, we define the base path for the application. You need to change this path based on your specific path to the application. Once the base path is set, the rest of the paths will be derived out from it. In Line 12, we define the path to the raw data set folder. Note that we just join the name of the data folder and the raw text file with the base path to get the data path. We will be using the data path to read in the raw data.

In the config folder there will be another file named __init__.py . This is a special file which tells Python to treat the config folder as part of the package. This file inside this folder will be an empty file with no code in it

Loading Data

The next helper files we will build are those for loading raw files and preprocessing. The code we use for these purposes are the same code which we used for building the prototype. This file will reside in the dataLoader folder

In your text editor open a new file and name it as datasetloader.py and then add the below code into it

'''
Factory Model for Machine translation preprocessing.
This is the script for loading the data and preprocessing data
'''

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array

# Creating the class to load data and then do the preprocessing as sequence of steps

class textLoader:
	def __init__(self , preprocessors = None):
		# This init method is to store the text preprocessing pipeline
		self.preprocessors = preprocessors
		# Initializing the preprocessors as an empty list of the preprocessors are None
		if self.preprocessors is None:
			self.preprocessors = []

	def loadDoc(self,filepath):
		# This is the function to read the file from the path provided
		# Open the file
		file = open(filepath,mode = 'rt',encoding = 'utf-8')
		# Reading the text
		text = file.read()
		#Once the file is read, applying the preprocessing steps one by one
		if self.preprocessors is not None:
			# Looping over all the preprocessing steps and applying them on the text data
			for p in self.preprocessors:
				text = p.preprocess(text)
				
		# Closing the file
		file.close()
				
		# Returning the text after all the preprocessing
		return text

Before addressing the code block line by line, let us get a big picture perspective of what we are trying to accomplish. When working with text you would have realised that different sources of raw text requires different preprocessing treatments. A preprocessing method which we have used for one circumstance may not be warranted in a different one. So in this code block we are building a template called textLoader, which reads in raw data and then applies different preprocessing steps like a pipeline as the situation warrants. Each of the individual preprocessing steps would be defined seperately. The textLoader class first reads in the data and then applies the selected preprocessing one after the other. Let us now dive into the details of the code.

Lines 6 to 10 imports all the necessary library packages for the process.

Line 14 we define the textLoader class. The constructor in line 15 takes the text preprocessor pipeline as the input. The prepreprocessors are given as lists. The default value is taken as None. The preprocessors provided in the constructor is initialized in line 17. Lines 19-20 initializes an empty list if the preprocessor argument is none. If you havent got a handle of why the preprocessors are defined this way, it is ok. This will be more clear when we define the actual preprocessors. Just hang on till then.

From line 22 we start the first function within this class. This function is to read the raw text and the apply the processing pipeline. Lines 25 – 27, where we open the text file and read the text is the same as what we defined during the prototype phase in the last post. We do a check to see if we have defined any preprocessor pipeline in line 29. If there are any pipeline defined those are applied on the text one by one in lines 31-32. The method .preprocess is specific to each of the preprocessor in the pipeline. This method would be clear once we take a look at each of the preprocessors. We finally close the raw file and the return the processed text in lines 35-38.

The __init__.py file inside this folder will contain the following line for importing the textLoader class from the datasetloader.py file for any calling script.

from .datasetloader import textLoader

Processing Data : Preprocessing pipeline construction

Next we will create the files for preprocessing the text. In the last section we saw how the raw data was loaded and then preprocessing pipeline was applied. In this section we look into the preprocessing pipeline. The folder structure will be as shown in the figure.

There would be three preprocessors classes for processing the raw data.

SentenceSplit : Preprocessor to split raw text into pair of English and German sentences. This class is inside the file splitsentences.py
cleanData : Preprocessor to apply cleaning steps like removing punctuations, removing whitespaces which is included in the datacleaner.py file.
TrainMaker : Preprocessor to tokenize text and then finally prepare the train and validation sets contined in the tokenizer.py file

Let us now dive into each of the preprocessors.

Open a new file and name it splitsentences.py. Add the following code to this file.

'''
Script for preprocessing of text for Machine Translation
This is the class for splitting the text into sentences
'''

import string
from numpy import array

class SentenceSplit:
	def __init__(self,nrecords):
		# Creating the constructor for splitting the sentences
		# nrecords is the parameter which defines how many records you want to take from the data set
		self.nrecords = nrecords
		
	# Creating the new function for splitting the text
	def preprocess(self,text):
		sen = text.strip().split('\n')
		sen = [i.split('\t') for i in sen]
		# Saving into an array
		sen = array(sen)
		# Return only the first two columns as the third column is metadata. Also select the number of rows required
		return sen[:self.nrecords,:2]

This is the first or our preprocessors. This preprocessor splits the raw text and finally outputs an array of English and German sentence pairs.

After we import the required packages in lines 6-7, we define the class in line 9. We pass a variable nrecords to the constructor to subset the raw text and select number of rows we want to include for training.

The preprocess function starts in line 16. This is the function which we were accessing in line 32 of the textLoader class which we discussed in the last section. The rest is the same code we have used in the prototype building phase which includes

Splitting the text into sentences in line 17
Splitting each sentece on tab spaces to get the German and English sentences ( line 18)

Finally we convert the processed sentences into an array and return only the first two columns of the array. Please note that the third column contains metadata of each line and therefore we exclude it from the returned array. We also subset the array based on the number of records we want.

Now that the first preprocessor is complete,let us now create the second preprocessor.

Open a new file and name it datacleaner.py and copy the below code.

'''
Script for preprocessing data for Machine Translation application
This is the class for removing the punctuations from sentences and also converting it to lower cases
'''

import string
from numpy import array
from unicodedata import normalize

class cleanData:
	def __init__(self):
		# Creating the constructor for removing punctuations and lowering the text
		pass
		
	# Creating the function for removing the punctuations and converting to lowercase
	def preprocess(self,lines):
		cleanArray = list()
		for docs in lines:
			cleanDocs = list()
			for line in docs:
				# Normalising unicode characters
				line = normalize('NFD', line).encode('ascii', 'ignore')
				line = line.decode('UTF-8')
				# Tokenize on white space
				line = line.split()
				# Removing punctuations from each token
				line = [word.translate(str.maketrans('', '', string.punctuation)) for word in line]
				# convert to lower case
				line = [word.lower() for word in line]
				# Remove tokens with numbers in them
				line = [word for word in line if word.isalpha()]
				# Store as string
				cleanDocs.append(' '.join(line))
			cleanArray.append(cleanDocs)
		return array(cleanArray)

This preprocessor is to clean the array of German and English sentences we received from the earlier preprocessor. The cleaning steps are the same as what we have seen in the previous post. Let us quickly dive in and understand the code block.

We start of by defining the cleanData class in line 10. The preprocess method starts in line 16 with the array from the previous preprocessing step as the input. We define two placeholder lists in line 17 and line 19. In line 20 we loop through each of the sentence pair of the array and the carry out the following cleaning operations

Lines 22-23, normalise the text
Line 25 : Split the text to remove the whitespaces
Line 27 : Remove punctuations from each sentence
Line 29: Convert the text to lower case
Line 31: Remove numbers from text

Finally in line 33 all the tokens are joined together and appended into the cleanDocs list. In line 34 all the individual sentences are appended into the cleanArray list and converted into an array which is returned in line 35.

Let us now explore the third preprocessor.

Open a new file and name it tokenizer.py . This file is pretty long and therefore we will go over it function by function. Let us explore the file in detail

'''
This class has methods for tokenizing the text and preparing train and test sets
'''

import string
import numpy as np
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


class TrainMaker:
	def __init__(self):
		# Creating the constructor for creating the tokenizers
		pass
	
	# Creating an internal function for tokenizing the text	
	def tokenMaker(self,text):
		tokenizer = Tokenizer()
		tokenizer.fit_on_texts(text)
		return tokenizer

We down load all the required packages in lines 5-10, after which we define the constructor in lines 13-16. There is nothing going on in the constructor so we can conveniently pass it over.

The first function starts on line 19. This is a function we are familiar with in the previous post. This function fits the tokenizer function on text. The first step is to instantiate the tokenizer object in line 20 and then fit the tokenizer object on the provided text in line 21. Finally the tokenizer object which is fit on the text is returned in line 22. This function will be used for creating the tokenizer dictionaries for both English and German text.

The next function which we will see is the sequenceMaker. In the previous post we saw how we convert text as sequence of integers. The sequenceMaker function is used for this task.

		
	# Creating an internal function for encoding and padding sequences
	
	def sequenceMaker(self,tokenizer,stdlen,text):
		# Encoding sequences as integers
		seq = tokenizer.texts_to_sequences(text)
		# Padding the sequences with respect standard length
		seq = pad_sequences(seq,maxlen=stdlen,padding = 'post')
		return seq

The inputs to the sequenceMaker function on line 26 are the tokenizer , the maximum length of a sequence and the raw text which needs to be converted to sequences. First the text is converted to sequences of integers in line 28. As the sequences have to be of standard legth, they are padded to the maximum length in line 30. The standard length integer sequences is then returned in line 31.

		
	# Creating another function to find the maximum length of the sequences	
	def qntLength(self,lines):
		doc_len = []
		# Getting the length of all the language sentences
		[doc_len.append(len(line.split())) for line in lines]
		return np.quantile(doc_len, .975)

The next function we will define is the function to find the quantile length of the sentences. As seen from the previous post we made the standard length of the sequences equal to the 97.5 % quantile length of the respective text corpus. The function starts in line 34 where the complete text is given as input. We then create a placeholder in line 35. In line 37 we parse through each of the line and the find the total length of the sentence. The length of each sentence is stored in the placeholder list we created earlier. Finally in line 38, the 97.5 quantile of the length is returned to get the standard length.

		
	# Creating the function for creating tokenizers and also creating the train and test sets from the given text
	def preprocess(self,docArray):
		# Creating tokenizer forEnglish sentences
		eng_tokenizer = self.tokenMaker(docArray[:,0])
		# Finding the vocabulary size of the tokenizer
		eng_vocab_size = len(eng_tokenizer.word_index) + 1
		# Creating tokenizer for German sentences
		deu_tokenizer = self.tokenMaker(docArray[:,1])
		# Finding the vocabulary size of the tokenizer
		deu_vocab_size = len(deu_tokenizer.word_index) + 1
		# Finding the maximum length of English and German sequences
		eng_length = self.qntLength(docArray[:,0])
		ger_length = self.qntLength(docArray[:,1])
		# Splitting the train and test set
		train,test = train_test_split(docArray,test_size = 0.1,random_state = 123)
		# Calling the sequence maker function to create sequences of both train and test sets
		# Training data
		trainX = self.sequenceMaker(deu_tokenizer,int(ger_length),train[:,1])
		trainY = self.sequenceMaker(eng_tokenizer,int(eng_length),train[:,0])
		# Validation data
		testX = self.sequenceMaker(deu_tokenizer,int(ger_length),test[:,1])
		testY = self.sequenceMaker(eng_tokenizer,int(eng_length),test[:,0])
		return eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,docArray,trainX,trainY,testX,testY,eng_length,ger_length

We tie all the earlier functions in the preprocess method starting in line 41. The input to this function is the English, German sentence pair as array. The various processes under this function are

Line 43 : Tokenizing English sentences using the tokenizer function created in line 19
Line 45 : We find the vocabulary size for the English corpus
Lines 47-49 the above two processes are repeated for German corpus
Lines 51-52 : The standard lengths of the English and German senetences are found out
Line 54 : The array is split to train and test sets.
Line 57 : The input sequences for the training set is created using the sequenceMaker() function. Please note that the German sentences are the input variable ( TrainX).
Line 58 : The target sequence which is the English sequence is created in this step.
Lines 60-61: The input and target sequences are created for the test set

All the variables and the train and test sets are returned in line 62

The __init__.py file inside this folder will contain the following lines

from .splitsentences import SentenceSplit
from .datacleaner import cleanData
from .tokenizer import TrainMaker

That takes us to the end of the preprocessing steps. Let us now start the model building process.

Model building Scripts

Open a new file and name it mtEncDec.py . Copy the following code into the file.

'''
This is the script and template for different models.
'''

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import TimeDistributed

class ModelBuilding:
	@staticmethod
	def EncDecbuild(in_vocab,out_vocab, in_timesteps,out_timesteps,units):
		# Initializing the model with Sequential class
		model = Sequential()
		# Initiating the embedding layer for the text
		model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
		# Adding the first LSTM layer
		model.add(LSTM(units))
		# Using the RepeatVector to map the input sequence length to output sequence length
		model.add(RepeatVector(out_timesteps))
		# Adding the second layer of LSTM 
		model.add(LSTM(units, return_sequences=True))
		# Adding the fully connected layer with a softmax layer for getting the probability
		model.add(TimeDistributed(Dense(out_vocab, activation='softmax')))
		# Compiling the model
		model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
		# Printing the summary of the model
		model.summary()
		return model

The model building scripts is straight forward. Here we implement the encoder decoder model we described extensively in the last post.

We start by importing all the necessary packages in lines 5-10. We then get to the meat of the model by defining the ModelBuilding class in line 12. The model we are using for our application is defined through a function EncDecbuild in line 14. The inputs to the function are the

in_vocab : This is the size of the German vocabulary
out_vocab : This is the size of the Enblish vocabulary
in_timesteps : The standard sequence length of the German sentences
out_timesteps : Standard sequence length of Enblish sentences
units : Number of hidden units for the LSTM layers.

The progressive building of the model was covered extensively in the last post. Let us quickly run through the same here

Line 16 we initialize the sequential class
The next layer is the Embedding layer defined in line 18. This layer converts the text to word embedding vectors. The inputs are the German vocabulary size, the dimension required for the word embeddings and the sequence length of the input sequences. In this example we have kept the dimension of the word embedding same as the number of units of LSTM. However this is a parameter which can be experimented with.
Line 20, we initialize our first LSTM unit.
We then perform the Repeat vector operation in Line 22 so as to make the mapping between the encoder time steps and decoder time steps
We add our second LSTM layer for the decoder part in Line 24.
The next layer is the dense layer whose output size is equal to the English vocabulary size.(Line 26)
Finally we compile the model using ‘adam’ optimizer and then summarise the model in lines 28-30

So far we explored the file ecosystem for our application. Next we will tie all these together in the driver program.

Driver Program

Open a new file and name it mt_driver_train.py and start adding the following code blocks.

'''
This is the driver file which controls the complete training process
'''

from factoryModel.config import mt_config as confFile
from factoryModel.preprocessing import SentenceSplit,cleanData,TrainMaker
from factoryModel.dataLoader import textLoader
from factoryModel.models import ModelBuilding
from tensorflow.keras.callbacks import ModelCheckpoint
from factoryModel.utils.helperFunctions import *

## Define the file path to input data set
filePath = confFile.DATA_PATH

print('[INFO] Starting the preprocessing phase')

## Load the raw file and process the data
ss = SentenceSplit(50000)
cd = cleanData()
tm = TrainMaker()

Let us first look at the library file importing part. In line 5 we import the configuration file which we defined earlier. Please note the folder structure we implemented for the application. The configuration file is imported from the config folder which is inside the folder named factoryModel. Similary in line 6 we import all three preprocessing classes from the preprocessing folder. In line 7 we import the textLoader class from the dataLoader folder and finally in line 8 we import the ModelBuilding class from the models folder.

The first task we will do is to get the path of the files which we defined in the configuration file. We get the path to the raw data in line 13.

Lines 18-20, we instantiate the preprocessor classes starting with the SentenceSplit, cleanData and finally the trainMaker classes. Please note that we pass a parameter to the SentenceSplit(50000) class to indicate that we want only 50000 rows of the raw data, for processing.

Having seen the three preprocessing classes, let us now see how these preprocessors are tied together in a pipeline to be applied sequentially on the raw text. This is achieved in next code block

# Initializing the data set loader class and then executing the processing methods
tL = textLoader(preprocessors = [ss,cd,tm])
# Load the raw data, preprocess it and create the train and test sets
eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,text,trainX,trainY,testX,testY,eng_length,ger_length = tL.loadDoc(filePath)

Line 21 we instantiate the textLoader class. Please note that all the preprocessing classes are given sequentially in a list as the parameters to this class. This way we ensure that each of the preprocessors are implemented one after the other when we implement the textLoader class. Please take some time to review the class textLoader earlier in the post to understand the dynamics of the loading and preprocessing steps.

In Line 23 we implement the loadDoc function which takes the path of the data set as the input. There are lots of processes which goes on in this method.

First loads the raw text using the file path provided.
On the raw text which is loaded, the three preprocessors are implemented one after the other
The last preprocessing step returns all the required data sets like the train and test sets along with the variables we require for modelling.

We now come to the end of the preprocessing step. Next we take the preprocessed data and train the model.

Training the model

We have already built all the necessary scripts required for training. We will tie all those pieces together in the training phase. Enter the following lines of code in our script

### Initiating the training phase #########
# Initialise the model
model = ModelBuilding.EncDecbuild(int(deu_vocab_size),int(eng_vocab_size),int(ger_length),int(eng_length),256)
# Define the checkpoints
checkpoint = ModelCheckpoint('model.h5',monitor = 'val_loss',verbose = 1, save_best_only = True,mode = 'min')
# Fit the model on the training data set
model.fit(trainX,trainY,epochs = 50,batch_size = 64,validation_data=(testX,testY),callbacks = [checkpoint],verbose = 2)

In line 34, we initialize the model object. Please note that when we built the script ModelBuilding was the name of the class and EncDecbuild was the method or function under the class. This is how we initialize the model object in line 34. The various parameter we give are the German and English vocabulary sizes, sequence lenghts of the German and English senteces and the number of units for LSTM ( which is what we adopt for the embedding size also). We define the checkpoint variables in line 36.

We start the model fitting in line 38. At the end of the training process the best model is saved in the path we have defined in the configuration file.

Saving the other files and variables

Once the training is done the model file is stored as a 'model.h5‘ file. However we need to save other files and variables as pickle files so that we utilise them during our inference process. We will create a script where we store all such utility functions for saving data. This script will reside in the utils folder. Open a new file and name it helperfunctions.py and copy the following code.

'''
This script lists down all the helper functions which are required for processing raw data
'''

from pickle import load
from numpy import argmax
from tensorflow.keras.models import load_model
from pickle import dump

def save_clean_data(data,filename):
    dump(data,open(filename,'wb'))
    print('Saved: %s' % filename)

Lines 5-8 we import all the necessary packages.

The first function we will be creating is to dump any files as pickle files which is initiated in line 10. The parameters are the data and the filename of the data we want to save.

Line 11 dumps the data as pickle file with the file name we have provided. We will be using this utility function to save all the files and variables after the training phase.

In our training driver file mt_driver_train.py add the following lines

### Saving the tokenizers and other variables as pickle files
save_clean_data(eng_tokenizer,'eng_tokenizer.pkl')
save_clean_data(eng_vocab_size,'eng_vocab_size.pkl')
save_clean_data(deu_tokenizer,'deu_tokenizer.pkl')
save_clean_data(deu_vocab_size,'deu_vocab_size.pkl')
save_clean_data(trainX,'trainX.pkl')
save_clean_data(trainY,'trainY.pkl')
save_clean_data(testX,'testX.pkl')
save_clean_data(testY,'testY.pkl')
save_clean_data(eng_length,'eng_length.pkl')
save_clean_data(ger_length,'ger_length.pkl')

Lines 42-52, we save all the variables we received from line 24 as pickle files.

Executing the script

Now that we have completed all the scripts, let us go ahead and execute the scripts. Open a terminal and give the following command line arguments to run the script.

$ python mt_driver_train.py

All the scripts will be executed and finally the model files and other variables will be stored on disk. We will be using all the saved files in the inference phase. We will address the inference phase in the next post of the series.

Go to article 7 of this series : From prototype to production: Inference Process

You can download the notebook for the prototype using the following link

https://github.com/BayesianQuest/MachineTranslation/tree/master/Production

Do you want to Climb the Machine Learning Knowledge Pyramid ?

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Logic of Logistic Regression – Part III

data

In our previous post on logistic regression we defined the concept of parameters and had a first hand glimpse on the dynamics between the data set and the parameters to obtain our first set of predictions. In this part we will go further into how we optimize the parameters in order to improve the accuracy of our predictions. We will be dealing with the following concepts

Deciphering the prediction errors
Minimizing errors through gradient descent and finding optimized parameters
Prediction with the optimized set of parameters.

Deciphering Prediction Errors

Let us revisit the toy example we discussed in our last post and dissect the below table which represented the dynamics of prediction.

activation

To recap, let us list down our discussions in the last post on the dynamics involved in the above table.

We first assumed an initial set of parameters
Multiplied the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
Converted the weighted sum into predictions ( column 6) by applying the activation function (sigmoid function).

Let us take a moment to reflect on what the predictions really mean ? The predictions are in fact the probabilities of the customer buying the insurance policy. For example, for the first customer, we are predicting that the probability that the customer will buy the insurance policy is almost 17.9%.

However when we talk about predictions the first thing which comes to our mind is the veracity of those predictions. How close to reality are the first set of predictions which we made ? If we recall, in our last discussion on the training set, we introduced a new column called the labels. The labels in fact is the reality !! For example looking at the labels column we know that the first two customers did not buy the insurance policy ( label of ‘0’) and the next two bought the insurance policy. The veracity of our predictions can be realized by comparing our predictions with the reality manifested in the labels. By comparing we can see that the first and last customer predictions are somewhat close to reality and the middle ones are pretty off target. In ideal state, we want the first two predictions to be close to zero and the last two pretty close to ‘1’. However, what we predicted have obviously deviated from the reality. Such deviations are the errors we have inherited in our predictions. However we need to note that the calculation of error for a classification problem like ours is a little mathematically oriented and is not as straight forward as subtracting the probability from the labels. For the sake of simplicity let us not get into those mathematical calculations and stick to our understanding that there some errors inherited for each example. From the errors of each example we can find the average error by summing up errors of all examples and dividing it by the number of examples ( 4 in our case). In machine learning parlance the average error so obtained can also be called the ‘Cost’.

Now that we know that there are ‘Cost’ involved in our predictions, our aim should be to minimize the cost so that our predictions are as close to the reality as possible.However the million dollar question is how do we minimize the cost ? What are the levers we have to reduce our costs ? Going back to our toy example, the two entities we have played around to get the predictions are the ‘data’ and the ‘parameters’ . We cannot change the given data because it is fixed. So all we have got to play around with is the parameters which we assumed. We have to try to change our parameters systematically so that we minimize the costs and get our predictions as close to the reality as possible. One of the ways we do this is by a procedure called gradient descent.

Gradient Descent

To understand the concept of gradient descent let us look at some graphical representations.

cost

A pictorial representation of the cost function will look as the above. In the ‘X’ axis we have our parameters and in the ‘Y’ axis we have the cost. From the figure we can see that there are some set of parameters,’P’ with which we can get to the minimum cost ‘C_min’. Our aim is to find those parameters which will give us the minimum cost.

Let us represent the initial parameters we assumed as P_initial. For this set of parameters let us denote the cost we derived as C1, as given in the figure. We can see from the figure that by moving the P value to the left ( decreasing the parameters ) by some value we can get to the minimum value of cost. Alternatively, if our initial ‘P’ value were to be on the left side of the graph, we would have to move to the right ( increase the value of parameters ) to get to the minimum cost. The procedure for achieving this is called the gradient descent.

The idea behind gradient descent is represented pictorially as below.

gradient_descent

We decrease the parameters by small steps in an iterative fashion so as to get to the minimum cost. To find out the “small steps” which I mentioned in the previous line we use a trick we learned in high school calculus called partial derivative. By taking the partial derivative at each point of the cost curve we get a value by which we have to reduce the parameters. With the new set of reduced parameters we find the new cost. Again we find the partial derivative at the new cost level to get the next steps which we have to take, and this process continues till we reach the minimum cost. An analogy to this process is like this. Suppose we are on top of a hill, blindfolded, and we want to find our way down the hill. The way we can do this is by feeling the ground with our foot to find those spots which are lower than the ones where we are currently and then move to the new spot. From the new spot we repeat the process till we finally reach the bottom of the hill. Gradient descent works somewhat similar to this.

Summarizing our discussions on gradient descent, these are the steps we take to get the optimum parameters.

First start of with the assumed random parameters.
Find the cost ( errors ) associated with the assumed parameters.
Find the small steps we have to take to alter our parameters, by taking partial derivative of the cost.
Reduce the parameters by the small steps and get a new set of parameters
Find the new cost associated with the new parameters.
Repeat the processes 3,4 & 5 till we get the most optimized cost.

The optimized parameters which we finally get are called the learned parameters.Getting to this optimized parameters is the most involved part of machine learning. Once we learn the parameters using, the training set, we are all set to do predictions which is the objective of any machine learning process.

Doing Predictions

Having learned our set of optimized parameters from the training set, we are now equipped with enough ammunition to do predictions. For doing predictions we take a new set of data called the test set. However there is a difference between the training set and test set. The test set will not have any labels. Our job is to predict the labels from the parameters we have learned. So in the insurance company example, the test set would be the new set of leads which the sales team generated. We have to predict the likelihood of these leads, buying an insurance policy. The way we do the prediction is as follows.

We take the optimized set of parameters learned from the training set
Multiply the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
Convert the weighted sum into probabilities ( column 6) by applying the activation function (sigmoid function).
We take a threshold point ( say 0.5). So any probability less than the threshold point is predicted as ‘0’ ( Will not buy) and anything greater that the threshold point is predicted as ‘1’.

The threshold point which we take to make a decision on our predictions is called the decision boundary.Needless to say, the logistic regression is the basic model among a vast set of powerful classification algorithms. The significance of logistic regression is that it is the building block for the development of powerful algorithms like Support Vector machines, Neural Networks etc. Having said that there are many problem areas where we have to go for simple algorithms like logistic regression. Having dealt with the basic building blocks of classification problems we will have further discussions on some of the most powerful algorithms in future posts. Until then watch out this space for more.

Logic of Logistic Regression – Part II

images

In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

Weights or parameters which we learn
The activation function, and
The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being). Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.

trng

The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data set. If we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’ Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5 Range of ‘Age’ = 46 – 28 = 18 Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167 Similarly we do it for the complete data set. The scaled data set is as represented below. Please note that we do not scale the labels.
Prediction with initial parameters : Once the data is scaled, we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations. Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function. One of the most common activation function used in classification problems is the Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities. The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy. Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

The Bayesian Quest

Unraveling the Enigma of Data Science

Category: Bayesian Inference

Causal Estimation methods for Machine learning and Data Science Part II – Propensity Score Matching

Build you Computer Vision Application – Part VI: Road pothole detector using Tensorflow Object Detection API

Build you Computer Vision Application – Part V: Road pothole detector using YOLO-V5

Build you Computer Vision Application – Part III: Pothole detector from scratch using legacy methods (Image Pyramids and sliding window)

Building Self Learning Recommendation system – VIII : Evaluating deployment options

Building Self Learning Recommendation system – VII : Productionizing the application : II

VIII : Build and deploy data science products: Machine translation application -Build and deploy using Flask

VI : Build and deploy data science products: Machine translation application – From prototype to production. Introduction to the factory model

Logic of Logistic Regression – Part III

Logic of Logistic Regression – Part II

Concept of Parameters

Learning Parameters from data

Unraveling the Enigma of Data Science

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Concept of Parameters

Learning Parameters from data

Rate this:

Share this: