Build you computer vision application IV : Building the pothole detector using RCNN

This is the fourth post of the series were we build a pothole detection application. We will be using multiple methods on computer vision which includes annotating images using labelImg, learning about object detection and localisation, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 8 posts.

1. Introduction to object detection

2. Data set preperation and annotation Using labelImg

3. Building your object detection model from scratch using Image pyramids and sliding window

4. Building your road pothole detector using RCNN ( This Post )

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In the last post we built an object detector from scratch using image pyramids and sliding window techniques. These techniques are legacy techniques, however important, as these techniques lay the foundation to some of the advanced techniques. In this post we will make our foray into an advanced technique by learning about the RCNN family and then will implement an object detector using RCNN. Let us dive in.

RCNN family of object detectors

RCNN framework was originally introduced by Girshik et al. in 2013. There have been several modifications to the original architecture, resulting in better performance over time. For some time the RCNN framework was the go to model for object detection tasks.

Image Source : https://arxiv.org/pdf/1311.2524.pdf

The original RCNN algorithm contains the following key steps

  • Extract regions which potentially contain an object from the input image. Such extractions are called region proposal extractions. The extractions are done using an algorithm like selective search.
  • Use a pretrained CNN to extract features from the proposal regions.
  • Classify each extracted region, using a classifier like Support Vector Machines ( SVM).

The original RCNN algorithm gave much better results than traditional methods like the sliding window and pyramid based methods. However this system was slow. Besides, deep learning was not used for localising the objects in the image and it was mostly left to algorithms like selective search.

A significant improvement was made to the original RCNN algorithm, by the same author, within a year of publishing the original paper. This algorithm was named Fast-RCNN. In this algorithm there were some novel ideas like Region of Interest Pooling layer. The Fast-RCNN algorithm used a CNN for the entire image to extract feature map from it. The region proposals were done on the feature maps extracted from the CNN layer and like the RCNN, this algorithm also used selective search for Region Proposal. A fixed size window from the feature map was extracted and then passed to a fully connected layer to get the output label for the proposal regions. This step was termed as the Region of Interest Pooling. Two sets of fully connected layers were used to get class labels of the regions along with the location of the bounding boxes for each region.

Within couple of months from the publishing of the Fast-RCNN algorithm another algorithm called the Faster-RCNN was published which improved upon the Fast-RCNN algorithm.

The new algorithm had another salient feature called the Region Proposal Network ( RPN), which was introduced to eliminate the need of selective search algorithm and build the capability for region proposal into the R-CNN architecture itself. In this algorithm, anchors were placed uniformly accross the entire image at varying scale and aspect ratios.

The image is split into equally spaced points called the anchor points and at each of the anchor point, 9 different anchors are generated and the Intersection over Union ( IOU ) of the anchors with the ground truth bounding boxes is determined to generate an objectness score. The objectness score is an indicator as to whether there is an object or not.

The objectness score is also used to filter down the number of proposals which will thereby be propogated to the subsequent binary classification and bounding box regression layer.

The binary classifier classifies the proposals as foreground ( containing an object) and background ( no object) and the regressor outputs the delta or adjustments that needs to be made to the reference anchor box, to make it similar to the ground truth bounding boxes. After these two steps in the RPN layer, the proposals are sorted based on the probability score as to whether it is foreground and background and then it undergoes Non maxima suppression to reduce the overlapping bounding boxes.

The reduced number of bounding boxes are then propogated to an ROI pooling layer which reduces the dimensions and then goes through the fully connected layers to the final softmax layers and the regressor layers. The softmax layer detects what type of object it is ( whether it is a pothole or vegetation or sign board etc) and the regressor layer gives the adjusted bounding boxes to that object.

One of the biggest advantages Faster RCNN has achieved over the previous versions is that all the moving parts can be integrated as one single network along with considerable speed in its implementation. We will leave the implementation of Faster RCNN to the subsequent chapter, where you could implement it using Tensorflow object detection API.

Having got an overview of the RCNN family, let us get to the implementation of the RCNN network.

Implementation of pothole object detector using RCNN

Let us quickly get an overview of the steps involved in the implementation of the object detector using RCNN

  1. Creation of data sets with both positive and negative images. For creation of the data sets, we will be using the image annotation details we created in post 2. We will be using the same csv file which we created in post 2.
  2. Use transfer learning technique to build our classifier. The pre-trained model we will be using is the MobileNetV2
  3. Fine tune the pre-trained model as the classifier and save the model
  4. Perform selective search algorithm using opencv for generating regions of proposals
  5. Classify the proposal regions using the fine tuned Image net model
  6. Perform non maxima suppression on the proposal regions

Let us start by importing the packages we require for this implementation

import os
import glob
import pandas as pd
import io
import cv2
import h5py
import numpy as np

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.image import extract_patches_2d
from imutils import paths
import matplotlib.pyplot as plt
import pickle
import imutils

Data Preprocessing

For data preprocessing we have to convert the data and labels into arrays for us to train our models. We have two classes of data i.e the positive class which pertains to the potholes and the negative class which are those images other than potholes. We have to preprocess both these images seperately.

Let us start the process with the positive class. We will be using the ‘csv’ file which we created in Post 2 for getting the required information on the positive classes. Let us read the csv files and create two empty lists to store the data and labels.

# Reading the csv file
pothole_df = pd.read_csv('pothole_df.csv')

Let us explore the head of the positive class information data frame

pothole_df.head()
figure 1 : Positive class information

Each row of the data frame contains information on the file name of our image along with the localisation information of the pothole. We will be using these information to extract the region of interest ( roi ) from the image. Let us now get to creating the roi’s from this information. To start off we will create two empty lists to store the roi features and the labels.

# Empty lists to store data and labels
data = []
labels = []

Next we will create a function to extract the region of interest(roi’s) from the positive class. This class is similar to the one which we created in the previous post.

Region of interest Extractor for positive and negative classes

# Functions to extract the bounding boxes and preprocess the image
def roiExtractor(row,path):
    img = cv2.imread(path + row['filename'])    
    # Get the bounding box elements
    bb = [int(row['xmin']),int(row['ymin']),int(row['xmax']),int(row['ymax'])]
    # Crop the image
    roi = img[bb[1]:bb[3], bb[0]:bb[2]]
    # Reshape the image
    roi = cv2.resize(roi,(224,224),interpolation=cv2.INTER_CUBIC)    
    # Convert the image to an array
    roi = img_to_array(roi)
    # Preprocess the image
    roi = preprocess_input(roi)    
    return roi

The inputs to the function are each row of the csv file and the path to the folder where the images are placed. We first read the image in line 39.The image is read by concatenating the path to the images folder and the filename listed in the csv file. Once the image is read, the bounding box information for the image is extracted in line 41 and then the image is cropped to get only the positive classes in line 43. The images are then resized to a standard size of (224,224 )in line 45. We resize it to a standard dimension as that is the dimension required for the Mobilenet network. In lines 47-49, the images are converted to arrays and then preprocessed. The preprocess_input() method in line 49 normalises the pixel values so that it is between 0-1.

We will process the images based on the function we just created. We iterate through each row of the csv file ( line 54) and then extract only those rows where the class is ‘pothole’ ( line 55). We get the roi using the roiExtractor function ( line 56) and then append the roi to the list we created (data) ( line 58). The labels for the positive class are also appended to labels ( line 59) .

# This is the path where the images are placed. Change this path to the location you have defined
path = 'data/'
# Looping through the excel sheet rows
for idx, row in pothole_df.iterrows():    
    if row['class'] == 'pothole':
        roi = roiExtractor(row,path)
        # Append the data and labels for the positive class
        data.append(roi)
        labels.append(int(1))
print(len(data))
print(data[0].shape)

I have 31 roi’s of the positive class with a shape of (224,224,3).

Having processed the positive examples, let us now extract the negative examples. As seen in the previous post the negative classes are general images of roads without potholes.

# Listing all the negative examples
path = 'data/Annotated'
roadFiles = glob.glob(path + '/*.jpeg')
print(len(roadFiles))

I have selected 21 negative examples. You are free to get as many of these examples as possible. Only point which should be ensured is that there should be a good balance between the positive and negative class. We will now process the negative class images

# Looping through the images of negative class
for row in roadFiles:
    # Read the image
    img = cv2.imread(row)
    # Extract patches
    patches = extract_patches_2d(img,(128,128),max_patches=2)
    # For each patch do the augmentation
    for patch in patches:        
        # Reshape the image
        roi = cv2.resize(patch,(224,224),interpolation=cv2.INTER_CUBIC)
        #print(roi.shape)
        # Convert the image to an array
        roi = img_to_array(roi)
        # Preprocess the image
        roi = preprocess_input(roi)
        #print(roi.shape)
        # Append the data into the data folder and labels folder
        data.append(roi)
        labels.append(int(0))    

For the negative classes, we iterate through each of the images and then read them in line 69. We then extract two patches each of size (128,128) from the image in line 71. Each patch is then resized to the standard size and the converted to array and preprocessed in lines 75-80. Finally the patches are appended to data and labels are appended as ‘0’.

Let us now take a count of the total examples we have

print(len(data))

We now have 73 examples which comprises of 31 positive classes and 42 ( 21 x 2 patches each ) negative classes.

Preparing the train and test sets

We will now convert the data and labels into arrays and then perform one hot encoding to the labels for preperation of our train and test sets.

# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels)
print(data.shape)
print(labels.shape)
# perform one-hot encoding on the labels
lb = LabelBinarizer()
# Fit transform the labels array
labels = lb.fit_transform(labels)
# Convert this to categorical 
labels = to_categorical(labels)
print(labels.shape)
labels

After one hot encoding the labels array is transformed into a shape (73,2), where the second dimension is the class label. The first class is our negative class [0] and the second one is the positive class [1].

Finally let us create our train and test sets using a 85:15 split. We are taking a higher proportion of train set since we have very less training examples.

# Partition data to train and test set with 85 : 15 split
(trainX, testX, trainY, testY) = train_test_split(data, labels,test_size=0.15, stratify=labels, random_state=42)
print("training data shape :",trainX.shape)
print("testing data shape :",testX.shape)
print("training labels shape :",trainY.shape)
print("testing labels shape :",testY.shape)

Now that we have finished the data processing its time to start our training process

Training a MobilenetV2 model using transfer learning : Warming up phase

We will be building our object detector model using transfer learning process. To build our transfer learned model for pothole detection we will be using MobileNetV2 as our base network. We will remove the top layer and then build our custom layer to cater to our use case. Let us see how we build our network.

# Create the base network by removing the top of the MobileNetV2 model
baseNetwork = MobileNetV2(weights="imagenet", include_top=False,input_tensor=Input(shape=(224, 224, 3)))
# Create a custom head network on top of the basenetwork to cater to two classes.
topNetwork = baseNetwork.output
topNetwork = AveragePooling2D(pool_size=(5, 5))(topNetwork)
topNetwork = Flatten(name="flatten")(topNetwork)
topNetwork = Dense(128, activation="relu")(topNetwork)
topNetwork = Dropout(0.5)(topNetwork)
topNetwork = Dense(2, activation="softmax")(topNetwork)
# Place our custom top layer on top of the base layer. We will only train the base layer.
model = Model(inputs=baseNetwork.input, outputs=topNetwork)
# Freeze the base network so that they are not updated during the training process
for layer in baseNetwork.layers:
    layer.trainable = False

We load the base network in line 106. The base network is the MobileNetV2 and we exclude the top layer by specifying the parameter , include_top=False. We also specify the shape of the input layer.

Its now time to specify our custom network. We build our custom network on top of the output of the base network as shown in line 108. From lines 109-112, we build the different layers of our custom layer starting with the AveragePooling layer and the final Dense layer. In line 113 we define the final Softmax layer for our 2 classes. We then define the model using the Model() class with the inputs as the baseNetwork input and the output as the custom network we have defined in line 115.

In line 117, we specify which layers needs to be trained. Here we are specifying that the base network layers need not be trained. This is because the base network is already pre-trained and our custom layer is the one which is not trained. By specifying that only our custom layer be trained ( or alternatively the base network need not be trained), we are optimising the custom layer. This process can be called the warming up process for the custom layer. Once the custom layer is warmed up after some iterations, we can even specify that some layers of the base network too can be trained. We will perform all these steps.

First let us train our custom layer. We start off the process by defining our training parameters like learning rate, number of epochs and the batch size.

# Initialise the learning rate, epochs and batch size
LR = 1e-4
epoc = 5
bs = 16

You might be surprise that the epochs we have selecte is only 5. This is because since the base network is pre-trained we dont have to train the custom layer for many epochs. Besides we are only warming up the custom layer.

Next let us define the data generator along with the augmentation layer.

# Create a image generator with data augmentation
aug = ImageDataGenerator(rotation_range=40,zoom_range=0.25,width_shift_range=0.2,height_shift_range=0.2,shear_range=0.30,
 horizontal_flip=True,fill_mode="nearest")

In the previous post we implemented manual data augmentation methods. Keras has a great method to do image augmentation during training using the ImageDataGenerator(). It lets us do all the augmentation we did manually in the previous post.

We have now defined most of the moving parts required for training. Lets now define the optimiser and then compile the model and then fit the model with the data set.

# Compile the model
print("[INFO] compiling model...")
opt = Adam(lr=LR)
model.compile(loss="binary_crossentropy", optimizer=opt,metrics=["accuracy"])
# Training the customer head network
print("[INFO] training the model...")
history = model.fit(aug.flow(trainX, trainY, batch_size=bs),steps_per_epoch=len(trainX) // bs,validation_data=(testX, testY),
 validation_steps=len(testX) // bs,epochs=epoc)

Training some layers of the base network

We have done the warm up of the custom head we placed over the base network. Now let us also train some of the layers of the network along with the head. Let us first print out all the layers of the base network to determine the layers we want to train along with our head.

for (i,layer) in enumerate(baseNetwork.layers):
    print(" [INFO] {}\t{}".format(i,layer.__class__.__name__))

In line 134, we iterate through each of the layers of the base network and the print the name of the layer.

We can see that there are 153 layers in the base network. Let us train from layer 140 onwards and freeze all the layers above 140.

for layer in baseNetwork.layers[140:]:
    layer.trainable = True

# Compile the model
print("[INFO] Compiling the model again...")
opt = Adam(lr=LR)
model.compile(loss="binary_crossentropy", optimizer=opt,metrics=["accuracy"])
# Training the customer head network
print("[INFO] Fine tuning the model along with some layers of base network...")
history = model.fit(aug.flow(trainX, trainY, batch_size=bs),steps_per_epoch=len(trainX) // bs,validation_data=(testX, testY),
 validation_steps=len(testX) // bs,epochs=epoc)

With the new training we can see that the accuracy has jumped to 98% from the initial 80%. Let us predict on test set and then print the classification report.

For generating the classification report let us convert the label names into a string as shown below

# Converting the target names as string for classification report
target_names = list(map(str,lb.classes_))

Let us now print the classification report and see how well our model is performing on the test set

# make predictions on the test set
print("[INFO] Generating inference...")
predictions = model.predict(testX, batch_size=bs)
# For each prediction we need to find the index with maximum probability 
predIdxs = np.argmax(predictions, axis=1)
# Print the classification report
print(classification_report(testY.argmax(axis=1), predIdxs,target_names=target_names))

We get the predictions which are in the form of probabilities for each class in line 151. We then extract the id of the class which has the maximum probability using the np.argmax method in line 153. Finally we generate the classification report in line 155. We can see that we have a near perfect classification report as shown below.

Let us also visualise our training accuracy and loss and then save the figure.

# plot the training loss and accuracy
N = epoc
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, N), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), history.history["accuracy"], label="train_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend(loc="lower left")
plt.savefig("plot.png")
plt.show()

Let us finally save our model and the label binarizer so that we can use it later in our inference process

MODEL_PATH = "output/pothole_detector_RCNN.h5"
ENCODER_PATH = "output/label_encoder_RCNN.pickle"
# serialize the model to disk
print("[INFO] saving pothole detector model...")
model.save(MODEL_PATH, save_format="h5")
# serialize the label encoder to disk
print("[INFO] saving label encoder...")
f = open(ENCODER_PATH, "wb")
f.write(pickle.dumps(lb))
f.close()

We have completed the training cycle and have saved the model. Let us now implement the inference cycle.

Inference run for pothole detection

In the inference cycle, we will use the model we just built to localise and predict potholes in test images. Let us first load the model and the label encoder which we saved.

MODEL_PATH = "output/pothole_detector_RCNN.h5"
ENCODER_PATH = "output/label_encoder_RCNN.pickle"
print("[INFO] loading model and label binarizer...")
model = load_model(MODEL_PATH)
lb = pickle.loads(open(ENCODER_PATH, "rb").read())

We have downloaded some test files. Lets visualise some of them here

# Please change the path where your files are placed
testpath = 'data/test'
testFiles = glob.glob(testpath + '/*.jpeg')
testFiles

Lets plot one of the images

# load the input image from disk
image = cv2.imread(testFiles[2])
#Resize the image and plot the image
image = imutils.resize(image, width=500)
plt.imshow(image,aspect='equal')
plt.show()

We will use Opencv to generate the bounding boxes proposals for the image. Detailed below are the specific steps for the selective search implementation using Opencv to generate the bounding boxes. The set of proposals would be contained in the variable rects

# Implementing selective search to generate bounding box proposals
print("[INFO] running selective search and generating bounding boxes...")
ss = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
ss.setBaseImage(image)
ss.switchToSelectiveSearchFast()
rects = ss.process()

Let us look how many proposals the selective search algorithm has generated

len(rects)

For this specific image as you can see the selective search algorithm has generated 920 proposals. As you know these are regions where there is high probability to find an object. As you might have noticed this specific algorithm is pretty slow in identifying all the bounding boxes.

Next let us extract the region of interest from the image using the bounding boxes we obtained from the selective search algorithm. Let us explore the code

# Initialise lists to store the region of interest from the image and its bounding boxes
proposals = []
boxes = []
max_proposals = 100
# Iterate over the bounding box coordinates to extract region of interest from image
for (x, y, w, h) in rects[:max_proposals]:    
    # Crop region of interest from the image
	roi = image[y:y + h, x:x + w]
    # Convert to RGB format as CV2 has output in BGR format
	roi = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
    # Resize image to our standar size
	roi = cv2.resize(roi, (224,224),
		interpolation=cv2.INTER_CUBIC)
	# Preprocess the image
	roi = img_to_array(roi)
	roi = preprocess_input(roi)
	# Update the proposal and bounding boxes
	proposals.append(roi)
	boxes.append((x, y, x + w, y + h))

In lines 200-201, we initialise two lists for storing the roi’s and their bounding box co-oridinates. In line 202, we also define the max number of proposals we want. This step is to improve the speed of computation by eliminating processing of too many proposals. This is a parameter you can vary and I would encourage you to try out different values for this parameter.

Next we iterate through each of the bounding boxes we want, to extract the region of interest and their bounding boxes as detailed in lines 205-215. The various processes we implement are to crop the images, covert the images to RGB format, resize to the desired size and the final normalization of the pixel values. Finally the roi and bounding boxes are updated in lines 217-218 to the lists we created earlier.

Its now time to classify the regions of proposal using the model we fine tuned. Before classification we have to convert the lists to a numpy array. Let us implement these processes.

# Convert proposals and bouding boxes to NumPy arrays
proposals = np.array(proposals, dtype="float32")
boxes = np.array(boxes, dtype="int32")
print("[INFO] proposal shape: {}".format(proposals.shape))
# Classify the proposals based on the fine tuned model
print("[INFO] classifying proposals...")
proba = model.predict(proposals)

Next we will extract those roi’s which are classified as ‘potholes’ from the overall predictions.

# Find the predicted labels 
labels = lb.classes_[np.argmax(proba, axis=1)]
# Get the ids where the predictions are 'Potholes'
idxs = np.where(labels == 1)[0]
idxs

The model prediction gives us the probability of each class. We will find the predicted labels from the probability by taking the argmax of the predicted class probabilities as shown in line 227. Once we have the labels, we extract the indexes of the pothole class in line 229, which in our case is 1.

Next using the indexes we will extract the bounding boxes and probability of the ‘pothole’ class

# Using the indexes, extract the bounding boxes and prediction probabilities of 'pothole' class
boxes = boxes[idxs]
proba = proba[idxs][:, 1]

Next we will apply another filter and take only those bounding boxes which has a probability greater than a threshold value.

print(len(boxes))
# Filter the bounding boxes using a prediction probability threshold
pred_threshold = 0.995
# Select only those ids where the probability is greater than the threshold
idxs = np.where(proba >= pred_threshold)
boxes = boxes[idxs]
proba = proba[idxs]
print(len(boxes))

The threshold has been fixed in this case by experimenting with different values. This is another hyperparameter which needs to be arrived at observing the predictions you obtain for your specific set of images. We can see that before filtering we had 97 bounding boxes which has got reduced to 22 after the filtering. These filtered bounding boxes will be used to localise potholes on the image. Let us visualise the filtered bounding boxes on the image.

# Clone the original image for visualisation and inserting text
clone = image.copy()
# Iterate through the bounding boxes and associated probabilities
for (box, prob) in zip(boxes, proba):
    # Draw the bounding box, label, and probability on the image
    (startX, startY, endX, endY) = box
    cv2.rectangle(clone, (startX, startY), (endX, endY),(0, 255, 0), 2)
    # Initialising the cordinate for writing the text
    y = startY - 10 if startY - 10 > 10 else startY + 10
    # Getting the text to be attached on top of the box
    text= "Pothole: {:.2f}%".format(prob * 100)
    # Visualise the text on the image
    cv2.putText(clone, text, (startX, y),cv2.FONT_HERSHEY_SIMPLEX, 0.25, (0, 255, 0), 1)
# Visualise the bounding boxes on the image
plt.imshow(clone,aspect='equal')
plt.show() 

We clone the image in line 243 and then iterate through the boxes in lines 245 – 254. When we iterate through each box and grab the co-ordinates in line 247 and first draw the rectangle over the image with those co-ordinates in line 248. In the subsequent lines we print the class name and also the probability of the class on top of the bounding box. Finally we visualise the image with the bounding boxes and the text in lines 256-257.

As we can see we have the bounding boxes over the potholes and also regions around them also. However we can see that we have multiple overlapping boxes which ultimately needs to be reduced. So our next task is to apply non maxima suppression to reduce the number of bounding boxes.

Non Maxima Suppression

We will use the same method we used in the previous post for the non maxima suppression. Let us get the function for non maxima suppression. For explanation on this function please refer the previous post

def maxOverlap(boxes):
    '''
    boxes : This is the cordinates of the boxes which have the object
    returns : A list of boxes which do not have much overlap
    '''
    # Convert the bounding boxes into an array
    boxes = np.array(boxes)
    # Initialise a box to pick the ids of the selected boxes and include the largest box
    selected = []
    # Continue the loop till the number of ids remaining in the box is greater than 1
    while len(boxes) > 1:
        # First calculate the area of the bounding boxes 
        x1 = boxes[:, 0]
        y1 = boxes[:, 1]
        x2 = boxes[:, 2]
        y2 = boxes[:, 3]
        area = (x2 - x1) * (y2 - y1)
        # Sort the bounding boxes based on its area    
        ids = np.argsort(area)
        #print('ids',ids)
        # Take the coordinates of the box with the largest area
        lx1 = boxes[ids[-1], 0]
        ly1 = boxes[ids[-1], 1]
        lx2 = boxes[ids[-1], 2]
        ly2 = boxes[ids[-1], 3]
        # Include the largest box into the selected list
        selected.append(boxes[ids[-1]].tolist())
        # Initialise a list for getting those ids that needs to be removed.
        remove = []
        remove.append(ids[-1])
        # We loop through each of the other boxes and find the overlap of the boxes with the largest box
        for id in ids[:-1]:
            #print('id',id)
            # The maximum of the starting x cordinate is where the overlap along width starts
            ox1 = np.maximum(lx1, boxes[id,0])
            # The maximum of the starting y cordinate is where the overlap along height starts
            oy1 = np.maximum(ly1, boxes[id,1])
            # The minimum of the ending x cordinate is where the overlap along width ends
            ox2 = np.minimum(lx2, boxes[id,2])
            # The minimum of the ending y cordinate is where the overlap along height ends
            oy2 = np.minimum(ly2, boxes[id,3])
            # Find area of the overlapping coordinates
            oa = (ox2 - ox1) * (oy2 - oy1)
            # Find the ratio of overlapping area of the smaller box with respect to its original area
            olRatio = oa/area[id]            
            # If the overlap is greater than threshold include the id in the remove list
            if olRatio > 0.40:
                remove.append(id)                
        # Remove those ids from the original boxes
        boxes = np.delete(boxes, remove,axis = 0)
        # Break the while loop if nothing to remove
        if len(remove) == 0:
            break
    # Append the remaining boxes to the selected
    for i in range(len(boxes)):
        selected.append(boxes[i].tolist())
    return np.array(selected)

Let us now apply the non maxima suppression function and eliminate the overlapping boxes.

# Applying non maxima suppression
selected = maxOverlap(boxes)
len(selected)

We can see that by applying non maxima suppression we have reduced the number of boxes from 22 to around 3. Let us now visualise the images with the selected list of bounding boxes after non maxima suppression.

clone = image.copy()
plt.imshow(image,aspect='equal')
for (startX, startY, endX, endY) in selected:
    cv2.rectangle(clone, (startX, startY), (endX, endY), (0, 255, 0), 2)       

plt.imshow(clone,aspect='equal')
plt.show()

We can see that the number of bounding boxes have considerably reduced and have localised well to the two potholes.

With this we have come to the end of object detection using RCNN. Let us quickly recap what we have achieved in this post.

  1. We preprocessed the positive and negative classes of images and then built our train and test sets
  2. Fine tuned the MobileNet model to cater to our use case and made it our classifier.
  3. Built the inference pipeline using the fine tuned classifier
  4. Applied non maxima suppression to get the bounding boxes over the potholes.

We have come a long way and are now adept at implementing an advanced model like RCNN. However there are still variations to this model which we could try. One of the variations we can try is to implement a RCNN for multiple classes. So lets say we predict potholes and also road signs with the same network. Implementing a multiclass RCNN would adopt the same processes with a little variation during the model architecture and training. We will build a multiclass RCNN framework in a future post.

What Next ?

Having seen an advanced method like RCNN, we will go to another advanced method in the next post, which is Yolo. Yolo is a more faster method than RCNN and will enable us to use the road detection process in video files. We will be covering pothole detection using Yolo in the next post and then use it to detect potholes on videos in the subsequent post. Watch this space for more.

To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Build you Computer Vision Application – Part III: Pothole detector from scratch using legacy methods (Image Pyramids and sliding window)

This is the third post of the series were we build a road sign and pothole detection application. We will be using multiple methods through out this series which includes computer vision techniques using opencv, annotating images using labelImg, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 9 posts.

1. Introduction to object detection

2. Data set preperation and annotation Using labelImg

3. Building your object detection model from scratch using Image pyramids and Sliding window ( This post )

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In this post we build a custom object detector from scratch progressively using different methods like pyramid segmentation, sliding window and non maxima suppression. These methods are legacy methods which lays the foundation to many of the modern object detection methods. Let us look at the processes which will be covered in building an object detector from scratch.

  1. Prepare the train and test sets from the annotated images ( Covered in the last post)
  2. Build a classifier for detecting potholes
  3. Build the inference pipeline using image pyramids and sliding window techniques to predict bounding boxes for potholes
  4. Optimise the bounding boxes using Non Maxima suppression.

We will be covering all the topics from step 2 in this post. These posts are heavily inspired by the following posts.

Let us dive in.

Training a classifier on the data

In the last post we prepared our training data from positive and negative examples and then saved the data in h5py format. In this post we will use that data to build our pothole classifier. The classifier we will be building is a binary classifier which has a positive class and a negative class. We will be training this classifier using a SVM model. The choice of SVM model is based on some earlier work which is done in this space, however I would urge you to experiment with other classification models as well.

We will start off from where we stopped in the last section. We will read the database from disk and extract the labels and data

# Read the data base from disk
db = h5py.File(outputPath, "r")
# Extract the labels and data
(labels, data) = (db["pothole_features_all"][:, 0], db["pothole_features_all"][:, 1:])
# Close the data base
db.close()

print(labels.shape)
print(data.shape)

We will now use the data and labels to build the classifier

# Build the SVM model
model = SVC(kernel="linear", C=0.01, probability=True, random_state=123)
model.fit(data, labels)

Once the model is fit we will save the model as a pickle file in the output folder.

# Save the model in the output folder
modelPath = 'data/models/model.cpickle'
f = open(modelPath, "wb")
f.write(pickle.dumps(model))
f.close()

Please remember to create the 'models' folder in your local drive in the 'data' folder before saving the model. Once the model is saved you will be able to see the model pickle file within the path you specified.

Now that we have build the classifier, we will use this classifier for object detection in the next section. We will be covering two important concepts in the next section which is important for object detection, Image pyramids and Sliding windows. Let us get familiar with those concepts first.

Image Pyramids and Sliding window techniques

Let us try to understand the concept of image pyramids with an example. Let us assume that we have a window of fixed size and potholes are detected only if they fit perfectly inside the window. Let us look at how well the potholes are detected when using a fixed size window. Take the case of layer1 of the image below. We can see that the fixed sized window was able to detect one of the potholes which was further down the road as it fit well within the window size, however the bigger pothole which is at the near end the image is not detected because the window was obviously smaller than size of the pothole.

As a way to solve this, let us progressively reduce the size of the image, and try to fit the potholes to the fixed window size, as shown in the figure below. With the reduction in size of the image, the object we want to detect also reduces in size. Since our detection window remains the same, we are able to detect more potholes including the biggest one, when the image sizes are reduced. Thereby we will be able to detect most of the potholes which otherwise would not have been possible with a fixed size window and a constant size image. This is the concept behind image pyramids.

The name image pyramids signifies the fact that, if the scaled images are stacked vertically, then it will fit inside a pyramid as shown in the below figure.

The implementation of image pyramids can be done easily using Sklearn. There are many different types of image pyramid implementation. Some of the prominent ones are Gaussian pyramids and Laplacian pyramids. You can read about these pyramids in the link give here. Let us quickly look at the implementation of of pyramids.

from skimage.transform import pyramid_gaussian
for imgPath in allFiles[-2:-1]:
    # Read the image
    image = cv2.imread(imgPath)
    # loop over the layers of the image pyramid and display them
    for (i, layer) in enumerate(pyramid_gaussian(image, downscale=1.2)):
        # Break the loop if the image size is less than our window size
        if layer.shape[1] < 80 or layer.shape[0] < 40:
            break
        print(layer.shape)

From the output we can see how the images are scaled down progressively.

Having see the image pyramids, its time to discuss about sliding window. Sliding windows are effective methods to identify objects in an image at various scales and locations. As the name suggests, this method involves a window of standard length and width which slides accross an image to extract features. These features will be used in a classifier to identify object of interest. Let us look at the code block below to understand the dynamics of the sliding window method.

# Read the image
image = cv2.imread(allFiles[-2])
# Define the window size
windowSize = [80,40]
# Define the step size
stepSize = 40
# slide a window across the image
for y in range(0, image.shape[0], stepSize):
    for x in range(0, image.shape[1], stepSize):
        # Clone the image
        clone = image.copy()
        # Draw a rectangle on the image 
        cv2.rectangle(clone, (x, y), (x + windowSize[0], y + windowSize[1]), (0, 255, 0), 2)
        plt.imshow()

To implement the sliding window we need to understand some of the parameters which are used. The first is the window size, which is the dimension of the fixed window we would be sliding accross the image. We earlier calculated the size of this window to be [80,40] which was the average size of a pothole in our distribution. The second parameter is the step size. A step size is the number of pixels we need to step to move the fixed window accross the image. Smaller the step size, we will have to move through more pixels and vice-versa. We dont want to slide through every pixel and definitely dont want to skip important features, and therefore the step size is a necessary parameter. An ideal step size would depend on the image size. For our case let us experiment with the ‘y’ cordinate size of our fixed window which is 40. I would encourage to experiment with different step sizes and observe the results before finalising the step size.

To implement this method, we first iterates through the vertical distance starting from 0 to the height of the image with increments of the stepsize. We have an inner iterative loop which loops through the horizontal direction ranging from 0 to the width of the image with increments of stepsize. For each of these iterations we capture the x and y cordinates and then extract a rectangle with the same shape of the fixed window size. In the above implementation we are only drawing a rectangle on the image to understand the dynamics. However when we implement this along with image pyramids, we will crop an image size with the dimension of the window size as we slide accross the image. Let us see some of the sample outputs of the sliding window.

From the above output we can see how the fixed window slides accross the image both horizontally and vertically with a step size to extract features from the image of the same size as the fixed window.

So far we have seen the pyramid and the sliding window implementations independently. These two methods have to be integrated to use it as an object detector. However for integrating them we need to convert the sliding window method into a function. Let us look at the function to implement sliding windows.

# Function to implement sliding window
def slidingWindow(image, stepSize, windowSize):    
    # slide a window across the image
    for y in range(0, image.shape[0], stepSize):
        for x in range(0, image.shape[1], stepSize):
            # yield the current window
            yield (x, y, image[y:y + windowSize[1], x:x + windowSize[0]])

The function is not very different from what we implemented earlier. The only difference is as the output we yield a tuple of the x,y cordinates and the crop of the image of the same size as the window Size. Next we will see how we integrate this function with the image pyramids to implement our custom object detector.

Building the object detector

Its now time to bring all what we defined into creating our object detector. As a first step let us load the model which we saved during the training phase

# Listing the path were we stored the model
modelPath = 'data/models/model.cpickle'
# Loading the model we trained earlier
model = pickle.loads(open(modelPath, "rb").read())
model

Now let us look at the complete code to implement our object detector

# Initialise lists to store the bounding boxes and probabilities
boxes = []
probs = []
# Define the HOG parameters
orientations=12
pixelsPerCell=(4, 4)
cellsPerBlock=(2, 2)
# Define the fixed window size
windowSize=(80,40)
# Pick a random image from the image path to check our prediction
imgPath = sample(allFiles,1)[0]
# Read the image
image = cv2.imread(imgPath)
# Converting the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# loop over the image pyramid
for (i, layer) in enumerate(pyramid_gaussian(image, downscale=1.2)):
    # Identify the current scale of the image    
    scale = gray.shape[0] / float(layer.shape[0])
    # loop over the sliding window for each layer of the pyramid
    for (x, y, window) in slidingWindow(layer, stepSize=40, windowSize=(80,40)):
        # if the current window does not meet our desired window size, ignore it
        if window.shape[0] != windowSize[1] or window.shape[1] != windowSize[0]:
            continue
        # Let us now extract the hog features of this window within the image
        feat = hogFeatures(window,orientations,pixelsPerCell,cellsPerBlock,normalize=True).reshape(1,-1)
        # Get the prediction probabilities for the positive class ( potholesf)
        prob = model.predict_proba(feat)[0][1] 
        
        # Check if the probability is greater than a threshold probability
        if prob > 0.95:            
            # Extract (x, y)-coordinates of the bounding box using the current scale 
            # Starting coordinates
            (startX, startY) = (int(scale * x), int(scale * y))
            # Ending coordinates
            endX = int(startX + (scale * windowSize[0]))
            endY = int(startY + (scale * windowSize[1]))
            # update the list of bounding boxes and probabilities
            boxes.append((startX, startY, endX, endY))
            probs.append(prob)
            
# loop over the bounding boxes and draw them
for (startX, startY, endX, endY) in boxes:
    cv2.rectangle(image, (startX, startY), (endX, endY), (0, 0, 255), 2)       

plt.imshow(image,aspect='equal')
plt.show() 

To start of we initialise two lists in lines 2-3 where we will store the bounding box coordinates and also the probabilities which indicates the confidence about detecting potholes in the image.

We also define some important parameters which are required for HOG feature extraction method in lines 5-7

  1. orientations
  2. pixels per Cell
  3. Cells per block

We also define the size of our fixed window in line 9

To test our process, we randomly sample an image from the list of images we have and then convert the image into gray scale in lines 11-15.

We then start the iterative loop to implement the image pyramids in line 17. For each iteration the input image is scaled down as per the scaling factor we defined.Next we calculate the running scale of the image in line 19. The scale would always be the original shape divided by the scaled down image. We need to find the scale to blow up the x,y coordinates to the orginal size of the image later on.

Next we start the sliding window implementation in line 21. We provide the scaled down version of the image as the input along with the stepSize and the window size. The step size is the parameter which indicates by how much the window has to slide accross the original image. The window size indicates the size of the sliding window. We saw the mechanics of these when we looked at the sliding window function.

In lines 23-24 we ensure that we only take images, which meets our minimum size specification.For any image which passes the minimum size specification, HOG features are extracted in line 26. On the extracted HOG features, we do a prediction in line 28. The prediction gives the probability whether the image is a pothole or not. We extract only probability of the positive class. We then take only those images were the probability is greater than a threshold we have defined in line 31. We give a high threshold because, our distribution of both the positive and negative images are very similar. So to ensure that we get only the potholes, we given a higher threshold. The threshold has been arrived at after fair bit of experimentation. I would encourage you to try out with different thresholds before finalising the threshold you want.

Once we get the predictions, we take those x and y cordinates and then blow it to the original size using the scale we earlier calculated in lines 34-37. We find the starting cordinates and the ending cordinates and then append those coordinates in the lists we defined, in lines 39-40.

In lines 43-47, we loop through each of the coordinates and draw bounding boxes around the image.

Let us look at the output we have got, we can see that there are multiple bounding boxes created around the area were there are potholes. We can be happy that the object detector is doing its job by localising around the area around a pothole in most of the cases. However there are examples where the detector has detected objects other than potholes. We will come to that issue later. Let us first address another important issue.

All the images have multiple overlapping bounding boxes. Having a lot of bounding boxes can sometimes be cumbersome say if we want to calculate the area where the pot hole is present. We need to find a way to reduce the number of overlapping bounding boxes. This is were we use a technique called Non Maxima suppression. The objective of Non maxima suppression is to combine bounding boxes with significant overalp and get a single bounding box. The method which we would be implementing is inspired from this post

Non Maxima Suppression

We would be implementing a customised method of the non maxima suppression implementation. We will be implementing it through a function.

def maxOverlap(boxes):
    '''
    boxes : This is the cordinates of the boxes which have the object
    returns : A list of boxes which do not have much overlap
    '''
    # Convert the bounding boxes into an array
    boxes = np.array(boxes)
    # Initialise a box to pick the ids of the selected boxes and include the largest box
    selected = []
    # Continue the loop till the number of ids remaining in the box is greater than 1
    while len(boxes) > 1:
        # First calculate the area of the bounding boxes 
        x1 = boxes[:, 0]
        y1 = boxes[:, 1]
        x2 = boxes[:, 2]
        y2 = boxes[:, 3]
        area = (x2 - x1) * (y2 - y1)
        # Sort the bounding boxes based on its area    
        ids = np.argsort(area)
        #print('ids',ids)
        # Take the coordinates of the box with the largest area
        lx1 = boxes[ids[-1], 0]
        ly1 = boxes[ids[-1], 1]
        lx2 = boxes[ids[-1], 2]
        ly2 = boxes[ids[-1], 3]
        # Include the largest box into the selected list
        selected.append(boxes[ids[-1]].tolist())
        # Initialise a list for getting those ids that needs to be removed.
        remove = []
        remove.append(ids[-1])
        # We loop through each of the other boxes and find the overlap of the boxes with the largest box
        for id in ids[:-1]:
            #print('id',id)
            # The maximum of the starting x cordinate is where the overlap along width starts
            ox1 = np.maximum(lx1, boxes[id,0])
            # The maximum of the starting y cordinate is where the overlap along height starts
            oy1 = np.maximum(ly1, boxes[id,1])
            # The minimum of the ending x cordinate is where the overlap along width ends
            ox2 = np.minimum(lx2, boxes[id,2])
            # The minimum of the ending y cordinate is where the overlap along height ends
            oy2 = np.minimum(ly2, boxes[id,3])
            # Find area of the overlapping coordinates
            oa = (ox2 - ox1) * (oy2 - oy1)
            # Find the ratio of overlapping area of the smaller box with respect to its original area
            olRatio = oa/area[id]            
            # If the overlap is greater than threshold include the id in the remove list
            if olRatio > 0.50:
                remove.append(id)                
        # Remove those ids from the original boxes
        boxes = np.delete(boxes, remove,axis = 0)
        # Break the while loop if nothing to remove
        if len(remove) == 0:
            break
    # Append the remaining boxes to the selected
    for i in range(len(boxes)):
        selected.append(boxes[i].tolist())
    return np.array(selected)

The input to the function are the bounding boxes we got after our prediction. Let me give a big picture of what this implementation is all about. In this implementation we start with the box with the largest area and progressively eliminate boxes which have considerable overlap with the largest box. We then take the remaining boxes after elimination and the repeat the process of elimination till we get to the minimum number of boxes. Let us now see this implementation in the code above.

In line 7, we convert the bounding boxes into an numpy array and the initialise a list to store the bounding boxes we want to return in line 9.

Next in line 11, we start the continues loop for elimination of the boxes till the number of boxes which remain is less than 2.

In lines 13-17, we calculate the area of all the bounding boxes and then sort them in ascending order in line 19.

We then take the cordinates of the box with the largest area in lines 22-25 and then append the largest box to the selection list in line 27. We initialise a new list for the boxes which needs to be removed and then include the largest box in the removal list in line 30.

We then start another iterative loop to find the overlap of the other bounding boxes with the largest box in line 32. In lines 35-43, we find the coordinates of the overlapping portion of each of the other boxes with the largest box and the take the area of the overlapping portion. In line 45 we find the ratio of the overlapping area to the original area of the bounding box which we iterate through. If the ratio is larger than a threshold value, we include that box to the removal list in lines 47-48 as this has good overlap with the largest box. After iterating through all the boxes in the list, we will get a list of boxes which has good overlap with the largest box. We then remove all those overlapping boxes and the current largest box from the original list of boxes in line 50. We continue this process till there are no more boxes to be removed. Finally we add the last remaining box to the selected list and then return the selection.

Let us implement this function and observe the result

# Get the selected list
selected = maxOverlap(boxes)

Now let us look at different examples after non maxima suppression.

# Get the image again
image = cv2.imread(imgPath)
# Make a copy of the image
clone = image.copy()
for (startX, startY, endX, endY) in selected:
    cv2.rectangle(clone, (startX, startY), (endX, endY), (0, 255, 0), 2)       

plt.imshow(clone,aspect='equal')
plt.show() 
Non maxima suppression

We can see that the bounding boxes are considerably reduced using our non maxima suppression implementation.

Improvement Opportunities

Eventhough we have got reasonable detection effectiveness, is the model we built perfect ? Absolutely not. Let us look at some of the major pitfalls

Misclassifications of objects :

From the outputs, we can see that we have misclassified some of the objects.

Most of the misclassifications we have seen are for vegetation. There are also cases were road signs are also misclassified as potholes.

A major reason we have mis classification is because our training data is limited. We used only 19 positive images and 20 negative examples. Which is a very small data set for tasks like this. Considering the fact that the data set is limited the classifier has done a decent job. Also for negative images, we need to include some more variety, like get some road signs, vehicles, vegetation etc labelled as negative images. So with more positive images and more negative images with little more variety of objects that are likely to be found on roads will improve the classification accuracy of the classifier.

Another strategy is to experiment with different types of classifiers. In our example we used a SVM classifier. It would be worthwhile to use other binary classifiers starting from Logistic regression, Naive Bayes, Random forest, XG boost etc. I would encourage you to try out with different classifiers and then verify the results.

Non detection of positive classes

Along with misclassifications, we have also seen non detection of positive classes.

As seen from the examples, we can see that there has been non detection in cases of potholes with water in it. In addition some of the potholes which are further along the road are not detected.

These problems again can be corrected by including more variety in the positive images, by including potholes with water in it. It will also help to include images with potholes further away along the road. The other solution is to preprocess images with different techniques like smoothing and blurring, thresholding, gradient and edge detection, contours, histograms etc. These methods will help in highliging the areas with potholes which will help in better detection. In addition, increasing the number of positive examples will also help in addressing the problems associated with non detection.

What Next ?

The idea behind this post was to give you a perspective in building an object detector from scratch. This was also an attempt to give an experience in working in cases where the data sets are limited and where you have to create the necessary data sets. I believe these exercises will equip you will capabilities to deal with such issues in your projects.

Now that you have seen the basic grounds up approach, it is time to use this experience to learn more state of the art techniques. In the next post we will start with more advanced techniques. We will also be using transfer learning techniques extensively from the next post. In the next post we will cover object detection using RCNN.

To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Build you Computer Vision Application – Part II: Data preperation and Annotation

This is the second post of the series were we build a road sign and pothole detection application. We will be using multiple methods through out this series which includes computer vision techniques using opencv, annotating images using labelImg, mastering Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across 8 posts.

1. Introduction to object detection

2. Data set preperation and annotation Using labelImg ( This Post )

3. Building your road pothole detector from scratch using Image pyramids and Sliding window

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

In this post we will talk about the data annotation and data preperation stage of the process

Data Sets for Object Detection

In the last post we got introduced to Object detection tasks. We also briefly discovered some of the leading approaches for object detection. When discussing about model training approaches you would have identified that the data sets for object detection are not exactly like any data sets which you would have encountered in your normal machine learning lifecycle. Object detection data sets have two sets of labels, one is the class label for the objects and the second is the bounding boxes for each of the object. The bounding boxes contains the (x ,y )cordinates of the four corners where the object is present. There are different publicly available data sets for object detection tasks. The coco dataset being one of the most popular ones

For the specific task which we are dealing with i.e. Pothole detection, we might not have annotated data sets. Therefore we will have to create that dataset which includes the class labels and the bounding boxes.

This post will talk about downloading data for pothole detection, creating the class labels and bounding boxes for the data and then extracting the necessary information from the annotation task so that we can use it for training the data set. In this exercise we will use a tool called labelIMG which will be used for annotating the dataset.

Installing and Configuraing labelIMG

LabelImg is a free, open source tool for graphically labeling images. It’s written in Python and is an easy, free way to label images for your object detection projects.

Installation of labelImg is quite simple and it can be installed using pip command for python3 as shown below.

pip3 install labelImg

To know more about the installation and configuration you can refer the following link.

Lets now look at how we collect data and annotate them using labelImg

Raw Data Creation

The first task is to create the data set required for training the model and also annotation. The images which are used in this series are collected from google images.

You can download as many images as you want for this task. Always remember to get some good variety of images with different type of objects which you are likely to see on roads.

Annotation of the images

The annotation of the images are done using labelImg application.

To activate the labelImg application, just invoke the labelImg command on the terminal as follows

Figure 1 : Activating labelIMG on terminal

Once this is activated a front end will be opened as follows

Figure 2 : Front end of labelIMg

We start with selecting the directory where the files are stored. We select the directory using the open Dir icon. Once we select the Open Dir icon we will get all the images in the direcotry listed in the application as follows

Figure 3 : Files list

We navigate one image at a time, and then draw the bounding boxes of the objects we want to annotate. Once the bounding boxes are drawn we can input the label we want to give to the image. Once the bounding boxes are selected and annotation are done, the image can be saved as an xml file.

Figure 4 : Annotating the images

Let us open one fo the xml files and look at the information contained in the xml file. The xml file contains the bounding boxes and the class information of the images as shown below.

Figure 5 : xml file information

We have now annotated all the files with the class names and bounding boxes. Let us now extract the information from the xml files into a csv files

Extracting the Information from annotation

In this section we will extract all the annotation information into a pandas data frame and later on to csv file. We will start with importing all the library files we require.

import os
import glob
import pandas as pd
import xml.etree.ElementTree as ET

Next let us list down all the ‘xml’files in the folder using glob() method. We have to give the path of the folder where all the xml files are stored.

# Define the path
path = 'data'
# Get the list of all files in the folder
allFiles = glob.glob(path + '/*.xml')
allFiles
Figure 6 : List of all xml files

Next we need to parse through the 'xml'files and then extract the information from the file. We will use the 'ElementTree' method in the xml package to parse through the folder and then get the relevant information.

# Get one of the files
xml_file = allFiles[0]
# Parse xml file and get the root
tree = ET.parse(xml_file)
root = tree.getroot()
# For each element of the root print the tag and the attribute
for child in root:
    print(child.tag, child.attrib)
Figure 7 : Extracted elements from xml file

In line 13 -14 we get the 'tree' object and the get the 'root' of the xml file. The root contains all the elements as children. Lines 16-17 we go through each of the elements of the xml file and then extract the tags and the attribute of the element. We can see the major elements printed. If we look at the raw xml file we can see all these elements listed there.

As seen in the output, elements named as ‘object’ are the bounding boxes we annotated in the earlier step. These objects contains the bounding box information we need. Before we extract the bounding box information, let us look at some basic methods to extract any information from the root.

filename = root.find('filename').text
filename
Output : Name of the xml file

In line 18 we extract the filename of this xml file using the root.find() method. We need to specify which element we want to look into, which in our case is the text called ‘filename‘ as that is how it is represented in the xml file. To get the filename as a string we give the .text extension.

Let us now get the width and height of the image. We can see from the xml file that this is contained in the element, 'size'

# Extract width and height of the image
width = int(root.find('size').find('width').text)
height = int(root.find('size').find('height').text)
print(width,height)

In lines 21-22 use the find() method to extract width and height and then convert the text into integer.

Our next task is to extract the class names and the bounding box elements. These are contained in each of the 'object' elements under the name 'bndbox'. The class label of the image is contained inside this element under the element name 'name' and the bounding boxes are with the element names 'xmin','ymin','xmax','ymax'. Let us look at one of the sample object elements.

# Get all the 'object' elements
members = root.findall('object')
# Take the first one to extract the information as an example
member = members[0]
print(member.find('name').text)
print(member.find('bndbox').find('xmin').text)
Class label and x min coordinate of the object

From lines 28-29 we can see the class name and one of the bounding box values extracted using the find() method as seen before

Now that we have seen all the moving parts , let us encapsulate all these into a function and extract all the information into a pandas dataframe. This code is taken from this tutorial link for object detection.

def xml_to_pd(path):
    """Iterates through all .xml files (generated by labelImg) in a given directory and combines
    them in a single Pandas dataframe.

    Parameters:
    ----------
    path : str
        The path containing the .xml files
    Returns
    -------
    Pandas DataFrame
        The produced dataframe
    """

    xml_list = []
    # List down all the files within the path
    for xml_file in glob.glob(path + '/*.xml'):
        # Get the tree and the root of the xml files
        tree = ET.parse(xml_file)
        root = tree.getroot()
        # Get the filename, width and height from the respective elements
        filename = root.find('filename').text
        width = int(root.find('size').find('width').text)
        height = int(root.find('size').find('height').text)
        # Extract the class names and the bounding boxes of the classes
        for member in root.findall('object'):
            bndbox = member.find('bndbox')
            value = (filename,
                     width,
                     height,
                     member.find('name').text,
                     int(bndbox.find('xmin').text),
                     int(bndbox.find('ymin').text),
                     int(bndbox.find('xmax').text),
                     int(bndbox.find('ymax').text),
                     )
            xml_list.append(value)
    # Consolidate all the information into a data frame
    column_name = ['filename', 'width', 'height',
                   'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df

Let us now extract the information of all the xml files and then convert it into a pandas data frame.

pothole_df = xml_to_pd(path)
pothole_df
Pandas dataframe containing the bounding box information

Finally let us save this label information in a csv file as we will use it later for training our object detection elements.

pothole_df.to_csv('pothole_df.csv',index=False)

Having prepared the data set, let us now look at the next process which is to prepare the train and test sets.

Preparing the Training and test sets

The process of building the train images, involves multiple processes. Let us look at each of them

Mixing positive and negative images

We just annotated the images with potholes along with its bounding boxes. We will be using those images for building the positive classes for the object detector. Along with the positive classes, we also need to get some negative examples. For negative examples we will take some examples of roads without potholes. We will keep both the positive and negative examples, in seperate folders and then use them for building the training data. We will also use some augmentation techniques to increase the training data. Let us dive deeper with the preperation of the training data set.

import os
import glob
import pandas as pd
import io
import cv2
from skimage import feature
import skimage
from sklearn.feature_extraction.image import extract_patches_2d
from sklearn.svm import SVC
import numpy as np
import argparse
import pickle
import matplotlib.pyplot as plt
from random import sample
%matplotlib inline

We will start by importing all the required packages. Next let us look at the positive examples, which are the images with potholes that were downloaded in the last post.

# Positive Images
path = 'data'
allFiles = glob.glob(path + '/*.jpeg')
print(len(allFiles))
allFiles

The above figure lists the images which were downloaded and annotated earlier. You are free to download any number of images. The more the better, as the classifier will perform well with more examples. Later on we will see how we augment these images with different augmentation techniques to increase the number of positive images. However whatever the type of augmentation techniques we use, it would still not be a substitute for variety of positive images.

Let us now look at the negative classes of images. For negative classes we will be using images of normal roads. Let us look at some of the examples of the negative images

# Negative images
path = 'data/Annotated'
roadFiles = glob.glob(path + '/*.jpeg')
for imgPath in roadFiles[:2]:
    img = cv2.imread(imgPath)
    plt.imshow(img)
    plt.show()

These negative images were downloaded in the same way the positive images were also downloaded i.e. from Google images. Again more the examples the better. However what needs to be noted is to maintain a fair balance between the positive and negative examples.

Extracting HOG features from the images

Once the positive and negative images are collected, its now the turn to extract features from the images. There a different methods to extract features from images. The method we will be using is the HOG features. HOG stands for ‘Histogram of Oriented Gradients’. Let us quickly take a quick tour of the HOG method.

Histogram of Oriented Gradients ( HOG )

HOG descriptors are used to represent the structure and appearence of the object in an image. This algorithm works on the principle that an object in an image can be modeled by the distribution of intensity gradients within regions where the object reside. The implementation of this method entails dividing an image into small cells and then for each cell computing the histogram of oriented gradients for pixels within each cell. The histograms accross multiple cells are accumulated to form the feature vector. The dimensionality of these feature vectors depend on the dimension of the image and the parameters of the HOG descriptor like pixels_per_cell, cells_per_block and orientations. You can refer to the following link to learn more about HOG descriptors

Let us now implement the methods for extracting the features and saving the data set on to disk. As a first step we will read the positive images which are the pothole images. We will read the data from the information in the csv file we created earlier. We will take the information and then extract only those patches which contain potholes. Let us first look at the csv file containing the data.

# Reading the csv file
pothole_df = pd.read_csv('pothole_df.csv')
pothole_df

As seen from the output, the data set extracted here contains only 65 rows which comprises of all the classes including vegetation, signs, potholes etc. From this csv file, we will extract only the pothole data. The number of images have been kept intentionally low, so that we can also explore some augmentation techniques so as to enhance the data set. When you embark on custom solutions where data sets are not available you will have to resort to different augmentation techniques to improve your results.

Let us now exlore the dimensions of the set of potholes images we have, and then look at the average width and height of the bounding boxes. This, as we will see later, is to define the width of the window for the pyramid and sliding window techniques. We will use pothole_df data frame to find the dimensions.

# Find the mean of the x dim and y dimensions of the pothole class
xdim = np.mean(pothole_df[pothole_df['class']=='pothole']['xmax'] - pothole_df[pothole_df['class']=='pothole']['xmin'])
ydim = np.mean(pothole_df[pothole_df['class']=='pothole']['ymax'] - pothole_df[pothole_df['class']=='pothole']['ymin'])
print(xdim,ydim)

We will round off the dimensions to [80,40] which we will adopt as the window dimensions for the pyramid and sliding window methods.

# We will take the windows dimension as these dimensions rounded off
winDim = [80,40]

Once the images are read from the excel sheet, its time to extract the patches of potholes which we require, from the images. There are two functions which we require to extract the features which we want. The first one is to extract the hog features from the image. Let us look at that function first.

# Defining the hog structure
def hogFeatures(image,orientations,pixelsPerCell,cellsPerBlock,normalize=True):
    # Extracting the hog features from the image
    feat = feature.hog(image, orientations=orientations, pixels_per_cell=pixelsPerCell,cells_per_block = cellsPerBlock, transform_sqrt = normalize, block_norm="L1")
    feat[feat < 0] = 0
    return feat

The inputs to this function are the images from which we want to extract the features, the orientations, pixels per cell, cells per block and the normalize flag.

In line 40, we extract the features using feature.hog() method from the image. We provide all our parameters to the method to get the features. Once we extract the features, we remove all the negative pixels by making them as 0 in line 41. The extracted features are then returned by the function in the last line.

The next method we will see is the one to augment our images. There are different types of augmentation techniques which are useful. We will be using techniques like flipping ( both horizontal and vertical flipping) and then rotating them to diffrent angles. Let us see the function to augment our images.

# Defining the function for image augmentation
def imgAug(roi,ht,wd,extensive=True):
    # Initialise the empty list to store images
    rois = []
    # resize the ROI to the desired size
    roi = cv2.resize(roi, (ht,wd), interpolation=cv2.INTER_AREA)
    # Append the different images
    rois.append(roi)
    # Augment the image by flipping both horizontally and vertically
    rois.append(cv2.flip(roi, 1))
    if extensive:        
        rois.append(cv2.flip(roi, 0))
        rois.append(cv2.rotate(roi, cv2.ROTATE_90_CLOCKWISE))
        rois.append(cv2.rotate(roi, cv2.ROTATE_90_COUNTERCLOCKWISE))
        # Rotate to other angles
        for rot in [15,45,60,75,85]:
            # Get the rotation matrix
            rotMatrix = cv2.getRotationMatrix2D((ht/2,wd/2),rot,1)
            # ROtate the matrix using the rotation matrix
            rois.append(cv2.warpAffine(roi,rotMatrix,(ht,wd)))         
    return rois

The inputs to the function are the patch of image we want to augment along with the dimensions we want to resize the image. We also define a parameter called extensive to check if we want to do all the methods or just a simple horizontal flipping.

We first initialise a list to store all the augmented images in line 46 and then we go ahead and resize the image in line 48. The resized image is then appended to the list in line 50.

The first augmentation technique is implemented in the line 52 where in we flip it horizontally. The parameter 1 stands for flipping along the y axis.

Now if we want to go for extensive augmentation, we proceed with other types of augmentation. The first of these methods are the vertical flip, clockwise rotation and anticlockwise rotations as shown in lines 54-56.

Then we do 5 different rotations based on the list of angles we have specified in line 58. You can try out with more angles of your choice. To do the rotation we first have to define a rotation matrix which is centred along the centre of the image as shown in line 60. We also provide the centre of the image the angle by which we have to rotate and the scaling function as input parameters . We have chosen a scale of 1. You can try different scaling parameters and then see its effect on the image.

Once the rotation matrix is defined, the image is rotated using the method cv2.warpAffine() in line 62. Here we give the patch of image, the rotation matrix and the dimensions of the image as inputs.

We finally append all the augmented images into the list and then return the rois.

The overall process to extract the features consists of two functions as given below.

# Functions to extract the bounding boxes and the hog features
def roiExtractor(row,path):
    img = cv2.imread(path + row['filename'])    
    # Get the bounding box elements
    bb = [int(row['xmin']),int(row['ymin']),int(row['xmax']),int(row['ymax'])]
    # Crop the image
    roi = img[bb[1]:bb[3], bb[0]:bb[2]]
    # Get the list of augmented images
    rois = imgAug(roi,80,40)
    return rois

def featExtractor(rois,data,labels,positive=True):
    for roi in rois:
        # Extract hog features
        feat = hogFeatures(roi,orientations,pixelsPerCell,cellsPerBlock,normalize=True)
        # Append data and labels
        data.append(feat)
        labels.append(int(1))        
    return data,labels

The first of these functions is to read an image based on the information from the csv file and then crop the image based on the bounding box coordinates as shown in lines 66-70. Finally in line 72, we do the augmetation of the cropped image.

The second function takes the augmented images derived using the first function and extract the HOG features from each of them. We append the features in the list data and the labels are appended to 1 as these are the positive examples.

Having seen all the functions let us now see the process of preparing the data sets.

# Extracting pothole patches from the data
path = 'data/'
# Parameters for extracting HOG features
orientations=12
pixelsPerCell=(4, 4)
cellsPerBlock=(2, 2)
# Empty lists to store data and labels
data = []
labels = []
# Looping through the excel sheet rows
for idx, row in pothole_df.iterrows():
    if row['class'] == 'pothole':
        rois = roiExtractor(row,path)
        data,labels = featExtractor(rois,data,labels)

The process is quite straighforward. In lines 86-88, we define the parameters for HOG feature extraction. Then we initialise two empty lists in lines 90-91 to store data and the labels. We then loop through each of the rows of the pothole data frame and the extract the rois and features if the class of the row is ‘pothole’.

That was the positive examples we saw. Its now the turn of extracting features for the negative examples. Let us first list all the negative examples

# Listing all the negative examples
path = 'data/Annotated'
roadFiles = glob.glob(path + '/*.jpeg')
roadFiles
# Looping through the files
for row in roadFiles:
    # Read the image
    img = cv2.imread(row)
    # Extract patches
    patches = extract_patches_2d(img,(80,40),max_patches=10)
    # For each patch do the augmentation
    for patch in patches:        
        # Get the list of augmented images
        rois = imgAug(patch,80,40,False)
        # Extract the features using HOG        
        for roi in rois:
            feat = hogFeatures(roi,orientations,pixelsPerCell,cellsPerBlock,normalize=True)
            data.append(feat)
            labels.append(int(-1))

In the process for extracting negative examples , we first iterate through the files and then read each file. Since we dont have to crop a specific area within the image, we will adopt a different strategy to augment images. We extract certain patches of a fixed window size from the image. This is implemented through a method extract_patches_2d() in Sklearn. The dimension of the window size is based on the dimensions we fixed earlier. We also specify the number of patches we want to extract in line 106. For each of the patch we extract, we do only horizontal flip as it wouldnt make sense to do any other augmentation steps for the images of roads. We then extract the HOG features in line 113 like what we did for the positive examples. The labels for these examples are -1 as this is the negative image.

Having extracted features and the labels, we will now write the data to disk using h5py format.

import h5py
import numpy as np
# Define the output path
outputPath = 'data/pothole_features_all.hdf5'
# Create the database and write method
db = h5py.File(outputPath, "w")
dataset = db.create_dataset('pothole_features_all', (len(data), len(data[0]) + 1), dtype="float")
dataset[0:len(data)] = np.c_[labels, data]
db.close()

In this implementation we first define the outputPath and then create the database using the ‘write’ method. To create the dataset we use the create_dataset() method giving the name and the dimensions of the dataset. We increase the second dimenstion with +1 as we will be storing the label also in the same dataset. We finally store the dataset as numpy array where the labels and data are concatenated using the np.c_ method of numpy. After this step the new data base will get created in the specified path.

We can read the database using the h5py.File() method. Let us look at the name of the data set we earlier gave by taking the keys() of the database

# Read the h5py file
db = h5py.File(outputPath)
list(db.keys())
# Shape of the data
db["pothole_features_all"].shape

You can see that the shape of the data set we created. We had 730 examples of both the positive and negative examples . We can also see that we have 8209 features, which is a combination of label + the hog features of 8208.

That takes us to the end of the data preperation stage for building our object detector. In the next post we will take this data and build our object detector from scratch.

What Next ?

In the next post, we will explore different techniques to build our custom object detector. We will be covering the following topics in the next post

  1. Building a classifier using the training data
  2. Introduce the concept of Image pyramids and sliding windows
  3. Using Image pyramids and sliding windows to extract bouding boxes for your images
  4. Use non maxima suppression to eliminate overlap of bounding boxes.

We will be covering lot of ground in the next post. The next post will be published next week. To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

You can also access the code base for this series from the following git hub link

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Build you computer vision application : Pothole detection application – Introduction to Object Detection

This post is the start of a series where we embark on a journey into computer vision. In this series we will build a pothole detection application . We will be using multiple methods through out this series which includes computer vision techniques using opencv, annotating images using labelImg, mastering object detection techniques like RCNN, Yolo,Tensorflow object detection API, Training objection detection using transfer learning, Object detection on video etc. This series will be split across the following posts.

1. Introduction to object detection ( This post)

2. Data set preperation and annotation Using labelImg

3. Building your road pothole detector from scratch using Image pyramids and Sliding window

4. Building your road pothole detector using RCNN

5. Building your road pothole detector using YOLO

6. Building you road pothole detector using Tensorflow object detection API

7. Building your video analytics application for detecting potholes

8. Deploying your video analytics application for detection of potholes

You will be covering a lot of ground in this series and by the end of the series would have set a good understanding on different computer vision techniques. Let us get going on this exciting with an introduction to Object dectection

Introduction to Object detection

Object detection entails detecting and localising objects in images or video. Object detection process involves multiple techniques including object annotation, image preprocessing, bounding box localisation and image classifications to name a few. Object detection has broad applications in personal devices, public services and industrial processes. One of the prominent use case which you in your day to day use is the bounding box detection on your phone.

Object detection on phone

From such simple use cases like face detection object detection techniques can be used for real world examples with large impact like road traffic accident prevention, detection of defects in factory assembly line, detection for military purpose etc are some of the notable examples.

Object detection is not a new phenomenon. It has existed from the time computer vision has existed and has evolved a great deal from the earlier techniques. The early processes of object detection involved manually extracting features and then using classifiers to defect objects.

Some of the earlier techniques for feature extraction involved techniques like HOG ( histogram of oriented gradient), Haar and SIFT ( scale-invariant feature transform). Once features were extracted using these algorithms, classification of the images were done using classifiers like SVM ( Support vector machine) or other classification algorithms like Random forest or Adaboost.

These traditional machine learning techniques relied on extracting and classification of low-level feature information and for that matter wasnt able to scale well for complex use cases. However with the advent of deep learning, a host of techniques matured which scaled well to multiple use cases. Some of the prominent ones are R-CNN ( region based convolutional neural networks), SSD ( single shot multiBox detection), YOLO ( you only look once). These techniques provided much greater accuracy over the tranditional techniques. Now many frameworks like Tensorflow and Pytorch provide custom object detection capabilities with it. Some of the prominent frameworks like transformers are also being used widely for object detection tasks. One such object detector is the Vision Transformer which is used for image classification.

In this post, we will look at the evolution of different techniques and understand these techniques conceptually. This post will lay the foundation for the object detection application we will be building progressively over this series. In the course of this series we will get hands on experience in each of these techniques and then finally tie all of them together in the pothole detection application where we will use the trained model to detect potholes on videos. I can assure you that this is going to be a very exciting journey.

Evolution of object detection techniques

When learning about Object detection, let us start from the legacy methods. The idea is to understand how different techniques evolved over time.

Template matching

Template matching method can be termed as a naive approach for detecting objects in an image. In this method, a template of the object which we want to detect is slid accross the image and the correlation of the template with the input image is captured. The location where the correlation is the highest is predicted as the location of the object.

Template matching

As shown in the figure above, the template of the pothole is slid across the image. The correlation coefficient between pixel intensities of the template and image is captured and the best matching location is identified. This can easily be implemented using frameworks like OpenCV.

Being a simple method, template matching is also fraught with limitations. One problem which often crops up is the one related to different scales used for template and the image. If the scales for the template and image are different, then detection of objects, very often, becomes erroneous.

Another problem is the visual deviation of the object in the template and image. If the visual effects of the objects is different from that of the template, detection of object on the image suffers considerably.

Template matching techniques is one of the earlier methods employed for object detection and is no more in use in any of the modern object detectors. Next we will explore a method whose concepts are used in many of the advanced methods – Image pyramids and sliding window

Image pyramid and sliding window methods for Object detection

Let us try to understand the concept of image pyramids with an example. Let us assume that we have a window of fixed size for detecting potholes and that potholes are detected only if the pothole fits perfectly inside the box. With such a fixed sized window we might not be able to detect all potholes that might be present in an image. Take the case of layer1 of the image below. We can see that the fixed sized window was able to detect one of the potholes further down the road as it fit well within the window size, however bigger potholes at the near end the image are not detected as the box is smaller than the pothole.

As a way to solve this, let us progressively reduce the size of the image keeping the size of the box as constant. This can be seen below as we traverse from layer 1 to layer 7 in the figure below. With the reduction in size of the image, the object we want to detect also reduces in size and as our detection window remains the same, we are able to detect potholes with multiple sizes.

Object detection with fixed size window

This process of progressively scaling an image to detect objects is the underlying technique used in image pyramids. The name image pyramids signifies the fact that, if the scaled images are stacked vertically, then it will fit inside a pyramid as shown in the below figure.

Image Pyramids

There are many different types of image pyramid implementation. Some of the prominent ones are Gaussian pyramids and Laplacian pyramids.

Image pyramids alone do not help in detecting objects. This method has to be implemented in conjunction with a method called sliding windows which enables detection of objects in an image at various scales and locations. As the name suggests, this method involves sliding a window of standard length and width accross an image to extract features. These features will then be used in a classifier to identify the object of interest.

Sliding window accross the image to detect objects

We will be building an object detector from scratch using image pyramids and sliding windows in the third post of this series.

Next let us get to know some of the advanced methods for object detection which are built on deep learning models.

RCNN Framework

This framework was originally introduced by Girshik et al. in 2013. There have been several modifications to the original architecture, resulting in better performance over time. For some time the RCNN framework was the go to model for object detection tasks.

The original RCNN algorithm contains the following key steps

  • Extract regions which potentially contain an object from the input image. Such extractions are called regions proposals extractions. The extraction was done using an algorithm like selective search.
  • Use a pretrained CNN to extract features from the proposal regions.
  • Classify each regions extracted, using a classifier like Support Vector Machines ( SVM).

The original RCNN algorithm gave much better results than traditional methods like the sliding window and pyramid based methods. However this system was slow. Besides, deep learning was not used for localising the objects in the image and it was mostly left to algorithms like selective search.

Fast-RCNN Architecture: Image Source : https://arxiv.org/pdf/1504.08083.pdf

A significant improvement was made to the original RCNN algorithm, by the same author, within a year of publishing of the original paper. This algorithm was named Fast-RCNN. In this algorithm there were some novel ideas like Region of Interest Pooling layer. The Fast-RCNN algorithm used a CNN for the entire image to extract feature map from it. A fixed size window from the feature map was extracted and then passed to a fully connected layer to get the output label for the proposal regions. This step was termed as the Region of Interest Pooling. Two sets of fully connected layers were used to get class labels of the regions along with the location of the bounding boxes for each region.

Faster RCNN : Image Source : https://arxiv.org/pdf/1506.01497.pdf

Within couple of months from the publishing of the Fast-RCNN algorithm another algorithm called the Faster-RCNN was published which improved upon the Fast-RCNN algorithm. The new algorithm had another salient feature called the Region Proposal Network ( RPN), which was introduced to eliminate the need of selective search algorithms and build the capability for region proposal into the R-CNN architecture itself. In this algorithm, anchors were placed uniformly accross the entire image at varying scale and aspect ratios. These anchors would be examined by the RPN and a proposal as to where an object is likely to exist is then output by the RPN.

R-CNN architecture generate potential regions of bounding boxes in an image. These potential regions are then classified using a classifier. These classified regions are then pre-processed to refine bounding boxes, eliminate duplicate detections and rescore boxes on other objects in the image. We will be implementing an object detector using RCNN in the fourth post of this series.

YOLO Algorithm

YOLO which is an acronym for 'You only look once' is a simple algorithm which treats object detection task as a single regression problem processing an image from its pixels to bounding coordinates and class probabilities in a straight through process. This algorithm has at its core a single convolutional network which predicts multiple bounding boxes and class probabilities simultaneously.

Figure 4 Yolo Algorithm : Image Source – https://arxiv.org/pdf/1506.02640.pdf

An input image is divided in equal sized square grids. Each grid predicts multiple bounding boxes and confidence score for those boxes. The confidence scores indicate how confident the model is that the box contains an object.YOLO combines all components of object detection into a single neural network and this network uses features from the entire image to predict each bounding box. The bounding boxes of all classes are predicted simultaneously.

There are multiple variations of YOLO, starting from YOLOv1 – YOLOv5 to PP-YOLOv2 released in April 2021. The accuracy of the models are close to but usually not better than R-CNNs, however where they stand apart is in their detection speed which makes it good choice for real-time video or with camera feed. We will be implementing a YOLO object detector in the fifth post of this series.

Single Shot Detector (SSD) Algorithm

When we discussed the R-CNN architecture, we understood that it has multiple processes which includes

  1. A Region proposal network ( RPN)
  2. ROI pooling
  3. Image classifier.

All these processes are encapsed in the single framework, which considerably slows down the training process. In addition to these issues, the inference is also very slow which makes real time object detection painful. We saw how many of these issues were solved in the YOLO algorithm. SSD like YOLO is another approach which addresses all these issues and thereby achieve localization and detection in a single forward pass of the network during inference time. SSD framework was introduced by Liu et al in their 2015 paper, SSD : Single Shot Multibox detector

SSD : Image Source – https://arxiv.org/pdf/1512.02325.pdf

SSD has a base network, which typically is a pre-trained normally on large data sets like Imagenet. When this framework was first introduced VGG16 was used as the base network. However now there are much better base networks than VGG16 like MobileNet, SqueezeNet etc which gives better accuracy.

SSD framework uses a frameworks similar to Multibox algorithm published by Szegedy et al. It needs an input image and ground truth boxes for each object, during training. A set of default boxes with different scales and aspect ratios are evaluated in each feature map during the training process. For each of these boxes prediction on the shape of the offsets and confidence score on the object categories contained is done. These default boxes are then compared to the ground truth bounding boxes and then losses are calculated based on the localisation and confidence.

SSDs framework provides a unified end to end framework for object detection. However one criticism for SSD is that it dosent detect small objects in an image quite well. A common workaround for this problem is to increase the size of the image. Despite these small drawbacks, SSD provides an excellent end to end framwork for object detection.

Object detection using Tensorflow object detection API

Tensorflow object detection API is a framework that makes the task of training and deploying object detection very easy. The API also makes use of many pre-trained models which adds to the flexibility of the framework. Different types of model architectures can be easily implemented from scratch using the API framework. This ensures lesser number of moving parts when implementing complex tasks like object detection. When doing object detection, TFOD API is a go to tool to quickly scale and implement a object detection model. We will also be implementing our object detection framework using TFOD API in post 6 of this series.

What Next ?

In this post we reviewed some of the frameworks for object detection. The idea was to understand some of the critical parts of each of these frameworks. In the subsequent posts, we will train our pothole models using some of the most important frameworks.

In the next post we will deal with the issue of preperation of annotated dataset for our purpose. We will see how we can use labelImg framework to annotate the data sets and then extract the classes and bounding boxes from the annotated images. Preperation of your own data set will enable you to build your custom object detctors. Publicly available data sets like the COCO data for object detection comes with annotated images for certain set of objects. However if you want to put object detection to use for custom use cases like pothole detection or detection of defective parts in an assembly line, we will have to prepare our own data sets. The next post will enable you to build your own annotated data sets for your custom projects.

The next post will be published next week. To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

Watch out this space for more.

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – VIII : Evaluating deployment options

This is the eighth and last post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit
  4. Build the prototype of the self learning recommendation system : Part I
  5. Build the prototype of the self learning recommendation system : Part II
  6. Productionising the self learning recommendation system : Part I – Customer Segmentation
  7. Productionising self learning recommendation system: Part II : Implementing self learning recommendations
  8. Evaluating deployment options for the self learning recommendation systems. ( This post )

This post ties together all what we discussed in the previous two posts where in we explored all the classes and methods we built for the application. In this post we will implement the driver file which controls all the processes and then explore different options to deploy this application.

Implementing the driver file

Now that we have seen all the classes and methods of the application, let us now see the main driver file which will control the whole process.

Open a new file and name it rlRecoMain.py and copy the following code into the file

import argparse
import pandas as pd
from utils import Conf,helperFunctions
from Data import DataProcessor
from processes import rfmMaker,rlLearn,rlRecomend
import os.path
from pymongo import MongoClient

# Construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument('-c','--conf',required=True,help='Path to the configuration file')
args = vars(ap.parse_args())

# Load the configuration file
conf = Conf(args['conf'])

print("[INFO] loading the raw files")
dl = DataProcessor(conf)

# Check if custDetails already exists. If not create it
if os.path.exists(conf["custDetails"]):
    print("[INFO] Loading customer details from pickle file")
    # Load the data from the pickle file
    custDetails = helperFunctions.load_files(conf["custDetails"])
else:
    print("[INFO] Creating customer details from csv file")
    # Let us load the customer Details
    custDetails = dl.gvCreator()
    # Starting the RFM segmentation process
    rfm = rfmMaker(custDetails,conf)
    custDetails = rfm.segmenter()
    # Save the custDetails file as a pickle file 
    helperFunctions.save_clean_data(custDetails,conf["custDetails"])

# Starting the self learning Recommendation system

# Check if the collections exist in Mongo DB
client = MongoClient(port=27017)
db = client.rlRecomendation

# Get all the collections from MongoDB
countCol = db["rlQuantdic"]
polCol = db["rlValuedic"]
rewCol = db["rlRewarddic"]
recoCountCol = db['rlRecotrack']

print(countCol.estimated_document_count())

# If Collections do not exist then create the collections in MongoDB
if countCol.estimated_document_count() == 0:
    print("[INFO] Main dictionaries empty")
    rll = rlLearn(custDetails, conf)
    # Consolidate all the products
    rll.prodConsolidator()
    print("[INFO] completed the product consolidation phase")
    # Get all the collections from MongoDB
    countCol = db["rlQuantdic"]
    polCol = db["rlValuedic"]
    rewCol = db["rlRewarddic"]

# start the recommendation phase
rlr = rlRecomend(custDetails,conf)
# Sample a state since the state is not available
stateId = rlr.stateSample()
print(stateId)

# Get the respective dictionaries from the collections

countDic = countCol.find_one({stateId: {'$exists': True}})
polDic = polCol.find_one({stateId: {'$exists': True}})
rewDic = rewCol.find_one({stateId: {'$exists': True}})

# The count dictionaries can exist but still recommendation dictionary can not exist. So we need to take this seperately

if recoCountCol.estimated_document_count() == 0:
    print("[INFO] Recommendation tracking dictionary empty")
    recoCountdic = {}
else:
    # Get the dictionary from the collection
    recoCountdic = recoCountCol.find_one({stateId: {'$exists': True}})


print('recommendation count dic', recoCountdic)


# Initialise the Collection checker method
rlr.collfinder(stateId,countDic,polDic,rewDic,recoCountdic)
# Get the list of recommended products
seg_products = rlr.rlRecommender()
print(seg_products)
# Initiate customer actions

click_list,buy_list = rlr.custAction(seg_products)
print('click_list',click_list)
print('buy_list',buy_list)

# Get the reward functions for the customer action
rlr.rewardUpdater(seg_products,buy_list ,click_list)

We import all the necessary libraries and classes in lines 1-7.

Lines 10-12, detail the argument parser process. We provide the path to our configuration file as the argument. We discussed in detail about the configuration file in post 6 of this series. Once the path of the configuration file is passed as the argument, we read the configuration file and the load the value in the variable conf in line 15.

The first of the processes is to initialise the dataProcessor class in line 18. As you know from post 6, this class has the methods for loading and processing data. After this step, lines 21-33 implements the raw data loading and processing steps.

In line 21 we check if the processed data frame custDetails is already present in the output directory. If it is present we load it from the folder in line 24. If we havent created the custDetails data frame before, we initiate that action in line 28 using the gvCreator method we have seen earlier. In lines 30-31, we create the segments for the data using the segmenter method in the rfmMaker class. Finally the custDetails data frame is saved as a pickle file in line 33.

Once the segmentation process is complete the next step is to start the recommendation process. We first establish the connection with our collection in lines 38-39. Then we collect the 4 collections from MongoDB in lines 42-45. If the collections do not exist it will return a ‘None’.

If the collections are none, we need to create the collections. This is done in lines 50-59. We instantiate the rlLearn class in line 52 and the execute the prodConsolidator() method in line 54. Once this method is run the collections would be created. Please refer to the prodConsolidator() method in post 7 for details. Once the collections are created, we get those collections in lines 57-59.

Next we instantiate the rlRecomend class in line 62 and then sample a stateID in line 64. Please note that the sampling of state ID is only a work around to simulate a state in the absence of real customer data. If we were to have a live application, then the state Id would be created each time a customer logs into the sytem to buy products. As you know the state Id is a combination of the customers segment, month and day in which the logging happens. So as there are no live customers we are simulating the stateId for our online recommendation process.

Once we have sampled the stateId, we need to extract the dictionaries corresponding to that stateId from the MongoDb collections. We do that in lines 69-71. We extract the dictionary corresponding to the recommendation as a seperate step in lines 75-80.

Once all the dictionaries are extracted, we do the initialisation of the dictionaries in line 87 using the collfinder method we explored in post 7 . Once the dictionaries are initialised we initiate the recommendation process in line 89 to get the list of recommended products.

Once we get the recommended products we simulate customer actions in line 93, and then finally update the rewards and values using rewardUpdater method in line 98.

This takes us to the end of the complete process to build the online recommendation process. Let us now see how this application can be run on the terminal

Figure 1 : Running the application on terminal

The application can be executed on the terminal with the below command

$ python rlRecoMain.py --conf config/custprof.json

The argument we give is the path to the configuration file. Please note that we need to change directory to the rlreco directory to run this code. The output from the implementation would be as below

The data can be seen in the MongoDB collections also. Let us look at ways to find the data in MongoDB collections.

To initialise Mongo db from terminal, use the following command

Figure 3 : Initialize Mongo

You should get the following output

Now to find all the data bases in Mongo DB you can use the below command

You will be able to see all the databases which you have created. The one marked in red is the database we created. No to use that data base the command used is use rlRecomendation as shown below. We will get the command that the database has been switched to the desired data base.

To see all the collections we have made in this database we can use the below command.

From the output we can see all the collections we have created. Now to see some specific record within the collections, we can use the following command.

db.rlValuedic.find({"Q1_August_1_Monday":{$exists:true} })

In the above command we are trying to find all records in the collection rlValuedic for the stateID "Q1_August_1_Monday". Once we execute this command we get all the records in this collection for this specific stateID. You should get the below output.

The output displays all the proucts for that stateID and its value function.

What we have implemented in code is a simulation of the complete process. To run this continuously for multiple customers, we can create another scrip with a list of desired customers and then execute the code multiple times. I will leave that step as an exercise for you to implement. Now let us look at different options to deploy this application.

Deployment of application

The end product of any data science endeavour should be to build an application and sharing it with the world. There are different options to deploy python applications. Let us look at some of the options available. I would encourage you to explore more methods and share your results.

Flask application with Heroku

A great option to deploy your applications is to package it as a Flask application and then deploy it using Heroku. We have discussed this option in one of our earlier series, where we built a machine translation application. You can refer this link for details. In this section we will discuss the nuances of building the application in Flask and then deploying it on Heroku. I will leave the implementation of the steps for you as an exercise.

When deploying the self learning recommendation system we have built, the first thing which we need to design is what the front end will contain. From the perspective of the processes we have implemented, we need to have the following processes controlled using the front end.

  1. Training process : This is the process which takes the raw data, preprocesses the data and then initialises all the dictionaries. This includes all the processes till line 59 in the driver file rlRecoMain.py. We need to initialise the process of training from the front end of the flask application. In the background all the process till line 59 should run and the dictionaries needs to be updated.
  2. Recommendation simulation : The second process which needs to be controlled is the one where we get the recommendations. The start of this process is the simulation of the state from the front end. To do this we can provide a drop down of all the customer IDs on the flask front end and take the system time details to form the stateID. Once this stateID is generated, we start the recommendation process which includes all the process starting from line 62 till line 90 in the the driver file rlRecoMain.py. Please note that line 64 is the stateID simulating process which will be controlled from the front end. So that line need not be implemented. The final output, which is the list of all recommended products needs to be displayed on the front end. It will be good to add some visual images along with the product for visual impact.
  3. Customer action simulation : Once the recommended products are displayed on the front end, we can send feed back from the front end in terms of the products clicked and the products bought through some widgets created in the front end. These widgets will take the place of line 93, in our implementation. These feed back from the front end needs to be collected as lists, which will take the place of click_list and buy_list given in lines 94-95. Once the customer actions are generated, the back end process in line 98, will have to kick in to update the dictionaries. Once the cycle is completed we can build a refresh button on the screen to simulate the recommendation process again.

Once these processes are implemented using a Flask application, the application can be deployed on Heroku. This post will give you overall guide into deploying the application on Heroku.

These are broad guidelines for building the application and then deploying them. These need not be the most efficient and effective ones. I would challenge each one of you to implement much better processes for deployment. Request you to share your implementations in the comments section below.

Other options for deployment

So far we have seen one of the option to build the application using Flask and then deploy them using Heroku. There are other options too for deployment. Some of the noteable ones are the following

  • Flask application on Ubuntu server
  • Flask application on Docker

The attached link is a great resource to learn about such deployment. I would challenge all of you to deploy using any of these implementation steps and share the implementation for the community to benefit.

Wrapping up.

This is the last post of the series and we hope that this series was informative.

We will start a new series in the near future. The next series will be on a specific problem on computer vision specifically on Object detection. In the next series we will be building a ‘Road pothole detector using different object detection algorithms. This series will touch upon different methods in object detection like Image Pyramids, RCNN, Yolo, Tensorflow Object detection API etc. Watch out this space for the next series.

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – VII : Productionizing the application : II

This is the seventh post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit
  4. Build the prototype of the self learning recommendation system : Part I
  5. Build the prototype of the self learning recommendation system : Part II
  6. Productionising the self learning recommendation system : Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation ( This Post )
  8. Evaluating different deployment options for the self learning recommendation systems.

This post builds on the previous post where we started off with productionizing the application using python scripts. In the last post we completed the customer segmentation part. In this post we continue from where we left off and then build the self learning system using python scripts. Let us get going.

Creation of States

Let us take a quick recap of the project structure and what we covered in the last post.

In the last post we were in the early part of our main driver file rlRecoMain.py. We explored rfmMaker class in file rfmProcess.py from the processes directory. We will now explore selfLearnProcess.py file in the same directory.

Open a new file and name it selfLearnProcess.py and insert the following code

import pandas as pd
from numpy.random import normal as GaussianDistribution
from collections import OrderedDict
from collections import Counter
import operator
from random import sample
import numpy as np
from pymongo import MongoClient
client = MongoClient(port=27017)
db = client.rlRecomendation



class rlLearn:
    def __init__(self,custDetails,conf):
        # Get the date  as a seperate column
        custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
        # Converting date to float for easy comparison
        custDetails['Date'] = custDetails['Date'].astype('float64')
        # Get the period of month column
        custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > conf['monthPer']))
        # Aggregate the custDetails to get a distribution of rewards
        rewardFull = custDetails.groupby(['Segment', 'Month', 'monthPeriod', 'Day', conf['product_id']])[conf['prod_qnty']].agg(
            'sum').reset_index()
        # Get these data frames for all methods
        self.custDetails = custDetails
        self.conf = conf
        self.rewardFull = rewardFull
        # Defining some dictionaries for storing the values
        self.countDic = {}  # Dictionary to store the count of products
        self.polDic = {}  # Dictionary to store the value distribution
        self.rewDic = {}  # Dictionary to store the reward distribution
        self.recoCountdic = {}  # Dictionary to store the recommendation counts

    # Method to find unique values of each of the variables
    def uniqeVars(self):
        # Finding unique value for each of the variables
        segments = list(self.rewardFull.Segment.unique())
        months = list(self.rewardFull.Month.unique())
        monthPeriod = list(self.rewardFull.monthPeriod.unique())
        days = list(self.rewardFull.Day.unique())
        return segments,months,monthPeriod,days

    # Method to consolidate all products
    def prodConsolidator(self):
        # Get all the unique values of the variables
        segments, months, monthPeriod, days = self.uniqeVars()
        # Creating the consolidated dictionary
        for seg in segments:
            for mon in months:
                for period in monthPeriod:
                    for day in days:
                        # Get the subset of the data
                        subset1 = self.rewardFull[(self.rewardFull['Segment'] == seg) & (self.rewardFull['Month'] == mon) & (
                                self.rewardFull['monthPeriod'] == period) & (self.rewardFull['Day'] == day)]
                        # INitializing a temporary dictionary to storing in mongodb
                        tempDic = {}
                        # Check if the subset is valid
                        if len(subset1) > 0:
                            # Iterate through each of the subset and get the products and its quantities
                            stateId = str(seg) + '_' + mon + '_' + str(period) + '_' + day
                            # Define a dictionary for the state ID
                            self.countDic[stateId] = {}
                            tempDic[stateId] = {}
                            for i in range(len(subset1.StockCode)):
                                # Store in the Count dictionary
                                self.countDic[stateId][subset1.iloc[i]['StockCode']] = int(subset1.iloc[i]['Quantity'])
                                tempDic[stateId][subset1.iloc[i]['StockCode']] = int(subset1.iloc[i]['Quantity'])
                            # Dumping each record into mongo db
                            db.rlQuantdic.insert(tempDic)

        # Consolidate the rewards and value functions based on the quantities
        for key in self.countDic.keys():
            # Creating two temporary dictionaries for loading in Mongodb
            tempDicpol = {}
            tempDicrew = {}
            # First get the dictionary of products for a state
            prodCounts = self.countDic[key]
            self.polDic[key] = {}
            self.rewDic[key] = {}
            # Initializing temporary dictionaries also
            tempDicpol[key] = {}
            tempDicrew[key] = {}
            # Update the policy values
            for pkey in prodCounts.keys():
                # Creating the value dictionary using a Gaussian process
                self.polDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)
                tempDicpol[key][pkey] = self.polDic[key][pkey]
                # Creating a reward dictionary using a Gaussian process
                self.rewDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)
                tempDicrew[key][pkey] = self.rewDic[key][pkey]
            # Dumping each of these in mongo db
            db.rlRewarddic.insert(tempDicrew)
            db.rlValuedic.insert(tempDicpol)
        print('[INFO] Dumped the quantity dictionary,policy and rewards in MongoDB')

As usual we start with import of the libraries we want from lines 1-7. In this implementation we make a small deviation from the prototype which we developed in the previous post. During the prototyping phase we predominantly relied on dictionaries to store data. However here we would be storing data in Mongo DB. Those of you who are not fully conversant with MongoDB can refer to some good tutorials on MongDB like the one here. I will also be explaining the key features as and when required. In line 8, we import the MongoClient which is required for connections with the data base. We then define the client using the default port number ( 27017 ) in line 9 and then name the data base where we will store the recommendation in line 10. The name of the database we have selected is rlRecomendation . You are free to choose any name of your choice.

Let us now explore the rlLearn class. The constructor of the class which starts from line 15, takes the custDetails data frame and the configuration file as inputs. You would already be familiar with lines 17-23 from our prototyping phase, where we extract information to create states and then consolidate the data frame to get the quantities of each state. In lines 30-33, we create dictionaries where we store the relevant information like count of products, value distribution, reward distribution and the number of times the products are recommended.

The main method within the rlLearn class is the prodConslidator() method in lines 45-95. We have seen the details of this method in the prototyping phase. Just to recap, in this method we iterate through each of the components of our states and then store the quantities of each product under the state in the dictionaries. However there is a subtle difference from what we did during the prototyping phase. Here we are inserting each state and its associated products in Mongodb data base we created, as shown in line 70, 93 and 94. We create a temporary dictionary in line 57 to dump each state into Mongodb. We also store the data in the dictionaries,as we did during the prototyping phase, so that we get the data for other methods in this class. The final outcome from this method, is the creation of the count dictionary, value dictionary and reward dictionary from our data and updation of this data in Mongodb.

This takes us to the end of the rlLearn class.

We now go back to the driver file rlRecoMain.py and the explore the next important class rlRecomend.

The rlRecomend class has the methods which are required for recommending products. This class has many methods and therefore we will go one by one through each of the methods. We have seen all these methods during the prototyping phase and therefore we will not get into detailed explanation of these methods here. For detailed explanation you can refer to the previous post.

Now on the selfLearnProcess.py start adding the code pertaining to the rlRecomend class.

class rlRecomend:
    def __init__(self, custDetails, conf):
        # Get the date  as a seperate column
        custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
        # Converting date to float for easy comparison
        custDetails['Date'] = custDetails['Date'].astype('float64')
        # Get the period of month column
        custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > conf['monthPer']))
        # Aggregate the custDetails to get a distribution of rewards
        rewardFull = custDetails.groupby(['Segment', 'Month', 'monthPeriod', 'Day', conf['product_id']])[
            conf['prod_qnty']].agg(
            'sum').reset_index()
        # Get these data frames for all methods
        self.custDetails = custDetails
        self.conf = conf
        self.rewardFull = rewardFull

The above code is for the constructor of the class ( lines 97 – 112 ), which is similar to the constructor of the rlLearn class. Here we consolidate the custDetails data frame and get the count of each products for the respective state.

Let us now look at the next two methods. Add the following code to the class we earlier created.

# Method to find unique values of each of the variables
    def uniqeVars(self):
        # Finding unique value for each of the variables
        segments = list(self.rewardFull.Segment.unique())
        months = list(self.rewardFull.Month.unique())
        monthPeriod = list(self.rewardFull.monthPeriod.unique())
        days = list(self.rewardFull.Day.unique())
        return segments, months, monthPeriod, days

    # Method to sample a state
    def stateSample(self):
        # Get the unique state elements
        segments, months, monthPeriod, days = self.uniqeVars()
        # Get the context of the customer. For the time being let us randomly select all the states
        seg = sample(segments, 1)[0]  # Sample the segment
        mon = sample(months, 1)[0]  # Sample the month
        monthPer = sample([0, 1], 1)[0]  # sample the month period
        day = sample(days, 1)[0]  # Sample the day
        # Get the state id by combining all these samples
        stateId = str(seg) + '_' + mon + '_' + str(monthPer) + '_' + day
        self.seg = seg
        return stateId

The first method , lines 115 – 121, is to get the unique values of segments, months, month-period and days. This information will be used in some of the methods we will see later on. The second method detailed in lines 124-135, is to sample a state id, through random sampling of the components of a state.

The next methods we will explore are to initialise dictionaries if a state id has not been seen earlier. The first method initialises dictionaries and the second method inserts a recommendation collection record in MongoDB if the state dosent exist. Let us see the code for these methods.

  # Method to initialize a dictionary in case a state Id is not available
    def collfinder(self,stateId,countDic,polDic,rewDic,recoCountdic):
        # Defining some dictionaries for storing the values
        self.countDic = countDic  # Dictionary to store the count of products
        self.polDic = polDic  # Dictionary to store the value distribution
        self.rewDic = rewDic  # Dictionary to store the reward distribution
        self.recoCountdic = recoCountdic  # Dictionary to store the recommendatio
        self.stateId = stateId
        print("[INFO] The current state is :", stateId)
        if self.countDic is None:
            print("[INFO] State ID do not exist")
            self.countDic = {}
            self.countDic[stateId] = {}
            self.polDic = {}
            self.polDic[stateId] = {}
            self.rewDic = {}
            self.rewDic[stateId] = {}
        if self.recoCountdic is None:
            self.recoCountdic = {}
            self.recoCountdic[stateId] = {}
        else:
            self.recoCountdic[stateId] = {}

# Method to update the recommendation dictionary
    def recoCollChecker(self):
        print("[INFO] Inside the recommendation collection")
        recoCol = db.rlRecotrack.find_one({self.stateId: {'$exists': True}})
        if recoCol is None:
            print("[INFO] Inserting the record in the recommendation collection")
            db.rlRecotrack.insert_one({self.stateId: {}})
        return recoCol

The inputs to the first method, as in line 138 are the state Id and all the other 4 dictionaries we extract from Mongo DB, which we will see later on in the main script rlRecoMain.py. If no record exists for a specific state Id, the dictionaries we extract from Mongo DB would be null and therefore we need to initialize these dictionaries for storing all the values of products, its values, rewards and the count of recommendations. The initialisation of these dictionaries are implemented in this method from lines 146-158.

The second initialisation method is to check for the recommendation count dictionary for a specific state Id. We first check for the state Id in the collection in line 163. If the record dosent exist then we insert a blank dictionary for that state in line 166.

Let us now look at the next two methods in the class

    # Create a function to get a list of products for a certain segment
    def segProduct(self,seg, nproducts):
        # Get the list of unique products for each segment
        seg_products = list(self.rewardFull[self.rewardFull['Segment'] == seg]['StockCode'].unique())
        seg_products = sample(seg_products, nproducts)
        return seg_products

    # This is the function to get the top n products based on value
    def sortlist(self,nproducts,seg):
        # Get the top products based on the values and sort them from product with largest value to least
        topProducts = sorted(self.polDic[self.stateId].keys(), key=lambda kv: self.polDic[self.stateId][kv])[-nproducts:][::-1]
        # If the topProducts is less than the required number of products nproducts, sample the delta
        while len(topProducts) < nproducts:
            print("[INFO] top products less than required number of products")
            segProducts = self.segProduct(seg, (nproducts - len(topProducts)))
            newList = topProducts + segProducts
            # Finding unique products
            topProducts = list(OrderedDict.fromkeys(newList))
        return topProducts

The method in lines 171-175 is to sample a list of products for a segment. This method is used incase the number of products in a particular state is less than the total number of products which we want to recommend. In such cases, we randomly sample some products from the list of all products bought by customers in that segment and then add it to the list of products we want to recommend. We will see this in action in sortlist method (lines 178-188).

The sortlist method, sorts the list of products based on the demand for that product and the returns the list of top products. The inputs to this method are the number of products we want to be recommended and the segment ( line 178 ). We then get the top ‘n‘ products by sorting the value dictionary based on the number of times a product is bought as in line 180. If the number of products is less than the required products, sampling of products is done using the segProduct method we saw earlier. The final list of top products is then returned by this method.

The next method which we are going to explore is the one which controls the exploration and exploitation process thereby generating a list of products to be recommended. Let us add the following code to the class.

# This is the function to create the number of products based on exploration and exploitation
    def sampProduct(self,seg, nproducts,epsilon):
        # Initialise an empty list for storing the recommended products
        seg_products = []
        # Get the list of unique products for each segment
        Segment_products = list(self.rewardFull[self.rewardFull['Segment'] == seg]['StockCode'].unique())
        # Get the list of top n products based on value
        topProducts = self.sortlist(nproducts,seg)
        # Start a loop to get the required number of products
        while len(seg_products) < nproducts:
            # First find a probability
            probability = np.random.rand()
            if probability >= epsilon:
                # print(topProducts)
                # The top product would be first product in the list
                prod = topProducts[0]
                # Append the selected product to the list
                seg_products.append(prod)
                # Remove the top product once appended
                topProducts.pop(0)
                # Ensure that seg_products is unique
                seg_products = list(OrderedDict.fromkeys(seg_products))
            else:
                # If the probability is less than epsilon value randomly sample one product
                prod = sample(Segment_products, 1)[0]
                seg_products.append(prod)
                # Ensure that seg_products is unique
                seg_products = list(OrderedDict.fromkeys(seg_products))
        return seg_products

The inputs to the method are the segment, number of products to be recommended and the epsilon value which determines exploration and exploitation as shown in line 191. In line 195, we get the list of the products for the segment. This list is from where products are sampled during the exploration phase. We also get the list of top products which needs to be recommended in line 197, using the sortlist method we defined earlier. In lines 199-218 we implement the exploitation and exploration processes we discussed during the prototyping phase and finally we return the list of top products for recommendation.

The next method which we will explore is the one to update dictionaries after the recommendation process.

# This is the method for updating the dictionaries after recommendation
    def dicUpdater(self,prodList, stateId):        
        for prod in prodList:
            # Check if the product is in the dictionary
            if prod in list(self.countDic[stateId].keys()):
                # Update the count by 1
                self.countDic[stateId][prod] += 1                
            else:
                self.countDic[stateId][prod] = 1                
            if prod in list(self.recoCountdic[stateId].keys()):
                # Update the recommended products with 1
                self.recoCountdic[stateId][prod] += 1                
            else:
                # Initialise the recommended products as 1
                self.recoCountdic[stateId][prod] = 1                
            if prod not in list(self.polDic[stateId].keys()):
                # Initialise the value as 0
                self.polDic[stateId][prod] = 0                
            if prod not in list(self.rewDic[stateId].keys()):
                # Initialise the reward dictionary as 0
                self.rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)                
        print("[INFO] Completed the initial dictionary updates")

The inputs to this method, as in line 221, are the list of products to be recommended and the state Id. From lines 222-234, we iterate through each of the recommended product and increament the count in the dictionary if the product exists in the dictionary or initialize the count to 1 if the product wasnt available. Later on in lines 235-240, we initialise the value dictionary and the reward dictionary if the products are not available in them.

The next method we will see is the one for initializing the dictionaries in case the context dosent exist.

    def dicAdder(self,prodList, stateId):        
        # Loop through the product list
        for prod in prodList:
            # Initialise the count as 1
            self.countDic[stateId][prod] = 1
            # Initialise the value as 0
            self.polDic[stateId][prod] = 0
            # Initialise the recommended products as 1
            self.recoCountdic[stateId][prod] = 1
            # Initialise the reward dictionary as 0
            self.rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)
        print("[INFO] Completed the dictionary initialization")
        # Next update the collections with the respective updates        
        # Updating the quantity collection
        db.rlQuantdic.insert_one({stateId: self.countDic[stateId]})
        # Updating the recommendation tracking collection
        db.rlRecotrack.insert_one({stateId: self.recoCount[stateId]})
        # Updating the value function collection for the products
        db.rlValuedic.insert_one({stateId: self.polDic[stateId]})
        # Updating the rewards collection
        db.rlRewarddic.insert_one({stateId: self.rewDic[stateId]})
        print('[INFO] Completed updating all the collections')

If the state Id dosent exist, the dictionaries are initialised as seen in lines 147-155. Once the dictionaries are initialised, MongoDb data bases are updated in lines 259-265.

The next method which we are going to explore is one of the main methods which integrates all the methods we have seen so far. This methods implements the recomendation process. Let us explore this method.

# Method to sample a stateID and then initialize the dictionaries
    def rlRecommender(self):
        # First sample a stateID
        stateId = self.stateId        
        # Start the recommendation process
        if len(self.polDic[stateId]) > 0:
            print("The context exists")
            # Implement the sampling of products based on exploration and exploitation
            seg_products = self.sampProduct(self.seg, self.conf["nProducts"],self.conf["epsilon"])
            # Check if the recommendation count collection exist
            recoCol = self.recoCollChecker()
            print('Recommendation collection existing :',recoCol)
            # Update the dictionaries of values and rewards
            self.dicUpdater(seg_products, stateId)
        else:
            print("The context dosent exist")
            # Get the list of relavant products
            seg_products = self.segProduct(self.seg, conf["nProducts"])
            # Add products to the value dictionary and rewards dictionary
            self.dicAdder(seg_products, stateId)
        print("[INFO] Completed the recommendation process")

        return seg_products

The first step in the process is to get the state Id ( line 271 ) based on which we have to do all the recommendations. Once we have the state Id, we check if it is an existing state id in line 273. If it is an existing state Id we get the list of ‘n’ products for recommendation using the sampProduct method we saw earlier, where we implement exploration and exploitation. Once we get the products we initialise the recommendation collection in line 278. Finally we update all dictionaries using the dicUpdater method in line 281.

From lines 282-287, we implement a similar process when the state Id dosent exist. The only difference in this case is in the initialisation of the dictionaries in line 287, where we use the dicAdder method.

Once we complete the recommendation process, we get into simulating the customer action.

# Function to initiate customer action
    def custAction(self,segproducts):
        print('[INFO] getting the customer action')
        # Sample a value to get how many products will be clicked
        click_number = np.random.choice(np.arange(0, 10),
                                        p=[0.50, 0.35, 0.10, 0.025, 0.015, 0.0055, 0.002, 0.00125, 0.00124, 0.00001])
        # Sample products which will be clicked based on click number
        click_list = sample(segproducts, click_number)

        # Sample for buy values
        buy_number = np.random.choice(np.arange(0, 10),
                                      p=[0.70, 0.15, 0.10, 0.025, 0.015, 0.0055, 0.002, 0.00125, 0.00124, 0.00001])
        # Sample products which will be bought based on buy number
        buy_list = sample(segproducts, buy_number)

        return click_list, buy_list

Lines 296-305 implements the processes for simulating the list of products which are bought and browsed by the customer based on the recommendation we made. The method returns the list of products which were browsed through and also the one which were bought. For detailed explanations on these methods please refer the previous post

The next methods we will explore are the ones related to the value updation of the recommendation system.

    def getReward(self,loc):
        rew = GaussianDistribution(loc=loc, scale=1, size=1)[0].round(2)
        return rew

    def saPolicy(self,rew, prod):
        # This function gets the relavant algorithm for the policy update
        # Get the current value of the state        
        vcur = self.polDic[self.stateId][prod]        
        # Get the counts of the current product
        n = self.recoCountdic[self.stateId][prod]        
        # Calculate the new value
        Incvcur = (1 / n) * (rew - vcur)       
        return Incvcur

The getReward method on line 309 is to generate a reward from a gaussian distribution centred around the reward value. We will see the use of this method in subsequent methods.

The saPolicy method in lines 313-321 updates the value of the state based on the simple averaging method in line 320. We have already seen these methods in our prototyping phase in the previous post.

Next we will see the method which uses both the above methods.

    def valueUpdater(self,seg_products, loc, custList, remove=True):
        for prod in custList:
            # Get the reward for the bought product. The reward will be centered around the defined reward for each action
            rew = self.getReward(loc)            
            # Update the reward in the reward dictionary
            self.rewDic[self.stateId][prod] += rew            
            # Update the policy based on the reward
            Incvcur = self.saPolicy(rew, prod)            
            self.polDic[self.stateId][prod] += Incvcur           
            # Remove the bought product from the product list
            if remove:
                seg_products.remove(prod)
        return seg_products

The inputs to this method are the recommended list of products, the mean reward ( click, buy or ignore), the corresponding list ( click list or buy list) and a flag to indicate if the product has to be removed from the recommendation list or not.

We interate through all the products in the customer action list in line 324 and then gets the reward in line 326. Once the reward is incremented in the reward dictionary in line 328, we get the incremental value in line 330 and this is updated in the value dictionary in line 331. If the flag is True, we remove the product from the recommended list and the finally returns the remaining recommendation list.

The next method is the last of the methods and ties the above three methods with the customer action.

# Function to update the reward dictionary and the value dictionary based on customer action
    def rewardUpdater(self, seg_products,custBuy=[], custClick=[]):
        # Check if there are any customer purchases
        if len(custBuy) > 0:
            seg_products = self.valueUpdater(seg_products, self.conf['buyReward'], custBuy)
            # Repeat the same process for customer click
        if len(custClick) > 0:
            seg_products = self.valueUpdater(seg_products, self.conf['clickReward'], custClick)
            # For those products not clicked or bought, give a penalty
        if len(seg_products) > 0:
            custList = seg_products.copy()
            seg_products = self.valueUpdater(seg_products, -2, custList,False)
        # Next update the collections with the respective updates
        print('[INFO] Updating all the collections')
        # Updating the quantity collection
        db.rlQuantdic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.countDic[self.stateId]})
        # Updating the recommendation tracking collection
        db.rlRecotrack.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.recoCountdic[self.stateId]})
        # Updating the value function collection for the products
        db.rlValuedic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.polDic[self.stateId]})
        # Updating the rewards collection
        db.rlRewarddic.replace_one({self.stateId: {'$exists': True}}, {self.stateId: self.rewDic[self.stateId]})
        print('[INFO] Completed updating all the collections')

In lines 340-348, we update the value based on the number of products bought, clicked and ignored. Once the value dictionaries are updated, the respective MongoDb dictionaries are updated in lines 352-358.

With this we have covered all the methods which are required for implementing the self learning recommendation system. Let us summarise our learning so far in this post.

  • Created the states and updated MongoDB with the states data. We used the historic data for initialisation of values.
  • Implemented the recommendation process by getting a list of products to be recommended to the customer
  • Explored customer response simulation wherein the customer response to the recommended products were implemented.
  • Updated the value functions and reward functions after customer response
  • Updated Mongo DB collections after the completion of the process for a customer.

What next ?

We are coming to the fag end of our series. The next post is where we tie all these methods together in the main driver file and see how these processes are implmented. We will also run the script on the terminal and observe the results. Once the application implementation is done, we will also explore avenues to deploy the application. Watch this space for the last post of the series.

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – VI : Productionizing the application : I

This is the sixth post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system. This series consists of the following posts

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit
  4. Build the prototype of the self learning recommendation system : Part I
  5. Build the prototype of the self learning recommendation system : Part II
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation ( This post )
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

This post builds on the previous post where we started off with building the prototype of the application in Jupyter notebooks. In this post we will see how to convert our prototype into Python scripts. Converting into python script is important because that is the basis for building an application and then deploying them for general consumption.

File Structure for the project

First let us look at the file structure of our project.

The directory RL_Recomendations is the main directory which contains other folders which are required for the project. Out of the directories rlreco is a virtual environment we will create and all our working directories are within this virtual environment.Along with the folders we also have the script rlRecoMain.py which is the main driver script for the application. We will now go through some of the steps in creating this folder structure

When building an application it is always a good practice to create a virtual environment and then complete the application build process within the virtual environment. We talked about this in one of our earlier series for building machine translation applications . This way we can ensure that only application specific libraries and packages are present when we deploy our application.

Let us first create a separate folder in our drive and then create a virtual environment within that folder. In a Linux based system, a seperate folder can be created as follows

$ mkdir RL_Recomendations

Once the new directory is created let us change directory into the RL_Recomendations directory and then create a virtual environment. A virtual environment can be created on Linux with Python3 with the below script

RL_Recomendations $ python3 -m venv rlreco

Here the rlreco is the name of our virtual environment. The virtual environment which we created can be activated as below

RL_Recomendations $ source rlreco/bin/activate

Once the virtual environment is enabled we will get the following prompt.

(rlreco) ~$

In addition you will notice that a new folder created with the same name as the virtual environment. We will use that folder to create all our folders and main files required for our application. Let us traverse through our driver file and then create all the folders and files required for our application.

Create the driver file

Open a file using your favourite editor and name it rlRecoMain.py and the insert the following code.

import argparse
import pandas as pd
from utils import Conf,helperFunctions
from Data import DataProcessor
from processes import rfmMaker,rlLearn,rlRecomend
from utils import helperFunctions
import os.path
from pymongo import MongoClient

Lines 1-2 we import the libraries which we require for our application. In line 3 we have to import Conf class from the utils folder.

So first let us create a folder called utils, which will have the following file structure.

The utils folder has a file called Conf.py which contains the Conf class and another file called helperFunctions.py . The first file controls the configuration functions and the second file contains some of the helper functions like saving data into pickle files. We will get to that in a moment.

First let us open a new python file Conf.py and copy the following code.

from json_minify import json_minify
import json

class Conf:

    def __init__(self,confPath):
        # Read the json file and load it into a dictionary
        conf = json.loads(json_minify(open(confPath).read()))
        self.__dict__.update(conf)
    def __getitem__(self, k):
        return self.__dict__.get(k,None)

The Conf class is a simple class, with its constructor loading the configuration file which is in json format in line 8. Once the configuration file is loaded the elements are extracted by invoking ‘conf’ method. We will see more of how this is used later.

We have talked about the Conf class which loads the configuration file, however we havent made the configuration file yet. As you may know a configuration file contains all the parameters in the application. Let us see the directory structure of the configuration file.

Figure : config folder and configuration file

You can now create the folder called config, under the rlreco folder and then open a file in your editor and then name it custprof.json and include the following code.

{

  /****
  * paths required
  ****/

  "inputData" : "/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/Datasets/Retail/OnlineRetail.csv",
  "custDetails" : "/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/BayesianQuest/RL_Recomendations/rlreco/output/custDetails.pkl",

  /****
  * Column mapping
  ****/

  "order_id" : "InvoiceNo",
  "product_id": "StockCode",
  "product" : "Description",
  "prod_qnty" : "Quantity",
  "order_date" : "InvoiceDate",
  "unit_price" : "UnitPrice",
  "customer_id" : "CustomerID",
    /****
  * Parameters
  ****/

  "nclust" : 4,
  "monthPer" : 15,
  "epsilon" : 0.1,
  "nProducts" : 10,
  "buyReward" : 5,
  "clickReward": 1
}

As you can see the config, file contains all the configuration items required as part of the application. The first part is where the paths to the raw file and processed pickle files are stored. The second part is the mapping of the column names in the raw file and the names used in our application. The third part contains all the parameters required for the application. The Conf class which we earlier saw will read the json file and all these parameters will be loaded to memory for us to be used in the application.

Lets come back to the utils folder and create the second file which we will name as helperFunctions.py and insert the following code.

from pickle import load
from pickle import dump
import numpy as np


# Function to Save data to pickle form
def save_clean_data(data,filename):
    dump(data,open(filename,'wb'))
    print('Saved: %s' % filename)

# Function to load pickle data from disk
def load_files(filename):
    return load(open(filename,'rb'))

This file contains two functions. The first function starting in line 7 saves a file in pickle format to the specified path. The second function in line 12, loads a pickle file and return the data. These two functions are handy functions which will be used later in our project.

We will come back to the main file rlRecoMain.py and look at the next folder and methods on line 4. In this line we import DataProcessor method from the folder Data . Let us take a look at the folder called Data.

Create the data processor module

The class and the methods associated with the class are in the file dataLoader.py. Let us first create the folder, Data and then open a file named dataLoader.py and insert the following code.

import os
import pandas as pd
import pickle
import numpy as np
import random
from utils import helperFunctions
from datetime import datetime, timedelta,date
from dateutil.parser import parse

class DataProcessor:
    def __init__(self,configfile):
        # This is the first method in the DataProcessor class
        self.config = configfile

     # This is the method to load data from the input files
    def dataLoader(self):
        inputPath = self.config["inputData"]
        dataFrame = pd.read_csv(inputPath,encoding = "ISO-8859-1")
        return dataFrame

    # This is the method for parsing dates
    def dateParser(self):
        custDetails = self.dataLoader()
        #Parsing  the date
        custDetails['Parse_date'] = custDetails[self.config["order_date"]].apply(lambda x: parse(x))
        # Parsing the weekdaty
        custDetails['Weekday'] = custDetails['Parse_date'].apply(lambda x: x.weekday())
        # Parsing the Day
        custDetails['Day'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%A"))
        # Parsing the Month
        custDetails['Month'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%B"))
        # Getting the year
        custDetails['Year'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%Y"))
        # Getting year and month together as one feature
        custDetails['year_month'] = custDetails['Year'] + "_" +custDetails['Month']

        return custDetails

    def gvCreator(self):
        custDetails = self.dateParser()
        # Creating gross value column
        custDetails['grossValue'] = custDetails[self.config["prod_qnty"]] * custDetails[self.config["unit_price"]]

        return custDetails

The constructor of the DataProcessor class takes the config file as the input and then make it available for all the other methods in line 13.

This dataProcessor class will have three methods, dataLoader, dateParser and gvCreator. The last method is the driving method which internally calls other two methods. Let us look at the gvCreator method.

The dateParser method is called first within the gvCreator method in line 40. The dateParser method in turn calls the dataLoader method in line 23. The dataLoader method loads the customer data as a pandas data frame in line 18 and the passes it to the dateParser method in line 23. The dateParser method takes the custDetails data frame and then extracts all the date related fields from lines 25-35. We saw this in detail during the prototyping phase in the previous post.

Once the dates are parsed in the custDetails data frame, it is passed to gvCreator method in line 40 and then the ‘gross value’ is calcuated by multiplying the unit price and the product quantity. Finally the processed custDetails file is returned.

Now we will come back to the rlRecoMain file and the look at the three other classes, rfmMaker,rlLearn,rlRecomend, we import in line 5 of the file rlRecoMain.py. This is imported from the ‘processes’ folder. Let us look at the composition of the processes folder.

We have three files in the folder, processes.

The first one is the __init__.py file which is the constructor to the package. Let us see its contentes. Open a file and name it __init__.py and add the following lines of code.

from .rfmProcess import rfmMaker
from .selfLearnProcess import rlLearn,rlRecomend

Create customer segmentation modules

In lines 1-2 of the constructor file we make the three classes ( rfmMaker,rlLearn and rlRecomend) available to the package. The class rfmMaker is in the file rfmProcess.py and the other two classes are in the file selfLearnProcess.py.

Let us open a new file, name it rfmProcess.py and then insert the following code.

import sys
sys.path.append('path_to_the_folder/RL_Recomendations/rlreco')
import pandas as pd
import lifetimes
from sklearn.cluster import KMeans
from utils import helperFunctions



class rfmMaker:
    def __init__(self,custDetails,conf):
        self.custDetails = custDetails
        self.conf = conf

    def rfmMatrix(self):
        # Converting data to RFM format
        RfmAgeTrain = lifetimes.utils.summary_data_from_transaction_data(self.custDetails, self.conf['customer_id'], 'Parse_date','grossValue')
        # Reset the index
        RfmAgeTrain = RfmAgeTrain.reset_index()
        return RfmAgeTrain

    # Function for ordering cluster numbers

    def order_cluster(self,cluster_field_name, target_field_name, data, ascending):
        # Group the data on the clusters and summarise the target field(recency/frequency/monetary) based on the mean value
        data_new = data.groupby(cluster_field_name)[target_field_name].mean().reset_index()
        # Sort the data based on the values of the target field
        data_new = data_new.sort_values(by=target_field_name, ascending=ascending).reset_index(drop=True)
        # Create a new column called index for storing the sorted index values
        data_new['index'] = data_new.index
        # Merge the summarised data onto the original data set so that the index is mapped to the cluster
        data_final = pd.merge(data, data_new[[cluster_field_name, 'index']], on=cluster_field_name)
        # From the final data drop the cluster name as the index is the new cluster
        data_final = data_final.drop([cluster_field_name], axis=1)
        # Rename the index column to cluster name
        data_final = data_final.rename(columns={'index': cluster_field_name})
        return data_final

    # Function to do the cluster ordering for each cluster
    #

    def clusterSorter(self,target_field_name,RfmAgeTrain, ascending):
        # Defining the number of clusters
        nclust = self.conf['nclust']
        # Make the subset data frame using the required feature
        user_variable = RfmAgeTrain[['CustomerID', target_field_name]]
        # let us take four clusters indicating 4 quadrants
        kmeans = KMeans(n_clusters=nclust)
        kmeans.fit(user_variable[[target_field_name]])
        # Create the cluster field name from the target field name
        cluster_field_name = target_field_name + 'Cluster'
        # Create the clusters
        user_variable[cluster_field_name] = kmeans.predict(user_variable[[target_field_name]])
        # Sort and reset index
        user_variable.sort_values(by=target_field_name, ascending=ascending).reset_index(drop=True)
        # Sort the data frame according to cluster values
        user_variable = self.order_cluster(cluster_field_name, target_field_name, user_variable, ascending)
        return user_variable


    def clusterCreator(self):
        
        #data : THis is the dataframe for which we want to create the clsuters
        #clustName : This is the variable name
        #nclust ; Numvber of clusters to be created
        
        # Get the RFM data Frame
        RfmAgeTrain = self.rfmMatrix()
        # Implementing for user recency
        user_recency = self.clusterSorter('recency', RfmAgeTrain,False)
        #print('recency grouping',user_recency.groupby('recencyCluster')['recency'].mean().reset_index())
        # Implementing for user frequency
        user_freqency = self.clusterSorter('frequency', RfmAgeTrain, True)
        #print('frequency grouping',user_freqency.groupby('frequencyCluster')['frequency'].mean().reset_index())
        # Implementing for monetary values
        user_monetary = self.clusterSorter('monetary_value', RfmAgeTrain, True)
        #print('monetary grouping',user_monetary.groupby('monetary_valueCluster')['monetary_value'].mean().reset_index())

        # Merging the individual data frames with the main data frame
        RfmAgeTrain = pd.merge(RfmAgeTrain, user_monetary[["CustomerID", 'monetary_valueCluster']], on='CustomerID')
        RfmAgeTrain = pd.merge(RfmAgeTrain, user_freqency[["CustomerID", 'frequencyCluster']], on='CustomerID')
        RfmAgeTrain = pd.merge(RfmAgeTrain, user_recency[["CustomerID", 'recencyCluster']], on='CustomerID')
        # Calculate the overall score
        RfmAgeTrain['OverallScore'] = RfmAgeTrain['recencyCluster'] + RfmAgeTrain['frequencyCluster'] + RfmAgeTrain['monetary_valueCluster']
        return RfmAgeTrain

    def segmenter(self):
        
        #This is the script to create segments after the RFM analysis
        
        # Get the RFM data Frame
        RfmAgeTrain = self.clusterCreator()
        # Segment data
        RfmAgeTrain['Segment'] = 'Q1'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 0), 'Segment'] = 'Q2'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 1), 'Segment'] = 'Q2'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 2), 'Segment'] = 'Q3'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 4), 'Segment'] = 'Q4'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 5), 'Segment'] = 'Q4'
        RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 6), 'Segment'] = 'Q4'

        # Merging the customer details with the segment
        custDetails = pd.merge(self.custDetails, RfmAgeTrain, on=['CustomerID'], how='left')
        # Saving the details as a pickle file
        helperFunctions.save_clean_data(custDetails,self.conf["custDetails"])
        print("[INFO] Saved customer details ")

        return custDetails

The rfmMaker, class contains methods which does the following tasks,Converting the custDetails data frame to the RFM format. We saw this method in the previous post, where we used the lifetimes library to convert the data frame to the RFM format. This process is detailed in the rfmMatrix method from lines 15-20.

Once the data is made in the RFM format, the next task as we saw in the previous post was to create the clusters for recency, frequency and monetary values. During our prototyping phase we decided to adopt 4 clusters for each of these variables. In this method we will pass the number of clusters through the configuration file as seen in line 44 and then we create these clusters using Kmeans method as shown in lines 48-49. Once the clusters are created, the clusters are sorted to get a logical order. We saw these steps during the prototyping phase and these are implemented using clusterCreator method ( lines 61-85) clusterSorter method ( lines 42-58 ) and orderCluster methods ( lines 24 – 37 ). As the name suggests the first method is to create the cluster and the latter two are to sort it in the logical way. The detailed explanations of these functions are detailed in the last post.

After the clusters are made and sorted, the next task was to merge it with the original data frame. This is done in the latter part of the clusterCreator method ( lines 80-82 ). As we saw in the prototyping phase we merged all the three cluster details to the original data frame and then created the overall score by summing up the scores of each of the individual clusters ( line 84 ) . Finally this data frame is returned to the final method segmenter for defining the segments

Our final task was to combine the clusters to 4 distinct segments as seen from the protoyping phase. We do these steps in the segmenter method ( lines 94-100 ). After these steps we have 4 segments ‘Q1’ to ‘Q4’ and these segments are merged to the custDetails data frame ( line 103 ).

Thats takes us to the end of this post. So let us summarise all our learning so far in this post.

  • Created the folder structure for the project
  • Created a virtual environment and activated the virtual environment
  • Created folders like Config, Data, Processes, Utils and the created the corresponding files
  • Created the code and files for data loading, data clustering and segmenting using the RFM process

We will not get into other aspects of building our self learning system in the next post.

What Next ?

Now that we have explored rfmMaker class in file rfmProcess.py in the next post we will define the classes and methods for implementing the recommendation and self learning processes. The next post will be published next week. To be notified of the next post please subscribe to this blog post .You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – V : Prototype Phase II : Self Learning Implementation

This is the fifth post of our series on building a self learning recommendation system using reinforcement learning. This post of the series builds on the previous post where we segmented customers using RFM analysis. This series consists of the following posts.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit
  4. Build the prototype of the self learning recommendation system : Part I
  5. Build the prototype of the self learning recommendation system: Part II ( This post )
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

Introduction

In the last post we saw how to create customer segments from transaction data. In this post we will use the customer segments to create states of the customer. Before making the states let us make some assumptions based on the buying behaviour of customers.

  1. Customers in the same segment have very similar buying behaviours
  2. The second assumption we will make is that buying pattern of customers vary accross the months. Within each month we are assuming that the buying behaviour within the first 15 days is different from the buying behaviour in the next 15 days. Now these assumptions are made only to demonstrate how such assumptions will influence the creation of different states of the customer. One can still go much more granular with assumptions that the buying pattern changes every week in a month, i.e say the buying pattern within the first week will be differnt from that of the second week and so on. With each level of granularity the number of states required will increase. Ideally such decisions need to be made considering the business dynamics and based on real customer buying behaviours.
  3. The next assumption we will be making is based on the days in a week. We make an assumption that buying behaviours of customers during different days of a week also varies.

Based on these assumptions, each state will have four tiers i.e

Customer segment >> month >> within first 15 days or not >> day of the week.

Let us now see how this assumption can be carried forward to create different states for our self learning recommendation system.

As a first step towards creation of states, we will create some more variables from the existing variables. We will be using the same dataframe we created till the segmentation phase, which we discussed in the last post.

# Feature engineering of the customer details data frame
# Get the date  as a seperate column
custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
# Converting date to float for easy comparison
custDetails['Date']  = custDetails['Date'] .astype('float64')
# Get the period of month column
custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > 15))

custDetails.head()

Let us closely look at the changes incorporated. In line 3, we are extracting the date of the month and then converting them into a float type in line 5. The purpose of taking the date is to find out which of these transactions have happened before 15th of the month and which after 15th. We extract those details in line 7, where we create a binary points ( 0 & 1) as to whether a date falls in the last 15 days or the first 15 days of the month. Now all data points required to create the state is in place. These individual data points will be combined together to form the state ( i.e. Segment-Month-Monthperiod-Day ). We will getinto nuances of state creation next.

Initialization of values

When we discussed about the K armed bandit in post 2, we saw the functions for generating the rewards and value. What we will do next is to initialize the reward function and the value function for the states.A widely used method for finding the value function and the reward function is to intialize those values to zero. However we already have data on each state and the product buying frequency for each of these states. We will aggregate the quantities of each product as per the state combination to create our initial value functions.

# Aggregate custDetails to get a distribution of rewards
rewardFull = custDetails.groupby(['Segment','Month','monthPeriod','Day','StockCode'])['Quantity'].agg('sum').reset_index()

rewardFull

From the output, we can see the state wise distribution of products . For example for the state Q1_April_0_Friday we find that the 120 quantities of product ‘10002’ was bought and so on. So the consolidated data frame represents the propensity of buying of each product. We will make the propensity of buying the basis for the initial values of each product.

Now that we have consolidated the data, we will get into the task of creating our reward and value distribution. We will extract information relevant for each state and then load the data into different dictionaries for ease of use. We will kick off these processes by first extracting the unique values of each of the components of our states.

# Finding unique value for each of the segment 
segments = list(rewardFull.Segment.unique())
print('segments',segments)
months = list(rewardFull.Month.unique())
print('months',months)
monthPeriod = list(rewardFull.monthPeriod.unique())
print('monthPeriod',monthPeriod)
days = list(rewardFull.Day.unique())
print('days',days)

In lines 16-22, we take the unique values of each of the components of our state and then store them as list. We will use these lists to create our reward an value function dictionaries . First let us create dictionaries in which we are going to store the values.

# Defining some dictionaries for storing the values
countDic = {} # Dictionary to store the count of products
polDic = {} # Dictionary to store the value distribution
rewDic = {} # Dictionary to store the reward distribution
recoCount = {} # Dictionary to store the recommendation counts

Let us now implement the process of initializing the reward and value functions.

for seg in segments:
    for mon in months:
        for period in monthPeriod:
            for day in days:
                # Get the subset of the data
                subset1 = rewardFull[(rewardFull['Segment'] == seg) & (rewardFull['Month'] == mon) & (
                            rewardFull['monthPeriod'] == period) & (rewardFull['Day'] == day)]                
                # Check if the subset is valid
                if len(subset1) > 0:
                    # Iterate through each of the subset and get the products and its quantities
                    stateId = str(seg) + '_' + mon + '_' + str(period) + '_' + day
                    # Define a dictionary for the state ID
                    countDic[stateId] = {}                    
                    for i in range(len(subset1.StockCode)):
                        countDic[stateId][subset1.iloc[i]['StockCode']] = int(subset1.iloc[i]['Quantity'])

Thats an ugly looking loop. Let us unravel it. In lines 30-33, we implement iterative loops to go through each component of our state, starting from segment, month, month period and finally days. We then get the data which corresponds to each of the components of the state in line 35. In line 38 we do a check to see if there is any data pertaining to the state we are interested in. If there is valid data, then we first define an ID for the state, by combining all the components in line 40. In line 42, we define an inner dictionary for each element of the countDic, dictionary. The key of the countDic dictionary is the state Id we defined in line 40. In the inner dictionary we store each of the products as its key and the corresponding quantity values of the product as its values in line 44.

Let us look at the total number of states in the countDic

len(countDic)

You will notice that there are 572 states formed. Let us look at the data for some of the states.

stateId = 'Q4_September_1_Wednesday'
countDic[stateId]

From the output we can see how for each state, the products and its frequency of purchase is listed. This will form the basis of our reward distribution and also the value distribution. We will create that next

Consolidation of rewards and value distribution

from numpy.random import normal as GaussianDistribution
# Consolidate the rewards and value functions based on the quantities
for key in countDic.keys():    
    # First get the dictionary of products for a state
    prodCounts = countDic[key]
    polDic[key] = {}
    rewDic[key] = {}    
    # Update the policy values
    for pkey in prodCounts.keys():
        # Creating the value dictionary using a Gaussian process
        polDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)
        # Creating a reward dictionary using a Gaussian process
        rewDic[key][pkey] = GaussianDistribution(loc=prodCounts[pkey], scale=1, size=1)[0].round(2)

In line 50, we iterate through each of the states in the countDic. Please note that the key of the dictionary is the state. In line 52, we store the products and its counts for a state, in another variable prodCounts. The prodCounts dictionary has the the product id as its key and the buying frequency as the value,. Lines 53 and 54, we create two more dictionaries for the value and reward dictionaries. In line 56 we loop through each product of the state and make it the key of the inner dictionaries of reward and value dictionaries. We generate a random number from a Gaussian distribution with the mean as the frequency of purchase for the product . We store the number generated from the Gaussian distribution as values for both rewards and value function dictionaries. At the end of the iterations, we get a distribution of rewards and value for each state and the products within each state. The distribution would be centred around the frequency of purchase of each of the product under the state.

Let us take a look at some sample values of both the dictionaries

polDic[stateId]
rewDic[stateId]

We have the necessary ingradients for building our selflearning recommendation engine. Let us now think about the actual process in an online recommendation system. In the actual process when a customer visits the ecommerce site, we first need to understand the state of that customer which will be the segment of the customer, the currrent month, which half of the month the customer is logging in and also the day when the customer is logging in. These are the information we would require to create the states.

For our purpose we will simulate the context of the customer using random sampling

Simulation of customer action

# Get the context of the customer. For the time being let us randomly select all the states
seg = sample(['Q1','Q2','Q3','Q4'],1)[0] # Sample the segment
mon = sample(['January','February','March','April','May','June','July','August','September','October','November','December'],1)[0] # Sample the month
monthPer = sample([0,1],1)[0] # sample the month period
day = sample(['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],1)[0] # Sample the day
# Get the state id by combining all these samples
stateId = str(seg) + '_' +  mon + '_' + str(monthPer) + '_' + day
print(stateId)

Lines 64-67, we sample each component of the state and then in line 68 we combine them to form the state id. We will be using the state id for the recommendation process. The recommendation process will have the following step.

Process 1 : Initialize dictionaries

A check is done to find if the value of reward dictionares which we earlier defined has the states which we sampled. If the state exists we take the value dictionary corresponding to the sampled state, if the state dosent exist, we initialise an empty dictionary corresponding to the state. Let us look at the function to do that.

def collfinder(dictionary,stateId):
    # dictionary ; This is the dictionary where we check if the state exists
    # stateId : StateId to be checked    
    if stateId in dictionary.keys():        
        mycol = {}
        mycol[stateId] = dictionary[stateId]
    else:
        # Initialise the state Id in the dictionary
        dictionary[stateId] = {}
        # Return the state specific collection
        mycol = {}
        mycol[stateId] = dictionary[stateId]
        
    return mycol[stateId],mycol,dictionary

In line 71, we define the function. The inputs are the dictionary the state id we want to verify. We first check if the state id exists in the dictionary in line 74. If it exists we create a new dictionary called mycol in line 75 and then load all the products and its count to mycol dictionary in line 76.

If the state dosent exist, we first initialise the state in line 79 and then repeat the same processes as of lines 75-76.

Let us now implement this step for the dictionaries which we have already created.

# Check for the policy Dictionary
mypolDic,mypol,polDic = collfinder(polDic,stateId)

Let us check the mypol dictionary.

mypol

We can see the policy dictionary for the state we defined. We will now repeat the process for the reward dictionary and the count dictionaries

# Check for the Reward Dictionary
myrewDic, staterew,rewDic = collfinder(rewDic,stateId)
# Check for the Count Dictionary
myCount,quantityDic,countDic = collfinder(countDic,stateId)

Both these dictionaries are similar to the policy dictionary above.

We also will be creating a similar dictionary for the recommended products, to keep count of all the products which are recommended. Since we havent created a recommendation dictionary, we will initialise that and create the state for the recommendation dictionary.

# Initializing the recommendation dictionary
recoCountdic = {}
# Check the recommendation count dictionary
myrecoDic,recoCount,recoCountdic = collfinder(recoCountdic,stateId)

We will now get into the second process which is the recommendation process

Process 2 : Recommendation process

We start the recommendation process based on the epsilon greedy method. Let us define the overall process for the recommendation system.

As mentioned earlier, one of our basic premise was that customers within the same segment have similar buying propensities. So the products which we need to recommend for a customer, will be picked from all the products bought by customers belonging to that segment. So the first task in the process is to get all the products relevant for the segment to which the customer belongs. We sort the products, in descending order, based on the frequency of product purchase.

Implementing the self learning recommendation system using epsilon greedy process

Next we start the epsion greedy process as learned in post 2, to select the top n products we want to recommend. To begin this process, we generate a random probability distribution value. If the random value is greater than the epsilon value, we pick the first product in the sorted list of products for the segment. Once a product is picked we remove it from the list of products from the segment to ensure that we dont pick it again. This process as we learned when we implemented K-armed bandit problem, is the exploitation phase.

The above was a case when the random probability number was greater than the epsilon value, now if the random probability number is less than the epsilon value, we get into the exploration phase. We randomly sample a product from the universe of products for the segment. Here again we restrict our exploration to the universe of products relevant for the segment. However one could design the exploration ourside the universe of the segment and maybe explore from the basket of all products for all customers.

We continue the exploitation and exploration process till we get the top n products we want. We will look at some of the functions which implements this process.

# Create a function to get a list of products for a certain segment
def segProduct(seg, nproducts,rewardFull):
    # Get the list of unique products for each segment
    seg_products = list(rewardFull[rewardFull['Segment'] == seg]['StockCode'].unique())
    seg_products = sample(seg_products, nproducts)
    return seg_products

# This is the function to get the top n products based on value
def sortlist(nproducts, stateId,seg,mypol):
    # Get the top products based on the values and sort them from product with largest value to least
    topProducts = sorted(mypol[stateId].keys(), key=lambda kv: mypol[stateId][kv])[-nproducts:][::-1]
    # If the topProducts is less than the required number of products nproducts, sample the delta
    while len(topProducts) < nproducts:
        print("[INFO] top products less than required number of products")
        segProducts = segProduct(seg,(nproducts - len(topProducts)))
        newList = topProducts + segProducts
        # Finding unique products
        topProducts = list(OrderedDict.fromkeys(newList))
    return topProducts

# This is the function to create the number of products based on exploration and exploitation
def sampProduct(seg, nproducts, stateId, epsilon,mypol):
    # Initialise an empty list for storing the recommended products
    seg_products = []
    # Get the list of unique products for each segment
    Segment_products = list(rewardFull[rewardFull['Segment'] == seg]['StockCode'].unique())
    # Get the list of top n products based on value
    topProducts = sortlist(nproducts, stateId,seg,mypol)
    # Start a loop to get the required number of products
    while len(seg_products) < nproducts:
        # First find a probability
        probability = np.random.rand()
        if probability >= epsilon:            
            # The top product would be first product in the list
            prod = topProducts[0]
            # Append the selected product to the list
            seg_products.append(prod)
            # Remove the top product once appended
            topProducts.pop(0)
            # Ensure that seg_products is unique
            seg_products = list(OrderedDict.fromkeys(seg_products))
        else:
            # If the probability is less than epsilon value randomly sample one product
            prod = sample(Segment_products, 1)[0]
            seg_products.append(prod)
            # Ensure that seg_products is unique
            seg_products = list(OrderedDict.fromkeys(seg_products))
    return seg_products

In line 117 we define the function to get the recommended products. The input parameters for the function are the segment, number of products we want to recommend, state id,epsilon value and the policy dictionary . We initialise a list to store the recommended products in line 119 and then extract all the products relevant for the segment in line 121. We then sort the segment products according to frequency of the products. We use the function ‘sortlist‘ in line 104 for this purpose. We sort the value dictionary according to the frequency and then select the top n products in the descending order in line 106. Now if the number of products in the dictionary is less than the number of products we want to be recommended, we randomly select the remaining products from the list of products for the segment. Lines 99-100 in the function ‘segproducts‘ is where we take the list of unique products for the segment and then randomly sample the required number of products and return it in line 110. In line 111 the additional products along with the top products is joined together. The new list of top products are sorted as per the order in which the products were added in line 112 and returned to the calling function in line 123.

Lines 125-142 implements the epsilon greedy process for product recommendation. This is a loop which continues till we get the required number of products for recommending. In line 127 a random probability score is generated and is verified whether it is greater than the epsilon value in line 128. If the random probability score is greater than epsilon value, we extract the topmost product from the list of products in line 130 and then append it to the recommendation candidate product list in line 132. After extraction of the top product, it is removed from the list in line 134. The list is sorted according to the order in which products are added in line 136. This loop continues till we get the required number of products for recommendation.

Lines 137-142 is the loop when the random score is less than the epsilon value i.e exploration stage. In this stage we randomly sample products from the list of products appealing to the segment and append it to the list of recommendation candidates. The final list of candiate products to be recommended is returned in line 143.

Process 3 : Updation of all relevant dictionaries

In the last section we saw the process of selecting the products for recommendation. The next process we will cover is how the products recommended are updated in the relevant dictionaries like quantity dictionary, value dictionary, reward dictionary and recommendation dictionary. Again we will use a function to update the dictionaries. The first function we will see is the one used to update sampled products.

def dicUpdater(prodList, stateId,countDic,recoCountdic,polDic,rewDic):
    # Loop through each of the products
    for prod in prodList:        
        # Check if the product is in the dictionary
        if prod in list(countDic[stateId].keys()):
            # Update the count by 1
            countDic[stateId][prod] += 1            
        else:
            countDic[stateId][prod] = 1            
        if prod in list(recoCountdic[stateId].keys()):            
            # Update the recommended products with 1
            recoCountdic[stateId][prod] += 1           
        else:
            # Initialise the recommended products as 1
            recoCountdic[stateId][prod] = 1            
        if prod not in list(polDic[stateId].keys()):
            # Initialise the value as 0
            polDic[stateId][prod] = 0            
        if prod not in list(rewDic[stateId].keys()):            
            # Initialise the reward dictionary as 0
            rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)     
            
    # Return all the dictionaries after update
    return countDic,recoCountdic,polDic,rewDic

The inputs for the function are the recommended products ,prodList , stateID, count dictionary, recommendation dictionary, value dictionary and reward dictionary as shown in line 144.

A inner loop is executed in lines 146-166, to go through each product in the product list. In line 148 a check is made to find out if the product is in the count dictionary. This entails, understanding if the product was ever bought under that state. If the product was ever bought before, the count is updated by 1. However if the product was not bought earlier, then the dictionary for that product under that state is initialised as 1 in line 152.

The next step is for updating the recommendation count for the same product. The same logic as above applies. If the product was recommended before, for that state, the number is updated by 1 if not the number is initialised to 1 in lines 153-158.

The next task is to verify if there is a value distribution for this product for the specific state as in lines 159-161. If the value distribution does not exist, it is initialised to zero. However we dont do any updation to the value distribution here. The updation to value distribution happens later on. We will come to that in a moment

The last check is to verify if the product exists in the reward dictionary for that state in lines 162-164. If it dosent exist then it is initialised with a gaussian distribution. Again we dont do any updation for reward as this is done later on.

Now that we have seen the function for updating the dictionaries, we will get into a function which initializes dictionaries. This process is required, if a particular state has never been seen for any of the dictionaries. Let us get to that function

def dicAdder(prodList, stateId,countDic,recoCountdic,polDic,rewDic):
    countDic[stateId] = {}
    polDic[stateId] = {}
    recoCountdic[stateId] = {}
    rewDic[stateId] = {}    
    # Loop through the product list
    for prod in prodList:
        # Initialise the count as 1
        countDic[stateId][prod] = 1
        # Initialise the value as 0
        polDic[stateId][prod] = 0
        # Initialise the recommended products as 1
        recoCountdic[stateId][prod] = 1
        # Initialise the reward dictionary as 0
        rewDic[stateId][prod] = GaussianDistribution(loc=0, scale=1, size=1)[0].round(2)
    # Return all the dictionaries after update
    return countDic,recoCountdic,polDic,rewDic

The inputs to this function as seen in line 168 are the same as what we saw in the update function. In lines 169-172, we initialise the innner dictionaries for the current state. Lines 174-182, all the inner dictionaries are initialised for the respective products. The count and recommendation dictionaries are initialised by 1 and the value dictionary is intialised as 0. The reward dictionary is initialised using a gaussian distribution. Finally the updated dictionaries are returned in line 184.

Next we start the recommendation process using all the functions we have defined so far.

nProducts = 10
epsilon=0.1

# Get the list of recommended products and update the dictionaries.The process is executed for a scenario when the context exists and does not exist
if len(mypolDic) > 0:    
    print("The context exists")
    # Implement the sampling of products based on exploration and exploitation
    seg_products = sampProduct(seg, nProducts , stateId, epsilon,mypol)
    # Update the dictionaries of values and rewards
    countDic,recoCountdic,polDic,rewDic = dicUpdater(seg_products, stateId,countDic,recoCountdic,polDic,rewDic)
else:
    print("The context dosent exist")
    # Get the list of relavant products
    seg_products = segProduct(seg, nProducts)
    # Add products to the value dictionary and rewards dictionary
    countDic,recoCountdic,polDic,rewDic = dicAdder(seg_products, stateId,countDic,recoCountdic,polDic,rewDic)

We define the number of products and epsilon values in lines 185-186. In line 189 we check if the state exists which would mean that there would be some products in the dictionary. If the state exists, then we get the list of recommended products using the ‘sampProducts‘ function we saw earlier in line 192. After getting the list of products we update all the dictionaries in line 194.

If the state dosent exist, then products are randomly sampled using the ‘segProduct‘ function in line 198. As before we update the dictionaries in line 200.

Process 4 : Customer Action

So far we have implemented the recommendation process. In real world application, the products we generated are displayed as recommendations to the customer. Based on the recommendations received, the customer carries out different actions as below.

  1. Customer could buy one or more of the recommended products
  2. Customer could browse through the recommended products
  3. Customer could ignore all the recommendations.

Based on the customer actions, we need to give feed back to the online learning system as to how good the recommendations were. Obviously the first scenario is the most desired one, the second one indicates some level of interest and the last one is the undesirable effect. From an self learning perspective we need to reinforce the desirable behaviours and discourage undesirable behavrious by devising proper rewards systems.

Just like we simulated customer states, we will create some functions to simulate customer actions. We define probability distribution to simulate customers propensity for buying a product or clicking a product. Based on the probability distribution we get how many products get bought or how many get clicked. Based on these numbers we sample products from our recommended list as to how many of them are going to be bought or how many would be clicked. Please note that these processes are only required as we are not implementing on a real system. When we are implementing this process in a real system, we get all these feedbacks from the the choices made by the customer.

def custAction(segproducts):
    print('[INFO] getting the customer action')
    # Sample a value to get how many products will be clicked    
    click_number = np.random.choice(np.arange(0, 10), p=[0.50,0.35,0.10, 0.025, 0.015,0.0055, 0.002,0.00125,0.00124,0.00001])
    # Sample products which will be clicked based on click number
    click_list = sample(segproducts,click_number)

    # Sample for buy values    
    buy_number = np.random.choice(np.arange(0, 10), p=[0.70,0.15,0.10, 0.025, 0.015,0.0055, 0.002,0.00125,0.00124,0.00001])
    # Sample products which will be bought based on buy number
    buy_list = sample(segproducts,buy_number)

    return click_list,buy_list

The input to the function is the recommended products as seen from line 201. We then simulate the number of products the customer is going to click using a probability distribution shown in line 204. From the probability distribution we can see there is 50% of chance for not clicking any product, 35% chance to click one product and so on. Once we get the number of products which are likely to be clicked, we sample that many products from the recommended product list. We do a similar process for products that are likely to be bought in lines 209-211. Finally we return the list of products that will be clicked and bought. Please note that there is high likelihood that the returned lists will be empty as the probability distributions are skewed heavily towards that possiblity. Let us implement that function and see what we get.

click_list,buy_list = custAction(seg_products)
print(click_list)
print(buy_list)

So from the simulation, we can see that the customer browsed one product however did not buy any of the products. Please note that you might get a very different simulation when you try as this is a random sampling of products.

Now that we have got the customer action, our next step is to get rewards based on the customer actions. As reward let us define that we will give 5 points if the customer has bought a product and a reward of 1 if the customer has clicked the product and -2 reward if the customer has done neither of these.We will define some functions to update the value dictionaries based on the rewards.

def getReward(loc):
    rew = GaussianDistribution(loc=loc, scale=1, size=1)[0].round(2)
    return rew

def saPolicy(rew, stateId, prod,polDic,recoCountdic):
    # This function gets the relavant algorithm for the policy update
    # Get the current value of the state    
    vcur = polDic[stateId][prod]    
    # Get the counts of the current product
    n = recoCountdic[stateId][prod]    
    # Calculate the new value
    Incvcur = (1 / n) * (rew - vcur)    
    return Incvcur

def valueUpdater(seg_products, loc,custList,stateId,rewDic,polDic,recoCountdic, remove=True):
    for prod in custList:       
        # Get the reward for the bought product. The reward will be centered around the defined reward for each action
        rew = getReward(loc)        
        # Update the reward in the reward dictionary
        rewDic[stateId][prod] += rew        
        # Update the policy based on the reward
        Incvcur = saPolicy(rew, stateId, prod,polDic,recoCountdic)        
        polDic[stateId][prod] += Incvcur        
        # Remove the bought product from the product list
        if remove:
            seg_products.remove(prod)
    return seg_products,rewDic,polDic,recoCountdic

The main function is in line 231, whose inputs are the following,

seg_products : segment products we earlier derived

loc : reward for action ( i.e 5 for buy, 1 for browse and -2 for ignoring)

custList : The list of products which are clicked or bought by the customer

stateId : The state ID

rewDic,polDic,recoCountdic : Reward dictionary, value dictionary and recommendation count dictionary for updates

An iterative loop is initiated from line 232 to iterate through all the products in the corresponding list ( buy or click list). First we get the corresponding reward for the action in line 234. This line calls a function defined in line 217, which returns the reward from a Gaussian distribution centred at the reward location ( 5, 1 or -2). Once we get the reward we update the reward dictionary in line 236 with the new reward.

In line 238 we call the function ‘saPolicy‘ for getting the new value for the action. The function ‘saPolicy‘ defined in line 221, takes the reward, state Id , product and dictionaries as input and output the new values for updating the policy dictionary.

In line 224, we get the current value for the state and the product and in line 226 we get the number of times that product was ever selected. The new value is calculated in line 228 through the simple averaging method we dealt with in our post on K armed bandits. The new value is then returned to the calling function and then incremented with the existing value in lines 238-239. To avoid re-recommending the current product for the customer we do a check in line 241 and then remove it from the segment products in line 242. The updated list of segment products along with the updated dictionaries are then returned in line 243.

Let us now look at the implementation of these functions next.

if len(buy_list) > 0:
    seg_products,rewDic,polDic,recoCountdic = valueUpdater(seg_products, 5, buy_list,stateId,rewDic,polDic,recoCountdic)
    # Repeat the same process for customer click
if len(click_list) > 0:
    seg_products,rewDic,polDic,recoCountdic = valueUpdater(seg_products, 1, click_list,stateId,rewDic,polDic,recoCountdic)
    # For those products not clicked or bought, give a penalty
if len(seg_products) > 0:
    custList = seg_products.copy()
    seg_products,rewDic,polDic,recoCountdic = valueUpdater(seg_products, -2, custList,stateId ,rewDic,polDic,recoCountdic, False)

In lines 245,248 and 252 we update the values for the buy list, click list and the ignored products respectively. In the process all the dictionaries also get updated.

That takes us to the end of all the processes for the self learning system. When implementing these processes as system, we have to keep implementing these processes one by one. Let us summarise all the processes which needs to be repeated to build this self learning recommendation system.

  1. Identify the customer context by simulating the states. In a real life system we dont have to simulate this information as this will be available when a customer logs in
  2. Initialise the dictionaries for the state id we generated
  3. Get the list of products to be recommended based on the state id
  4. Update the dictionaries based on the list of products which were recommended
  5. Simulate customer actions on the recommended products. Again in real systems we done simulate customer actions as it will be captured online.
  6. Update the value dictionary and reward dictionary based on customer actions.

All these 6 steps will have to be repeated for each customer instance. Once this cycle runs for some continuous steps, we will get the value dictionaries updated and dynamically aligned to individual customer segments.

What next ?

In this post we built our self learning recommendation system using Jupyter notebooks. Next we will productionise these processes using python scripts. When we productionise these processes, we will also use Mongo DB database to store and retrieve data. We will start the productionising phase in the next post.

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – IV : Prototype Phase I: Segmenting the customers.

This is the fourth post of our series on building a self learning recommendation system using reinforcement learning. In the coming posts of the series we will expand on our understanding of the reinforcement learning problem and build an application for recommending products. These are the different posts of the series where we will progressively build our recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit
  4. Build the prototype of the self learning recommendation system: Part I ( This post )
  5. Build the prototype of the self learning recommendation system: Part II
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

Introduction

In the last post of the series we formulated the idea on how we can build the self learning recommendation system as a K armed bandit. In this post we will go ahead and start building the prototype of our self learning system based on the idea we developed. We will be using Jupyter notebook to build our prototype. Let us dive in

Processes for building our self learning recommendation system

Let us take a birds eye view of the recommendation system we are going to build. We will implement the following processes

  1. Cleaning of the data set
  2. Segment the customers using RFM segmentation
  3. Creation of states for contextual recommendation
  4. Creation of reward and value distributions
  5. Implement the self learning process using simple averaging method
  6. Simulate customer actions to initiate self learning for recommendations

The first two processes will be implemented in this post and the remaining processes will be covered in the next post.

Cleaning the data set

The data set which we would be using for this exercise would be the online retail data set. Let us load the data set in our system and get familiar with the data. First let us import all the necessary library files

from pickle import load
from pickle import dump
import numpy as np
import pandas as pd
from dateutil.parser import parse
import os
from collections import Counter
import operator
from random import sample

We will now define a simple function to load the data using pandas.

def dataLoader(orderPath):
    # THis is the method to load data from the input files    
    orders = pd.read_csv(orderPath,encoding = "ISO-8859-1")
    return orders

The above function reads the csv file and returns the data frame. Let us use this function to load the data and view the head of the data

# Please define your specific path where the data set is loaded
filename = "OnlineRetail.csv"
# Let us load the customer Details
custDetails = dataLoader(filename)
custDetails.head()
Figure 1 : Retail data set

Further in the exercise we have to work a lot with the dates and therefore we need to extract relevant details from the date column like the day, weekday, month, year etc. We will do that with the date parser library. Let us now parse all the date related column and create new columns storing the new details we extract after parsing the dates.

#Parsing  the date
custDetails['Parse_date'] = custDetails["InvoiceDate"].apply(lambda x: parse(x))
# Parsing the weekdaty
custDetails['Weekday'] = custDetails['Parse_date'].apply(lambda x: x.weekday())
# Parsing the Day
custDetails['Day'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%A"))
# Parsing the Month
custDetails['Month'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%B"))
# Extracting the year
custDetails['Year'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%Y"))
# Combining year and month together as one feature
custDetails['year_month'] = custDetails['Year'] + "_" +custDetails['Month']

custDetails.head()
Figure 2 : Data frame after date parsing

As seen from line 22 we have used the lambda() function to first parse the ‘date’ column. The parsed date is stored in a new column called ‘Parse_date’. After parsing the dates first, we carry out different operations, again using the lambda() function on the parsed date. The different operations we carry out are

  1. Extract weekday and store it in a new column called ‘Weekday’ : line 24
  2. Extract the day of the week and store it in the column ‘Day’ : line 26
  3. Extract the month and store in the column ‘Month’ : line 28
  4. Extract year and store in the column ‘Year’ : line 30

Finally, in line 32 we combine the year and month to form a new column called ‘year_month’. This is done to enable easy filtering of data based on the combination of a year and month.

We will also create a column which gives you the gross value of each puchase. Gross value can be calculated by multiplying the quantity with unit price.

# Creating gross value column
custDetails['grossValue'] = custDetails["Quantity"] * custDetails["UnitPrice"]
custDetails.head()
Figure 3 :Customer Details Data frame

The reason we are calculating the gross value is to use it for segmentation of customers which will be dealt with in the next section. This takes us to the end of the initial preparation of the data set. Next we start creating customer segments.

Creating Customer Segments

In the last post, where we formulated the problem statement, we identified that customer segment could be one of the important components of the states. In addition to the customer segment,the other components are day of purchase and period of the month. So our next endeavour is to prepare data to create the different states we require. We will start with defining the customer segment.

There are different approaches to creating customer segments. In this post we will use the RFM analysis to create customer segments. Let us get going with creation of customer segments from our data set. We will continue on the same notebook we were using so far.

import lifetimes

In line 39,We import the lifetimes package to create the RFM data from our transactional dataset. Next we will use the package to convert the transaction data to the specific format.

# Converting data to RFM format
RfmAgeTrain = lifetimes.utils.summary_data_from_transaction_data(custDetails, 'CustomerID', 'Parse_date', 'grossValue')
RfmAgeTrain

The process for getting the frequency, recency and monetary value is very simple using the life time package as shown in line 42 . From the output we can see the RFM data frame formed with each customer ID as individual row. For each of the customer, the frequency and recency in days is represented along with the average monetary value for the customer. We will be using these values for creating clusters of customer segments.

Before we work further, let us clean the data frame a bit by resetting the index values as shown in line 44

RfmAgeTrain = RfmAgeTrain.reset_index()
RfmAgeTrain

What we will now do is to use recency, frequency and monetary values seperately to create clusters. We will use the K-means clustering technique to find the number of clusters required. Many parts of the code used for clustering is taken from the following post on customer segmentation.

In lines 46-47 we import the Kmeans clustering method and matplotlib library.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

For the purpose of getting the recency matrix let us take a subset of the data frame with only customer ID and recency value as shown in lines 48-49

user_recency = RfmAgeTrain[['CustomerID','recency']]
user_recency.head()

In any clustering problem,as you might know, one of the critical tasks is to determine the number of clusters which in the Kmeans algorithm is a parameter. We will use the well known elbow method to find the optimum number of clusters.

# Initialize a dictionary to store sum of squared error
sse = {}
recency = user_recency[['recency']]

# Loop through different cluster combinations
for k in range(1,10):
    # Fit the Kmeans model using the iterated cluster value
    kmeans = KMeans(n_clusters=k,max_iter=2000).fit(recency)
    # Store the cluster against the sum of squared error for each cluster formation   
    sse[k] = kmeans.inertia_
    
# Plotting all the clusters
plt.figure()
plt.plot(list(sse.keys()),list(sse.values()))
plt.xlabel("Number of clusters")
plt.show()
Figure 4 : Plot of number of clusters

In line 51, we initialise a dictionary to store the sum of square error for each k-means cluster and then subset the data frame ‘recency’ with only the recency values in line 52.

From line 55, we start a loop to itrate through different cluster values. For each cluster value, we fit the k-means model in line 57. We also store the sum of squared error in line 59 for each of the cluster in the dictionary we initialized.

Lines 62-65, we visualise the number of clusters against the sum of squared error, which gives and indication of the right k value to choose.

From the plot we can see that 2,3 and 4 cluster values are where the elbow tapers and one of these values can be taken as the cluster value.Let us choose 4 clusters for our purpose and then refit the data.

# let us take four clusters 
kmeans = KMeans(n_clusters=4)
# Fit the model on the recency data
kmeans.fit(user_recency[['recency']])
# Predict the clusters for each of the customer
user_recency['RecencyCluster'] = kmeans.predict(user_recency[['recency']])
user_recency

In line 67, we instantiate the KMeans class using 4 clusters. We then use the fit method on the recency values in line 69. Once the model is fit, we predict the cluster for each customer in line 71.

From the output we can see that the recency cluster is predicted against each customer ID. We will clean up this data frame a bit, by resetting the index.

user_recency.sort_values(by='recency',ascending=False).reset_index(drop=True)

From the output we can see that the data is ordered according to the clusters. Let us also look at how the clusters are mapped vis a vis the actual recency value. For doing this, we will group the data with respect to each cluster and then find the mean of the recency value, as in line 74.

user_recency.groupby('RecencyCluster')['recency'].mean().reset_index()

From the output we see the mean value of recency for each cluster. We can clearly see that there is a demarcation of the mean values with the value of the cluster. However, the mean values are not mapped in a logical (increasing or decreasing) order of the clusters. From the output we can see that cluster 3 is mapped to the smallest recency value ( 7.72). The next smallest value (115.85) is mapped to cluster 0 and so on. So there is not specific ordering to the custer and the mean value mapping. This might be a problem when we combine all the clusters for recency, frequency and monetary together to derive a combined score. So it is necessary to sort it in an ordered fashion. We will use a custom function to get the order right. Let us see the function.

# Function for ordering cluster numbers

def order_cluster(cluster_field_name,target_field_name,data,ascending):    
    # Group the data on the clusters and summarise the target field(recency/frequency/monetary) based on the mean value
    data_new = data.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    # Sort the data based on the values of the target field
    data_new = data_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    # Create a new column called index for storing the sorted index values
    data_new['index'] = data_new.index
    # Merge the summarised data onto the original data set so that the index is mapped to the cluster
    data_final = pd.merge(data,data_new[[cluster_field_name,'index']],on=cluster_field_name)
    # From the final data drop the cluster name as the index is the new cluster
    data_final = data_final.drop([cluster_field_name],axis=1)
    # Rename the index column to cluster name
    data_final = data_final.rename(columns={'index':cluster_field_name})
    return data_final

In line 77, we define the function and its inputs. Let us look at the inputs to the function

cluster_field_name : This is the field name we give to the cluster in the data set like “RecencyCluster”.

target_field_name : This is the field pertaining to our target values like ‘recency’ , ‘frequency’ and ,’monetary_values’.

data : This is the data frame containing the cluster information and target values, for eg ( user_recency)

ascending : This is a flag indicating whether the data has to be sorted in ascending order or not

Line 79, we group the data based on the cluster and summarise the data under each group to get the mean of the target variable. The idea is to sort the data frame based on the mean values in ascending order which is done in line 81. Once the data is sorted in ascending order, we form a new feature with the data frame index as its values, in line 83. Now the index values will act as sorted cluster values and we will get a mapping between the existing cluster values and the new cluster values which are sorted. In line 85, we merge the summarised data frame with the original data frame so that the new cluster values are mapped to all the values in the data frame. Once the new sorted cluster labels are mapped to the original data frame, the old cluster labels are dropped in line 87 and the column renamed in line 89

Now that we have defined the function, let us implement it and sort the data frame in a logical order in line 91.

user_recency = order_cluster('RecencyCluster','recency',user_recency,False)

Next we will summarise the new sorted data frame and check if the clusters and mapped in a logical order.

user_recency.groupby('RecencyCluster')['recency'].mean().reset_index()

From the above output we can see that the cluster numbers are mapped in a logical order of decreasing recency.
We now need to repeat the process for frequency and monetary values. For convenience we will wrap all these processes in a new function.

def clusterSorter(target_field_name,ascending):    
    # Make the subset data frame using the required feature
    user_variable = RfmAgeTrain[['CustomerID',target_field_name]]
    # let us take four clusters indicating 4 quadrants
    kmeans = KMeans(n_clusters=4)
    kmeans.fit(user_variable[[target_field_name]])
    # Create the cluster field name from the target field name
    cluster_field_name = target_field_name + 'Cluster'
    # Create the clusters
    user_variable[cluster_field_name] = kmeans.predict(user_variable[[target_field_name]])
    # Sort and reset index
    user_variable.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    # Sort the data frame according to cluster values
    user_variable = order_cluster(cluster_field_name,target_field_name,user_variable,ascending)
    return user_variable

Let us now implement this function to get the clusters for frequency and monetary values.

# Implementing for user frequency
user_freqency = clusterSorter('frequency',True)
user_freqency.groupby('frequencyCluster')['frequency'].mean().reset_index()
# Implementing for monetary values
user_monetary = clusterSorter('monetary_value',True)
user_monetary.groupby('monetary_valueCluster')['monetary_value'].mean().reset_index()

Let us now sit back and look at the three results which we got and try to analyse the results. For recency, we implemented the process using ‘ascending’ value as ‘False’ and the other two with ascending value as ‘True’. Why do you think we did it this way ?

To answer let us look these three variables from the perspective of the desirable behaviour from a customer. We would want customers who are very recent, are very frequent and spent lot of money. So from a recency perspective lesser days is a good behaviour as this indicate very recent customers. The reverse is true for frequency and monetary where the more of those values is the desirable behaviour. This is why we used 'ascending = false' in the recency variable as the clusters would be sorted with the less frequent ( more days) for cluster ‘0’ and the mean days comes down when we go to cluster 3. So in effect we are making cluster 3 as the group of most desirable customers. The reverse applies to frequency and monetary value where we gave 'ascending = True' to make custer 3 as the group of most desirable customers.

Now that we have obtained the clusters for each of the variables seperately, its time to combine them into one data frame and then get a consolidated score which will become the segments we want.

Let us first combine each of the individual dataframes we created with the original data frame

# Merging the individual data frames with the main data frame
RfmAgeTrain = pd.merge(RfmAgeTrain,user_monetary[["CustomerID",'monetary_valueCluster']],on='CustomerID')
RfmAgeTrain = pd.merge(RfmAgeTrain,user_freqency[["CustomerID",'frequencyCluster']],on='CustomerID')
RfmAgeTrain = pd.merge(RfmAgeTrain,user_recency[["CustomerID",'RecencyCluster']],on='CustomerID')
RfmAgeTrain.head()

In lines 115-117, we combine the individual dataframes to our main dataframe. We combine them on the ‘CustomerID’ field. After combining we have a consolidated data frame with each individual cluster label mapped to each customer id as shown below

Let us now add the individual cluster labels to get a combined cluster score.

# Calculate the overall score
RfmAgeTrain['OverallScore'] = RfmAgeTrain['RecencyCluster'] + RfmAgeTrain['frequencyCluster'] + RfmAgeTrain['monetary_valueCluster']
RfmAgeTrain

Let us group the data based on the ‘OverallScore’ and find the mean values of each of our variables , recency, frequency and monetary.

RfmAgeTrain.groupby('OverallScore')['frequency','recency','monetary_value'].mean().reset_index()

From the output we can see how the distributions of the new clusters are. From the values we can see that there is some level of logical demarcation according to the cluster labels. The higher cluster labels ( 4,5 & 6) have high monetary values, high frequency levels and also mid level recency levels. The first two clusters ( 0 & 1) have lower monetary values, high recency and low levels of frequency. Another stand out cluster is cluster 3, which has the lowest monetary value, lowest frequency and the lowest recency. We can very well go with these six clusters or we can combine clusters who demonstrate similar trends/behaviours. However this assessment needs to be taken based on the number of customers we have under each of these new clusters. Let us get those numbers first.

RfmAgeTrain.groupby('OverallScore')['frequency'].count().reset_index()

From the counts, we can see that the higher scores ( 4,5,6) have very few customers relative to the other clusters. So it would make sense to combine them to one single segment. As these clusters have higher values we will make them customer segment ‘Q4’. Cluster 3 has some of the lowest relative scores and so we will make it segment ‘Q1’. We can also combine clusters 0 & 1 to a single segment as the number of customers for those two clusters are also lower and make it segment ‘Q2’. Finally cluster 2 would be segment ‘Q3’ . Lets implement these steps next.

RfmAgeTrain['Segment'] = 'Q1'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 0) ,'Segment']='Q2'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 1),'Segment']='Q2'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 2),'Segment']='Q3'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 4),'Segment']='Q4'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 5),'Segment']='Q4'
RfmAgeTrain.loc[(RfmAgeTrain.OverallScore == 6),'Segment']='Q4'

RfmAgeTrain

After allocating the clusters to the respective segments, the subsequent data frame will look as above. Let us now take the mean values of each of these segments to understand how the segment values are distributed.

RfmAgeTrain.groupby('Segment')['frequency','recency','monetary_value'].mean().reset_index()

From the output we can see that for each customer segment the monetary value and frequency values are in ascending order. The value of recency is not ordered in any fashion. However that dosent matter as all what we are interested in getting is the segmentation of the customer data into four segments. Finally let us merge the segment information to the orginal customer transaction data.

# Merging the customer details with the segment
custDetails = pd.merge(custDetails, RfmAgeTrain, on=['CustomerID'], how='left')
custDetails.head()

The above output is just part of the final dataframe. From the output we can see that the segment data is updated to the original data frame.

With that we complete the first step of our process. Let us summarise what we have achieved so far.

  • Preprocessed data to extract information required to generate states
  • Transformed data to the RFM format.
  • Clustered data with respect to recency, frequency and monetary values and then generated the composite score.
  • Derived 4 segments based on the cluster data.

Having completed the segmentation of customers, we are all set to embark on the most important processes.

What Next ?

The next step is to take the segmentation information and then construct our states and action strategies from them. This will be dealt with in the next post. Let us take a peek into the processes we will implement in the next post.

  1. Create states and actions from the customer segments we just created
  2. Initialise the value distribution and rewards distribution
  3. Build the self learning recommendaton system using the epsilon greedy method
  4. Simulate customer action to get the feed backs
  5. Update the value distribution based on customer feedback and improve recommendations

There is lot of ground which will be covered in the next post.Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system – III : Recommendation System as a K-armed Bandit

This is the third post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit ( This post )
  4. Build the prototype of the self learning recommendation system: Part I
  5. Build the prototype of the self learning recommendation system: Part II
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

Introduction

In our previous post we implemented couple of experiments with K-armed bandit. When we discussed the idea of the K-armed bandits from the context of recommendation systems, we briefly touched upon the idea that the buying behavior of a customer depends on the customers context. In this post we will take the idea of the context forward and how the context will be used to build the recommendation system using the K-armed bandit solution.

Defining the context for customer buying

When we discussed about reinforcement learning in our first post, we learned about the elements of a reinforcement learning setting like state, actions, rewards etc. Let us now identify these elements in the context of the recommendation system we are building.

State

When we discussed about reinforcement learning in the first post, we learned that when an agent interacts with the environment at each time step, the agent manifests a certain state. In the example of the robot picking trash the different states were that of high charge or low charge. However in the context of the recommendation system, what would be our states ? Let us try to derive the states from the context of a customer who makes an online purchase. What would be those influencing factors which defines the product the customer buys ? Some of these are

  • The segment the customer belongs
  • The season or time of year the purchase is made
  • The day in which purchase is made

There could be many other influencing factors other than this. For simplicity let us restrict to these factors for now. A state could be made from the combinations of all these factors. Let us arrive at these factors through some exploratory analysis of the data

The data set we would be using is the online retail data set available in the UCI Machine learning library. We will download the data and the place it in local folder and the read the file from the local folder.

import numpy as np
import pandas as pd
from dateutil.parser import parse

Lines 1-3 imports all the necessary packages for our purpose. Let us now load the data as a pandas data frame

# Please use the path to the actual data
filename = "data/Online Retail.xlsx"
# Let us load the customer Details
custDetails = pd.read_excel(filename, engine='openpyxl')
custDetails.head()
Figure 1: Head of the retail data set

In line 5, we load the data from disk and then read the excel shee using the ‘openpyxl’ engine. Please note to pip install the ‘openpyxl’ package if not available.

Let us now parse the date column using date parser and extract information from the date column.

#Parsing  the date
custDetails['Parse_date'] = custDetails["InvoiceDate"].apply(lambda x: parse(str(x)))
# Parsing the weekdaty
custDetails['Weekday'] = custDetails['Parse_date'].apply(lambda x: x.weekday())
# Parsing the Day
custDetails['Day'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%A"))
# Parsing the Month
custDetails['Month'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%B"))
# Getting the year
custDetails['Year'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%Y"))
# Getting year and month together as one feature
custDetails['year_month'] = custDetails['Year'] + "_" +custDetails['Month']
# Feature engineering of the customer details data frame
# Get the date  as a seperate column
custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
# Converting date to float for easy comparison
custDetails['Date']  = custDetails['Date'] .astype('float64')
# Get the period of month column
custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > 15))

custDetails.head()
Figure 2 : Parsed Data

As seen from line 11 we have used the lambda() function to first parse the ‘date’ column. The parsed date is stored in a new column called ‘Parse_date’. After parsing the dates first, we carry out different operations, again using the lambda() function on the parsed date. The different operations we carry out are

  1. Extract weekday and store it in a new column called ‘Weekday’ : line 13
  2. Extract the day of the week and store it in the column ‘Day’ : line 15
  3. Extract the month and store in the column ‘Month’ : line 17
  4. Extract year and store in the column ‘Year’ : line 19

In line 21 we combine the year and month to form a new column called ‘year_month’. This is done to enable easy filtering of data, based on the combination of a year and month.

We make some more changes from line 24-28. In line 24, we extract the date of the month and then convert it into a float type in line 26. The purpose of taking the date is to find out which of these transactions have happened before 15th of the month and which after 15th. We extract those details in line 28, where we create a binary points ( 0 & 1) as to whether a date falls in the last 15 days or the first 15 days of the month.

We will also create a column which gives you the gross value of each puchase. Gross value can be calculated by multiplying the quantity with unit price. After that we will consolidate the data for each unique invoice number and then explore some of the elements of states which we want to explore

# Creating gross value column
custDetails['grossValue'] = custDetails["Quantity"] * custDetails["UnitPrice"]
# Consolidating accross the invoice number for gross value
retailConsol = custDetails.groupby('InvoiceNo')['grossValue'].sum().reset_index()
print(retailConsol.shape)
retailConsol.head()
Figure 3: Aggregated Data

Now that we have got the data consolidated based on each invoice number, let us merge the date related features from the original data frame with this consolidated data. We merge the consolidated data with the custDetails data frame and then drop all the duplicate data so that we get a record per invoice number, along with its date features.

# Merge the other information like date, week, month etc
retail = pd.merge(retailConsol,custDetails[["InvoiceNo",'Parse_date','Weekday','Day','Month','Year','year_month','monthPeriod']],how='left',on='InvoiceNo')
# dropping ALL duplicate values
retail.drop_duplicates(subset ="InvoiceNo",keep = 'first', inplace = True)
print(retail.shape)
retail.head()
Figure 4 : Consolidated data

Let us first look at the month wise consolidation of data and then plot the data. We will use a functions to map the months to its index position. This is required to plot the data according to months. The function ‘monthMapping‘, maps an integer value to the month and then sort the data frame.

# Create a map for each month
def monthMapping(mnthTrend):
    # Get the map
    mnthMap = {"January": 1, "February": 2,"March": 3, "April": 4,"May": 5, "June": 6,"July": 7, "August": 8,"September": 9, "October": 10,"November": 11, "December": 12}
    # Create a new feature for month
    mnthTrend['mnth'] = mnthTrend.Month
    # Replace with the numerical value
    mnthTrend['mnth'] = mnthTrend['mnth'].map(mnthMap)
    # Sort the data frame according to the month value
    return mnthTrend.sort_values(by = 'mnth').reset_index()

We will use the above function to consolidate the data according to the months and then plot month wise grossvalue data

mnthTrend = retail.groupby(['Month'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)
# sort the months in the right order
mnthTrend = monthMapping(mnthTrend)
sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=mnthTrend, x='Month', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can see that there is sufficient amount of variability of data month on month. So therefore we will take months as one of the context items on which the states can be constructed.

Let us now look at buying pattern within each month and check how the buying pattern is within the first 15 days and the latter half

# Aggregating data for the first 15 days and latter 15 days
fortnighTrend = retail.groupby(['monthPeriod'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)

sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=fortnighTrend, x='monthPeriod', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can see that there is as small difference between buying patterns in the first 15 days of the month and the latter half of the month. Eventhough the difference is not significant, we will still take this difference as another context.

Next let us aggregate data as per the days of the week and and check the trend

# Aggregating data accross weekdays
dayTrend = retail.groupby(['Weekday'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)

sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=dayTrend, x='Weekday', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can also see that there is quite a bit of variability of buying patterns accross the days of the week. We will therefore take the week days also as another context

So far we have observed 4 different features, which will become our context for recommending products. The context which we have defined would act as the states from the reinforcement learning setting perspective. Let us now look at the big picture of how we will formulate the recommendation task as reinforcement learning setting.

The Big Picture

Figure 5: The Big Picture

We will now have a look at the big picture of this implementation. The above figure is the representation of what we will implement in code in the next few posts.

The process starts with the customer context, consisting of segment, month, period in the month and day of the week. The combination of all the contexts will form the state. From an implementation perspective we will run simulations to generate the context since we do not have a real system where customers logs in and thereby we automatically capture context.

Based on the context, the system will recommend different products to the customer. From a reinforcement learning context these are the actions which are taken from each state. The initial recommendation of products ( actions taken) will be based on the value function learned from the historical data.

The customer will give rewards/feedback based on the actions taken( products recommended ). The feedback would be the manifestation of the choices the customer make. The choice the customer makes like the products the customer buys, browses and ignores from the recommended list. Depending on the choice made by the customer, a certain reward will be generated. Again from an implementation perspective, since we do not have real customers giving feedback, we will be simulating the customer feedback mechanism.

Finally the update of the value functions based on the reward generated will be done based on the simple averaging method. Based on the value update, the bandit will learn and adapt to the affinities of the customers in the long run.

What next ?

In this post we explored the data and then got a big picture of what we will implement going forward. In the next post we will start implementing these processes and building a prototype using Jupyter notebook. Later on we will build an application using Python scripts and then explore options to deploy the application. Watch out this space for more.

 The next post will be published next week. Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!