VI : Build and deploy data science products: Machine translation application – From prototype to production. Introduction to the factory model

Source: brainyquote.com

This is the sixth part of the series where we continue on our pursuit to build a machine translation application. In this post we embark on a transformation process where in we transform our prototype into a production grade code.

This series comprises of 8 posts.

  1. Understand the landscape of solutions available for machine translation
  2. Explore sequence to sequence model architecture for machine translation.
  3. Deep dive into the LSTM model with worked out numerical example.
  4. Understand the back propagation algorithm for a LSTM model worked out with a numerical example.
  5. Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
  6. Build the production grade code for the training module using Python scripts.( This post)
  7. Building the Machine Translation application -From Prototype to Production : Inference process
  8. Build the machine translation application using Flask and understand the process to deploy the application on Heroku

In this section we will see how we can take the prototype which we built in the last article into a production ready code. In the prototype building phase we were developing our code on a Jupyter/Colab notebook. However if we have to build an application and deploy it, notebooks would not be very effective. We have to convert the code we built on the notebook into production grade code using python scripts. We will be progressively building the scripts using a process, I call, as the factory model. Let us see what a factory model is.

Factory Model

A Factory model is a modularized process of generating business outcomes using machine learning models. There are some distinct phases in the process which includes

  1. Ingestion/Extraction process : Process of getting data from source systems/locations
  2. Transformation process : Transformation process entails transforming raw data ingested from multiple sources into a form fit for the desired business outcome
  3. Preprocessing process: This process involves basic level of cleaning of the transformed data.
  4. Feature engineering process : Feature engineering is the process of converting the preprocessed data into features which are required for model training.
  5. Training process : This is the phase where the models are built from the featurized data.
  6. Inference process : The models which were built during the training phase is then utilized to generate the desired business outcomes during the inference process.
  7. Deployment process : The results of the inference process will have to be consumed by some process. The consumer of the inferences could be a BI report or a web service or an ERP application or any downstream applications. There is a whole set of process which is involved in enabling the down stream systems to consume the results of the inference process. All these steps are called the deployment process.

Needless to say all these processes are supported by an infrastructure layer which is also called the data engineering layer. This layer looks at the most efficient and effective way of running all these processes through modularization and parallelization.

All these processes have to be designed seamlessly to get the business outcomes in the most effective and efficient way. To take an analogy its like running a factory where raw materials gets converted into a finished product and thereby gets consumed by the end customers. In our case, the raw material is the data, the product is the model generated from the training phase and the consumers are any business process which uses the outcomes generated from the model.

Let us now see how we can execute the factory model to generate the business outcomes.

Project Structure

Before we dive deep into the scripts, let us look at our project structure.

Our root folder is the Machine Translation folder which contains two sub folders Data and factoryModel. The Data subfolder contains the raw data. The factoryModel folder contains different subfolders containing scripts for our processes. We will be looking at each of these scripts in detail in the subsequent sections. Finally we have two driver files mt_driver_train.py which is the driver file for the training process and mt_Inference.py which is the driver file for the inference process.

Let us first dive into the training phase scripts.

Training Phase

The first part of the factory model is the training phase which comprises of all the processes till the creation of the model. We will start off by building the supporting files and folders before we get into the driver file. We will first start with the configuration file.

Configuration file

When we were working with the notebook files, we were at a liberty to change the pararmeters we wanted to vary, say for example the path to the input file or some hyperparameters like the number of dimensions of the embedding vector, on the notebook itself. However when an application is in production we would not have the luxury to change the parameters and hyperparameters directly in the code base. To get over this problem we use the configuration files. We consolidate all the parameters and hyperparameters of the model on to the configuration file. All processes will pick the parameters from the configuration file for further processing.

The configuration file will be inside the config folder. Let us now build the configuration file.

Open a word editor like notepad++ or any other editor of your choice and open a new file and name it mt_config.py. Let us start adding the below code in this file.

'''
This is the configuration file for storing all the application parameters
'''

import os
from os import path


# This is the base path to the Machine Translation folder
BASE_PATH = '/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/BayesianQuest/MT/MachineTranslation'
# Define the path where data is stored
DATA_PATH = path.sep.join([BASE_PATH,'Data/deu.txt'])

Lines 5 and 6, we import the necessary library packages.

Line 10, we define the base path for the application. You need to change this path based on your specific path to the application. Once the base path is set, the rest of the paths will be derived out from it. In Line 12, we define the path to the raw data set folder. Note that we just join the name of the data folder and the raw text file with the base path to get the data path. We will be using the data path to read in the raw data.

In the config folder there will be another file named __init__.py . This is a special file which tells Python to treat the config folder as part of the package. This file inside this folder will be an empty file with no code in it

Loading Data

The next helper files we will build are those for loading raw files and preprocessing. The code we use for these purposes are the same code which we used for building the prototype. This file will reside in the dataLoader folder

In your text editor open a new file and name it as datasetloader.py and then add the below code into it

'''
Factory Model for Machine translation preprocessing.
This is the script for loading the data and preprocessing data
'''

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array

# Creating the class to load data and then do the preprocessing as sequence of steps

class textLoader:
	def __init__(self , preprocessors = None):
		# This init method is to store the text preprocessing pipeline
		self.preprocessors = preprocessors
		# Initializing the preprocessors as an empty list of the preprocessors are None
		if self.preprocessors is None:
			self.preprocessors = []

	def loadDoc(self,filepath):
		# This is the function to read the file from the path provided
		# Open the file
		file = open(filepath,mode = 'rt',encoding = 'utf-8')
		# Reading the text
		text = file.read()
		#Once the file is read, applying the preprocessing steps one by one
		if self.preprocessors is not None:
			# Looping over all the preprocessing steps and applying them on the text data
			for p in self.preprocessors:
				text = p.preprocess(text)
				
		# Closing the file
		file.close()
				
		# Returning the text after all the preprocessing
		return text

Before addressing the code block line by line, let us get a big picture perspective of what we are trying to accomplish. When working with text you would have realised that different sources of raw text requires different preprocessing treatments. A preprocessing method which we have used for one circumstance may not be warranted in a different one. So in this code block we are building a template called textLoader, which reads in raw data and then applies different preprocessing steps like a pipeline as the situation warrants. Each of the individual preprocessing steps would be defined seperately. The textLoader class first reads in the data and then applies the selected preprocessing one after the other. Let us now dive into the details of the code.

Lines 6 to 10 imports all the necessary library packages for the process.

Line 14 we define the textLoader class. The constructor in line 15 takes the text preprocessor pipeline as the input. The prepreprocessors are given as lists. The default value is taken as None. The preprocessors provided in the constructor is initialized in line 17. Lines 19-20 initializes an empty list if the preprocessor argument is none. If you havent got a handle of why the preprocessors are defined this way, it is ok. This will be more clear when we define the actual preprocessors. Just hang on till then.

From line 22 we start the first function within this class. This function is to read the raw text and the apply the processing pipeline. Lines 25 – 27, where we open the text file and read the text is the same as what we defined during the prototype phase in the last post. We do a check to see if we have defined any preprocessor pipeline in line 29. If there are any pipeline defined those are applied on the text one by one in lines 31-32. The method .preprocess is specific to each of the preprocessor in the pipeline. This method would be clear once we take a look at each of the preprocessors. We finally close the raw file and the return the processed text in lines 35-38.

The __init__.py file inside this folder will contain the following line for importing the textLoader class from the datasetloader.py file for any calling script.

from .datasetloader import textLoader

Processing Data : Preprocessing pipeline construction

Next we will create the files for preprocessing the text. In the last section we saw how the raw data was loaded and then preprocessing pipeline was applied. In this section we look into the preprocessing pipeline. The folder structure will be as shown in the figure.

There would be three preprocessors classes for processing the raw data.

  • SentenceSplit : Preprocessor to split raw text into pair of English and German sentences. This class is inside the file splitsentences.py
  • cleanData : Preprocessor to apply cleaning steps like removing punctuations, removing whitespaces which is included in the datacleaner.py file.
  • TrainMaker : Preprocessor to tokenize text and then finally prepare the train and validation sets contined in the tokenizer.py file

Let us now dive into each of the preprocessors.

Open a new file and name it splitsentences.py. Add the following code to this file.

'''
Script for preprocessing of text for Machine Translation
This is the class for splitting the text into sentences
'''

import string
from numpy import array

class SentenceSplit:
	def __init__(self,nrecords):
		# Creating the constructor for splitting the sentences
		# nrecords is the parameter which defines how many records you want to take from the data set
		self.nrecords = nrecords
		
	# Creating the new function for splitting the text
	def preprocess(self,text):
		sen = text.strip().split('\n')
		sen = [i.split('\t') for i in sen]
		# Saving into an array
		sen = array(sen)
		# Return only the first two columns as the third column is metadata. Also select the number of rows required
		return sen[:self.nrecords,:2]

This is the first or our preprocessors. This preprocessor splits the raw text and finally outputs an array of English and German sentence pairs.

After we import the required packages in lines 6-7, we define the class in line 9. We pass a variable nrecords to the constructor to subset the raw text and select number of rows we want to include for training.

The preprocess function starts in line 16. This is the function which we were accessing in line 32 of the textLoader class which we discussed in the last section. The rest is the same code we have used in the prototype building phase which includes

  • Splitting the text into sentences in line 17
  • Splitting each sentece on tab spaces to get the German and English sentences ( line 18)

Finally we convert the processed sentences into an array and return only the first two columns of the array. Please note that the third column contains metadata of each line and therefore we exclude it from the returned array. We also subset the array based on the number of records we want.

Now that the first preprocessor is complete,let us now create the second preprocessor.

Open a new file and name it datacleaner.py and copy the below code.

'''
Script for preprocessing data for Machine Translation application
This is the class for removing the punctuations from sentences and also converting it to lower cases
'''

import string
from numpy import array
from unicodedata import normalize

class cleanData:
	def __init__(self):
		# Creating the constructor for removing punctuations and lowering the text
		pass
		
	# Creating the function for removing the punctuations and converting to lowercase
	def preprocess(self,lines):
		cleanArray = list()
		for docs in lines:
			cleanDocs = list()
			for line in docs:
				# Normalising unicode characters
				line = normalize('NFD', line).encode('ascii', 'ignore')
				line = line.decode('UTF-8')
				# Tokenize on white space
				line = line.split()
				# Removing punctuations from each token
				line = [word.translate(str.maketrans('', '', string.punctuation)) for word in line]
				# convert to lower case
				line = [word.lower() for word in line]
				# Remove tokens with numbers in them
				line = [word for word in line if word.isalpha()]
				# Store as string
				cleanDocs.append(' '.join(line))
			cleanArray.append(cleanDocs)
		return array(cleanArray)

This preprocessor is to clean the array of German and English sentences we received from the earlier preprocessor. The cleaning steps are the same as what we have seen in the previous post. Let us quickly dive in and understand the code block.

We start of by defining the cleanData class in line 10. The preprocess method starts in line 16 with the array from the previous preprocessing step as the input. We define two placeholder lists in line 17 and line 19. In line 20 we loop through each of the sentence pair of the array and the carry out the following cleaning operations

  • Lines 22-23, normalise the text
  • Line 25 : Split the text to remove the whitespaces
  • Line 27 : Remove punctuations from each sentence
  • Line 29: Convert the text to lower case
  • Line 31: Remove numbers from text

Finally in line 33 all the tokens are joined together and appended into the cleanDocs list. In line 34 all the individual sentences are appended into the cleanArray list and converted into an array which is returned in line 35.

Let us now explore the third preprocessor.

Open a new file and name it tokenizer.py . This file is pretty long and therefore we will go over it function by function. Let us explore the file in detail

'''
This class has methods for tokenizing the text and preparing train and test sets
'''

import string
import numpy as np
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


class TrainMaker:
	def __init__(self):
		# Creating the constructor for creating the tokenizers
		pass
	
	# Creating an internal function for tokenizing the text	
	def tokenMaker(self,text):
		tokenizer = Tokenizer()
		tokenizer.fit_on_texts(text)
		return tokenizer	

We down load all the required packages in lines 5-10, after which we define the constructor in lines 13-16. There is nothing going on in the constructor so we can conveniently pass it over.

The first function starts on line 19. This is a function we are familiar with in the previous post. This function fits the tokenizer function on text. The first step is to instantiate the tokenizer object in line 20 and then fit the tokenizer object on the provided text in line 21. Finally the tokenizer object which is fit on the text is returned in line 22. This function will be used for creating the tokenizer dictionaries for both English and German text.

The next function which we will see is the sequenceMaker. In the previous post we saw how we convert text as sequence of integers. The sequenceMaker function is used for this task.

		
	# Creating an internal function for encoding and padding sequences
	
	def sequenceMaker(self,tokenizer,stdlen,text):
		# Encoding sequences as integers
		seq = tokenizer.texts_to_sequences(text)
		# Padding the sequences with respect standard length
		seq = pad_sequences(seq,maxlen=stdlen,padding = 'post')
		return seq

The inputs to the sequenceMaker function on line 26 are the tokenizer , the maximum length of a sequence and the raw text which needs to be converted to sequences. First the text is converted to sequences of integers in line 28. As the sequences have to be of standard legth, they are padded to the maximum length in line 30. The standard length integer sequences is then returned in line 31.

		
	# Creating another function to find the maximum length of the sequences	
	def qntLength(self,lines):
		doc_len = []
		# Getting the length of all the language sentences
		[doc_len.append(len(line.split())) for line in lines]
		return np.quantile(doc_len, .975)

The next function we will define is the function to find the quantile length of the sentences. As seen from the previous post we made the standard length of the sequences equal to the 97.5 % quantile length of the respective text corpus. The function starts in line 34 where the complete text is given as input. We then create a placeholder in line 35. In line 37 we parse through each of the line and the find the total length of the sentence. The length of each sentence is stored in the placeholder list we created earlier. Finally in line 38, the 97.5 quantile of the length is returned to get the standard length.

		
	# Creating the function for creating tokenizers and also creating the train and test sets from the given text
	def preprocess(self,docArray):
		# Creating tokenizer forEnglish sentences
		eng_tokenizer = self.tokenMaker(docArray[:,0])
		# Finding the vocabulary size of the tokenizer
		eng_vocab_size = len(eng_tokenizer.word_index) + 1
		# Creating tokenizer for German sentences
		deu_tokenizer = self.tokenMaker(docArray[:,1])
		# Finding the vocabulary size of the tokenizer
		deu_vocab_size = len(deu_tokenizer.word_index) + 1
		# Finding the maximum length of English and German sequences
		eng_length = self.qntLength(docArray[:,0])
		ger_length = self.qntLength(docArray[:,1])
		# Splitting the train and test set
		train,test = train_test_split(docArray,test_size = 0.1,random_state = 123)
		# Calling the sequence maker function to create sequences of both train and test sets
		# Training data
		trainX = self.sequenceMaker(deu_tokenizer,int(ger_length),train[:,1])
		trainY = self.sequenceMaker(eng_tokenizer,int(eng_length),train[:,0])
		# Validation data
		testX = self.sequenceMaker(deu_tokenizer,int(ger_length),test[:,1])
		testY = self.sequenceMaker(eng_tokenizer,int(eng_length),test[:,0])
		return eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,docArray,trainX,trainY,testX,testY,eng_length,ger_length

We tie all the earlier functions in the preprocess method starting in line 41. The input to this function is the English, German sentence pair as array. The various processes under this function are

  • Line 43 : Tokenizing English sentences using the tokenizer function created in line 19
  • Line 45 : We find the vocabulary size for the English corpus
  • Lines 47-49 the above two processes are repeated for German corpus
  • Lines 51-52 : The standard lengths of the English and German senetences are found out
  • Line 54 : The array is split to train and test sets.
  • Line 57 : The input sequences for the training set is created using the sequenceMaker() function. Please note that the German sentences are the input variable ( TrainX).
  • Line 58 : The target sequence which is the English sequence is created in this step.
  • Lines 60-61: The input and target sequences are created for the test set

All the variables and the train and test sets are returned in line 62

The __init__.py file inside this folder will contain the following lines

from .splitsentences import SentenceSplit
from .datacleaner import cleanData
from .tokenizer import TrainMaker

That takes us to the end of the preprocessing steps. Let us now start the model building process.

Model building Scripts

Open a new file and name it mtEncDec.py . Copy the following code into the file.

'''
This is the script and template for different models.
'''

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import TimeDistributed

class ModelBuilding:
	@staticmethod
	def EncDecbuild(in_vocab,out_vocab, in_timesteps,out_timesteps,units):
		# Initializing the model with Sequential class
		model = Sequential()
		# Initiating the embedding layer for the text
		model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
		# Adding the first LSTM layer
		model.add(LSTM(units))
		# Using the RepeatVector to map the input sequence length to output sequence length
		model.add(RepeatVector(out_timesteps))
		# Adding the second layer of LSTM 
		model.add(LSTM(units, return_sequences=True))
		# Adding the fully connected layer with a softmax layer for getting the probability
		model.add(TimeDistributed(Dense(out_vocab, activation='softmax')))
		# Compiling the model
		model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
		# Printing the summary of the model
		model.summary()
		return model

The model building scripts is straight forward. Here we implement the encoder decoder model we described extensively in the last post.

We start by importing all the necessary packages in lines 5-10. We then get to the meat of the model by defining the ModelBuilding class in line 12. The model we are using for our application is defined through a function EncDecbuild in line 14. The inputs to the function are the

  • in_vocab : This is the size of the German vocabulary
  • out_vocab : This is the size of the Enblish vocabulary
  • in_timesteps : The standard sequence length of the German sentences
  • out_timesteps : Standard sequence length of Enblish sentences
  • units : Number of hidden units for the LSTM layers.

The progressive building of the model was covered extensively in the last post. Let us quickly run through the same here

  • Line 16 we initialize the sequential class
  • The next layer is the Embedding layer defined in line 18. This layer converts the text to word embedding vectors. The inputs are the German vocabulary size, the dimension required for the word embeddings and the sequence length of the input sequences. In this example we have kept the dimension of the word embedding same as the number of units of LSTM. However this is a parameter which can be experimented with.
  • Line 20, we initialize our first LSTM unit.
  • We then perform the Repeat vector operation in Line 22 so as to make the mapping between the encoder time steps and decoder time steps
  • We add our second LSTM layer for the decoder part in Line 24.
  • The next layer is the dense layer whose output size is equal to the English vocabulary size.(Line 26)
  • Finally we compile the model using ‘adam’ optimizer and then summarise the model in lines 28-30

So far we explored the file ecosystem for our application. Next we will tie all these together in the driver program.

Driver Program

Open a new file and name it mt_driver_train.py and start adding the following code blocks.

'''
This is the driver file which controls the complete training process
'''

from factoryModel.config import mt_config as confFile
from factoryModel.preprocessing import SentenceSplit,cleanData,TrainMaker
from factoryModel.dataLoader import textLoader
from factoryModel.models import ModelBuilding
from tensorflow.keras.callbacks import ModelCheckpoint
from factoryModel.utils.helperFunctions import *

## Define the file path to input data set
filePath = confFile.DATA_PATH

print('[INFO] Starting the preprocessing phase')

## Load the raw file and process the data
ss = SentenceSplit(50000)
cd = cleanData()
tm = TrainMaker()

Let us first look at the library file importing part. In line 5 we import the configuration file which we defined earlier. Please note the folder structure we implemented for the application. The configuration file is imported from the config folder which is inside the folder named factoryModel. Similary in line 6 we import all three preprocessing classes from the preprocessing folder. In line 7 we import the textLoader class from the dataLoader folder and finally in line 8 we import the ModelBuilding class from the models folder.

The first task we will do is to get the path of the files which we defined in the configuration file. We get the path to the raw data in line 13.

Lines 18-20, we instantiate the preprocessor classes starting with the SentenceSplit, cleanData and finally the trainMaker classes. Please note that we pass a parameter to the SentenceSplit(50000) class to indicate that we want only 50000 rows of the raw data, for processing.

Having seen the three preprocessing classes, let us now see how these preprocessors are tied together in a pipeline to be applied sequentially on the raw text. This is achieved in next code block

# Initializing the data set loader class and then executing the processing methods
tL = textLoader(preprocessors = [ss,cd,tm])
# Load the raw data, preprocess it and create the train and test sets
eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,text,trainX,trainY,testX,testY,eng_length,ger_length = tL.loadDoc(filePath)

Line 21 we instantiate the textLoader class. Please note that all the preprocessing classes are given sequentially in a list as the parameters to this class. This way we ensure that each of the preprocessors are implemented one after the other when we implement the textLoader class. Please take some time to review the class textLoader earlier in the post to understand the dynamics of the loading and preprocessing steps.

In Line 23 we implement the loadDoc function which takes the path of the data set as the input. There are lots of processes which goes on in this method.

  • First loads the raw text using the file path provided.
  • On the raw text which is loaded, the three preprocessors are implemented one after the other
  • The last preprocessing step returns all the required data sets like the train and test sets along with the variables we require for modelling.

We now come to the end of the preprocessing step. Next we take the preprocessed data and train the model.

Training the model

We have already built all the necessary scripts required for training. We will tie all those pieces together in the training phase. Enter the following lines of code in our script

### Initiating the training phase #########
# Initialise the model
model = ModelBuilding.EncDecbuild(int(deu_vocab_size),int(eng_vocab_size),int(ger_length),int(eng_length),256)
# Define the checkpoints
checkpoint = ModelCheckpoint('model.h5',monitor = 'val_loss',verbose = 1, save_best_only = True,mode = 'min')
# Fit the model on the training data set
model.fit(trainX,trainY,epochs = 50,batch_size = 64,validation_data=(testX,testY),callbacks = [checkpoint],verbose = 2)

In line 34, we initialize the model object. Please note that when we built the script ModelBuilding was the name of the class and EncDecbuild was the method or function under the class. This is how we initialize the model object in line 34. The various parameter we give are the German and English vocabulary sizes, sequence lenghts of the German and English senteces and the number of units for LSTM ( which is what we adopt for the embedding size also). We define the checkpoint variables in line 36.

We start the model fitting in line 38. At the end of the training process the best model is saved in the path we have defined in the configuration file.

Saving the other files and variables

Once the training is done the model file is stored as a 'model.h5‘ file. However we need to save other files and variables as pickle files so that we utilise them during our inference process. We will create a script where we store all such utility functions for saving data. This script will reside in the utils folder. Open a new file and name it helperfunctions.py and copy the following code.

'''
This script lists down all the helper functions which are required for processing raw data
'''

from pickle import load
from numpy import argmax
from tensorflow.keras.models import load_model
from pickle import dump

def save_clean_data(data,filename):
    dump(data,open(filename,'wb'))
    print('Saved: %s' % filename)

Lines 5-8 we import all the necessary packages.

The first function we will be creating is to dump any files as pickle files which is initiated in line 10. The parameters are the data and the filename of the data we want to save.

Line 11 dumps the data as pickle file with the file name we have provided. We will be using this utility function to save all the files and variables after the training phase.

In our training driver file mt_driver_train.py add the following lines

### Saving the tokenizers and other variables as pickle files
save_clean_data(eng_tokenizer,'eng_tokenizer.pkl')
save_clean_data(eng_vocab_size,'eng_vocab_size.pkl')
save_clean_data(deu_tokenizer,'deu_tokenizer.pkl')
save_clean_data(deu_vocab_size,'deu_vocab_size.pkl')
save_clean_data(trainX,'trainX.pkl')
save_clean_data(trainY,'trainY.pkl')
save_clean_data(testX,'testX.pkl')
save_clean_data(testY,'testY.pkl')
save_clean_data(eng_length,'eng_length.pkl')
save_clean_data(ger_length,'ger_length.pkl')

Lines 42-52, we save all the variables we received from line 24 as pickle files.

Executing the script

Now that we have completed all the scripts, let us go ahead and execute the scripts. Open a terminal and give the following command line arguments to run the script.

$ python mt_driver_train.py

All the scripts will be executed and finally the model files and other variables will be stored on disk. We will be using all the saved files in the inference phase. We will address the inference phase in the next post of the series.

Go to article 7 of this series : From prototype to production: Inference Process

You can download the notebook for the prototype using the following link

https://github.com/BayesianQuest/MachineTranslation/tree/master/Production

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Data Science for Predictive Maintenance

Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. In this post I would be giving an end to end perspective on tackling a business use case within the framework of Data Science. We will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science project

Data Science Process

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

  1. Gradual drop in conductance over time entails normal behaviour and sudden drop would mean anomalous behaviour
  2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behaviour

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

  1. Understand the business problem which we are set out to solve
  2. Identify all key variables related to the business problem
  3. Identify the lead indicators within these variable which will help in solving the business problem.
  4. Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

  • Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
  • What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
  • What is the frequency of data capture ? Does it vary across the variables ?
  • Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

  • Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
    • Time stamp when the conductance data was taken
    • Unique identifier for each battery
    • Other related information like manufacturer , installation location, model , string it was connected to etc
  • Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
    • Voltage of the battery
    • Temperature
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
  • Discharge Data : Discharge data is collected once every 3 months. Key variable include
    • Discharge voltage
    • Current at which voltage discharges
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
Data sets for battery end life prediction

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

  1. Which battery will fail &
  2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

  • Conductance : One reading of a battery captured once every 3 days.
  • Voltage & Temperature : 4-5 readings per battery captured every day.
  • Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.

Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).

The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

  1. Understand the business problem and formulate the response variables.
  2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
  3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
  4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
  5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
  6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

So far we discussed first three phases of the data science process namely business discovery, data discovery and data preparation.The next phase which we will discuss about one of the critical steps of the process namely exploratory. It is in this phase where we leverage the domain knowledge and observe our hypothesis in the data.

Exploratory Analysis – Unravelling latent trends

This phase entails digging deep to get a feel of the data and extract intuitions for feature engineering. When embarking upon exploratory analysis, it would be a good idea to get inputs from domain team on the relation between variables and the business problem. Such inputs are often the starting point for this phase.

Let us now get to the context of our preventive maintenance problem and evolve a philosophy for exploratory analysis.In the case of industrial batteries, a key variable which affects the state of health of a battery is its conductance. It turns out that an indicator of failing health of  battery is the precipitous drop in conductance. Armed with this information our next task should be to  identify, from our available data set,batteries that have higher probability to fail. Since precipitous fall in conductance is an indicator of failing health,the conductance data of  unhealthy batteries will have more variance than the normal ones. So the best way to identify failing batteries from the normal ones would be to apply some consolidating metric like standard deviation or variance on the conductance data and further drill deep on samples which stand apart from the normal population.


Separating potential failure cases

The above is a plot depicting standard deviation of conductance for all batteries. Now what might be of interest to us is the red zone which we can call the “Potential failure Zone”. The potential failure zone consists of those batteries whose conductance values show high standard deviation. Batteries with failing health are likely to exhibit large fall in conductance and as a corollary their values will also show higher standard deviation. This implies that the samples of batteries which have higher probability of failure will in all likelihood be from this failure zone. However to ascertain this hypothesis we will have to dig deep into batteries in the failure zone and look for patterns which might differentiate them from normal batteries. Another objective to dig deep is also to elicit clues from the underlying patterns on what features to include in the predictive model. We will discuss more on the feature extraction when we discuss about feature engineering. Now let us come back to our discussion on digging deep into the failure zone and ferreting out significant patterns. It has to be noted that in addition to the samples in the failure zone we will also have to observe patterns from the normal zone to help separate wheat from the chaff . Intuitions derived by observing different patterns would become vital during feature engineering stage.

Identifying failure zones by comparison

The above figure is a comparison of patterns from either zones. The figure on the left is from the failure zone and the one on the right is from the other. We can clearly see how the precipitous fall is manifested in the sample from the failure zone. The other aspect to note is also the magnitude of the fall. Every battery will have degrading conductance over time. However the magnitude of  degradation is what differentiates the unhealthy  battery from a normal one. We can observe from the plot on the left that the fall in conductance is more than 50%, however for the battery to the right the drop is more muted.  Another aspect we can observe is the slope of conductance. As evident from the two plots, the slope of  conductance profile for the battery on the left is much more steeper over time than the one on the right. These intuitions which we have derived so far might become critical from the overall scheme of feature engineering and modelling. Similar to the intuitions which we have disinterred so far, more could be extracted by observing more samples. The philosophy behind exploratory analysis entails visualizing more and more samples, observing patterns and extracting clues for feature engineering. The more time we spend on doing this more ammunition we get for feature engineering.

Let us now try to encapsulate the philosophy of exploratory analysis in few steps

  1. Take inputs from domain team related to the problem we are trying to solve. In our case the clue which we got was the relation between conductance and health of batteries.
  2. Identify any consolidating metric for the variable under consideration to separate out anomalous samples. In the example above we used standard deviation of conductance values to find anomalies.
  3. Once the samples are demarcated using the consolidation metric, visualize samples from different sets to identify discernible patterns in data.
  4. From the patterns we observe root out clues for feature engineering. In our example we identified that % fall in conductance and slope of conductance over time could be potential features.

Multivariate Exploration

So far we were limited to analysis of a single variable i.e conductance. However to get more meaningful insights we have to connect other variables layer by layer to the initial variable which we have analysed to get more insights on the problem. As far as battery is concerned some of the critical variables other than conductance are voltage and discharge. Let us connect these two variables along with the conductance profile to gain more intuitions from the data.

Combining different variables to observe trends

The above figure is a plot which depicts three variables across the same time span. The idea of plotting multiple variables together across a common time span is to unearth any discernible trends we can see together. A cursory look at this plot will reveal some obvious observations.

  1. The fall in current and voltage in conjunction with drop in conductance.
  2. The cyclic nature of the voltage profile.
  3. A gradual drop in the troughs of the voltage profile.

Having made some observations,we now need to ascertain whether these observations can be codified to some definitive trends. This can be verified only by observing plots for many samples of similar variables. By sampling data pertaining to many batteries if we can get similar observations, then we can be sure that we have unearthed some trends explaining behaviors of different variables. However just unearthing some trends will not suffice. We have to get some intuitions from such trends which will help in transforming the raw variables to some form which will help in the modelling task. This is achieved by feature engineering the raw variables.

Feature Engineering

Many a times the given set of raw variables will not suffice for extracting the required predictive power from the model. We will have to transform the raw variables to generate new variables giving us the extra thrust towards better predictive metrics. What transformation has to be done, will be based on the intuitions we build during the exploratory analysis phase and also by combining domain knowledge. For the case of batteries let us revisit some of the intuitions we build during the exploratory analysis phase and see how these intuitions we build can be used for feature engineering.

During our discussions with domain team we found out that precipitous fall in conductance is an indicator of failing health of a battery. So a probable feature we can extract from the conductance variable is the slope of the data points over a fixed time span.The rationale for such a feature is this, if precipitous fall in conductance over time is an indicator of failing health of a battery  then the slope of data points for a battery which is failing will be more steeper than the battery which is healthy. It was observed that through such transformation there was a positive influence on predictive metrics. The dynamics of such transformation is as follows, if we have conductance data for the battery for three years, we can take consecutive three month window of conductance data and take the slope of all the data points and make it as a feature.  By doing this, the number of rows of data for the variable also gets consolidated to much fewer numbers.

Let us also look at another example of feature engineering which we can introduce to the variable, discharge voltage. As seen from the above figure, the discharge voltage follows a wave like profile. It turns out that when a battery discharges the voltage first drops and then it rises. This behavior is called the “Coupe De Fouet” (CDF) effect. Now our thought should be, how do we combine the observed wave like pattern and the knowledge about CDF into a feature ? Again we have to dig into domain knowledge. As per theory on the state of health of batteries there are standards for the CDF profile of a healthy battery and that of a failing battery. These are prescribed by the manufacturer of the battery. For example the manufacturing standards prescribe certain depth to which the voltage will fall during discharge and certain height to which it will go up during a typical CDF effect. The deviance between the observed CDF and the manufacture prescribed standard can be taken as another feature. Similarly we can also think of other features related to voltage, like depth of discharge ( DOD), number of cycles etc. Our focus should be in using the available domain knowledge to transform raw variables into features.

As seen from the above two examples the essence of feature engineering is all about translating the domain knowledge and the trends seen in the data to more meaningful features. The veracity of the models which are built depends a lot on the strength of  the features built. Now that we have seen the feature engineering phase let us now look at modelling strategy for this use case.

Modelling Phase

In the initial part of this article we discussed labelling strategy for training the model. Since the use case is to predict which battery would fail and at what period of time, we have to look back in time from the failure point label for creating different classes related to periods of failure. In this specific case, the different features were created by consolidating 3 months of data into a single row. So one period before failure would denote 3 months before failure. So if the requirement is to predict failure 6 months prior to when it is likely to happen, then we will have 4 different classes i.e  failure point,one period before failure(3 months prior to failure point) ,two periods before failure and (6 months prior to failure point) & normal state. All periods prior to 6 months can be labelled as normal state.

With respect to modelling, we can spot check with different classification algorithms ( logistic regression, Naive bayes, SVM, Random Forest, XGboost .. etc). The choice of final model will be based on the accuracy metrics ( sensitivity , specificity etc) of the spot checked models. Another aspect which might be useful to note is also that, data set could be highly unbalanced i.e the number of normal battery classes is likely to outnumber the failure classes disproportionately. It will be a good idea to try out class balancing methods on the data set before modelling.

Wrapping up

This post brings down curtains to an end to end view of a predictive analytics use case for industrial batteries. Any use case within the manufacturing sector can be quite challenging as the variables involved are very technical and would require lot of interventions from related domain teams. Constant engagement of domain specialist as part of the data science team is very important for the success of such projects.

I have tried my best to write the nuances of such a difficult use case. I have tried to cover the critical elements in the process. In case of any clarifications on the use case and details of its implementation you can connect with me through the following email id bayesianquest@gmail.com. Looking forward to hearing from you.  Till then let me sign off.

Watch this space for more such use cases.

Applied Data Science Series : Solving a Predictive Maintenance Business Problem – Part II

 

ExploratoryAnalysis

 

In the first part of the applied data science series, we discussed about first three phases of the data science process namely business discovery, data discovery and data preparation. In business discover phase we talked on how the business problem i.e. predicting end life of batteries, defines the choice of  variables that comes into play. In the data discovery phase we discussed data sufficiency and other considerations like variety and velocity of data and how these considerations affect the data science problem formulation. In the last phase we touched upon how the data points and its various constituents drive the predictive problem formulation. In this post we will discuss further on how exploratory analysis can be used for getting insights for feature engineering.

Exploratory Analysis – Unraveling latent trends

This phase entails digging deep to get a feel of the data and extract intuitions for feature engineering. When embarking upon exploratory analysis, it would be a good idea to get inputs from domain team on the relation between variables and the business problem. Such inputs are often the starting point for this phase.

Let us now get to the context of our preventive maintenance problem and evolve a philosophy for exploratory analysis.In the case of industrial batteries, a key variable which affects the state of health of a battery is its conductance. It turns out that an indicator of failing health of  battery is the precipitous drop in conductance. Armed with this information our next task should be to  identify, from our available data set,batteries that have higher probability to fail. Since precipitous fall in conductance is an indicator of failing health,the conductance data of  unhealthy batteries will have more variance than the normal ones. So the best way to identify failing batteries from the normal ones would be to apply some consolidating metric like standard deviation or variance on the conductance data and further drill deep on samples which stand apart from the normal population.

SD1_Plot The above is a plot depicting standard deviation of conductance for all batteries. Now what might be of interest to us is the red zone which we can call the “Potential failure Zone”. The potential failure zone consists of those batteries whose conductance values show high standard deviation. Batteries with failing health are likely to exhibit large fall in conductance and as a corollary their values will also show higher standard deviation. This implies that the samples of batteries which have higher probability of failure will in all likelihood be from this failure zone. However to ascertain this hypothesis we will have to dig deep into batteries in the failure zone and look for patterns which might differentiate them from normal batteries. Another objective to dig deep is also to elicit clues from the underlying patterns on what features to include in the predictive model. We will discuss more on the feature extraction when we discuss about feature engineering. Now let us come back to our discussion on digging deep into the failure zone and ferreting out significant patterns. It has to be noted that in addition to the samples in the failure zone we will also have to observe patterns from the normal zone to help separate wheat from the chaff . Intuitions derived by observing different patterns would become vital during feature engineering stage.

Conductance_Comparison

The above figure is a comparison of patterns from either zones. The figure on the left is from the failure zone and the one on the right is from the other. We can clearly see how the precipitous fall is manifested in the sample from the failure zone. The other aspect to note is also the magnitude of the fall. Every battery will have degrading conductance over time. However the magnitude of  degradation is what differentiates the unhealthy  battery from a normal one. We can observe from the plot on the left that the fall in conductance is more than 50%, however for the battery to the right the drop is more muted.  Another aspect we can observe is the slope of conductance. As evident from the two plots, the slope of  conductance profile for the battery on the left is much more steeper over time than the one on the right. These intuitions which we have derived so far might become critical from the overall scheme of feature engineering and modelling. Similar to the intuitions which we have disinterred so far, more could be extracted by observing more samples. The philosophy behind exploratory analysis entails visualizing more and more samples, observing patterns and extracting clues for feature engineering. The more time we spend on doing this more ammunition we get for feature engineering.

Wrapping up

So far we discussed different considerations for the exploratory analysis phase. To summarize, here are some of the pointers during this phase.

  1. Take inputs from domain team related to the problem we are trying to solve. In our case the clue which we got was the relation between conductance and health of batteries.
  2. Identify any consolidating metric for the variable under consideration to separate out anomalous samples. In the example above we used standard deviation of conductance values to find anomalies.
  3. Once the samples are demarcated using the consolidation metric, visualize samples from different sets to identify discernible patterns in data.
  4. From the patterns we observe root out clues for feature engineering. In our example we identified that % fall in conductance and slope of conductance over time could be potential features.

The above pointers are general guidelines on how one should think through during  exploratory analysis phase.

The discussions so far were centered on exploratory analysis on a single variable. Next we have to connect other variables to the one which we already observed and identify trends in unison. When we combine trends from multiple variables we will be able to unravel more insights for feature engineering. We will continue our discussions on combining more variables and subsequent feature engineering in our next post. Watch out this space for more.

 

Logic of Logistic Regression – Part III

 

data

In our previous post on logistic regression we defined the concept of parameters and had a first hand glimpse on the dynamics between the data set and the parameters to obtain our first set of predictions. In this part we will go further into how we optimize the parameters in order to improve the accuracy of our predictions. We will be dealing with the following concepts

  1. Deciphering the prediction errors
  2. Minimizing errors through gradient descent and finding optimized parameters
  3. Prediction with the optimized set of parameters.

Deciphering Prediction Errors

Let us revisit the toy example we discussed in our last post and dissect the below table which represented the dynamics of prediction.

activation

To recap, let us list down our discussions in  the last post on the dynamics involved in the above table.

  • We first assumed an initial set of parameters
  • Multiplied the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Converted the weighted sum into predictions ( column 6) by applying the activation function (sigmoid function).

Let us take a moment to reflect on what the predictions really mean ? The predictions are in fact the probabilities of the customer  buying the insurance policy. For example, for the first customer, we are predicting that the probability that the customer will buy the insurance policy is almost 17.9%.

However when we talk about predictions the first thing which comes to our mind is the veracity of those predictions. How close to reality are the first set of predictions which we made ? If we recall, in our last discussion on the training set, we introduced a new column called the labels. The labels in fact is the reality !! For example looking at the labels column we know that the first two customers did not buy the insurance policy ( label of ‘0’) and the next two bought the insurance policy. The veracity of our predictions can be realized by comparing our predictions with the reality manifested in the labels. By comparing we can see that the first and last customer predictions are somewhat close to reality and the middle ones are pretty off target. In ideal state, we want the first two predictions to be close to zero and the last two pretty close to ‘1’. However, what we predicted have obviously deviated from the reality. Such deviations are the errors we have inherited in our predictions.  However we need to note that the calculation of error for a classification problem like ours is a little mathematically oriented and is not as straight forward as subtracting the probability from the labels. For the sake of simplicity let us not get into those mathematical calculations and stick to our understanding that there  some errors inherited for each example. From the errors of each example we  can find the average error by summing up errors of all examples and dividing it by the number of examples ( 4 in our case). In machine learning parlance the average error so obtained can also be called the ‘Cost’.

Now that we know that there are ‘Cost’ involved in our predictions, our aim should be to minimize the cost so that our predictions are as close to the reality as possible.However the million dollar question is how do we minimize the cost ? What are the levers we have to reduce our costs ? Going back to our toy example, the two entities we have played around to get the predictions are the ‘data’ and the ‘parameters’ . We cannot change the given data because it is fixed. So all we have got to play around with is the parameters which we assumed. We have to try to change our parameters systematically so that we minimize the costs and get our predictions as close to the reality as possible. One of the ways we do this is by a procedure called gradient descent.

Gradient Descent

To understand the concept of gradient descent let us look at some graphical representations.

cost

A pictorial representation of the cost function will look as the above. In the ‘X’ axis we have our parameters and in the ‘Y’ axis we have the cost. From the figure we can see that there are some set of parameters,’P’ with which we can get to the minimum cost ‘C_min’. Our aim is to find those parameters which will give us the minimum cost.

Let us represent the initial parameters we assumed as P_initial. For this set of parameters let us denote the  cost we derived as C1, as given in the figure. We can see from the figure that by moving the P value to the left ( decreasing the parameters ) by some value we can get to the minimum value of cost. Alternatively, if our initial ‘P’ value were to be on the left side of the graph, we would have to move to the right ( increase the value of parameters ) to get to the minimum cost. The procedure for achieving this is called the gradient descent.

The idea behind gradient descent is represented pictorially as below.

gradient_descent

We decrease the parameters by small steps in an iterative fashion so as to get to the minimum cost. To find out  the “small steps” which I mentioned in the previous line we use a trick we learned in high school calculus called partial derivative. By taking the partial derivative at each point of the cost curve we get a value by which we have to reduce the parameters. With the new set of reduced parameters we find the new cost. Again we find the partial derivative at the new cost level to get the next steps which we have to take, and this process continues till we reach the minimum cost. An analogy to this process is like this. Suppose we are on top of a hill, blindfolded, and we want to find our way down the hill. The way we can do this is by feeling the ground with our foot to find those spots which are lower than the ones where we are currently and then move to the new spot. From the new spot we repeat the process till we finally reach the bottom of the hill. Gradient descent works somewhat similar to this.

 

Summarizing our discussions on gradient descent, these are the steps we take to get the optimum parameters.

  1. First start of with the assumed random parameters.
  2. Find the cost ( errors ) associated with the assumed parameters.
  3. Find the small steps we have to take to alter our parameters, by taking partial derivative of the cost.
  4. Reduce the parameters by the small steps and get a new set of parameters
  5. Find the new cost associated with the new parameters.
  6. Repeat the processes 3,4 & 5 till we get the most optimized cost.

The optimized parameters which we finally get are called the learned parameters.Getting to this optimized parameters is the most involved part of machine learning. Once we learn the parameters using, the training set, we are all set to do predictions which is the objective of any machine learning process.

Doing Predictions

Having learned our set of optimized parameters from the training set, we are now equipped with enough ammunition to do predictions. For doing predictions we take a new set of data called the test set. However there is a difference between the training set and test set. The test set will not have any labels. Our job is to predict the labels from the parameters we have learned. So in the insurance company example, the test set would be the new set of leads which the sales team generated. We have to predict the likelihood of these leads, buying an insurance policy. The way we do the prediction is as follows.

  • We take the optimized set of parameters learned from the training set
  • Multiply the parameters with the respective features ( columns 2,3 &4) to get the weighted sum.
  • Convert the weighted sum into probabilities ( column 6) by applying the activation function (sigmoid function).
  • We take a threshold point ( say 0.5). So any probability less than the threshold point is predicted as ‘0’ ( Will not buy) and anything greater that the threshold point is predicted as ‘1’.

The threshold point which we take to make a decision on our predictions is called the decision boundary.Needless to say, the logistic regression is the basic model among a vast set of powerful classification algorithms. The significance of logistic regression is that it is the building block for the development of powerful algorithms like Support Vector machines, Neural Networks etc. Having said that there are many problem areas where we have to go for simple algorithms like logistic regression. Having dealt with the basic building blocks of classification problems we will have further discussions on some of the most powerful algorithms in future posts. Until then watch out this space for more.

Logic of Logistic Regression – Part II

images

In the first part of this series on Logistic Regression, we set the stage for unveiling the logic behind logistic regression. We stopped our discussion by identifying three dynamic forces at play which determines the quality of predictions,

  1. Weights or parameters which we learn
  2. The activation function, and
  3. The decision boundary

In this second, part of the series we will look deeper into the first two of those dynamic forces.

Concept of Parameters

In the first part of this series when we were discussing the example we assumed a set of parameters i.e W(age) = 8 ; W(income) = 3 and W(propensity) = 10. Quite naturally, a  question lot of people asked me was, where did we get those values from ? Well, as far as that example was concerned, it was just some assumed values. However in the world of machine learning, the parameters is its Holy Grail. The cardinal purpose of the algorithms and theorems of machine learning is to enable the pursuit of the right set of parameters. But why is it that the parameters, so important ? To answer this let us look at what the parameters help us achieve.

Let us revisit the toy data set which we used in the first part. Let us first understand this data set before we get into understanding the parameters.

As can be seen, this data set consists of rows and columns. The data along the columns ( Age, Income & Propensity) are called its  features and the ones along the rows are the examples. In short each customer record in this data set is an example.

Now that we have seen the data set, let us now see the dynamics between the parameters and the data.

The role of the parameter is to act as a weighting factor for each of the features. In other words each feature will have a unique parameter playing the role of a weight. Our example data set has three features and therefore the number of parameters we will have is also three. In general if there are ‘n’ features there should be at least ‘n’ parameters ( However, in practice we will have n+1 parameters where the additional parameter is called the bias term. We will ignore that for the time being).  Please note here that the number of parameters does not depend on the number of examples.

Having looked at the anatomy of the data set and parameters, let us look at how the parameters are learned from a given data set.

Learning Parameters from data

The data set which is used for learning parameters is called a training set. There is a subtle difference between a training set and the one shown above. For the training set we will have an additional column and this additional column is for the labels or dependent variables.

trng

The above data set is an example for a training set. The ‘labels’ column represent the results or outcome for each record. The records with ‘0’ are negative examples and those with ‘1’ are the positive examples. In this context the negative example would mean those customers who did not buy an insurance policy and the positive examples are the ones who bought them. The labels can also be interpreted from the perspective of probability of buying. So all the negative examples are the ones where the probability of sales is low i.e near 0% and the positive ones are those with high probability i.e near 100%. In real life a training set can be made from the historical data of customers in the organisation i.e who are the customers ? How many of them bought a policy ? How many did not ? etc.

The way, we go about the task of learning the parameters from the training set is as follows

  • Random Assumption of Parameters: To start off, we randomly select some arbitrary values for the parameters. For eg. let us assume the following values for the parameters ; W(age) = 1 ; W(income) = 1 and W(propensity) = 1
  • Scaling of the data : Once that we have assumed the parameters let us do some modification on the training data setIf we note the values for each features, the scale of values for each feature vary quite a bit. The values of feature ‘Age’ are all two digit numbers, the values of ‘Income’ are four digit numbers etc. In machine learning, when the values falls within different scales, the accuracy of prediction gets affected. So it is a good practice to normalize the data. One popular way is to subtract each value with the average of the feature and then divide by the range( difference between the maximum value and minimum value). Let us see this in action,with the feature ‘Age’                                                                                                                                           Average value of ‘Age’ = (28+32+36+ 46)/ 4 = 35.5                                                                         Range of ‘Age’ = 46 – 28 = 18                                                                                                                Scaled value for the first data (28) = 28 – 35.5 / 18 = -0.4167                                                  Similarly we do it for the complete data set. The scaled data set is as represented below.    Please note that we do not scale the labels.                                                                                                                                                          scale
  • Prediction with initial parameters : Once the data is scaled,  we go to the next step of using the assumed parameters for prediction. As mentioned earlier, the parameters are like weights which needs to be applied on each feature of the data. Therefore the first step in arriving at a prediction is to multiply the parameters with the corresponding feature and adding up the weighted features for each example. The same is carried out as below. Please note that the labels are not involved in any of these operations.   Weight   Let us study the above column closely. The weighted sum column which is got by applying the parameter on each feature and adding them up, is the value which finally determines the prediction. However for a classification problem the most intuitive way of representing the prediction is in terms of probabilities. As you know, when you represent a value as a probability it has to be within the range of ‘0’ and ‘1’. However if you note our weighted sum column, most of the values are outside the range of 0 & 1. So our challenge would be to apply some mathematical operation to represent them as a probability. The mathematical operation we use for this purpose is called the Activation Function.  One of the most common activation function used in classification problems is the  Sigmoid function . By applying this function on the weighted sum column we convert it into numbers which can be interpreted as probabilities. activation The new data set after applying the activation function is as represented above. Note that the probabilities column is our actual prediction and it can be interpreted as the probability that the  customer will buy the insurance policy. So for the first customer there is only 17.88% chance for buying the policy and for the last customer there is a high chance ( 81.4 %) for him/her to buy the policy.                                                                                                                                                                                                                                   Now that we have seen how we apply the activation function to get the prediction, we are a step closer to our final goal of learning the right parameters which gives the most accurate prediction. This all important step called the gradient descent will be explained in the next part of the post. Please watch out this space for the most important part of our logistic regression problem.

The Logic of Logistic Regression

At the onset let me take this opportunity to wish each one of you a very happy and prosperous New Year. In this post I will start the discussion around one of the most frequent type of problems encountered in a machine learning context – classification problem. I will also introduce one of the basic algorithms used in the classification context called the logistic regression.

classification

In one of my earlier posts on machine learning I mentioned that the essence of machine learning is prediction. When we talk about prediction there are basically two types of predictions  we encounter in a machine learning context. In the first type, given some data your aim is to estimate a real scalar value. For example, predicting the amount of rainfall  from meteorological data or predicting the stock prices based on the current economic environment or predicting sales based on the past market data are all valid use cases of the first type of prediction context. This genre of prediction problems is called the regression problem. The second type of problems deal with predicting the category or class the observed examples fall into. For example, classifying whether a given mail is spam or not , predicting whether a prospective lead will buy an insurance policy or not, or processing images of handwritten digits and classifying the images under the correct digit etc fall under this gamut of problem. The second type of problem is called the classification problem. As mentioned earlier classification problems are the most widely encountered ones in the machine learning domain and therefore I will devout considerable space to give an intuitive sense of the classification problem. In this post I will define the basic settings for classification problems.

Classification Problems Unplugged – Setting the context

In a machine learning setting we work around with two major components. One is the data we have at hand and the second are the parameters of the data. The dynamics between the data and the parameters provides us the results which we want i.e the correct prediction. Of these two components, the one which is available readily to us is the data. The parameters are something which we have to learn or derive from the available data. Our ability to learn the correct set of parameters determines the efficacy of our prediction. Let me elaborate with a toy example.

Suppose you are part of an insurance organisation and you have a large set of customer data and you would like to predict which of these customers are likely to buy a health insurance in the future.

For simplicity let us assume that each customers data consists of three variables

  • Age of the customer
  • Income of the customer and
  • A propensity factor based on the interest the customer shows for health insurance products.

Let the data for 3 of our leads look like the below

Customer                Age                 Income                Propensity
Cust-1                                   22                      1000                           1
Cust-2                                   36                     5000                           6
Cust-3                                   62                     4500                            8

Suppose, we also have a set of parameters which were derived from our historical data on past leads and the conversion rate(i.e how many of the leads actually bought the insurance product).

Let the parameters be denoted by ‘W’ suffixed by the name of the variable, i.e

W(age) = 8 ; W(income) = 3 ; W(propensity) = 10

Once we have the data and the parameters, our next task is to use these two data points and arrive at some relative scoring for the leads so that we can make predictions. For this, let us multiply the parameters with the corresponding variables and find a weighted score for each customer.

Customer           Age                 Income             Propensity           Total Score
Cust-1                  22 x 8         +     1000 x 3     +    1 x 10                  3186
Cust-2                 36 x 8         +    5000 x 3     +     6 x 10                  15,348
Cust-3                  62 x 8          +   4500 x 3     +    8 x 10                 14,076

Now that we have the weighted score for each customer, its time to arrive at some decisions. From our past experience we have also observed that any lead, obtaining a score of  more than 14,000 tend to buy an insurance policy. So based on this knowledge we can comfortably make prediction that customer 1 will not buy the insurance policy and that there is very high chance that customer 2 will buy the policy. Customer 3 is in the borderline and with little efforts one can convert this customer too. Equipped with this predictive knowledge, the sales force can then focus their attention to customer 2 & 3 so that they get more “bang for their buck”.

In the above toy example, we can observe some interesting dynamics at play,

  1. The derivation of the parameters for each variable – In machine learning, the quality of the results we obtain depend to a large extend on the parameters or weights we learn.
  2. The derivation of the total score – In this example we multiplied the weights with the data and summed the results to get a score. In effect we applied a function(multiplication and addition) to get a score. In machine learning parlance such functions are called activation functions.The activation functions converts the parameters and data into a composite measure aiding the final decision.
  3. The decision boundary – The score(14,000) used to demarcate the examples as to whether the lead can be converted or not.

The efficacy of our prediction  is dependent on how well we are able to represent the interplay between all these dynamic forces. This in effect is the big picture on what we try to achieve through machine learning.

Now that we have set our context, I will delve deeper into these dynamics in the next part of this post. In the next part I will primarily be dealing with the dynamics of parameter learning. Watch out this space for more on that.