VI : Build and deploy data science products: Machine translation application – From prototype to production. Introduction to the factory model

Source: brainyquote.com

This is the sixth part of the series where we continue on our pursuit to build a machine translation application. In this post we embark on a transformation process where in we transform our prototype into a production grade code.

This series comprises of 8 posts.

  1. Understand the landscape of solutions available for machine translation
  2. Explore sequence to sequence model architecture for machine translation.
  3. Deep dive into the LSTM model with worked out numerical example.
  4. Understand the back propagation algorithm for a LSTM model worked out with a numerical example.
  5. Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
  6. Build the production grade code for the training module using Python scripts.( This post)
  7. Building the Machine Translation application -From Prototype to Production : Inference process
  8. Build the machine translation application using Flask and understand the process to deploy the application on Heroku

In this section we will see how we can take the prototype which we built in the last article into a production ready code. In the prototype building phase we were developing our code on a Jupyter/Colab notebook. However if we have to build an application and deploy it, notebooks would not be very effective. We have to convert the code we built on the notebook into production grade code using python scripts. We will be progressively building the scripts using a process, I call, as the factory model. Let us see what a factory model is.

Factory Model

A Factory model is a modularized process of generating business outcomes using machine learning models. There are some distinct phases in the process which includes

  1. Ingestion/Extraction process : Process of getting data from source systems/locations
  2. Transformation process : Transformation process entails transforming raw data ingested from multiple sources into a form fit for the desired business outcome
  3. Preprocessing process: This process involves basic level of cleaning of the transformed data.
  4. Feature engineering process : Feature engineering is the process of converting the preprocessed data into features which are required for model training.
  5. Training process : This is the phase where the models are built from the featurized data.
  6. Inference process : The models which were built during the training phase is then utilized to generate the desired business outcomes during the inference process.
  7. Deployment process : The results of the inference process will have to be consumed by some process. The consumer of the inferences could be a BI report or a web service or an ERP application or any downstream applications. There is a whole set of process which is involved in enabling the down stream systems to consume the results of the inference process. All these steps are called the deployment process.

Needless to say all these processes are supported by an infrastructure layer which is also called the data engineering layer. This layer looks at the most efficient and effective way of running all these processes through modularization and parallelization.

All these processes have to be designed seamlessly to get the business outcomes in the most effective and efficient way. To take an analogy its like running a factory where raw materials gets converted into a finished product and thereby gets consumed by the end customers. In our case, the raw material is the data, the product is the model generated from the training phase and the consumers are any business process which uses the outcomes generated from the model.

Let us now see how we can execute the factory model to generate the business outcomes.

Project Structure

Before we dive deep into the scripts, let us look at our project structure.

Our root folder is the Machine Translation folder which contains two sub folders Data and factoryModel. The Data subfolder contains the raw data. The factoryModel folder contains different subfolders containing scripts for our processes. We will be looking at each of these scripts in detail in the subsequent sections. Finally we have two driver files mt_driver_train.py which is the driver file for the training process and mt_Inference.py which is the driver file for the inference process.

Let us first dive into the training phase scripts.

Training Phase

The first part of the factory model is the training phase which comprises of all the processes till the creation of the model. We will start off by building the supporting files and folders before we get into the driver file. We will first start with the configuration file.

Configuration file

When we were working with the notebook files, we were at a liberty to change the pararmeters we wanted to vary, say for example the path to the input file or some hyperparameters like the number of dimensions of the embedding vector, on the notebook itself. However when an application is in production we would not have the luxury to change the parameters and hyperparameters directly in the code base. To get over this problem we use the configuration files. We consolidate all the parameters and hyperparameters of the model on to the configuration file. All processes will pick the parameters from the configuration file for further processing.

The configuration file will be inside the config folder. Let us now build the configuration file.

Open a word editor like notepad++ or any other editor of your choice and open a new file and name it mt_config.py. Let us start adding the below code in this file.

'''
This is the configuration file for storing all the application parameters
'''

import os
from os import path


# This is the base path to the Machine Translation folder
BASE_PATH = '/media/acer/7DC832E057A5BDB1/JMJTL/Tomslabs/BayesianQuest/MT/MachineTranslation'
# Define the path where data is stored
DATA_PATH = path.sep.join([BASE_PATH,'Data/deu.txt'])

Lines 5 and 6, we import the necessary library packages.

Line 10, we define the base path for the application. You need to change this path based on your specific path to the application. Once the base path is set, the rest of the paths will be derived out from it. In Line 12, we define the path to the raw data set folder. Note that we just join the name of the data folder and the raw text file with the base path to get the data path. We will be using the data path to read in the raw data.

In the config folder there will be another file named __init__.py . This is a special file which tells Python to treat the config folder as part of the package. This file inside this folder will be an empty file with no code in it

Loading Data

The next helper files we will build are those for loading raw files and preprocessing. The code we use for these purposes are the same code which we used for building the prototype. This file will reside in the dataLoader folder

In your text editor open a new file and name it as datasetloader.py and then add the below code into it

'''
Factory Model for Machine translation preprocessing.
This is the script for loading the data and preprocessing data
'''

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array

# Creating the class to load data and then do the preprocessing as sequence of steps

class textLoader:
	def __init__(self , preprocessors = None):
		# This init method is to store the text preprocessing pipeline
		self.preprocessors = preprocessors
		# Initializing the preprocessors as an empty list of the preprocessors are None
		if self.preprocessors is None:
			self.preprocessors = []

	def loadDoc(self,filepath):
		# This is the function to read the file from the path provided
		# Open the file
		file = open(filepath,mode = 'rt',encoding = 'utf-8')
		# Reading the text
		text = file.read()
		#Once the file is read, applying the preprocessing steps one by one
		if self.preprocessors is not None:
			# Looping over all the preprocessing steps and applying them on the text data
			for p in self.preprocessors:
				text = p.preprocess(text)
				
		# Closing the file
		file.close()
				
		# Returning the text after all the preprocessing
		return text

Before addressing the code block line by line, let us get a big picture perspective of what we are trying to accomplish. When working with text you would have realised that different sources of raw text requires different preprocessing treatments. A preprocessing method which we have used for one circumstance may not be warranted in a different one. So in this code block we are building a template called textLoader, which reads in raw data and then applies different preprocessing steps like a pipeline as the situation warrants. Each of the individual preprocessing steps would be defined seperately. The textLoader class first reads in the data and then applies the selected preprocessing one after the other. Let us now dive into the details of the code.

Lines 6 to 10 imports all the necessary library packages for the process.

Line 14 we define the textLoader class. The constructor in line 15 takes the text preprocessor pipeline as the input. The prepreprocessors are given as lists. The default value is taken as None. The preprocessors provided in the constructor is initialized in line 17. Lines 19-20 initializes an empty list if the preprocessor argument is none. If you havent got a handle of why the preprocessors are defined this way, it is ok. This will be more clear when we define the actual preprocessors. Just hang on till then.

From line 22 we start the first function within this class. This function is to read the raw text and the apply the processing pipeline. Lines 25 – 27, where we open the text file and read the text is the same as what we defined during the prototype phase in the last post. We do a check to see if we have defined any preprocessor pipeline in line 29. If there are any pipeline defined those are applied on the text one by one in lines 31-32. The method .preprocess is specific to each of the preprocessor in the pipeline. This method would be clear once we take a look at each of the preprocessors. We finally close the raw file and the return the processed text in lines 35-38.

The __init__.py file inside this folder will contain the following line for importing the textLoader class from the datasetloader.py file for any calling script.

from .datasetloader import textLoader

Processing Data : Preprocessing pipeline construction

Next we will create the files for preprocessing the text. In the last section we saw how the raw data was loaded and then preprocessing pipeline was applied. In this section we look into the preprocessing pipeline. The folder structure will be as shown in the figure.

There would be three preprocessors classes for processing the raw data.

  • SentenceSplit : Preprocessor to split raw text into pair of English and German sentences. This class is inside the file splitsentences.py
  • cleanData : Preprocessor to apply cleaning steps like removing punctuations, removing whitespaces which is included in the datacleaner.py file.
  • TrainMaker : Preprocessor to tokenize text and then finally prepare the train and validation sets contined in the tokenizer.py file

Let us now dive into each of the preprocessors.

Open a new file and name it splitsentences.py. Add the following code to this file.

'''
Script for preprocessing of text for Machine Translation
This is the class for splitting the text into sentences
'''

import string
from numpy import array

class SentenceSplit:
	def __init__(self,nrecords):
		# Creating the constructor for splitting the sentences
		# nrecords is the parameter which defines how many records you want to take from the data set
		self.nrecords = nrecords
		
	# Creating the new function for splitting the text
	def preprocess(self,text):
		sen = text.strip().split('\n')
		sen = [i.split('\t') for i in sen]
		# Saving into an array
		sen = array(sen)
		# Return only the first two columns as the third column is metadata. Also select the number of rows required
		return sen[:self.nrecords,:2]

This is the first or our preprocessors. This preprocessor splits the raw text and finally outputs an array of English and German sentence pairs.

After we import the required packages in lines 6-7, we define the class in line 9. We pass a variable nrecords to the constructor to subset the raw text and select number of rows we want to include for training.

The preprocess function starts in line 16. This is the function which we were accessing in line 32 of the textLoader class which we discussed in the last section. The rest is the same code we have used in the prototype building phase which includes

  • Splitting the text into sentences in line 17
  • Splitting each sentece on tab spaces to get the German and English sentences ( line 18)

Finally we convert the processed sentences into an array and return only the first two columns of the array. Please note that the third column contains metadata of each line and therefore we exclude it from the returned array. We also subset the array based on the number of records we want.

Now that the first preprocessor is complete,let us now create the second preprocessor.

Open a new file and name it datacleaner.py and copy the below code.

'''
Script for preprocessing data for Machine Translation application
This is the class for removing the punctuations from sentences and also converting it to lower cases
'''

import string
from numpy import array
from unicodedata import normalize

class cleanData:
	def __init__(self):
		# Creating the constructor for removing punctuations and lowering the text
		pass
		
	# Creating the function for removing the punctuations and converting to lowercase
	def preprocess(self,lines):
		cleanArray = list()
		for docs in lines:
			cleanDocs = list()
			for line in docs:
				# Normalising unicode characters
				line = normalize('NFD', line).encode('ascii', 'ignore')
				line = line.decode('UTF-8')
				# Tokenize on white space
				line = line.split()
				# Removing punctuations from each token
				line = [word.translate(str.maketrans('', '', string.punctuation)) for word in line]
				# convert to lower case
				line = [word.lower() for word in line]
				# Remove tokens with numbers in them
				line = [word for word in line if word.isalpha()]
				# Store as string
				cleanDocs.append(' '.join(line))
			cleanArray.append(cleanDocs)
		return array(cleanArray)

This preprocessor is to clean the array of German and English sentences we received from the earlier preprocessor. The cleaning steps are the same as what we have seen in the previous post. Let us quickly dive in and understand the code block.

We start of by defining the cleanData class in line 10. The preprocess method starts in line 16 with the array from the previous preprocessing step as the input. We define two placeholder lists in line 17 and line 19. In line 20 we loop through each of the sentence pair of the array and the carry out the following cleaning operations

  • Lines 22-23, normalise the text
  • Line 25 : Split the text to remove the whitespaces
  • Line 27 : Remove punctuations from each sentence
  • Line 29: Convert the text to lower case
  • Line 31: Remove numbers from text

Finally in line 33 all the tokens are joined together and appended into the cleanDocs list. In line 34 all the individual sentences are appended into the cleanArray list and converted into an array which is returned in line 35.

Let us now explore the third preprocessor.

Open a new file and name it tokenizer.py . This file is pretty long and therefore we will go over it function by function. Let us explore the file in detail

'''
This class has methods for tokenizing the text and preparing train and test sets
'''

import string
import numpy as np
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


class TrainMaker:
	def __init__(self):
		# Creating the constructor for creating the tokenizers
		pass
	
	# Creating an internal function for tokenizing the text	
	def tokenMaker(self,text):
		tokenizer = Tokenizer()
		tokenizer.fit_on_texts(text)
		return tokenizer	

We down load all the required packages in lines 5-10, after which we define the constructor in lines 13-16. There is nothing going on in the constructor so we can conveniently pass it over.

The first function starts on line 19. This is a function we are familiar with in the previous post. This function fits the tokenizer function on text. The first step is to instantiate the tokenizer object in line 20 and then fit the tokenizer object on the provided text in line 21. Finally the tokenizer object which is fit on the text is returned in line 22. This function will be used for creating the tokenizer dictionaries for both English and German text.

The next function which we will see is the sequenceMaker. In the previous post we saw how we convert text as sequence of integers. The sequenceMaker function is used for this task.

		
	# Creating an internal function for encoding and padding sequences
	
	def sequenceMaker(self,tokenizer,stdlen,text):
		# Encoding sequences as integers
		seq = tokenizer.texts_to_sequences(text)
		# Padding the sequences with respect standard length
		seq = pad_sequences(seq,maxlen=stdlen,padding = 'post')
		return seq

The inputs to the sequenceMaker function on line 26 are the tokenizer , the maximum length of a sequence and the raw text which needs to be converted to sequences. First the text is converted to sequences of integers in line 28. As the sequences have to be of standard legth, they are padded to the maximum length in line 30. The standard length integer sequences is then returned in line 31.

		
	# Creating another function to find the maximum length of the sequences	
	def qntLength(self,lines):
		doc_len = []
		# Getting the length of all the language sentences
		[doc_len.append(len(line.split())) for line in lines]
		return np.quantile(doc_len, .975)

The next function we will define is the function to find the quantile length of the sentences. As seen from the previous post we made the standard length of the sequences equal to the 97.5 % quantile length of the respective text corpus. The function starts in line 34 where the complete text is given as input. We then create a placeholder in line 35. In line 37 we parse through each of the line and the find the total length of the sentence. The length of each sentence is stored in the placeholder list we created earlier. Finally in line 38, the 97.5 quantile of the length is returned to get the standard length.

		
	# Creating the function for creating tokenizers and also creating the train and test sets from the given text
	def preprocess(self,docArray):
		# Creating tokenizer forEnglish sentences
		eng_tokenizer = self.tokenMaker(docArray[:,0])
		# Finding the vocabulary size of the tokenizer
		eng_vocab_size = len(eng_tokenizer.word_index) + 1
		# Creating tokenizer for German sentences
		deu_tokenizer = self.tokenMaker(docArray[:,1])
		# Finding the vocabulary size of the tokenizer
		deu_vocab_size = len(deu_tokenizer.word_index) + 1
		# Finding the maximum length of English and German sequences
		eng_length = self.qntLength(docArray[:,0])
		ger_length = self.qntLength(docArray[:,1])
		# Splitting the train and test set
		train,test = train_test_split(docArray,test_size = 0.1,random_state = 123)
		# Calling the sequence maker function to create sequences of both train and test sets
		# Training data
		trainX = self.sequenceMaker(deu_tokenizer,int(ger_length),train[:,1])
		trainY = self.sequenceMaker(eng_tokenizer,int(eng_length),train[:,0])
		# Validation data
		testX = self.sequenceMaker(deu_tokenizer,int(ger_length),test[:,1])
		testY = self.sequenceMaker(eng_tokenizer,int(eng_length),test[:,0])
		return eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,docArray,trainX,trainY,testX,testY,eng_length,ger_length

We tie all the earlier functions in the preprocess method starting in line 41. The input to this function is the English, German sentence pair as array. The various processes under this function are

  • Line 43 : Tokenizing English sentences using the tokenizer function created in line 19
  • Line 45 : We find the vocabulary size for the English corpus
  • Lines 47-49 the above two processes are repeated for German corpus
  • Lines 51-52 : The standard lengths of the English and German senetences are found out
  • Line 54 : The array is split to train and test sets.
  • Line 57 : The input sequences for the training set is created using the sequenceMaker() function. Please note that the German sentences are the input variable ( TrainX).
  • Line 58 : The target sequence which is the English sequence is created in this step.
  • Lines 60-61: The input and target sequences are created for the test set

All the variables and the train and test sets are returned in line 62

The __init__.py file inside this folder will contain the following lines

from .splitsentences import SentenceSplit
from .datacleaner import cleanData
from .tokenizer import TrainMaker

That takes us to the end of the preprocessing steps. Let us now start the model building process.

Model building Scripts

Open a new file and name it mtEncDec.py . Copy the following code into the file.

'''
This is the script and template for different models.
'''

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import TimeDistributed

class ModelBuilding:
	@staticmethod
	def EncDecbuild(in_vocab,out_vocab, in_timesteps,out_timesteps,units):
		# Initializing the model with Sequential class
		model = Sequential()
		# Initiating the embedding layer for the text
		model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
		# Adding the first LSTM layer
		model.add(LSTM(units))
		# Using the RepeatVector to map the input sequence length to output sequence length
		model.add(RepeatVector(out_timesteps))
		# Adding the second layer of LSTM 
		model.add(LSTM(units, return_sequences=True))
		# Adding the fully connected layer with a softmax layer for getting the probability
		model.add(TimeDistributed(Dense(out_vocab, activation='softmax')))
		# Compiling the model
		model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
		# Printing the summary of the model
		model.summary()
		return model

The model building scripts is straight forward. Here we implement the encoder decoder model we described extensively in the last post.

We start by importing all the necessary packages in lines 5-10. We then get to the meat of the model by defining the ModelBuilding class in line 12. The model we are using for our application is defined through a function EncDecbuild in line 14. The inputs to the function are the

  • in_vocab : This is the size of the German vocabulary
  • out_vocab : This is the size of the Enblish vocabulary
  • in_timesteps : The standard sequence length of the German sentences
  • out_timesteps : Standard sequence length of Enblish sentences
  • units : Number of hidden units for the LSTM layers.

The progressive building of the model was covered extensively in the last post. Let us quickly run through the same here

  • Line 16 we initialize the sequential class
  • The next layer is the Embedding layer defined in line 18. This layer converts the text to word embedding vectors. The inputs are the German vocabulary size, the dimension required for the word embeddings and the sequence length of the input sequences. In this example we have kept the dimension of the word embedding same as the number of units of LSTM. However this is a parameter which can be experimented with.
  • Line 20, we initialize our first LSTM unit.
  • We then perform the Repeat vector operation in Line 22 so as to make the mapping between the encoder time steps and decoder time steps
  • We add our second LSTM layer for the decoder part in Line 24.
  • The next layer is the dense layer whose output size is equal to the English vocabulary size.(Line 26)
  • Finally we compile the model using ‘adam’ optimizer and then summarise the model in lines 28-30

So far we explored the file ecosystem for our application. Next we will tie all these together in the driver program.

Driver Program

Open a new file and name it mt_driver_train.py and start adding the following code blocks.

'''
This is the driver file which controls the complete training process
'''

from factoryModel.config import mt_config as confFile
from factoryModel.preprocessing import SentenceSplit,cleanData,TrainMaker
from factoryModel.dataLoader import textLoader
from factoryModel.models import ModelBuilding
from tensorflow.keras.callbacks import ModelCheckpoint
from factoryModel.utils.helperFunctions import *

## Define the file path to input data set
filePath = confFile.DATA_PATH

print('[INFO] Starting the preprocessing phase')

## Load the raw file and process the data
ss = SentenceSplit(50000)
cd = cleanData()
tm = TrainMaker()

Let us first look at the library file importing part. In line 5 we import the configuration file which we defined earlier. Please note the folder structure we implemented for the application. The configuration file is imported from the config folder which is inside the folder named factoryModel. Similary in line 6 we import all three preprocessing classes from the preprocessing folder. In line 7 we import the textLoader class from the dataLoader folder and finally in line 8 we import the ModelBuilding class from the models folder.

The first task we will do is to get the path of the files which we defined in the configuration file. We get the path to the raw data in line 13.

Lines 18-20, we instantiate the preprocessor classes starting with the SentenceSplit, cleanData and finally the trainMaker classes. Please note that we pass a parameter to the SentenceSplit(50000) class to indicate that we want only 50000 rows of the raw data, for processing.

Having seen the three preprocessing classes, let us now see how these preprocessors are tied together in a pipeline to be applied sequentially on the raw text. This is achieved in next code block

# Initializing the data set loader class and then executing the processing methods
tL = textLoader(preprocessors = [ss,cd,tm])
# Load the raw data, preprocess it and create the train and test sets
eng_tokenizer,eng_vocab_size,deu_tokenizer,deu_vocab_size,text,trainX,trainY,testX,testY,eng_length,ger_length = tL.loadDoc(filePath)

Line 21 we instantiate the textLoader class. Please note that all the preprocessing classes are given sequentially in a list as the parameters to this class. This way we ensure that each of the preprocessors are implemented one after the other when we implement the textLoader class. Please take some time to review the class textLoader earlier in the post to understand the dynamics of the loading and preprocessing steps.

In Line 23 we implement the loadDoc function which takes the path of the data set as the input. There are lots of processes which goes on in this method.

  • First loads the raw text using the file path provided.
  • On the raw text which is loaded, the three preprocessors are implemented one after the other
  • The last preprocessing step returns all the required data sets like the train and test sets along with the variables we require for modelling.

We now come to the end of the preprocessing step. Next we take the preprocessed data and train the model.

Training the model

We have already built all the necessary scripts required for training. We will tie all those pieces together in the training phase. Enter the following lines of code in our script

### Initiating the training phase #########
# Initialise the model
model = ModelBuilding.EncDecbuild(int(deu_vocab_size),int(eng_vocab_size),int(ger_length),int(eng_length),256)
# Define the checkpoints
checkpoint = ModelCheckpoint('model.h5',monitor = 'val_loss',verbose = 1, save_best_only = True,mode = 'min')
# Fit the model on the training data set
model.fit(trainX,trainY,epochs = 50,batch_size = 64,validation_data=(testX,testY),callbacks = [checkpoint],verbose = 2)

In line 34, we initialize the model object. Please note that when we built the script ModelBuilding was the name of the class and EncDecbuild was the method or function under the class. This is how we initialize the model object in line 34. The various parameter we give are the German and English vocabulary sizes, sequence lenghts of the German and English senteces and the number of units for LSTM ( which is what we adopt for the embedding size also). We define the checkpoint variables in line 36.

We start the model fitting in line 38. At the end of the training process the best model is saved in the path we have defined in the configuration file.

Saving the other files and variables

Once the training is done the model file is stored as a 'model.h5‘ file. However we need to save other files and variables as pickle files so that we utilise them during our inference process. We will create a script where we store all such utility functions for saving data. This script will reside in the utils folder. Open a new file and name it helperfunctions.py and copy the following code.

'''
This script lists down all the helper functions which are required for processing raw data
'''

from pickle import load
from numpy import argmax
from tensorflow.keras.models import load_model
from pickle import dump

def save_clean_data(data,filename):
    dump(data,open(filename,'wb'))
    print('Saved: %s' % filename)

Lines 5-8 we import all the necessary packages.

The first function we will be creating is to dump any files as pickle files which is initiated in line 10. The parameters are the data and the filename of the data we want to save.

Line 11 dumps the data as pickle file with the file name we have provided. We will be using this utility function to save all the files and variables after the training phase.

In our training driver file mt_driver_train.py add the following lines

### Saving the tokenizers and other variables as pickle files
save_clean_data(eng_tokenizer,'eng_tokenizer.pkl')
save_clean_data(eng_vocab_size,'eng_vocab_size.pkl')
save_clean_data(deu_tokenizer,'deu_tokenizer.pkl')
save_clean_data(deu_vocab_size,'deu_vocab_size.pkl')
save_clean_data(trainX,'trainX.pkl')
save_clean_data(trainY,'trainY.pkl')
save_clean_data(testX,'testX.pkl')
save_clean_data(testY,'testY.pkl')
save_clean_data(eng_length,'eng_length.pkl')
save_clean_data(ger_length,'ger_length.pkl')

Lines 42-52, we save all the variables we received from line 24 as pickle files.

Executing the script

Now that we have completed all the scripts, let us go ahead and execute the scripts. Open a terminal and give the following command line arguments to run the script.

$ python mt_driver_train.py

All the scripts will be executed and finally the model files and other variables will be stored on disk. We will be using all the saved files in the inference phase. We will address the inference phase in the next post of the series.

Go to article 7 of this series : From prototype to production: Inference Process

You can download the notebook for the prototype using the following link

https://github.com/BayesianQuest/MachineTranslation/tree/master/Production

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

V : Build and deploy data science products: Machine translation application-Develop the prototype

Source:boagworld.com

”Prototyping is the conversation you have with your ideas”

Tom Wujec

This is the fifth part of the series where we see our theoretical foundation on machine translation come to fruition. This series comprises of 8 posts.

  1. Understand the landscape of solutions available for machine translation
  2. Explore sequence to sequence model architecture for machine translation.
  3. Deep dive into the LSTM model with worked out numerical example.
  4. Understand the back propagation algorithm for a LSTM model worked out with a numerical example.
  5. Build a prototype of the machine translation model using a Google colab / Jupyter notebook.( This post)
  6. Build the production grade code for the training module using Python scripts.
  7. Building the Machine Translation application -From Prototype to Production : Inference process
  8. Build the machine translation application using Flask and understand the process to deploy the application on Heroku

In the previous 4 posts we understood the solution landscape for machine translation ,explored different architecture choices for sequence to sequence models and did a deep dive into the forward pass and back propagation algorithm for LSTMs. Having set a theoretical foundation on the application, it is time to build a prototype of the machine translation application. We will be building the prototype using a Google Colab / Jupyter notebook.

Building the prototype

The prototype building phase will consist of the following steps.

  1. Loading the raw data
  2. Preprocessing the raw data for machine translation
  3. Preparing the train and test sets
  4. Building the encoder – decoder architecture
  5. Training the model
  6. Getting the predictions

Let us get started in building the prototype of the application on a notebook

Downloading the raw text

Let us first grab the raw data for this application. The data can be downloaded from the link below.

http://www.manythings.org/anki/deu-eng.zip

This is also available in the github repository. The raw text consists of English sentences paired with the corresponding German sentence. Once the data text file is downloaded let us upload the data in our Google drive. If you do not want to do the prototype in Colab, you can download it in your local drive and then use a Jupyter notebook also for the purpose.

Preprocessing the text

Before starting the processes, let us import all the packages we will be using for the process

import string
import re
from numpy import array, argmax, random, take
from numpy.random import shuffle
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, RepeatVector
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers
import matplotlib.pyplot as plt
% matplotlib inline
pd.set_option('display.max_colwidth', 200)
from pickle import dump
from unicodedata import normalize
from tensorflow.keras.models import load_model

The raw text which we have downloaded needs to be opened and progressively preprocessed through series of processing steps to ultimately get the train and test set which we require for building our models. Let us first define the path for the text, so as to take it from the google drive. This path has to be changed by you based on the path in which you load the data

# Define the path to the raw data set 
fileurl = '/content/drive/My Drive/Bayesian Quest/deu.txt'

Once the path is defined, let us read the text data.

# open the file 
file = open(fileurl, mode='rt', encoding='utf-8') 
# read all text 
text = file.read()

The text which is read from the text file would be in the format shown below

text[0:200]
Output of first 200 characters of text

From the output we can see that each record is seperated by a line (\n) and within each record the data we want is seperated by tabs (\t).So we can first split each record on new lines (\n) and after that each line we split on the tabs (\t) to get the data in the format we want

# Split the text into individual lines
lines = text.strip().split('\n')
# Splitting each line based on tab spaces and creating a list
lines = [line.split('\t') for line in lines]
# Visualizing first 5 lines
lines[0:5]

We can see that the processed records are stored as lists with each list containing an enlish word, its German translation and some metadata about the data. Let us store these lists as an array for convenience and then display the shape of the array.

# Storing the lines into an array
mtData = array(lines)
# Displaying the shape of the array
print(mtData.shape)
Shape of array

All the above steps we can represent as a function. Let us construct the function which will be used to load the data and do basic preprocessing of the data.

# function to read raw text file
def read_text(filename):
    # open the file
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    
    # Split the text into individual lines
    lines = text.strip().split('\n')
    # Splitting each line based on tab spaces and creating a list
    lines = [line.split('\t') for line in lines]

    file.close()
    return array(lines)

We can call the function to load the data and convert it into an array of English and German sentences. We can also see that the raw data has more than 200,000 rows and three columns. We dont require the third column and therefore we can eliminate them. In addition processing all rows would also be computationally expensive. Let us take the first 50000 rows. However this decision is left to you on how many rows you want based on the capacity of your machine.

# Reading the data using the function
mtData = read_text(fileurl)
# Taking only 50000 rows of data
mtData = mtData[:50000,:2]
print(mtData.shape)
mtData[0:10]

With the array format, the data is in a neat format with the first column being English and the second one the corresponding German sentence. However if you notice the text, there are lot of punctuations and other characters which are unwanted. We also need to standardize the text to lower case. Let us now crank up our cleaning process. The following are the processes which we will follow

  1. Normalize all unicode characters,which are special characters found in a language, to its corresponding ascii format. We will be using a library called ‘unicodedata’ for this normalization.
  2. Tokenize the string to individual words
  3. Convert all the characters to lower case
  4. Remove all punctuations from the text
  5. Remove all non alphabets from text

Since there are multiple processes involved we will be wrapping all these processes in a function. Let us look at the code which implements this.

# Cleaning the document for all unwanted characters

def cleanDocs(lines):
  cleanArray = list()
  for docs in lines:
    cleanDocs = list()
    for line in docs:
      # Normalising unicode characters
      line = normalize('NFD', line).encode('ascii', 'ignore')
      line = line.decode('UTF-8')
      # Tokenize on white space
      line = line.split()
      # Removing punctuations from each token
      line = [word.translate(str.maketrans('', '', string.punctuation)) for word in line]
      # convert to lower case
      line = [word.lower() for word in line]
      # Remove tokens with numbers in them
      line = [word for word in line if word.isalpha()]
      # Store as string
      cleanDocs.append(' '.join(line))
    cleanArray.append(cleanDocs)
  return array(cleanArray)

The input to the function is the array which we created in the earlier step. We first initialize some empty lists to store the processed text in Line 3.

Lines 5 – 7, we loop through each row ( docs) and then through each column (line) of the row. The first process is to normalize the special characters . This is done through the normalize function available in the ‘unicodedata’ package. We use a normalization method called ‘NFD’ which maintains the same form of the characters in lines 9-10. The next process is to tokenize the string to individual words by applying the split() function in line 12. We then proceed to remove all unwanted punctuations using the translate() function in line 14 . After this process we convert the text to lower case and then retain only the charachters which are alphabets using the isalpha() function in lines 16-18. We join the individual columns within a row using the join() function and then store the processed row in the ‘cleanArray’ list in lines 20-21. The final output after the whole process looks quite clean and is ready for further processing.

# Cleaning the sentences
cleanMtDocs = cleanDocs(mtData)
cleanMtDocs[0:10]

Nueral Translation Data Set Preperation

Now that we have completed the initial preprocessing, its now time to get closer to the core process. Let us first prepare the data sets in the required format we want for modelling. The various steps which we will follow for preparation of data set are

  1. Tokenizing the text and creating vocabulary dictionaries for English and German sentences
  2. Define the sequence length for both English and German text
  3. Encode the text sequences as integer sequences
  4. Split the data set into train and test sets

Let us see each of these processes

Tokenization and vocabulary creation

Tokenization is the process of splitting the string to individual unique words or tokens. So if the string is

"Hi I am enjoying this learning and I look forward for more"

The unique tokens vocabulary would look like the following

{'i': 1, 'hi': 2, 'am': 3, , 'enjoying': 4 , 'this': 5 , 'learning': 6 'and': 7, , 'look': 8 , 'forward': 9, 'for': 10, 'more': 11}

Note that only unique words are taken and each token is given an index which will come in handy when we encode the tokens in later steps. So let us go ahead and prepare the tokens. Please note that we will be creating seperate vocabulary for English words and German words.

# Instantiating the tokenizer class
tokenizer = Tokenizer()

The function which does tokenization is the Tokenizer() class which could be imported from tensorflow.keras as shown above. The first step is to instantiate the Tokenizer() class. Next we will see how to fit text to the tokenizer object we created.

# Fit the tokenizer on the text
tokenizer.fit_on_texts(string)

Fitting the text is done using the fit_on_texts() method. This method splits the strings and then creates the vocabulary we saw earlier. Since these steps have to be repeated multiple times, let us package them as a function

# Function for creating tokenizers
def createTokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

Let us use the above function to create the tokenizer for English words and look at the total length of words in English

# Create English Tokenizer
eng_tokenizer = createTokenizer(cleanMtDocs[:,0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
print(eng_vocab_size)

We can see that the length of the English vocabulary is 6255. This is after we incremented the actual vocabulary size with 1 to account for any words which is not part of the vocabulary. Let us list down the first 10 words of the English vocabulary.

# Listing the first 10 items of the English tokenizer
list(eng_tokenizer.word_index.items())[0:10]

From the output we can see how the words are assigned an index value. Similary we will create the German vocabulary also

# Create German tokenizer
ger_tokenizer = createTokenizer(cleanMtDocs[:,1])
# Defining German Vocabulary
ger_vocab_size = len(ger_tokenizer.word_index) + 1

Now that we have tokenized the German and English sentences, the next task is to define a standard sequence length for these languges

Define Sequence lengths for German and English sentences

From our earlier introduction on sequence models, we know that we need data in sequences. A prerequisite in building sequence models is the sequences to be of standard lenght. However if we look at our corpus of both English and German sentences the lengths of each sentence will vary. We need to adopt a strategy for standardizing this length. One common strategy would be to adopt the maximum length of all the sentences as the standard sequence. Sentences which will have length lesser than the maximum length will have its indexes filled with zeros.However one pitfall of this strategy is, processing will be expensive. Let us say the length of the biggest sentence is 50 and most of the other sentences are of length ranging from 8 to 12. We have a situation wherein for just one sentence we unnecessarily increase the length of all other sentences by filling dummy values. When data sets become large, having all sentences standardized to the longest sentence will make the computation expensive.

To get over such issues we will adopt a strategy of finding a length under which majority of the sentences fall. This can be done by taking a high quantile value under which majority of the sentence lengths fall.

Let us implement this strategy. To start off we will have to count the lengths of all the sentences in the corpus

# Create an empty list to store all english sentence lenghts
len_english = []
# Getting the length of all the English sentences
[len_english.append(len(line.split())) for line in cleanMtDocs[:,0]]
len_english[0:10]

In line 2 we first created an empty list 'len_english'. Next we iterated through all the sentences in the corpus and found the length of each of the sentences and then appended each sentence lengths to the list we created, line 4.

Similarly we will create the list of all German sentence lenghts.

len_German = []
# Getting the length of all the English sentences
[len_German.append(len(line.split())) for line in cleanMtDocs[:,1]]
len_German[0:10]

After getting a distribution of all the lengths of English sentences, let us find the quantile value at 97.5% under which majority of the sentences fall.

# Find the quantile length
engLength = np.quantile(len_english, .975)
engLength

From the quantile value we can see that a sequence length of 5.0 would be a good value to adopt as majority of the sentences would fall within this length. Similarly let us calculate for the German sentences also.

# Find the quantile length
gerLength = np.quantile(len_German, .975)
gerLength

We will be using the sequence lengths we have calculated in the next process where we encode the word tokens as sequences of integers.

Encode the sequences as integers

Earlier we tokenized all the unique words and created vocabulary dictionaries. In those dictionaries we have a mapping of the word and an integer value for the word. For example let us display the first 5 tokens of the english vocabulary

# First 5 tokens and its integers of English tokenizer
list(eng_tokenizer.word_index.items())[0:5]

We can see that each tokens are associated with an integer value . In our sequence model we will be using the integer values instead of the tokens themselves. This process of converting the tokens to its corresponding integer values is called the encoding. We have a method called ‘texts_to_sequences’ in the tokenizer() to convert the tokens to integer sequences.

The standard length of the sequence which we calculated in the previous section will be the length of each of these integer encoding. However what happens if a sentence string has length more than the the standard length ? Well in that case the sentence string will be curtailed to the standard length. In the case of a sentence having length less than the standard length, the additional lengths will be filled with zeros. This process is called padding.

The above two processes will be implemented in a function for convenience. Let us look at the code implementation.

# Function for encoding and padding sequences

def encode_sequences(tokenizer,length, lines):
    # Sequences as integers
    X = tokenizer.texts_to_sequences(lines)
    # Padding the sentences with 0
    X = pad_sequences(X,maxlen=length,padding='post')
    return X

The above function takes three variables

tokenizer : Which is the language tokenizer we created earlier

length : The standard length

lines : Which is our data

In line 5 each line is converted to sequenc of integers using the 'texts_to_sequences' method and then padded using pad_sequences method, line 7. The parameter value of padding = 'post' means that the zeros are added after the corresponding length of the sentence till the standard length is reached.

Let us now use this function to prepare the integer sequence data for both English and German sentences. We will split the data set into train and test sets first and then encode the sequences. Please remember that German sequences are our X variable and English sentences are our Y variable as we are translating from German to English.

# Preparing the train and test splits
from sklearn.model_selection import train_test_split
# split data into train and test set
train, test = train_test_split(cleanMtDocs, test_size=0.1, random_state = 123)
print(train.shape)
print(test.shape)
# Creating the X variable for both train and test sets
trainX = encode_sequences(ger_tokenizer,int(gerLength),train[:,1])
testX = encode_sequences(ger_tokenizer,int(gerLength),test[:,1])
print(trainX.shape)
print(testX.shape)

Let us display first few rows of the training set

# Displaying first 5 rows of the traininig set
trainX[0:5]

From the visualization of the training set we can see the integer encoding of the sequences and also padding of the sequences . Similarly let us repeat the process for English sentences also.

# Creating the Y variable both train and test
trainY = encode_sequences(eng_tokenizer,int(engLength),train[:,0])
testY = encode_sequences(eng_tokenizer,int(engLength),test[:,0])
print(trainY.shape)
print(testY.shape)

We have come to the end of the preprocessing steps. Let us now get to the heart of the process which is defining the model and then training the model with the preprocessed training data.

Nueral Translation Model Building

In this section we will look into the building blocks of the model. We will define the model structure in a function as shown below. Let us dive into details of the model

def defineModel(src_vocab,tar_vocab,src_timesteps,tar_timesteps,n_units):
    model = Sequential()
    model.add(Embedding(src_vocab,n_units,input_length=src_timesteps,mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units,return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab,activation='softmax')))
    # Compiling the model
    model.compile(optimizer = 'adam',loss='sparse_categorical_crossentropy')
    # Summarising the model
    model.summary()
    
    return model

In the second article of this series we were introduced to the encoder-decoder architecture. We will be manifesting the encoder architecture within this code block. From the above code uptill line 5 is the encoder part and the remaining is the decoder part.

Let us now walk through each layer in this architecture.

Line 2 : Sequential Class

As you know neural networks, work on the basis of various layers stacked one after the other. In Keras, representation of the model as a stack of layers is initialized using a class called Sequential(). The sequential class is usable for most of the cases except in cases where one has to share multiple layers or have multiple inputs and outputs. For the latter case the functional API in keras is used. Since the model we have defined is quite straight forward, using sequential class will suffice.

Line 3 : Embedding Layer

A basic requirement for a neural network model is the input to be in numerical format. In our case our inputs are text format. So we have to convert this text into some numerical features. Word embedding is a very effective way of representing the sequence of texts in the form of numbers ensuring that the syntactic relationship between words in the sequence is also maintained.

Embedding layer in Keras can be explained in simple terms as a look up dictionary between the unique words in the vocabulary and the corresponding vector of that word. The vector for each word which is the representation of the semantic similarity is learned during the training process. The Embedding function within Keras requires the following parameters vocab_size, embedding_size and sequence_length

Vocab_size : The vocab size is required to initialize the matrix of unique words and its corresponding vectors. The unique indexes of each word is initialized based on the vocab size. Let us look at an example to illustrate this.

Suppose there are two sentences with the following words

Embedding gets the semantic relationship between words

‘Semantic relationships manifests the context

For demonstration purpose let us assume that the initial vector representation of these words are as shown in the table below.

IndexWordVector
0Embedding[0.02 , 0.01 , 0.12]
1gets[0.21 , 0.41 , 0.52]
2the[0.22 , 0.61 , 0.02]
3semantic[0.71 , 0.01 , 0.32]
4Relationship[0.85 ,-0.23 , -0.52]
5between[0.21 , -0.45 , 0.62]
6words[-0.29 , 0.91 , 0.052]
7manifests[0.121 , 0.401 , 0.352]
8context[0.721 , 0.531 , -0.592]

Let us understand each of the parameters of the embedding layer based on the above table. In our model the vocab size for the encoder part is the German vocabulary size. This is represented as src_vocab, which stands for source vocabulary. For the toy example we considered, our vocab size is 9 as there are 9 unique words in the above table.

embedding size : The second parameter which needs to be supplied is the embedding size. This represents the size of the vector for each word in the matrix. In the example matrix shown above the vector size is 3. The size of the embedding vector is a parameter which can be altered to get the right semantic relationship between the sequences of words in the sentence

sequence length : The sequence length represents the number of words which are required in each input sentence. As seen earlier during preprocessing, a pre-requisite for the LSTM layer was for the length of sequences to be standardized. If a particular sequence has less number of words than the sequence length, it was padded with dummy vectors so that the length was standard. For illustration purpose let us assume that the sequence length = 10. The representation of these two sentence sequences in the vector form will be as follows

[Embedding, gets, the ,semantic, relationship, between, words] => [[0.02 , 0.01 , 0.12], [0.21 , 0.41 , 0.52], [0.22 , 0.61 , 0.02], [0.71 , 0.01 , 0.32], [0.85 ,-0.23 , -0.52], [0.21 , -0.45 , 0.62], [-0.29 , 0.91 , 0.052], [0.00 , 0.00, 0.00], [0.00 , 0.00, 0.00]]

[Semantic, relationships, manifests ,the, context] => [[0.71 , 0.01 , 0.32], [0.85 ,-0.23 , -0.52], [0.121 , 0.401 , 0.352] ,[0.22 , 0.61 , 0.02], [0.721 , 0.531 , -0.592], [0.00 , 0.00, 0.00], [0.00 , 0.00, 0.00], [0.00 , 0.00, 0.00], [0.00 , 0.00, 0.00], [0.00 , 0.00, 0.00]]

The last parameter mask_zero = True is to inform the Model that some part of the data is padding data.

The final output from the embedding layer after providing all the above inputs will be a three dimensional matrix of the following shape (No. of samples ,sequence length , embedding size). Let us view this pictorially

As seen from the above figure, let each rectangular block represent the vector representation of a word in the sequence. The depth of the block will be the embedding size dimensions. Multiple words along the ‘X’ axis will form a sequence and multiple such sequences along the ‘Y’ axis will represent the number of examples we have in the corpora.

Line 4 : Sequence to sequence Layer (LSTM)

The next layer in the model is the sequence to sequence layer which in our case is a LSTM. We discussed in detail the dynamics of the LSTM layer in the third and fourth articles of the series. The number of hidden units is defined as a parameter when defining the LSTM unit.

Line 5 : Repeat Vector

In our machine translation application, we need to produce output which is equal in length with the standard sequence length of the target language ( English) . However our input at the encoder phase is equal in length to the source sequence ( German ). We therefore need a mechanism to map the output from the encoder phase to the number of sequences of the decoder phase. A ‘Repeat Vector’ is that operation which maps the input sequences (German sequence) to that of the output sequences ( English sequence). The below figure gives a pictorial representation of the operation.

As seen in the figure above we have to match the output from the encoder and the decoder. The sequence length of the encoder will be equal to the source sequence length ( German) and the length of the decoder will have to be the length of the target sequence ( English). Repeat vector can be described as a trick to match them. The output vector of the encoder where the information of the complete sequence is encoded is repeated in this operation. It is important to note that there are no weights and parameters in this operation.

Line 6 : LSTM Layer ( with return sequence is true)

The next layer is another LSTM unit. The dynamics within this unit is the same as the previous LSTM unit. The only difference in the output. In the previous LSTM unit we never had any output from each of the sequences. The output sequences is controlled by the parameter return_sequences. By default it is ‘False’. However in this case we have specified the return_sequences = True . This means that we need to have an output from each of the sequences. When we keep the return_sequences = False only the last sequence will have an output.

Line 7 : Time Distributed – Dense Layer with Softmax activation

This is the final layer of the network. This layer receives the output from the pervious LSTM layer which has outputs equal to the target sequence. Each of these sequences are then connected to a dense layer or a fully connected layer. Dense layer in Keras is synonymous to the dot product of the output and weight matrix along with addition of the bias term.

Dense = dot(Wy , Y) + by

Wy = Weight matrix of the Dense layer

Y = Output from each of the LSTM sequence

by = bias term for each sequence

After the dense operation, the resultant vector is taken through a softmax layer which converts the output to a probability distribution around the vocabulary of the target language. Another term to note is the command Time distributed. This implies that each sequence output which we get out of the LSTM layer has to be applied to a separate dense operation and a subsequent Softmax layer. So at the end of all the operation we will get a probability distribution around the target vocabulary from each of the output

Time Distributed Dense Layer

Line 9 Optimizer

In this layer the optimizer function and the loss functions are defined. The loss function we have defined is sparse_cross entropy, which is beneficial from a training perspective. If we use categorical_cross entropy we would require one hot encoding of the output matrix which can be very expensive to train given the huge size of the target vocabulary. Sparse_cross entropy gives us a great alternate.

Line 11 Summary

The last line is the summary of the model. Let us try to unravel each of the parameters of the summary level based on our understanding of the LSTM

The summary displays the model layer by layer the way we built it. The first layer is the embedding layer where the output shape is (None,6,256). None stands for the number of examples we have. The other two are the length of the source sequence ( src_timesteps = gerLength) and the embedding size ( 256 ).

Next we applied a LSTM layer with 256 hidden units which is represented as (None , 256 ). Please note that we will only have one output from this LSTM layer as we have not specified return_sequences = True.

After the single LSTM layer we have the repeat vector operations which copies the single output of the LSTM to a length equal to the target language length (engLength = 5).

We have another LSTM layer after the repeat vector operation. However in this LSTM layer we have defined the output as return_sequences=True . Therefore we have outputs of 256 units each for each of the sequence resulting in the output dimension of ( None, 5 , 256).

Finally we have the time distributed dense layer. We earlier saw that the time distributed dense layer will be a dense operation on each of the time sequence. Each sequence will be of the form Dense = dot(Wy , Y) + by. The weight matrix Wy will have a dimension of (256,6225 ) where 6225 is the dimension of the target vocabulary ( eng_vocab_size = 6225). Y is the output from each of the LSTM layer from the previous layer which has a dimension ( 1, 256 ). So the dot product of both these matrices will be

[ 1, 256 ] x [256,6225] = >> [1, 6225]

The above is for one time step. When there are 5 time steps for the target language we will get a dimension of ( None , 5 , 6225)

Model fitting

Having defined the model and the optimization function its time to fit the model on the data.

# Fitting the model
checkpoint = ModelCheckpoint('model1.h5',monitor='val_loss',verbose=1,save_best_only=True,mode='min')
model.fit(trainX,trainY,epochs=50,batch_size=64,validation_data=(testX,testY),callbacks=[checkpoint],verbose=2)

The initiation of both the forward and backward propagation is through the model.fit function. In this function we provide the inputs (trainX and trainY), the number of epochs , the batch size for each pass of the optimizing function and also the validation set. We also define the checkpointing to save our models based on the validation score. The model fitting process or training process is a time consuming step. During the train phase the forward pass, error identification and the back propogation processes will kick in.

With this we come to the end of the training process. Let us look back and summarize the model architecture to get a big picture of the process.

Model Big picture

Having seen the model components, let us now get a big picture as to the whole process and how the forward and back propagation work together to learn the required parameters from the data.

The start of the process is the creation of the features for the model namely the embedding layer. The inputs for the input layer are the source vocabulary size, embedding size and the length of the sequences. The output we get out of this is a three dimensional matrix with number of examples, sequence length and the embedding size as the three dimensions.

The embedding layer is then supplied to the first LSTM layer as input with each time step receiving an embedding layer . There will not be any output for each time step of the sequence. The only output will be from the last time step which is then given as input to the next LSTM layer. The number of time steps of the second LSTM unit will be the equal to length of the target language sequence. To ensure that the LSTM has inputs equal to the target sequences, the repeat vector function is used to copy the output from the previous LSTM layer to all the time steps of the second LSTM layer.

The second LSTM layer will given intermediate outputs for each of the time steps. Each of these outputs are then fed into a dense layer. The output of the dense layer will be a vector equal to the vocabulary length of the target language. This vector is then passed on to the softmax layer to convert it into a probability distribution around the target vocabulary. The output from the softmax layer, which is the prediction is compared with the actual label and the difference would be the error.

Once the error is generated, it has to be back propagated to all the parts of the network to get the gradients of each of the parameters. The error will start propagating first from the dense layer and then would propagate to each of the sequence of the second LSTM unit. Within the LSTM unit the error will start propogating from the last sequence and then will progressively move towards the first sequence. During the movement of the error from the last sequence to the first, the respective errors from each of the sequences are added to the propagated error so as to get the gradients. The final weight gradient would be sum of the gradients obtained from each of the sequence of the LSTM as seen from the numerical example on back propagation. The gradient with respect to each of the inputs will also be calculated by summing across all the time step. The sum total of the gradients of the inputs from the second LSTM layer will be propagated back to the first LSTM layer.

In the first LSTM layer, the gradient received from the top layer will be propagated from the last time sequence. The error propagates progressively through each time step. In this LSTM there will not be any error to be added at each sequence as there were no output for each of the sequence except for the last layer. Along with all the weight gradients , the gradient vector for the embedding vector is also calculated. All these operations are carried out for all the epochs and finally the model weights are learned, which help in the final prediction.

Once the training is over, we get the most optimised parameters inside the model object. This model object is then used to predict on the test data set. Let us now look at the prediction or inference phase of the process.

Inference Process

The proof of the pudding of the model we created is the predictions we get from a test set. Let us first look at how the predictions would be from the model which we just created

# Generating the predictions
prediction = model.predict(testX,verbose=0)
prediction.shape

We get the prediction from the model using model.predict() method with the test data as its input. The prediction we get would be of shape ( num_examples, target_sequence_length,target_vocabulary_size). Each example will be a sequence of probability distribution around the target vocabulary. For each sequence the predicted word would be the index of the vocabulary where the probability is the greatest. Let us demonstrate this with a figure.

Let us assume that the vocabulary has only three words [ I , Learning , Am] with indexes as [1,2,3] respectively. On predicting with the model we will get a probability distribution on each sequence as shown in the figure above. For the first sequence the probability for the first index word is 0.6 and the other two are 0.2 and 0.2 resepectively. So from the probability distribution the word in the first index has the largest probability and that will be the predicted word for that sequence. So based on the index with the maximum probability for the entire sequence we get the predictions as [1,3,2] which translates to [I , Am, Learning] as per the vocabulary.

To get the index of each of the sequences, we use a function called argmax(). This is how the code to get the indexes of the predictions will look

# Getting the prediction index along the last axis ( Vocabulary size axis)
predIndex = [argmax(vector,axis = -1) for vector in prediction]
predIndex[0:3]

In the above code axis = -1 means that the argmax has to be taken on the last dimension of the prediction which is along the vocabulary dimension. The prediction we get will be in the form of sequences of integers having the same sequence length as the target vocabulary.

If we look at the first 3 predictions we can see that the predictions are integers which have to be converted to the corresponding words. This can be done using the tokenizer dictionary we created earlier. Let us look at how this is done

# Creating the reverse dictionary
reverse_eng = eng_tokenizer.index_word

The index_word, method of the tokenizer class generates the word for an input index. In the above step we have created a dictionary called reverse_eng which outputs a word when given an index. For a sequence of predictions we have to loop through all the indexes of the predictions and then generate the predicted words as shown below.

# Converting the tokens to a sentence
preds = []
for pred in predIndex[0]:
  if pred == 0:
        continue 
  preds.append(reverse_eng[pred])  
print(' '.join(preds))

In the above code block in line 2 we first initialized an empty list preds . We then iterated through each of the indexes in lines 3-6 and generated the corresponding word for the index using the reverse_eng dictionary. The generated words are finally appended to the preds list. We joined all the words in the list together get our predicted sentence.

Let us now package all the inference code we have seen so far into two functions.

# Creating a function for converting sequences
def Convertsequence(tokenizer,source):
    target = list()
    reverse_eng = tokenizer.index_word
    for i in source:
        if i == 0:
            continue
        target.append(reverse_eng[int(i)])
    return ' '.join(target)

The first function is to convert the sequence of predictions to a sentence.

# Function to generate predictions from source data
def generatePredictions(model,tokenizer,data):
    prediction = model.predict(data,verbose=0)
    AllPreds = []
    for i in range(len(prediction)):
        predIndex = [argmax(prediction[i, :, :], axis=-1)][0]
        target = Convertsequence(tokenizer,predIndex)
        AllPreds.append(target)
    return AllPreds

The second function is to generate predictions from the test set and then generate the predicted sentence. The first function we defined is used inside the generatePredictions function.

Now that we have understood how the predictions can be generated let us go ahead and generate predictions for the first 20 examples of the test set and evaluate the results.

# Generate predictions
predSent = generatePredictions(model,eng_tokenizer,testX[0:20,:])
for i in range(len(testY[0:20])):
    targetY = Convertsequence(eng_tokenizer,testY[i:i+1][0])
    print("Original sentence : {} :: Prediction : {}".format([targetY],[predSent[i]]))

From the output we can see that the predictions are pretty close in a lot of the examples. We can also see that there are some instances where the context is understood and predicted with different words like the examples below

There are also predictions which are way off the target

However considering the fact that the model we used was simple and the data set we used were relatively small, the model does a reasonably okay job.

Inference on your own sentences

Till now we predicted on the test set. Let us see how we can generate predictions from an input sentence we provide.

To generate predictions from our own input sentences, we have to first clean the input sentences and then tokenize them to transform it to the format the model understands. Let us look at the functions which does these tasks.

def cleanInput(lines):
    cleanSent = []
    cleanDocs = list()
    for docs in lines.split():
        line = normalize('NFD', docs).encode('ascii', 'ignore')
        line = line.decode('UTF-8')
        line = [line.translate(str.maketrans('', '', string.punctuation))]
        line = line[0].lower()
        cleanDocs.append(line)
    cleanSent.append(' '.join(cleanDocs))
    return array(cleanSent)

The first function is the cleaning function. This is an abridged version of the cleaning function we used for our original data set. The second function we will use is the encode_sequences function we used earlier. Using these functions let us go ahead and generate our predictions.

# Trying different input sentences
inputSentence = 'Es ist ein großartiger Tag' # It is a great day ?

The first sentence we will try is the German equivalent of 'It is a great day ?'.

Let us clean the input text first using the function we developed

# Clean the input sentence
cleanText = cleanInput(inputSentence)

Next we will encode this sentence into sequence of integers

# Encode the inputsentence as sequence of integers
seq1 = encode_sequences(ger_tokenizer,int(gerLength),cleanText)

Let us get our predictions and print them out

# Generate the prediction
predSent = generatePredictions(model,eng_tokenizer,seq1)

print("Original sentence : {} :: Prediction : {}".format([cleanText[0]],predSent))

Its not a great prediction isnt it ?? Let us try couple more sentences

inputSentence1 ='Heute wird es regnen' #  it's going to rain Today
inputSentence2 ='Ich habe im Radio gesprochen' # I spoke on the radio

for sentence in [inputSentence1,inputSentence2]:
  cleanText = cleanInput(sentence)
  seq1 = encode_sequences(ger_tokenizer,int(gerLength),cleanText)
  # Generate the prediction
  predSent = generatePredictions(model,eng_tokenizer,seq1)

  print("Original sentence : {} :: Prediction : {}".format([cleanText[0]],predSent))

We can see that the predictions on our own sentences are not promising .

Why is it that the test set gave us reasonable predictions and our own sentences are not giving good predicitons ? Well one obvious reason is that the distribution of words we used could be different from the distribution which was used for training. Besides,the model we used was a simple one and the data set also relatively small. All these could be the reasons for bad predictions on our own sentences. So how do we improve the quality of predictions ? There are different ways to do that. Let us see some of them.

  1. Use bigger data set for training and train for longer epochs.
  2. Change the model architecture. Experiment with different number of units and number of layers. Try variations like bidirectional LSTM
  3. Try out different regularization methods like drop out.
  4. Use attention mechanisms

There are different avenues for improvement. I would urge you to try out different choices and let me know how your fared.

Next Steps

Congratulations, we have successfully built a prototype for machine translation system. The next step in our journey is to convert this prototype into an application. We will address that in the next post.

Go to article 6 of this series : From prototype to production

You can download the notebook for the prototype using the following link

https://github.com/BayesianQuest/MachineTranslation/tree/master/Prototype

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

IV : Build and Deploy Data Science Products : Looking under the hood of Machine translation model – LSTM Backpropagation

Source: drivezone.com

“True knowledge come with deep understanding of a topic and its inner working”

Albert Einsteen

This is the fourth part of the series where we continue on our quest to understand the innerworking of a LSTM model. Deep understanding of the model is a step towards acquiring comprehensive knowledge on our machine translation application. This series comprises of 8 posts.

  1. Understand the landscape of solutions available for machine translation
  2. Explore sequence to sequence model architecture for machine translation.
  3. Deep dive into the LSTM model with worked out numerical example.
  4. Understand the back propagation algorithm for a LSTM model worked out with a numerical example.( This post)
  5. Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
  6. Build the production grade code for the training module using Python scripts.
  7. Building the Machine Translation application -From Prototype to Production : Inference process
  8. Build the machine translation application using Flask and understand the process to deploy the application on Heroku

In the previous 3 posts we understood the solution landscape for machine translation ,explored different architecture choices for sequence to sequence models and did a deep dive into the forward propagation algorithm. Having understood the forward propagation its now time to explore the back propagation of the LSTM model

Back Propagation Through Time

We already know that Recurrent networks have a time component and we saw the calculations of different components during the forward propogation phase. We saw in the previous post that we traversed one time step at a time to get the expected outputs.

The backpropagation operation works in the reverse order and traverses one time step at a time in the reverse order to get the gradients of all he parameters. This process is called back propagation through time.

The initiation step of the back propagation is the error term .As you know the dynamics of optimization entails calculating the error between the predicted output and ground truth, propagating the gradient of the error to the layers thereby updating the parameters of the model. The work horse of back propagation is partial differentiation,using the chain rule. Let us see that in action.

Backpropagation calculation @ time step 2

We start with the output from our last layer which as calculated in the forward propagation stage is ( refer the figure above)

a t = -0.141

The label for this time step is

Yt = 0.3

In this toy example we will take a simple loss function , a squared loss. The error would be derived as the average of squared difference between the last time step and the ground truth (label).

Error = ( at – yt )2/2

Before we start the back propagation it would be a good idea to write down all the equations in the order in which we will be taking the derivative.

  1. Error = ( at – yt )2/2
  2. at = tanh( Ct ) * Ґo
  3. Ct = Ґu * C~ + Ґf * Ct-1
  4. Ґo = sigmoid(Wo *[xt , at-1] + bo)
  5. C~ = tanh(Wc *[xt , at-1] + bc)
  6. Ґu = sigmoid(Wu *[xt , at-1] + bu)
  7. Ґf = sigmoid(Wf *[xt , at-1] + bf)

We mentioned earlier that the dynamics of backpropogation is the propogation of the gradients . But why is getting the gradients important and what information does it carry ? Let us answer these questions.

A gradient represents unit rate of change i.e the rate at which parameters have to change to get a desired reduction in error. The error which we get is dependent on how close to reality our initial assumptions of weights were. If our initial assumption of weights were far off from reality, the error also would be large and viceversa. Now our aim is to adjust our initial weights so that the error is reduced. This adjustment is done through the back propagation algorithm. To make the adjustment we need to know the quantum of adjustment and also the direction( i.e whether we have to add or subtract the adjustments from initially assumed weights). To derive the quantum and direction we use partial differentiation. You would have learned in school that partial differentiation gives you the rate of change of a variable with respect to another. In this case we need to know the rate at which the error would change when we make adjustments to our assumed parameters like weights and bias terms.

Our goal is to get the rate of change of error with respect to the weights and the biases. However if you look at our first equation

Error = ( at – yt )2/2

we can see that it dosent have any weights in it. We only have the variable at . But we know from our forward propagation equations that at is derived by different operations involving weights and biases. Let us traverse downwards from the error term and trace out different trails to the weights and biases.

The above figure represents different trails ( purple,green,red and blue ) to reach the weights and biases from the error term. Let us first traverse the purple coloured trail.

The purple coloured trail is to calculate the gradients associated with the output gate. When we say gradients it entails finding the rate of change of the error term with respect to the weights and biases associated with the output gate, which in mathematical form is represented as ∂E/∂Wo. If we look down the purple trail we can see that Wo appears in the equation at the tail end of the trail. This is where we apply the chain rule of differentiation. Chain rule of differentiation helps us to differentiate the error with respect to the connecting links till we get to the terms which we want, like weights and biases. This operation can be represented using the following equation.

This image has an empty alt attribute; its file name is image.png

The green coloured trail is a longer one. Please note that the initial part of the green coloured trail, where the differentiation with respect to error term is involved ( ∂E/∂a ), is the same as the purple trail. From the second box onwards a distinct green trail takes shape and continues till the box with the update weight (Wu). The equation for this can be represented as follows

∂E/∂Wu = ∂E/∂a * ∂a/∂Ct * ∂Ct/ ∂Γu * ∂Γu/∂Wu

The other trails are similar. We will traverse through each of these trails using the numerical values we got after the forward propagation stage in the last post.

Gradients of Equation 1 : Error = ( at – yt )2/2

The first step in backpropagation is to take the derivative of the error with respect to at

dat = ∂E/∂at = ∂/∂at[ (at -y)2/2]
= 2 * (at-y)/ 2
= at-y

Let us substitute the values

Derivative 2.1.1.Numerical EqnValue
dat = at– y= -0.141 – 0.3-0.441

If you are thinking that thats all for this term then you are in for a surprise. In the case of a sequence model like LSTM there would be an error term associated with each time step and also an error term which back propagates from the previous time step. The error term for certain time step would be the sum total of both these errors. Let me explain this pictorially.

This image has an empty alt attribute; its file name is timestep_errorearlierstep.jpeg

Let us assume that we are taking the derivative of the error with respect to the output from the first time step (∂E/∂at-1), which is represented in the above figure as the time step on the left. Now if you notice the same output term at-1 is also propogated to the second time step during the forward propagation stage. So when we take the derivative there is a gradient of this output term which is with respect to the error from the second time step. This gets propogated through all the equations within the second time step using the chain rule as represented by the purple arrow. This gradient has to be added to the gradient derived from the first time step to get the final gradient for that output term.

However in the case of the top example since we are taking the gradient of the second time step and there are no further time step, this term which gets propogated from the previous time step would be 0. So ideally the equation for the derivative for the second output should be written as follows

Derivative – 1.1Numerical EqnValue
da = at – y + 0= -0.141 – 0.3 + 0-0.441

In this equation ‘0’ corresponds to the gradient from the third time step, which in this case dosent exist.

Gradients of Equation 2 : [at = tanh( Ct ) * Ґo ]

Having found the gradient for the first equation which had the error term, the next step is to find the gradient of the output term at and its component terms Ct and Ґo.

Let us first differentiate it with respect to Ct. The equations for this step are as follows

∂E/∂Ct = ∂E/∂at * ∂at/∂Ct
= da * ∂at/∂Ct

In this equation we have already found the value of the first term which is ∂E/∂at in the first step. Next we have to find the partial derivative of the second term ∂at/∂Ct

∂at/∂Ct = Γo * ∂/∂Ct [ tanh(Ct)]
= Γo * [1 - tanh2(Ct)]

Please note => ∂/∂x [ tanh(x)]= 1 – tanh2(x)

So the complete derivation for ∂E/∂Ct is

∂E/∂Ct = da * Γo * [1 - tanh2(Ct)]

So the above is the derivation of the partial derivative with respect to state 2. Well not quite. There is one more term to be added to this where state 2 will appear. Let me demonstrate that. Let us assume that we had 3 time steps as shown in the table below

We can see that the term Ctappears in the 3rd time step as circled in the table. When we take the derivative of error of time step 2 with respect to Ct we will have to take the derivative from the third time step also. However in our case since the third step doesn’t exist that term will be ‘0’ as of now. However when we take the derivative of the first time step we will have to consider the corresponding term from the second time step. We will come to that when we take the derivative of the first time step. For the time being it is ‘0’ for us now as there is no third time step.

Derivative – 2.2.1Numerical EqnValue
dCt =
da * Ґo * (1 – tanh2(Ct ) + 0
= -0.441 * 0.24* (1-tanh2(-0.674 ) =
-0.441 * 0.24* ( 1 – (-0.59 * -0.59))
-0.069

Let us now take the gradient with the second term of equation 2 which is Ґo. The complete equation for this term is as follows

∂E/Γo = ∂E/∂at * ∂at/∂Γo
= da * ∂at/∂Γo

The above equation is very similar to the earlier derivation. However there are some nuances with the derivation of the term with Γo . If you remember this term is a sigmoid gate with the following equation.

Γo = sigmoid(Wo *[xt , at-1] + bo)
= sigmoid(u)
Where u = Wo *[xt , at-1] + bo

When we take the derivative of the output term with respect to Γo (∂at/∂Γo ), this should be with respect to the terms inside the sigmoid function ( u). So ∂at/∂Γo would actually mean ∂at/∂u . So the entire equation can be rewritten as

at = tanh(Ct) * sigmoid(u) where Γo = sigmoid(u).

Therefore ∂at/∂Γo = tanh(Ct) * Γo *( 1 - Γo )

Please note if y = sigmoid(x) , ∂y/∂x = y(1-y)

The complete equation for the term ∂E/Γo =

da * tanh(Ct) * Γo *( 1 - Γo )

Substituting the numerical terms we get

Derivative – 2.2.2Numerical EqnValue
d Ґ0 = da *tanh(Ct)* Ґo *(1 – Ґo)= -0.441 * tanh(-0.674) * 0.24 *(1-0.24)0.047

Gradients of Equation 3 : [ Ct = Ґu * C~ + Ґf * Ct-1 ]

Let us now find the gradients with respect to the third equation

This equation has 4 terms, Ґu, C~ , Ґf and Ct-1 , for which we have to calculate the gradients. Let us start from the first term Ґu whose equation is the following

∂E/Γu = ∂E/∂at * ∂at/∂Ct * ∂Ct/ ∂Γu

However the first two terms of the above equation, ∂E/∂at * ∂at/∂Ct were already calculated in derivation 2.1, which can be represented as dCt . The above equation can be re-written as

∂E/Γu = dCt * ∂Ct/ ∂Γu

From the new equation we are left with the second part of the equation which is ∂Ct / ∂Γu ,which is the derivative of equation 3 with respect to Γu.

Ct = Ґu * C~ + Ґf * Ct-1.......... (3)

∂Ctu = C~ * ∂/ ∂Γu [ Ґu ] + 0.........(3.1)

In equation 3 we can see that there are two components, one with the term Ґu in it and the other with Ґf in it. The partial derivative of the first term which is partial derivative with respect to Ґu is what is represented in the first half of equation 3.1. The partial derivative of second half which is the part with the gate Ґf will be 0 as there is no Ґu in it. Equation 3.1 represents the final form of the partial derivative.

Now similar to derivative 2.2 which we developed earlier, the partial derivative of the sigmoid gate ∂/ ∂Γu [ Ґu ] will be Γu * ( 1 - Γu ). The final form of equation 3.1 would be

∂Ctu = C~ * Γu * ( 1 - Γu )

The solution for the gradient with respect to the update gate would be

∂E/Γu = dCt* C~ * Γu * ( 1 - Γu )

Derivative 2.3.1Numerical EqnValue
d Ґu = dCt *C~* Ґu *(1 – Ґu)= -0.069 * -0.63 * 0.755 *(1-0.755)0.0080

Next let us calculate the gradient with respect to the internal state C~ . The complete equation for this is as follows

∂E/C~ = ∂E/∂at * ∂at/∂Ct * ∂Ct/ ∂C~

= dCt * ∂Ct/ ∂C~

= dCt * ∂/ ∂C~ [ Ґu * C~]

= dCt * Ґu * ∂/ ∂C~ [ C~]

However we know that, C~ = tanh(Wc *[x , a] + bc) . Let us represent the terms within the tanh function as u .

C~ = tanh(u), where u = Wc *[x , a] + bc

Similar to the derivations we have done for the sigmoid gates, we take the derivatives with respect to the terms within the tanh() function which is ‘u’ . Therefore

∂/ ∂C~ [ C~] = 1-tanh2 (u)

= 1 - tanh2 (Wc *[x , a] + bc)

= 1 - (C~ )2

since, C~ = tanh(Wc *[x , a] + bc) and therefore

(C~ )2 = tanh2 (Wc *[x , a] + bc)

The final equation for this term would be

∂E/C~ = dCt * Ґu * 1 - (C~ )2

Derivative 2.3.2Numerical EqnValue
dC~ = dCt* Ґu * (1 – (C~)2)= -0.069 * 0.755 * (1-(-0.63)2)-0.0314

The gradient with respect to the third term Ґf would be very similar to the derivation of the first term Ґu

∂E/Γf = ∂E/∂at * ∂at/∂Ct* ∂Ct/ ∂Γf

= dCt* Ct-1 * Γf * ( 1 - Γf )

Derivative 2.3.3Numerical EqnValue
d Ґf = dCt* Ct-1* Ґf *(1 – Ґf)= -0.069 * -0.33 * 0.60 *(1-0.60)0.0055

Finally we come to the gradient with respect to the fourth term Ct-1 ,the equation for which is as follows

∂E/Ct-1 = ∂E/∂at * ∂at/∂Ct* ∂Ct/ ∂Ct-1

= dCt* Γf

Derivative 2.3.4Numerical EqnValue
dCt-1 = dCtf= -0.069 * 0.60-0.0414

In this step we have got the gradient of cell state 1 which will come in handy when we find the gradients of time step 1.

Gradients of previous time step output (at-1)

In the previous step we calculated the gradients with respect to equation 3. Now it is time to find the gradients with respect to the output from time step 1 , at-1.

However one fact which we have to be cognizant is that at-1 is present in 4 different equations, 4,5,6 & 7. So we have to take derivative with respect to all these equations and then sum it up. The gradient of at-1 within equation 4 is represented as below

∂E/∂at-1 = ∂E/∂at * ∂at/∂Ct * ∂Ct/ ∂Γo * ∂Γo/∂at-1

However we have already found the gradient of the first three terms of the above equation, which is with respect to the terms within the sigmoid function i.e ‘u’ as dΓo in derivative 2.1. Therefore the above equation can be simplified as

∂E/∂at-1 = dΓo * ∂Γo/∂at-1

The term ∂Γo/∂at-1 in reality is ∂u/∂at-1 where u = Wo *[x , at-1] + bo, because when we took the derivative of Γo we took it with respect to all the terms within the sigmoid() function,which we called as ‘u’.

From the above equation the derivative will take the form

∂Γo/∂at-1 = ∂u/∂at-1 = Wo

The complete equation from the gradient is therefore

∂E/∂at-1 = dΓo * Wo

There are some nuances to be taken care in the above equation since there is a multiplication by Wo . When we looked at the equation for the forward pass we saw that to get the equation of the gates, we originally had two weights, one for the x term and the other for the ‘a’ term as below

Ґo = sigmoid(Wo*[x ] + Uo* [a] + bo)

This equation was simplified by concatenating both the weight parameters and the corresponding x & a vectors to a form given below.

Ґo = sigmoid(Wo *[x , a] + bo)

So in effect there is a part of the final weight parameter Wo which is applied to ‘x’ and another part which is applied to ‘a’. Our initial value of the weight parameter Wo was [-0.75 ,-0.95 , -0.34]. The first two values are the values corresponding to ‘x’ as it is of dimension 2 and the last value ( -0.34) is what is applicable for ‘a’. So in our final equation for the gradient of at-1, ∂E/∂at-1 = dΓo * Wo , we will multiply o only with -0.34.

Similar to the above equation we have to take the derivative of at-1for all other equations 5,6 and 7 which will take the form

Equation 5 = > ∂E/∂at-1 = dC~* Wc

Equation 6= > ∂E/∂at-1 = dΓu* Wu

Equation 7 = > ∂E/∂at-1 = dΓf* Wf

The final equation of will be the sum total of all these components

Derivative 2.4.1Numerical EqnValue
dat-1 = Wo * o + Wc * dC~ + Wu * dҐu + Wf * dҐf= [(-0.34 * 0.047) + (-0.13 * -0.0314) + (1.31 * 0.0080) + (-0.13*0.0055) ]-0.00213

Now that we have calculated the gradients of all the components for time step 2, let us proceed with the calculations for time step 1

Back Propagation @ time step 1.

All the equations and derivations for time step 1 is similar to time step 2. Let us calculate the gradients of all the equations as we did with time step 1

Gradients of Equation 1 : Error term

Gradient with respect to error term of the current time step

Derivative 1.1.1Numerical EqnValue
dat-1 = at-1 – yt-1= -0.083 – 0.8-0.883

However we know that dat-1 = Gradient with respect current layer + Gradient from next time step, as shown in the figure below

Gradient propagated from the 2nd layer is the derivative of at-1which was derived as the last step of the previous time step ( Derivative 2.4.1)

Total Gradient

Derivative 1.1.1Numerical EqnValue
dat-1 = Gradient from this layer + Gradient from previous layer= -0.883 + -0.00213-0.88513

Gradients of Equation 2

Next we have to find the gradients of equation 2 with respect to the cell state, Ct-1and Ґo.When deriving gradient of cell state state we discussed that the cell state of the current layer appears in the next layer also, which will have to be considered. So the total derivative would be

Derivative 1.2.1Formulae
dCt-1 = dCt-1 in current time step + dCt-1 from next time stepdCt-1 = da * Ґo * (1 – tanh2(Ct-1 ) + dCt-1f ( Derivative 3.4)
Derivative 1.2.1Numerical EqnValue
dCt-1 = da * Ґo * (1 – tanh2(Ct-1 ) + dCt-1in next layer= -0.88513 * 0.26 * (1-tanh2(-0.33 ) + (-0.0414)
= -0.88513* 0.26 * ( 1 – (-0.319 * -0.319)) – 0.0414
-0.25

Next is the gradient with respect to Ґo .

Derivative 1.2.2Numerical EqnValue
d Ґ0 = da *tanh(Ct-1)* Ґo *(1 – Ґo)= -0.88513 * tanh(-0.33) * 0.26 *(1-0.26)0.054

Gradients of Equation 3

  • Derivative with respect to Ґu
Derivative 1.3.1Numerical EqnValue
d Ґu = dCt-1 *C~* Ґu *(1 – Ґu)= -0.25 * -0.39 * 0.848 *(1-0.848)0.013
  • Derivative with respect to C~
Derivative 1.3.2Numerical EqnValue
dC~ = dCt-1 * Ґu * (1 – (C~)2)= -0.25 * 0.848 * (1-(-0.39)2)-0.18
  • Derivative with respect to Ґf
Derivative 1.3.3Numerical EqnValue
d Ґf = dCt-1 *C0* Ґf *(1 – Ґf)= -0.25 * 0 * 0.443 *(1-0.443)0
  • Derivative with respect to initial cell state C<0>
Derivative 1.3.4Numerical EqnValue
dC0 = dCt-1f= -0.25 * 0.443-0.11

Gradients of initial output (a0)

Similar to the previous time step this has 4 components pertaining to equations 4,5,6 & 7

Derivative 1.4.1Numerical EqnValue
da0 = Wo * o + Wc * dC~ + Wu * dҐu + Wf * dҐf= [(-0.34 * 0.054) + (-0.13 * -0.18) + (1.31 * 0.013) + (-0.13*0) ]0.022

Now that we have completed the gradients for both time steps let us tabularize the results of all the gradients we have got so far

EqnGradientsValues
2.1.1dat = at – yt + 00.441
2.2.1dCt = dat * Ґo * (1 – tanh2(Ct ) + 0-0.069
2.2.2d Ґ0 = dat * tanh(Ct )* Ґo *(1 – Ґo)0.047
2.3.1d Ґu = dCt * C~ * Ґu *(1 – Ґu)0.0080
2.3.2dC~ = dCt * Ґu * (1 – (C~)2)-0.0314
2.3.3d Ґf = dCt * Ct-1 * Ґf *(1 – Ґf)0.0055
2.3.4dCt-1 = dCt * Ґf-0.0414
2.4.1dat-1 = Wo * dҐo + Wc * dC~ + Wu * dҐu + Wf * dҐf-0.00213
1.1.1dat-1 = at-1 – yt-1 + eq (2.4.1)-0.88513
1.2.1dCt-1 = dat-1 * Ґo * (1 – tanh2(Ct-1 ) + eq 2.3.4-0.25
1.2.2d Ґ0 = dat-1 * tanh(Ct-1 )* Ґo *(1 – Ґo)0.054
1.3.1d Ґu = dCt-1* C~ * Ґu *(1 – Ґu)0.013
1.3.2dC~ = dCt-1* Ґu * (1 – (C~)2)-0.18
1.3.3d Ґf = dCt-1 * C0 * Ґf *(1 – Ґf)0
1.3.4dC0 = dCt-1 * Ґf-0.11
1.4.1da0 = Wo * dҐo + Wc * dC~ + Wu * dҐu + Wf * dҐf0.022

Gradients with respect to weights

The next important derivative which we have to derive is with respect to the weights. We have to remember that the weights of an LSTM is shared across all the time steps. So the derivative of the weight will be the sum total of the derivatives from each individual time step. Let us first define the equation for the derivative of one of the weights, Wu.

The relevant equation for this is the following

The first three terms of the gradient is equal to u which was already derived through equations 2.3.1 and 1.3.1 in the table above. Also remember u is the gradient with respect to the terms inside the sigmoid function ( i.e Wu *[xt , at-1] + bu) . Therefore the derivative of the last term, ∂Γu/∂Wu would be

∂Γu/∂Wu = [xt , at-1]

The complete equation for the gradient with respect to the weight Wu would be

∂E/∂Wu = u * [xt , at-1]

The important thing to note in the above equation is the dimensions of each of the terms. The first term, u, is a scalar of dimension (1,1) , however the second term is a vector of dimension (1 ,3) . The resultant gradient would be another vector of dimension (1,3) as the scalar value will be multiplied with all the terms of the vector. We will come to that shortly. However for now let us find the gradients of all other weights.The derivation for other weights are similar to the one we saw. The equations for the gradients with respect to all weights for time step 1 are as follows

DerivativeEquation
dWf= dҐf * [xt , at-1]
dWo= dҐo * [xt , at-1]
dWc= dC~ * [ xt , at-1]
dWu= dҐu * [xt , at-1]

The total weight derivative would be sum of weight derivatives of all the time steps.

dW = dW1 + dW2

As discussed above to find the total gradient it would be convenient and more efficient to stack all these equations in a matrix form and then multiply it with the input terms and then adding them across the different time steps. This operation can be represented as below

Let us substitue the numberical values and calculate the gradients for the weights

This image has an empty alt attribute; its file name is image-19.png

The matrix multiplication will have the following values

This image has an empty alt attribute; its file name is image-25.png

The final gradients for all the weights are the following

This image has an empty alt attribute; its file name is image-24.png

Gradients with respect to biases

The next task is to get the gradients of the bias . The derivation of the gradients for bias is similar to that of the weights. The equation for the bias terms would be as follows

Similar to what we have done for the weights, the first three terms of the gradient is equal to u . The derivative of the fourth term which are the terms inside the sigmoid function ( i.e Wu *[xt , at-1] + bu) will be

∂Γu/∂bu = 1

The complete equation for the gradient with respect to the weight Wu would be

∂E/∂bu = u * 1

The final gradient of the bias term would be the sum of the gradients of the first time step and the second. As we have seen in case of the weights, the matrix form would be as follows

This image has an empty alt attribute; its file name is image-27.png

The final gradients for the bias terms are

This image has an empty alt attribute; its file name is image-28.png

Weights and bias updates

Calculating the gradients using back propagation is not an end by itself. After the gradients are calculated, they are used to update the initial weights and biases .

The equation is as follows

Wnew = Wold - α * Gradients

Here α is a constant which is the learning rate. Let us assume it to be 0.01

The new weights would be as follows

This image has an empty alt attribute; its file name is image.png

Similarly the updated bias would be

This image has an empty alt attribute; its file name is image-3.png

As you would have learned, these updated weights and bias terms would take the place of the initial weights and biases in the next forward pass and then the back progagation again kicks in to calculate the new set of gradients which will be applied on the updated weights and gradients to get the new set of parameters. This process will continue till the pre-defined epochs.

Error terms when softmax is used

The toy example which we saw just now has a squared error term, which was used for backpropagation. The adoption of such an example was only to demonstrate the concepts with a simple numerical example. However in the problem which we are dealing with or for that matter many of the problems which we will deal with will have a softmax layer as its final layer and thereafter cross entropy as its error function. How would the backpropogation derivation differ when we have a different error term than what we have just done in the toy example ? The major change would be in terms of how the error term is generated and how the error is propogated till the output term at. After this step the flow will be the same as what we we have seen earlier.

Let us quickly see an example for a softmax layer and a cross entropy error term. To demonstrate this we will have to revisit the forward pass from the point we generate the output from each layer.

The above figure is a representation of the equations for the forward pass and backward pass of an LSTM. Let us look at each of those steps

Dense Layer

We know that the output layer at from a LSTM will be a vector with dimension equal to number of units of the LSTM. However for the application which we are trying to build the output we require is the most probable word in the target vocabulary. For example if we are translating from German to English, given a German sentence, for each time step we need to predict a corresponding English word. The prediction from the final layer would be in the form of a probability distribution over all the words in the English vocabulary we have in our corpus.

This image has an empty alt attribute; its file name is image-14.png

Let us look at the above representation to understand this better. We have an input German sentence 'Wie geht es dir' which translates to 'How are you'. For simiplicity let us assume that there are only 3 words in the target vocabulary ( English vocabulary). Now the predictions will have to be a probability distribution over the three words over the target vocabulary and the index which has the highest probability will be the prediction. In the prediction we see that the first time step has the maximum probability ( 0.6) on the second index which corresponds to the word ‘How’ in the vocabulary. The second and third time steps have maximum probability on the first and third indexes respectively giving us the predicted string as ‘How are you’. ( Please note that in the figure above the index of the probability is from bottom to top, which means the bottom most box corresponds to index 1 and top most to index 3)

Coming back to our equation, the output layer at is only a vector and not a probability distribution. To get a probability distribution we need to have a dense layer and a final softmax layer. Let us understand the dynamics of how the conversion from the output layer to the probability distribution happens.

The first stage is the dense layer where the output layer vector is converted to a vector with the same dimension as of the vocabulary. This is achieved by the multiplication of the output vector with the weights of the dense layer. The weight matrix will have the dimension [ length of vocabulary , num of units in output layer]. So if there are 3 words in the vocabulary and one unit in the output layer then weight matrix will be of dimension [ 3, 1], ie it has 3 rows and one column. Another way of seeing this dimension is each row of the weight matrix corresponds to each word in the vocabulary. The dense layer would be derived by the dot product of weight matrix with the output layer

Z = Wy * at

The dimensions of the resultant vector will be as follows

[3,1] * [1,1] => [3,1]

The resultant vector Z after the dense layer operation will have 3 rows and 1 column as shown below.

This image has an empty alt attribute; its file name is image-19.png

This vector is still not a probability distribution. To convert it to a probability distribution we take the softmax of this dense layer and the resultant vector will be a probability distribution with the same dimension.

This image has an empty alt attribute; its file name is image-20.png

The resultant probability distribution will be called as Y^ which will have three components ( equal to the dimension of the vocabulary), each component will be the probability of the corresponding word in the vocabulary

This image has an empty alt attribute; its file name is image-21.png

Having seen the forward pass let us look at how the back propogation works. Let us start with the error term which in this case will be cross entropy loss as this is a classification problem. The cross entropy loss will have the form

This image has an empty alt attribute; its file name is image-29.png

In this equation the term y is the true label in one hot encoded form. So if the first index ( y1) is the true label for this example the label vector in one hot encoded format will be

This image has an empty alt attribute; its file name is image-34.png

Now let us get into the motions of backpropogation. We need to back propogate till at after which the backpropogation equation will be the same as what we derived in the toy example. The complete back propogation equation till at according to the chain rule will be

Let us look at the derivations term by term

Back propogation derivation for first term

The first equation ∂E/∂Y^ is a differentiation with respect to a vector as Y^ ,having three components, Y1 , Y2 and Y3 . A differentiation with a vector is called a Jacobian which will again be a vector of the same dimension as Y.

This image has an empty alt attribute; its file name is image-27.png

Let us look at deriving each of these terms within the vector

This image has an empty alt attribute; its file name is image-30.png

Please note ∂/∂y(Logy) = 1/y

Similarly we get

This image has an empty alt attribute; its file name is image-31.png

So the Jacobian of ∂E/∂Y^ will be

This image has an empty alt attribute; its file name is image-32.png

Suppose the first label (y1) is the true label. Then the one hot encoded form which is [ 1 0 0 ] will make the above Jacobian

This image has an empty alt attribute; its file name is image-36.png

Back propogation derivation for second term

Let us do the derivation for the second term ∂Y^/∂Z . This term is a little more interesting as both the Y and Z are vectors. The differentiation will result a Jacobian matrix of the form

This image has an empty alt attribute; its file name is image-38.png

Let us look at the first row and get the derivatives first. To recap let us look at the equations involved in the derivatives

This image has an empty alt attribute; its file name is image-40.png

Let us take the derivative of the first term ∂Y1^/∂Z1 . This term will be the derivative of the first element in the matrix

This image has an empty alt attribute; its file name is image-41.png

Taking the derivation based on the division rule of differentiation we get

This image has an empty alt attribute; its file name is image-44.png

Please note ∂/∂y(ey) = ey

Taking the common terms in the numerator and denominator and re-arranging the equation we get

This image has an empty alt attribute; its file name is image-45.png

Dividing through inside the bracket we get

This image has an empty alt attribute; its file name is image-47.png

Which can be simplified as

This image has an empty alt attribute; its file name is image-48.png

Since

This image has an empty alt attribute; its file name is image-49.png

Let us take the derivative of the second term ∂Y1^/∂Z2 . This term will be the derivative of the first element in the matrix with respect to Z2

This image has an empty alt attribute; its file name is image-50.png

With these two derivations we can get all the values of the Jacobian. The final form of the Jacobian would be as follows

This image has an empty alt attribute; its file name is image-57.png

Well that was a long derivation. Now to get on to the third term.

Back propogation derivation for third term

Let us do the derivation for the last term ∂Z/ ∂at . We know from the dense layer we have

Z = Wy * at

So in vector form this will be

This image has an empty alt attribute; its file name is image-53.png

So when we take the derivation we get another Jacobian vector of the form

This image has an empty alt attribute; its file name is image-54.png

So thats all in this derivation. Now let us tie everything together to get the derivative with respect to the output term using the chain rule

Gradient with respect to the output term

We earlier saw that the equation of the gradient as

Let us substitute with the derivations which we already found out

This image has an empty alt attribute; its file name is image-58.png

The dot product of the first two terms will get you

This image has an empty alt attribute; its file name is image-59.png

The dot product of the above term with the last vector will give you the result you want

This image has an empty alt attribute; its file name is image-61.png

This is the derivation till ∂E/∂at . The rest of the derivation down from here to various components inside the LSTM layer will be the same as we have seen earlier in the toy example.

In terms of dimensions let us convince ourselves that we get the dimension equal to the dimension of ∂at

[1 , 3 ] * [3, 3] * [3, 1] ==> [1,1]

Wrapping up

That takes us to the end of the “Looking inside the hood” sessions for our model. In the two sessions we saw the forward propagation part of the LSTM cell and also derived the backward propagation part of the LSTM using toy examples. These examples are aimed at giving you an intuitive sense of what is going on inside the cells. Having seen the mathematical details, let us now get into real action. In the next post we will build our prototype using python on a Jupyter notebook. We will be implementing the encoder decoder architecture using LSTM. Having equipped with the nuances of the encoder decoder architecture and also the inner working of the LSTM you would be in a better position to appreciate the models which we will be using to build our Machine translation application.

Go to article 5 of this series : Building the prototype using Jupyter notebook

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

III : Build and Deploy Data Science Products : Looking under the hood of Machine translation model – LSTM Forward Propagation

Source : How stuff works

“Look deep into nature and you will understand everything better”

Albert Einsteen

This is the third part of our series on building a machine translation application. In the last two posts we understood the solution landscape for machine translation and also explored different architecture choices for sequence to sequence models. In this post we take a deep dive into the dynamics of the model we use for machine translation, LSTM model. This series consists of 8 posts.

  1. Understand the landscape of solutions available for machine translation
  2. Explore sequence to sequence model architecture for machine translation.
  3. Deep dive into the LSTM model with worked out numerical example.( This post)
  4. Understand the back propagation algorithm for a LSTM model worked out with a numerical example.
  5. Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
  6. Build the production grade code for the training module using Python scripts.
  7. Building the Machine Translation application -From Prototype to Production : Inference process
  8. Build the machine translation application using Flask and understand the process to deploy the application on Heroku

Dissecting the LSTM network

I was recently reading the book ” The Agony and the Ecstacy’ written by Irving Stone. This book was about the Reniassence genius, master sculptor and artist Michelangelo. When sculptuing human forms, in his quest for perfection , Miehelangelo used to spent months dissecting dead bodies to understand the anotomy of human beings. His thought process was that unless he understood in detail how each fibre of human muscle work, it would be difficult to bring his work to life. I think his experience in dissecting and understanding the anatomy of the human body has had a profound impact on his masterpieces like Moses, Pieta,David and his paintings in the Sistine Chapel.

Michaelangelo’s Moses,Pieta, David & Sistine chapel frescos

I too believe in that philosophy of getting a handle on the inner working of algorithms to really appreciate how they can be used for getting the right business outcomes. In this post we will understand the LSTM network in depth and explore its therotical underpinnings. We will see a worked out example of the forward pass for a LSTM network.

Forward pass of the LSTM

Let us learn the dynamics of the forward pass of LSTM with a simple network. Our network has two time steps as represented in the below figure. The first time step is represented as 't-1' and the subsequent one as time step 't'

Let us try to understand each of the terms in the above network. A LSTM unit receives as its input the following

  1. c<t-2> : The cell state of the previous time step
  2. a<t-2> : The output from the previous time step
  3. x<t-1> : The input of the present time step

The cell state is the unit which is responsible for trasmitting the context accross different time steps. At each time step certain add and forget operations happens to the context transmitted from the previous time steps. These Operations are controlled through multiple gates. Let us understand each of the gates.

Forget Gate

The forget gate determines what part of the input have to be introduced into cell state and what needs to be forgotten. The forget gate operation can be represented as follows

Ґf = sigmoid(Wf*[ xt ] + Uf * [ at-1 ] + bf)

There are two weight parameters ( Wf and Uf ) which transforms the input ( xt ) and the output from the previous time step ( at-1) . This equation can be simplified by concatenating both the weight parameters and the corresponding xt & at vectors to a form given below.

Ґf = sigmoid(Wf *[xt , at-1] + bf)

Ґf is the forget gate

Wf is the new weight matrix got by concatenating [ Wf , Uf]

[xt , at-1]is the concatenation of the current time step input and the previous time step output from the

bf is the bias term.

The purpose of the sigmoid function is to quash the values within the bracket to act as a gate with values between 0 & 1 . These gates are used to control the flow of information. A value of 0 means no information can flow and 1 means all information needs to pass through. We will see more of those steps in a short while.

Update Gate

Update gate equation is similar to that of the forget gate . The only difference is the use of a different weight for this operation.

Ґu = sigmoid(Wu *[xt , at-1] + bu)

Wu is the weight matrix

Bu is the bias term for the update gate operation

All other operations and terms are similar to that in the forget gate

Input activation

In this operation the input layer is activated using a tanh non linear activation.

C~ = tanh(Wc *[x , a] + bc)

C~ is the input activation

Wc is the weight matrix

bc is the bias term which is added.

operation converts the terms within the bracket to values between -1 & 1 . Let us take a pause and analyse why a sigmoid is used for the gate operations and tanh used for the input activation layers.

The property of sigmoid is to give an output between 0 and 1. So in effect after the sigmoid gate, we either add to the available information or do not add any thing at all. However for the input activation we also might need to forget some items. Forgetting is done by having negative values as output. tanh layer ranges from -1 to 1 which you can see have negative values. This will ensure that we will be able to forget some elments and remember others when using the tanh operation.

Internal Cell State

Now that we have seen some of the building block operations, let us see how all of them come together. The first operation where all these individual terms come together is to define the internal cell state.

We already know that the forget and update gates which have values ranging between 0 to 1, act as controllers of information. The forget gate is applied on the previous time step cell state and then decides which of the information within the previous cell state has to be retained and what has to be eliminated.

Ґf * C<t-1>

The update gate is applied on the input activation information and determines which of these information needs to be retained and what needs to be eliminated .

Ґu * C~

These two informations block i.e the balance of the previous cell state and the selected information of the input activation are combined together to form the current cell state. This is represented in the equation as below.

C<t> = Ґu * C~ + Ґf * C<t-1>

Output Gate

Now that the cell state is defined it is time to work on the output from the current cell. As always, before we define the output candidates we first define the decision gate. The operations in the output gate is similar to the forget gate and the update gate .

Ґo = sigmoid(Wo *[x , a] + bo)

Wo is the weight matrix

Bo is the bias term for the update gate operation

Output

The final operation within the LSTM cell is to define the output layer. The output candidates are determined by carrying out a tanh() operation on the internal cell state. The output decision gate is then applied on this candidate to derive the output from the network. The equation for the output is as follows

a<t> = tanh(C<t>) * Ґo

In this operation using the tanh operation on the cell state we arrive at some candidates to be forgotten ( -ve values) and some to be remembered or added to the context. The decision on which of these have to be there in the output is decided by the final gate, output gate.

This sums up the mathematical operations within LSTM. Let us see these operations in action using a numerical example.

Dynamics of the Forward Pass

Now that we have seen the individual components of a LSTM let us understand the real dynamics using a toy numerical examples.

The basic building block of LSTM like any neural network is its hidden layer, which comprises of a set of neurons. The number of neurons within its hidden unit is a hyperparameter when initializing a LSTM. The dimensions of all the other components of a LSTM depends on the dimension of the hidden unit. Let us now define the dimensions of all the components of the LSTM.

ComponentDescriptionDimension of the component
LSTM hidden unitSize of the LSTM unit ( No of nuerons of the hidden unit)(n_a)
mNumber of examples(m)
n_xSize of inputs(n_x)
C<t-1>Dimension of previous cell state(n_a , m)
a<t-1>Dimensions of previous output(n_a , m)
x<t>Current state input(n_x , m)
[ x<t> , a<t-1> ]Concatenation of output of previous time step and current time step input(n_x + n_a, m)
Wf, Wu, Wc, WoWeights for all the gates(n_a , n_x + n_a)
bf bu bc b0Bias term for all operations(n_a ,1)
WyWeight for the output(n_y , n_a)
byBias term for the output(n_y ,1)

Let us now look at how the dimensions of the different outputs evolve after different operations within the LSTM .

Please note that when we do matrix multiplications with two matrices of size ( a,b) * (b,c) we get an output of size (a,c)
ComponentOperationDimensions
Ґf : Forget gatesigmoid(Wf * [x , a] + bf)(n_a, n_x + n_a) * (n_x + n_a ,m) + (n_a,1) = > (n_a , m).
Sigmoid is applied element wise and therefore dimension doesn’t change.
* : denotes matrix multiplication
Ґu: Update gatesigmoid(Wu *[x , a] + bu)(n_a, n_x+n_a ) * (n_x+n_a,m) + (n_a,1) = > (n_a , m)
C~: Input activationtanh(Wc *[x , a] + bc)(n_a, n_x + n_a) * (n_x + n_a , m) + (n_a, 1) = > (n_a, m).
Ґo : Output gate(Wo *[x , a] + bo)(n_a, n_x+n_a ) * (n_x + n_a ,m) + (n_a,1) = > (n_a,m)
C<t> : Current stateҐu x C~ + Ґf x C<t-1>(n_a, m) x (n_a, m) + (n_a, m) x (n_a, m) = > (n_a, m)
x: denotes element wise multiplication
a<t> : Output at current time steptanh(C<t>) x Ґo(n_a, m) x (n_a, m) => (n_a, m).

Let us do a toy example with a two time step network with random inputs and observe the dynamics of LSTM.

The network is as defined below with the following inputs for each time steps. We also define the actual outputs for each time step. As you might be aware the actual output will not be relevant during the forward pass, however it will be relevant during the back propogation phase.

Toy example with LSTM

Our toy example will have two time steps with its inputs (Xt) having two features as shown in the figure above. For time step 1 the input is Xt-1 = [0.4,0.3] and for time step 2 the input is Xt = [0.2,0.6]. As there are two features, the size of the input unit is n_x = 2. Let us tabulate these values

VariableDescriptionValuesDimension
X t-1Input for the first time step[0.4, 0.3](n_x , m)
= > (2 ,1)
XtInput for the second time step[0.2, 0.6](n_x , m)
= > (2 ,1)

For simplicity the hidden layer of the LSTM has only one unit which means that n_a = 1. For the first time step we can assume initial values for the cell state Ct-2 and output from previous layers at-2 as ‘0’.

VariableDescriptionValuesDimension
Ct-2Initial cell state[0](n_a , m) = > (1 ,1)
at-2Initial output from previous cell[0](n_a , m) = > (1 ,1)

Next we have to define the values for the weights and biases for all the gates. Let us randomly initialize values for the weights. As far as the weights are concerned, what needs to be carefully defined are the dimensions of the weights. In the earlier table where we defined the dimensions of all the components we defined the dimension of the weights as (n_a , n_x + n_a). But why do the weights be with these dimensions ? Let us dig deeper.

From our earlier discussions we know that the weights are used to get the sigmoid gates which are multiplied element wise on the cell states. For example

Ct = Ґu * C~ + Ґf * Ct-1

or

at = tanh(Ct) * Ґo.

From these equations we see that the gates are multiplied element wise to the cell states. To do an element wise multiplication, the gates have to be of the same dimensions as the cell state, i.e. (n_a, m). However, to derive the gates, we need to do a dot product of the initialised weights with the concatenation of previous cell state and the input vector [n_x+n_a]. Therefore to get an output dimension of (n_a, m) we need to have the weights with dimensions of (n_a , n_x + n_a) so that the equation of the gate ,Ґf = sigmoid(Wf *[x , a] + bf), generates an output of dimension of (n_a ,m ). In terms of matrix multiplication dynamics this equation can be represented as below

Having seen how the dimensions are derived, let us tabulate the values of weights and its biases .Please note that the values for all the weight matrices and its biases are randomly initialized.

WeightDescriptionValuesDimension
Wf,Forget gate Weight[-2.3 , 0.6 , -0.13 ]
[n_a , n_x + n_a] => (1,3)
bfForget gate bias[0.51][n_a] => 1
WuUpdate gate weight[1.51 ,-0.61 , 1.31][n_a , n_x + n_a] => (1,3)
buUpdate gate bias[1.30][n_a] => 1
Wc,Input activation weight[0.82,-0.57,-0.13][n_a , n_x + n_a] => (1,3)
bcInternal state bias[-0.57][n_a] => 1
WoOutput gate weight[-0.75 ,-0.95 , -0.34][n_a , n_x + n_a] => (1,3)
b0Output gate bias[-0.46][n_a] => 1

Having defined the initial values and the dimensions let us now traverse through each of the time steps and unravel the numerical example for forward propagation.

Time Step 1 :

Inputs : X t-1 = [0.4, 0.3]

Initial values of the previous state

at-2= [0] ,

Ct-2 = [0]

Forget gate => Ґf = sigmoid(Wf *[x , a] + bf) =>

= sigmoid( [-2.3 , 0.6 , -0.13 ] * [0.4, 0.3, 0] + [0.51] )

= sigmoid(((-2.3 * 0.4) + (0.6 * 0.3) + (-0.13 * 0 )) + 0.51)

= sigmoid(-0.23) = 0.443

Please note  sigmoid (-0.23) = 1/(1 + e(-(-0.23))

Update gate => Ґu = sigmoid(Wu *[x , a] + bu) =>

= sigmoid( [1.51 ,-0.61 , 1.31] * [0.4, 0.3, 0] + [1.30] )

= sigmoid((1.51 * 0.4) + (-0.61 * 0.3) + (1.31 * 0 ) + 1.30)

= sigmoid(1.721) = 0.848

Input activation => C~ = tanh(Wc *[x , a] + bc)

= tanh( [0.82,-0.57,-0.13] * [0.4, 0.3, 0] + [-0.57] )

= tanh (((0.82 * 0.4) + (-0.57 * 0.3) + (-0.13 * 0 )) + -0.57)

= tanh(-0.413) = -0.39

Please note tanh = ex – e-x / ( ex + e-x) where x = -0.413
= e-0.413 – e-(-0.413) / ( e-0.413 + e-(-0.413)) = -0.39

Output Gate => Ґo = sigmoid(Wo *[x , a] + bo)

= sigmoid( [-0.75 ,-0.95 , -0.34] * [0.4, 0.3, 0] + [-0.46] )

= sigmoid(((-0.75 * 0.4) + (-0.95 * 0.3) + (-0.34 * 0 )) + -0.46)

= sigmoid(-1.045)= 0.26

We now have all the components required to calculate the internal state and the outputs

Internal state => Ct-1 = Ґu * C~ + Ґf * Ct-2

= 0.848 * -0.39 + 0.443 * 0

= -0.33

Output => at-1 = tanh(Ct-1) * Ґo

= tanh(-0.33) * 0.26 = -0.083

Let us now represent all the numerical values for the first time step on the network.

With the calculated values of time step 1 let us proceed to calculating the values of time step 2

Time Step 2:

Inputs : Xt = [0.2, 0.6]

Values of the previous state output and cell states

at-1 = [-0.083]

Ct-1 = [-0.33]

Forget gate => Ґf = sigmoid(Wf *[xt , at-1] + bf) =>

= sigmoid( [-2.3 , 0.6 , -0.13 ] * [0.2, 0.6, -0.083] + [0.51] )

= sigmoid(((-2.3 * 0.2) + (0.6 * 0.6) + (-0.13 * -0.083 )) + 0.51)

= sigmoid(0.421) = 0.60

Update gate => Ґu = sigmoid(Wu *[xt , at-1] + bu) =>

= sigmoid( [1.51 ,-0.61 , 1.31] * [0.2, 0.6, -0.083] + [1.30] )

= sigmoid(((1.51 * 0.2) + (-0.61 * 0.6) + (1.31 * -0.083 )) + 1.30)

= sigmoid(1.13) = 0.755

Input activation => C~ = tanh(Wc *[xt , at-1] + bc)

= tanh( [0.82,-0.57,-0.13] * [0.2, 0.6, -0.083] + [-0.57] )

= tanh(((0.82 * 0.2) + (-0.57 * 0.6) + (-0.13 * -0.083 )) + -0.57)

= tanh(-0.737) = -0.63

Output Gate => Ґo = sigmoid(Wo *[x , a] + bo)

= sigmoid( [[-0.75 ,-0.95 , -0.34] * [0.2, 0.6, -0.083] + [-0.46] )

= sigmoid(((-0.75 * 0.2) + (-0.95 * 0.6) + (-0.34 * -0.083 )) + -0.46)

= sigmoid(-1.15178)= 0.24

Internal state => Ct = Ґu * C~ + Ґf * Ct-1

= 0.755 * -0.63 + 0.60 * -0.33

= -0.674

Output => at = tanh(Ct) * Ґo

= tanh(-0.674) * 0.24 = -0.1410252

Let us now represent the second time step within the LSTM unit

Second Time step

Let us also look at both the time steps together with all its numerical values

This sums a single forward pass for the LSTM. Once the forward pass is calculated the next step is to determine the error term and the backpropagating the error to determine the adjusted weights and bias terms. We will see those steps in the back propagation steps, which will be covered in the next post.

Go to article 4 of this series : Back propagation of the LSTM unit

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

Deep Learning Workshop

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Data Science for Predictive Maintenance

Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. In this post I would be giving an end to end perspective on tackling a business use case within the framework of Data Science. We will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science project

Data Science Process

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

  1. Gradual drop in conductance over time entails normal behaviour and sudden drop would mean anomalous behaviour
  2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behaviour

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

  1. Understand the business problem which we are set out to solve
  2. Identify all key variables related to the business problem
  3. Identify the lead indicators within these variable which will help in solving the business problem.
  4. Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

  • Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
  • What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
  • What is the frequency of data capture ? Does it vary across the variables ?
  • Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

  • Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
    • Time stamp when the conductance data was taken
    • Unique identifier for each battery
    • Other related information like manufacturer , installation location, model , string it was connected to etc
  • Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
    • Voltage of the battery
    • Temperature
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
  • Discharge Data : Discharge data is collected once every 3 months. Key variable include
    • Discharge voltage
    • Current at which voltage discharges
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
Data sets for battery end life prediction

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

  1. Which battery will fail &
  2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

  • Conductance : One reading of a battery captured once every 3 days.
  • Voltage & Temperature : 4-5 readings per battery captured every day.
  • Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.

Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).

The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

  1. Understand the business problem and formulate the response variables.
  2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
  3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
  4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
  5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
  6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

So far we discussed first three phases of the data science process namely business discovery, data discovery and data preparation.The next phase which we will discuss about one of the critical steps of the process namely exploratory. It is in this phase where we leverage the domain knowledge and observe our hypothesis in the data.

Exploratory Analysis – Unravelling latent trends

This phase entails digging deep to get a feel of the data and extract intuitions for feature engineering. When embarking upon exploratory analysis, it would be a good idea to get inputs from domain team on the relation between variables and the business problem. Such inputs are often the starting point for this phase.

Let us now get to the context of our preventive maintenance problem and evolve a philosophy for exploratory analysis.In the case of industrial batteries, a key variable which affects the state of health of a battery is its conductance. It turns out that an indicator of failing health of  battery is the precipitous drop in conductance. Armed with this information our next task should be to  identify, from our available data set,batteries that have higher probability to fail. Since precipitous fall in conductance is an indicator of failing health,the conductance data of  unhealthy batteries will have more variance than the normal ones. So the best way to identify failing batteries from the normal ones would be to apply some consolidating metric like standard deviation or variance on the conductance data and further drill deep on samples which stand apart from the normal population.


Separating potential failure cases

The above is a plot depicting standard deviation of conductance for all batteries. Now what might be of interest to us is the red zone which we can call the “Potential failure Zone”. The potential failure zone consists of those batteries whose conductance values show high standard deviation. Batteries with failing health are likely to exhibit large fall in conductance and as a corollary their values will also show higher standard deviation. This implies that the samples of batteries which have higher probability of failure will in all likelihood be from this failure zone. However to ascertain this hypothesis we will have to dig deep into batteries in the failure zone and look for patterns which might differentiate them from normal batteries. Another objective to dig deep is also to elicit clues from the underlying patterns on what features to include in the predictive model. We will discuss more on the feature extraction when we discuss about feature engineering. Now let us come back to our discussion on digging deep into the failure zone and ferreting out significant patterns. It has to be noted that in addition to the samples in the failure zone we will also have to observe patterns from the normal zone to help separate wheat from the chaff . Intuitions derived by observing different patterns would become vital during feature engineering stage.

Identifying failure zones by comparison

The above figure is a comparison of patterns from either zones. The figure on the left is from the failure zone and the one on the right is from the other. We can clearly see how the precipitous fall is manifested in the sample from the failure zone. The other aspect to note is also the magnitude of the fall. Every battery will have degrading conductance over time. However the magnitude of  degradation is what differentiates the unhealthy  battery from a normal one. We can observe from the plot on the left that the fall in conductance is more than 50%, however for the battery to the right the drop is more muted.  Another aspect we can observe is the slope of conductance. As evident from the two plots, the slope of  conductance profile for the battery on the left is much more steeper over time than the one on the right. These intuitions which we have derived so far might become critical from the overall scheme of feature engineering and modelling. Similar to the intuitions which we have disinterred so far, more could be extracted by observing more samples. The philosophy behind exploratory analysis entails visualizing more and more samples, observing patterns and extracting clues for feature engineering. The more time we spend on doing this more ammunition we get for feature engineering.

Let us now try to encapsulate the philosophy of exploratory analysis in few steps

  1. Take inputs from domain team related to the problem we are trying to solve. In our case the clue which we got was the relation between conductance and health of batteries.
  2. Identify any consolidating metric for the variable under consideration to separate out anomalous samples. In the example above we used standard deviation of conductance values to find anomalies.
  3. Once the samples are demarcated using the consolidation metric, visualize samples from different sets to identify discernible patterns in data.
  4. From the patterns we observe root out clues for feature engineering. In our example we identified that % fall in conductance and slope of conductance over time could be potential features.

Multivariate Exploration

So far we were limited to analysis of a single variable i.e conductance. However to get more meaningful insights we have to connect other variables layer by layer to the initial variable which we have analysed to get more insights on the problem. As far as battery is concerned some of the critical variables other than conductance are voltage and discharge. Let us connect these two variables along with the conductance profile to gain more intuitions from the data.

Combining different variables to observe trends

The above figure is a plot which depicts three variables across the same time span. The idea of plotting multiple variables together across a common time span is to unearth any discernible trends we can see together. A cursory look at this plot will reveal some obvious observations.

  1. The fall in current and voltage in conjunction with drop in conductance.
  2. The cyclic nature of the voltage profile.
  3. A gradual drop in the troughs of the voltage profile.

Having made some observations,we now need to ascertain whether these observations can be codified to some definitive trends. This can be verified only by observing plots for many samples of similar variables. By sampling data pertaining to many batteries if we can get similar observations, then we can be sure that we have unearthed some trends explaining behaviors of different variables. However just unearthing some trends will not suffice. We have to get some intuitions from such trends which will help in transforming the raw variables to some form which will help in the modelling task. This is achieved by feature engineering the raw variables.

Feature Engineering

Many a times the given set of raw variables will not suffice for extracting the required predictive power from the model. We will have to transform the raw variables to generate new variables giving us the extra thrust towards better predictive metrics. What transformation has to be done, will be based on the intuitions we build during the exploratory analysis phase and also by combining domain knowledge. For the case of batteries let us revisit some of the intuitions we build during the exploratory analysis phase and see how these intuitions we build can be used for feature engineering.

During our discussions with domain team we found out that precipitous fall in conductance is an indicator of failing health of a battery. So a probable feature we can extract from the conductance variable is the slope of the data points over a fixed time span.The rationale for such a feature is this, if precipitous fall in conductance over time is an indicator of failing health of a battery  then the slope of data points for a battery which is failing will be more steeper than the battery which is healthy. It was observed that through such transformation there was a positive influence on predictive metrics. The dynamics of such transformation is as follows, if we have conductance data for the battery for three years, we can take consecutive three month window of conductance data and take the slope of all the data points and make it as a feature.  By doing this, the number of rows of data for the variable also gets consolidated to much fewer numbers.

Let us also look at another example of feature engineering which we can introduce to the variable, discharge voltage. As seen from the above figure, the discharge voltage follows a wave like profile. It turns out that when a battery discharges the voltage first drops and then it rises. This behavior is called the “Coupe De Fouet” (CDF) effect. Now our thought should be, how do we combine the observed wave like pattern and the knowledge about CDF into a feature ? Again we have to dig into domain knowledge. As per theory on the state of health of batteries there are standards for the CDF profile of a healthy battery and that of a failing battery. These are prescribed by the manufacturer of the battery. For example the manufacturing standards prescribe certain depth to which the voltage will fall during discharge and certain height to which it will go up during a typical CDF effect. The deviance between the observed CDF and the manufacture prescribed standard can be taken as another feature. Similarly we can also think of other features related to voltage, like depth of discharge ( DOD), number of cycles etc. Our focus should be in using the available domain knowledge to transform raw variables into features.

As seen from the above two examples the essence of feature engineering is all about translating the domain knowledge and the trends seen in the data to more meaningful features. The veracity of the models which are built depends a lot on the strength of  the features built. Now that we have seen the feature engineering phase let us now look at modelling strategy for this use case.

Modelling Phase

In the initial part of this article we discussed labelling strategy for training the model. Since the use case is to predict which battery would fail and at what period of time, we have to look back in time from the failure point label for creating different classes related to periods of failure. In this specific case, the different features were created by consolidating 3 months of data into a single row. So one period before failure would denote 3 months before failure. So if the requirement is to predict failure 6 months prior to when it is likely to happen, then we will have 4 different classes i.e  failure point,one period before failure(3 months prior to failure point) ,two periods before failure and (6 months prior to failure point) & normal state. All periods prior to 6 months can be labelled as normal state.

With respect to modelling, we can spot check with different classification algorithms ( logistic regression, Naive bayes, SVM, Random Forest, XGboost .. etc). The choice of final model will be based on the accuracy metrics ( sensitivity , specificity etc) of the spot checked models. Another aspect which might be useful to note is also that, data set could be highly unbalanced i.e the number of normal battery classes is likely to outnumber the failure classes disproportionately. It will be a good idea to try out class balancing methods on the data set before modelling.

Wrapping up

This post brings down curtains to an end to end view of a predictive analytics use case for industrial batteries. Any use case within the manufacturing sector can be quite challenging as the variables involved are very technical and would require lot of interventions from related domain teams. Constant engagement of domain specialist as part of the data science team is very important for the success of such projects.

I have tried my best to write the nuances of such a difficult use case. I have tried to cover the critical elements in the process. In case of any clarifications on the use case and details of its implementation you can connect with me through the following email id bayesianquest@gmail.com. Looking forward to hearing from you.  Till then let me sign off.

Watch this space for more such use cases.

Applied Data Science Series : Solving a Predictive Maintenance Business Problem – Part III

battery2

In the previous post of the series we discussed the exploratory analysis phase and saw how the combination of domain knowledge and single variable exploration unravels intuitions from the data. In this post we will expand our analysis to multiple variables and then see how intuitions we develop during the exploration phase, can lead to generating new features for modelling.

In the example we were discussing, we were limited to analysis of a single variable i.e conductance. However to get more meaningful insights we have to connect other variables layer by layer to the initial variable which we have analysed to get more insights on the problem. As far as battery is concerned some of the critical variables other than conductance are voltage and discharge. Let us connect these two variables along with the conductance profile to gain more intuitions from the data.

Multivariable_plot

The above figure is a plot which depicts three variables across the same time span. The idea of plotting multiple variables together across a common time span is to unearth any discernible trends we can see together. A cursory look at this plot will reveal some obvious observations.

  1. The fall in current and voltage in conjunction with drop in conductance.
  2. The cyclic nature of the voltage profile.
  3. A gradual drop in the troughs of the voltage profile.

Having made some observations,we now need to ascertain whether these observations can be codified to some definitive trends. This can be verified only by observing plots for many samples of similar variables. By sampling data pertaining to many batteries if we can get similar observations, then we can be sure that we have unearthed some trends explaining behaviors of different variables. However just unearthing some trends will not suffice. We have to get some intuitions from such trends which will help in transforming the raw variables to some form which will help in the modelling task. This is achieved by feature engineering the raw variables.

Feature Engineering

Many a times the given set of raw variables will not suffice for extracting the required predictive power from the model. We will have to transform the raw variables to generate new variables giving us the extra thrust towards better predictive metrics. What transformation has to be done, will be based on the intuitions we build during the exploratory analysis phase and also by combining domain knowledge. For the case of batteries let us revisit some of the intuitions we build during the exploratory analysis phase and see how these intuitions we build can be used for feature engineering.

In the previous post , we found out that precipitous fall in conductance is an indicator of failing health of a battery. So a probable feature we can extract from the conductance variable is the slope of the data points over a fixed time span.The rationale for such a feature is this, if precipitous fall in conductance over time is an indicator of failing health of a battery  then the slope of data points for a battery which is failing will be more steeper than the battery which is healthy. It was observed that through such transformation there was a positive influence on predictive metrics. The dynamics of such transformation is as follows, if we have conductance data for the battery for three years, we can take consecutive three month window of conductance data and take the slope of all the data points and make it as a feature.  By doing this, the number of rows of data for the variable also gets consolidated to much fewer numbers.

Let us also look at another example of feature engineering which we can introduce to the variable, discharge voltage. As seen from the above figure, the discharge voltage follows a wave like profile. It turns out that when a battery discharges the voltage first drops and then it rises. This behavior is called the “Coupe De Fouet” (CDF) effect. Now our thought should be, how do we combine the observed wave like pattern and the knowledge about CDF into a feature ? Again we have to dig into domain knowledge. As per theory on the state of health of batteries there are standards for the CDF profile of a healthy battery and that of a failing battery. These are prescribed by the manufacturer of the battery. For example the manufacturing standards prescribe certain depth to which the voltage will fall during discharge and certain height to which it will go up during a typical CDF effect. The deviance between the observed CDF and the manufacture prescribed standard can be taken as another feature. Similarly we can also think of other features related to voltage, like depth of discharge ( DOD), number of cycles etc. Our focus should be in using the available domain knowledge to transform raw variables into features.

As seen from the above two examples the essence of feature engineering is all about translating the domain knowledge and the trends seen in the data to more meaningful features. The veracity of the models which are built depends a lot on the strength of  the features built. Now that we have seen the feature engineering phase let us now look at modelling strategy for this use case.

Modelling Phase

In the first part of this use case we discussed about labeling strategy for training the model. Since the use case is to predict which battery would fail and at what period of time, we have to look back in time from the failure point label for creating different classes related to periods of failure. In this specific case, the different features were created by consolidating 3 months of data into a single row. So one period before failure would denote 3 months before failure. So if the requirement is to predict failure 6 months prior to when it is likely to happen, then we will have 4 different classes i.e  failure point,one period before failure(3 months prior to failure point) ,two periods before failure and (6 months prior to failure point) & normal state. All periods prior to 6 months can be labelled as normal state.

With respect to modelling, we can spot check with different classification algorithms ( logistic regression, Naive bayes, SVM, Random Forest, XGboost .. etc). The choice of final model will be based on the accuracy metrics ( sensitivity , specificity etc) of the spot checked models. Another aspect which might be useful to note is also that, data set could be highly unbalanced i.e the number of normal battery classes is likely to outnumber the failure classes disproportionately. It will be a good idea to try out class balancing methods on the data set before modelling.

Wrapping up

This post brings down curtains to the three part series on predictive analytics for industrial batteries. Any use case within the manufacturing sector can be quite challenging as the variables involved are very technical and would require lot of interventions from related domain teams. Constant engagement of domain specialist as part of the data science team is very important for the success of such projects.

I have tried my best to write the nuances of such a difficult use case. I have tried to cover the critical elements in the process. In case of any clarifications on the use case and details of its implementation you can connect with me through the following email id bayesianquest@gmail.com. Looking forward to hearing from you.  Till then let me sign off.

Watch this space for more such use cases.

Applied Data Science Series : Solving a Predictive Maintenance Business Problem – Part II

 

ExploratoryAnalysis

 

In the first part of the applied data science series, we discussed about first three phases of the data science process namely business discovery, data discovery and data preparation. In business discover phase we talked on how the business problem i.e. predicting end life of batteries, defines the choice of  variables that comes into play. In the data discovery phase we discussed data sufficiency and other considerations like variety and velocity of data and how these considerations affect the data science problem formulation. In the last phase we touched upon how the data points and its various constituents drive the predictive problem formulation. In this post we will discuss further on how exploratory analysis can be used for getting insights for feature engineering.

Exploratory Analysis – Unraveling latent trends

This phase entails digging deep to get a feel of the data and extract intuitions for feature engineering. When embarking upon exploratory analysis, it would be a good idea to get inputs from domain team on the relation between variables and the business problem. Such inputs are often the starting point for this phase.

Let us now get to the context of our preventive maintenance problem and evolve a philosophy for exploratory analysis.In the case of industrial batteries, a key variable which affects the state of health of a battery is its conductance. It turns out that an indicator of failing health of  battery is the precipitous drop in conductance. Armed with this information our next task should be to  identify, from our available data set,batteries that have higher probability to fail. Since precipitous fall in conductance is an indicator of failing health,the conductance data of  unhealthy batteries will have more variance than the normal ones. So the best way to identify failing batteries from the normal ones would be to apply some consolidating metric like standard deviation or variance on the conductance data and further drill deep on samples which stand apart from the normal population.

SD1_Plot The above is a plot depicting standard deviation of conductance for all batteries. Now what might be of interest to us is the red zone which we can call the “Potential failure Zone”. The potential failure zone consists of those batteries whose conductance values show high standard deviation. Batteries with failing health are likely to exhibit large fall in conductance and as a corollary their values will also show higher standard deviation. This implies that the samples of batteries which have higher probability of failure will in all likelihood be from this failure zone. However to ascertain this hypothesis we will have to dig deep into batteries in the failure zone and look for patterns which might differentiate them from normal batteries. Another objective to dig deep is also to elicit clues from the underlying patterns on what features to include in the predictive model. We will discuss more on the feature extraction when we discuss about feature engineering. Now let us come back to our discussion on digging deep into the failure zone and ferreting out significant patterns. It has to be noted that in addition to the samples in the failure zone we will also have to observe patterns from the normal zone to help separate wheat from the chaff . Intuitions derived by observing different patterns would become vital during feature engineering stage.

Conductance_Comparison

The above figure is a comparison of patterns from either zones. The figure on the left is from the failure zone and the one on the right is from the other. We can clearly see how the precipitous fall is manifested in the sample from the failure zone. The other aspect to note is also the magnitude of the fall. Every battery will have degrading conductance over time. However the magnitude of  degradation is what differentiates the unhealthy  battery from a normal one. We can observe from the plot on the left that the fall in conductance is more than 50%, however for the battery to the right the drop is more muted.  Another aspect we can observe is the slope of conductance. As evident from the two plots, the slope of  conductance profile for the battery on the left is much more steeper over time than the one on the right. These intuitions which we have derived so far might become critical from the overall scheme of feature engineering and modelling. Similar to the intuitions which we have disinterred so far, more could be extracted by observing more samples. The philosophy behind exploratory analysis entails visualizing more and more samples, observing patterns and extracting clues for feature engineering. The more time we spend on doing this more ammunition we get for feature engineering.

Wrapping up

So far we discussed different considerations for the exploratory analysis phase. To summarize, here are some of the pointers during this phase.

  1. Take inputs from domain team related to the problem we are trying to solve. In our case the clue which we got was the relation between conductance and health of batteries.
  2. Identify any consolidating metric for the variable under consideration to separate out anomalous samples. In the example above we used standard deviation of conductance values to find anomalies.
  3. Once the samples are demarcated using the consolidation metric, visualize samples from different sets to identify discernible patterns in data.
  4. From the patterns we observe root out clues for feature engineering. In our example we identified that % fall in conductance and slope of conductance over time could be potential features.

The above pointers are general guidelines on how one should think through during  exploratory analysis phase.

The discussions so far were centered on exploratory analysis on a single variable. Next we have to connect other variables to the one which we already observed and identify trends in unison. When we combine trends from multiple variables we will be able to unravel more insights for feature engineering. We will continue our discussions on combining more variables and subsequent feature engineering in our next post. Watch out this space for more.

 

Applied Data Science Series : Solving a Predictive Maintenance Business Problem

JMJPFU

Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. From this week onward, we would be starting a new series  called the Applied Data Science Series. In this series I would be giving an end to end perspective on tackling business use cases or societal problems within the framework of Data Science. In this first article of the applied data science series we will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries, which falls under the genre of use cases called preventive maintenance use cases.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science projectBigPicture.

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

  1. Gradual drop in conductance over time would mean normal behavior and sudden drop would mean anomalous behavior
  2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behavior

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

  1. Understand the business problem which we are set out to solve
  2. Identify all key variables related to the business problem
  3. Identify the lead indicators within these variable which will help in solving the business problem.
  4. Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

  • Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
  • What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
  • What is the frequency of data capture ? Does it vary across the variables ?
  • Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

  • Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
    • Time stamp when the conductance data was taken
    • Unique identifier for each battery
    • Other related information like manufacturer , installation location, model , string it was connected to etc
  • Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
    • Voltage of the battery
    • Temperature
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
  • Discharge Data : Discharge data is collected once every 3 months. Key variable include
    • Discharge voltage
    • Current at which voltage discharges
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc

DataSets

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

  1. Which battery will fail &
  2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

  • Conductance : One reading of a battery captured once every 3 days.
  • Voltage & Temperature : 4-5 readings per battery captured every day.
  • Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.

DataConsolidation

Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).

Failure-Labelling

The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

  1. Understand the business problem and formulate the response variables.
  2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
  3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
  4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
  5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
  6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

Wrapping up till we meet again

So far we have discussed the initial two phase of a data science project . The first phase entails defining the business problem and carrying out the business discovery. In the next phase, which is the data discovery phase, we align the available data points to the business problem and then formulate the predictive problem. Once we have a clear understanding of how the predictive problem have to be formulated our next task will be to get into exploratory analysis and feature engineering phases. These phases and the subsequent phases would be dealt in detail in the next post of this series. Watch out this space for more.

 

Classification Algorithms: Random Forest – Part II

randomforest

 

In the first part of this series we set the context for Random Forest algorithm by introducing the tree based algorithm for classification problems. In this post we will look at some of the limitations of the tree based model and how they were overcome paving the way to a powerful model – Random Forest. Two major methods that were employed to overcome those pitfalls are Bootstrapping and Bagging. We will discuss them first before delving into random forest.

Bootstrapping and Bagging

When we discussed the tree based model we saw that such models are very intuitive i.e. they are easy to interpret. However such models suffer from a major drawback i.e high variance.Let us understand what high variance means in this context. Suppose we were to have a data set which we divide it into three parts. If three different tree models were fit on these data sets and we were to predict the result of a new observation based on these three models. The result we might get from each of these three models for the same observation can be very different. This is what we call in statistical jargon as ” Model with high variance”. High variance obviously is bad as the reliability of the results we get is compromised. One effective way to overcome high variance is to do averaging. This would mean taking multiple data sets, fitting a tree based model on each of these data sets, do predictions on new observations and then averaging the results got from each of the tree model to get a more reliable result. This seems a very plausible solution. However we have a major problem here. Doing averaging would require having multiple data sets. But what if the data we have is quite limited and obtaining additional data is prohibitively expensive ?

……….. Lo and Behold, we have a powerful method to help us out of this predicament and it is called Bootstrapping.

The etymological meaning of the word Bootstrapping is “Pulling oneself up by ones bootstrap”.In essence it means doing some task considered impossible. In statistics bootstrapping procedure entails sampling from the available data set with replacement. Let me elaborate with an example. Suppose our data set were to have 10 observations ( rows 1:10). From this data set we were to randomly pick an observation, say row 6. After that we replace the row 6 into the data set and we randomly pick another number. Say this time we got row 8. We again put this observation back and repeat the process till we get around 10 observations. Let us assume that the first set of observations we picked looks like this : 6,8,4,8,5,6,9,1,2,5. You might have noticed that there are observations which repeat within the above set. That is perfectly all-right in bootstrapping. We continue this process till we get a collection of bootstrapped samples of 10 observations each. Once we get a collection or a bag of bootstrapped data sets, we fit a tree model for each of these sets, carry out predictions and then average the results. This whole process is called bagging. Bagging helps us get over our original problem of high variance and the results mirror more closely to reality.

Random Forest

Now that we have discussed bootstrapping and bagging we are in a position to get into the nuances of random forest. Random Forest algorithm provides an improvement over bagging in terms of de-correlating the trees. Let me elaborate the de-correlating part. When we were discussing the tree based methods in the last post, we talked about splitting the data set based on the best features.When we grow our trees on the bootstrapped samples , more often than not it is those set of best features which gets picked, to do the split and thereby grow trees. This will result in getting a bunch of trees which look almost the same or in statistical terms “co-related”. We also have discussed that the final result will be obtained by averaging results from all the tree models grown on the bootstrapped samples. It works out that averaging predictions from co-related trees will result in sub-optimal predictions.

To overcome this, Random Forest algorithm randomly picks a smaller subset of features to do split. If there were “P” features in the data set, the subset picked is approximately √P.  The idea of randomly picking a subset of features for each tree is to avoid being biased towards the best predictors. In the new setting, all the predictors have equal chance of being picked and the tree models will be more “representative”. Averaging the results from these representative trees will provide more accurate predictions. In effect the combination of bootstrapping, bagging and random picking of features provides the robustness inherent in the random forest model.

Out of Bag Error Estimation

There is a very straight forward method to estimate the error in a bagged model and it is called “Out of Bag”(OOB) error estimation. In the example we discussed on bootstrapping ,we had 10 observations in our first sample, (6,8,4,8,5,6,9,1,2,5). We can see that the following observations ( 3,7,10) have not been picked in the first bootstrapped sample. These elements are called “Out of Bag” observations. In general it is seen that in the bootstrapping process approximately only 2/3rd of the observations are generally picked. That means about 1/3rd of the observations are OOB in each bootstrapped sample. OOBs have some very important purpose in the overall scheme of things i.e. they act as test beds for estimating error in the model. Let me emphasize this idea with an example. Let us take the case of observation 3. As seen, it is an OOB observation for the first bootstrapped sample. Let us assume that the same observation ends up as OOB for the 6th and 12th bootstrapped data set too. When a tree model is fit on the first, sixth and the twelfth bootstrapped set, the observation 3 will be used as a test set to predict three distinct results corresponding to each model. The three results for observation 3 will thereby be averaged(for regression) to get a single prediction. In case of classification problems the most prevalent class out of the three will be taken. Once we get one single prediction by averaging, the error is estimated by comparing against the true class the observation 3 fall into. Similarly the error estimation is done for all the OOB elements to get an overall aggregation of error. This method of error estimation eliminates the need for cross validation which can be cumbersome for large data sets.

Wrapping Up

The ideas behind random forest model i.e bootstrapping, bagging, random feature selection etc has aided the making of a very powerful algorithm. However random forest is not bereft of pitfalls. One major pitfalls of the model is that it cant be interpreted easily. However the positives of this model far outweighs the negative and because of this random forest is one of the most powerful algorithms providing realistic results.

It is time to wrap up our discussion on tree based algorithms and random forest in particular. From the next post onward we start a new series called the “Mind of a Data Scientist”. In this series we do an exploratory walk, through the thought process of a data scientist in enabling, data driven informed decision making. Watch out this space for more

Classification Algorithms : Random Forest – Part I, Setting the Context

TreesIn the last few posts, where we discussed  Logistic regression, we had a fair bit of discussions on classification problems. Classification problems are the most prevalent ones we encounter in the real world machine learning setting and it is important to deal with various facets of this problem.In the next few posts, we will decipher some of the popular algorithms used within the classification context. The first of those algorithms which we are discussing is called the Random Forest. It is one of the most popular and powerful algorithms, which is currently used in the classification setting. In addition to deciphering the dynamics of Random Forest, we will also be looking at a practical applications powered by Random Forest algorithm. The practical application we will be dealing with is a movie sentiment analyser using a Random Forest Model. Outlined below is the path we will traverse in our discussions,

  • Random Forest :setting the context – An introduction to Tree based methods
  • Introduction to bootstrapping and deciphering the dynamics of Random Forest
  • Random Forest in action – Introduction of the movie sentiment analyser and Feature selection approaches
  • Decoding the movie sentiment analyser with the Random Forest model.

Setting the stage with Tree based Methods

Random Forest algorithm falls under a genre of prediction method called the Tree Based Method. Understanding the logic behind Tree based methods will help immensely in getting a better handle on Random Forest. So let us start our discussion on Tree based methods.

Tree based methods resonates very closely to the way humans take decisions . Let me start off with a small example of how I take decision on whether I should take an umbrella when I want to go for a walk. The decision tree is as depicted below.

Tree

Figure 1

Looking at the above decision tree let us try to get some intuition on the key components. As seen from the tree, there are three decision gates which splits my decision space thereby helping me to take a more informed decision. In machine learning parlance these decision gates are called “Predictor space” or “Feature Space”. At each predictor space, there is also a value which splits the predictor space. For example for the first feature space, “Did it rain in the past few days”, the value is a binary values i.e Yes/No ( or 1/0 numerically). It is the combination of a predictor space and its corresponding value which will help in finally arriving at a decision. The values need not be binary in nature. It can be a continuous numbers or some ordinal value. Taking cues from the above toy example, let us look at a real data set and try to under stand the tree based method in more detail.

The data set we are going to analyse is called the “Heart” data set. It is available in the UCI Machine Learning Repository.  A snapshot of the data set is as given below.

heart

(Source: UCI Machine Learning Repository)

This data set consists of  records of 269 patients,along the rows, with symptoms of heart ailments. The columns 1 to 13  are predictors/features which necessarily are various attributes helping in detection of heart disease. The 14th column (“Pred”) is the outcome variable, which indicates whether that patient has heart disease or not, a value of “2” indicates the prevalence of heart disease and “1” absence of heart disease. Our aim is to understand how a tree based algorithm learns from a training set like the above and helps in predicting  the prevalence of heart disease in a patient.

The first step in a tree based algorithm is to find a predictor(from the 13 predictors) and a value to split the data set into two distinct regions or groups. In our toy example, the predictor we used for the first split was the one which indicated the presence of rain in the previous days. For the time being let us assume that the best predictor to do our split is the 13th predictor “Thal”. This predictor is the measure of Thalium stress test and contains 3 types of values , 3,6 & 7. A value of 3 indicates normal behavior, 6 indicates a fixed defect and 7 a reversible defect. Let us not worry too much about what fixed defect and reversible defects are as they are medical jargon. However we will only be concerned about the values 3,6 & 7. Let us assume that the value at which we do the split is  3. In essence any patient who has the result of Thalium stress test as 3 will be in group1 and the ones who have more than 3 will be another group. The first split is show in figure below

split1

Figure 2

As seen above, the first split divides the total of 269 patients into two group of 151 and 118 respectively. In group 1 out of the 151 patients 119 of them have the outcome variable as 1( absence of heart disease) and the other 32 have outcome of 2 ( presence of heart disease). The corresponding value for group 2 are 31 and 87 respectively.

The next step is to grow the trees deeper by splitting the two branches we got after the first split with some other predictor. The choice of predictor for each branch can be same or different. I will come to the criteria for the choice of predictor later on. For the time being let us assume the same predictor is used to split the two branches further. Let the split happen with the 10th predictor (ST) and a value of 0.5. Any record with value less than 0.5 will be in a group to the left and more than 0.5 will belong to the branch on the right. The split is depicted as below.

split2

Figure 3

This process of splitting the nodes continues till a certain threshold is reached. The threshold can be, say ,till no region has more than 5 observations. The last layer we get after all the splits are called the terminal nodes.

Now that we have seen the process of splitting(growing a tree) let us deal with two important aspects we did not explain in detail,

  1. How do we decide on the predictor/feature and values for making a split ?
  2. How do we do the predictions for any test set observations?

Selecting the predictors and its values

The process of selecting a predictor is an iterative one.We pick one predictor at a time and an arbitrary value within the predictor space and carry out an assessment if the picked predictor and value is the best one possible for carrying out the split.The way we do assessment of whether the predictor is the best one has its paralells with the way we find the right set of parameters by overall cost minimization in logistic regression. In tree based method we do the assessment through a method called minimization of classification error rate. The name might look intimidating, however the idea behind it is quite simple. When we carry out the node splitting process the labels or outcomes of all the observations are approximated to the most prevalent outcome. In the first split we did (figure 2) we split the observations in two branches. In the left branch out of the total 151 observations 119 of them have outcome as 1. So the most prevalent outcome in the left branch is outcome 1. Therefore all observations which fall in the left branch will be approximated as outcome 1, irrespective of what its true outcome is. In the process of approximating to the most prevalent outcome, we also make some classification errors. For the case of the left branch, there were 32 observations which belonged to outcome 2 which we are approximating as outcome 1. This obviously is the error we will inherit because of our approximation. This error as percentage of the total number of observations in that branch is called the classification error rate ( i.e 32/151). The criteria for the selecting the right predictor and the right value to do the split is the one which yields the lowest classification error rate. In a nutshell the overall process  for selecting the best  predictor and split value is as follows

  1. Pick a predictor and a value within the predictor space for doing a split
  2. Calculate the classification error rate.
  3. Repeat the process with another predictor and value noting the classification error rate obtained in each selection.
  4. Pick the predictor-value combination which yields the lowest classification error rate to do the split of that particular node.

Predictions for any test set observations

So far, what we have discussed is the training process. At the end of the training process what we learn are a set of best predictors and its corresponding values to do splits ,at each node. We have also discussed about the approximation of outcomes to the most prevalent outcome for calculation of classification error rate. The approximation of outcome  has another utility, i.e to determine the class of the terminal nodes. After the tree is grown to its full depth the terminal nodes will be categorized to an outcome which is most prevalent in that node. This categorization is required when we have to do predictions for any test set.

Once the learning is done on the training set, the way we do prediction on a test set is quite straight forward. The test set examples are split according to the splits we learned during the training process. After the splits, the test set examples finally ends up in one of the many terminal nodes. The prediction for the test set example will be the same as the outcome of the terminal node where it ends up.

Wrapping Up

As mentioned earlier tree based methods is the basis for understanding advanced algorithms like Random Forest. Now that we have seen the tree based methods we are well equipped to decipher Random Forest. We will do that in the next post. Watch out this space for your safari of Random Forest.