Applied Data Science Series : Solving a Predictive Maintenance Business Problem

JMJPFU

Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. From this week onward, we would be starting a new series  called the Applied Data Science Series. In this series I would be giving an end to end perspective on tackling business use cases or societal problems within the framework of Data Science. In this first article of the applied data science series we will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries, which falls under the genre of use cases called preventive maintenance use cases.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science projectBigPicture.

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

  1. Gradual drop in conductance over time would mean normal behavior and sudden drop would mean anomalous behavior
  2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behavior

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

  1. Understand the business problem which we are set out to solve
  2. Identify all key variables related to the business problem
  3. Identify the lead indicators within these variable which will help in solving the business problem.
  4. Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

  • Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
  • What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
  • What is the frequency of data capture ? Does it vary across the variables ?
  • Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

  • Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
    • Time stamp when the conductance data was taken
    • Unique identifier for each battery
    • Other related information like manufacturer , installation location, model , string it was connected to etc
  • Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
    • Voltage of the battery
    • Temperature
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
  • Discharge Data : Discharge data is collected once every 3 months. Key variable include
    • Discharge voltage
    • Current at which voltage discharges
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc

DataSets

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

  1. Which battery will fail &
  2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

  • Conductance : One reading of a battery captured once every 3 days.
  • Voltage & Temperature : 4-5 readings per battery captured every day.
  • Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.

DataConsolidation

Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).

Failure-Labelling

The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

  1. Understand the business problem and formulate the response variables.
  2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
  3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
  4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
  5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
  6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

Wrapping up till we meet again

So far we have discussed the initial two phase of a data science project . The first phase entails defining the business problem and carrying out the business discovery. In the next phase, which is the data discovery phase, we align the available data points to the business problem and then formulate the predictive problem. Once we have a clear understanding of how the predictive problem have to be formulated our next task will be to get into exploratory analysis and feature engineering phases. These phases and the subsequent phases would be dealt in detail in the next post of this series. Watch out this space for more.

 

Advertisements

Machine Learning in Action – Word Prediction

In my previous blog on machine learning, I explained the science behind how a machine learns from its parameters. In this week, I will delve on a very common application which we use in our day to day life – Next Word Prediction.

When we text with our smartphones all  of us would have appreciated how our phones make our typing so easy by predicting or suggesting the word which we have in mind. And many would also have noticed the fact that, our phones predict words which we tend to use regularly in our personal lexicon. Our phones have learned from our pattern of usage and is giving us a personalized offering. This genre of machine learning falls under a very potent field called the Natural Language Processing ( NLP).

Natural Language Processing, deals with ways in which machines derives its learning from human languages. The basic input within the NLP world is something called a Corpora, which essentially is a collection of words or groups of words, within the language. Some of the most prominent corpora for English are Brown Corpus, American National Corpus etc. Even Google has its own linguistic corpora with which it achieves many of the amazing features in many of its products. Deriving learning out of the corpora is the essence of NLP. In the context which we are discussing, i.e. word prediction, its about learning from the corpora to do prediction. Let us now see, how we do it.

The way we do learning from the corpora is through the use of some simple rules in probabilities. It all starts with calculating the frequencies of words or group of words within the corpora. For finding the frequencies, what we use is something called a n-gram model, where the “n” stands for the number of words which are grouped together. The most common n-gram models are the trigram and the bigram models. For example the sentence “the quick red fox jumps over the lazy brown dog” has the following word level trigrams:(Source : Wikipedia)

the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog

Similarly a bi-gram model will split a given sentence into combinations of two word groups. These groups of trigrams or bigrams forms the basic building blocks for calculating the frequencies of word combinations. The idea behind calculation of frequencies of word groups goes like this. Suppose we want to calculate the frequency of the trigram “the quick red”. What we look for in this calculation is how often we find the combination of the words “the” and “quick” followed by “red” within the whole corpora. Suppose in our corpora there were other 5 instances where the words “the” and “quick” was followed by the word “red”, then the frequency of this trigram is 5.

Once the frequencies of the words are found, the next step is to calculate the probabilities of the trigram. The probability is just the frequency divided by the total number of trigrams within the corpora.Suppose there are around 500,000 trigrams in our corpora, then the probability of our trigram “the quick red” will be 5/500,000.The probabilities so calculated comes under a subjective probability model called the Hidden Markov Model(HMM).By the term subjective probability what we mean is the probability of an event happening subject to something else happening. In our trigram model context it means,the probability of seeing the word “red” subject to having preceded with words “the” and “quick”. Extending the same concept to bigrams, it would mean probability of seeing the second word subject to have seen the first word. So if “My God” is a bigram, then the subjective probability would be the probability of seeing the word “God” followed by the word “My”

The trigrams and bigrams along with the calculated probabilities arranged in a huge table forms the basis of the word prediction algorithm.The mechanism of prediction works like this. Suppose you were planning to type “Oh my God” and you typed the first word “Oh”. The algorithm will quickly go through the n-gram table and identify those n-grams starting with word “Oh” in the order of its probabilities. So if the top words in the n-gram table starting with “Oh” are “Oh come on”,”Oh my God” and “Oh Dear Lord” in decreasing order of probabilities, the algorithm will predict the words “Come” ,”my” and “Dear” as your three choices as soon as you type the first word “Oh”.After you type “Oh” you also type “my” the algorithm reworks the prediction and looks at the highest probabilities of n-gram combinations preceded with words “Oh” and “my”. In this case the word “God” might be the most probable choice which is predicted. The algorithm will keep on giving prediction as you keep on typing more and more words. At every instance of your texting process the algorithm will look at the penultimate two words you have already typed to do the prediction of the running word and the process continues.

The algorithm which I have explained here is a very simple algorithm involving n-grams and HMM models. Needless to say there are more complex models which involves more complex models like Neural Networks. I will explain about Neural Networks and its applications in a future post.
images

Machine Learning: Teaching a machine to learn

In my previous post on recommendation engines, I fleetingly mentioned about machine learning. Talking about machine learning, what comes to my mind is a recent conversation I had with my uncle. He was asking me on what I was working on and I started mentioning about machine learning and data science . He listened very attentively and later on told my mother that he had absolutely no clue  what I was talking about. So  I thought it would be a good idea to try and unravel the science behind machine learning.

Let me start with an analogy. When we were all toddlers whenever we saw something new ,say a dog, our  parents would point and tell us “Look , a dog”. This is how we start to learn about things around us,from inputs such as these that we receive from our parents and elders . The science behind  machine learning works pretty similar. In this context, the toddler is the machine and the elder which teaches the machine is a bunch of  data .

In very simple terms the setup for a machine learning context works  like this. The machine is fed with a set of data. This data consists of two parts, one part is called  features and the other labels. Let me elaborate a little bit more. Suppose we are training the machine to identify the image of a dog. As a first step we feed multiple images of dogs to the machine. Each image which is fed, say a jpeg or png image, consists of millions of pixels. Each pixel in turn is composed of some value of the three primary colors Red, Blue and Green. The values of these primary colors ranges between 0 to 256. This is called the pixel intensity. For example the pixel intensity for the color orange would be (255,102,0), where 255 is the intensity of its red component, 102 its green component and 0 its blue component. Like wise, every pixel in an image will have various combinations of these primary colors.

RGB

These pixel intensities are the features of the image which are provided as inputs to the machine. Against each of these features, we also provide a class or category describing the features we provided. This is the label. This data set is our basic input. To  visualize the data set, think of it as a huge table of pixel values and its labels. If we have,say 10 pixels per image and there are 10 images. Our table will have 10 rows, corresponding to each image and for each row there would be 11 columns. The first 10 columns would correspond to  pixel values and the 11th column would be the label.

Now that we have provided the machine its data, let us look at how it learns. For this let me take you back to your school days. In your basic geometry, you would have learnt the equation of a line as Y = C + (theta * X). In this equation, the variable C is called the intercept and theta the slope of the line. These two variables govern the  properties of the line Y . The relevance of these variables is that, if we are given any other value of X, then by our knowledge of C and theta we will be able to predict or create a line. So by learning  two parameters we are in effect predicting an outcome. This  is the essence of machine learning. In a machine learning, setup the machine is made to learn the parameters from the features which is provided.Equipped with the knowledge of these parameters the machine will be able to predict the most probable values of Y(Outcomes) when new values of X(features) are provided.

In our dog identification example, the X values are the pixel intensities of the images we provided, Y denotes labels of the dogs. The parameters are learned from the provided data. If we are to give the machine new values of X’s which contain say  features of both dogs and cats, the machine will correctly identify which is a dog and which is a cat, with its knowledge of the parameters. The first set of data which we provide to the machine for it to learn parameters is called the training set and the new data which we provide for prediction is called the test set. The above mentioned genre of machine learning is called Supervised Learning. Needless to say, the earlier equation of the line is one among multiple types of algorithm used in machine learning. This type of algorithm for the line is called linear regression. There are multiple algorithms like these which enables machines to learn parameters and carry out predictions.

What I have described herewith is a very simple version of machine learning. Advances are being made in this field and scientists are trying to mimic the learning mechanism of human brain on to machines. An important and growing field aligned to this idea is called Deep Learning. I will delve on deep learning in a future post.

The power of machine learning is quite prevalent in the world around us and quite often the learning  is inconspicuous. As a matter of fact, we are all party to the training process inconspicuously. A very popular example is the photo tagging process in Facebook. When we tag pictures which we post on Facebook, we are in fact providing labels enabling a machine to learn.  Facebook’s powerful machines will extract features from the photos we tag. Next time we tag a new photo, Facebook will automatically predict the correct tag through the parameters which it has learned. So next time you tag a picture on Facebook, realize that you are also playing your part in teaching a machine to learn.