Data Science Strategy Safari : Aligning Data Science Strategy to Org Strategy


Back from the days when I was a management student a classic work on strategy,which inspired me was “Strategy Safari” by Henry Mintzberg, Bruce Ahlstrand, and Joseph Lampel.  Strategic Safari, describes different perspectives on strategy as summarised in the attached matrix.


Figure 1 : Facets of Strategy Formulation

These multiple facets of strategy did play a significant part in defining my perspectives on strategy.There is no doubt that works of other greats in the field like Peter Drucker and Michael Porter  did shape my thinking process and my perspectives on strategic management. However what made this book on top of my favourite list is the different angles through which the field of strategic management was looked at by the authors. The title of this post is derived by drawing inspiration from Mintzberg’s seminal work. In this post, I am attempting to take you on a safari through the data science strategy formulation process.

Data science strategy formulation : The big question

When formulating a data science strategy, a pertinent question one can ask is this. With tremendous strides data science is making in influencing business outcomes, should data science strategy lead organisation strategy or like any other functional strategy, should it be aligned to Organisational strategies ? Well, in my opinion, like any other functional strategy, data science strategy should also be aligned to organisational strategy. Data science domain would have no meaning if it is not used to support the organisation in meeting its overall objectives. And for this very reason I strongly believe that data science strategy has to be derived out of organisation strategy. So the next question is how do we define a data science strategy which is aligned to organisation strategy ? To answer that question let us decipher the strategic alignment framework.

Data science strategy safari : Alignment is the game

Strategic alignment is the process by which an organisation’s competencies,resources and actions are aligned to the planned organisational objectives. Data science has become a very critical competency an organisation have to build, to have an edge in this digitally connected era. However it is equally important that the output from a data science engagement i.e predictions,recommendations, inferential studies et al fits well into the overall scheme of strategic objectives an organisation wishes to pursue. This can be achieved by traversing the processes of the alignment framework. Figure 2 is the depiction of data science strategic alignment framework.


Figure 2 : Data Science Strategic Alignment Framework

The strategic alignment framework can be summarised into the following steps

  1. Analyse the critical functions within the business value chain.
  2. Within each function, identify critical performance indicators.
  3. From each of the performance indicators derive predictive or inferential use cases which will help in realisation of those performance indicators. Create a web of such use cases which are aligned to each of those performance indicators.
  4. For each of the use case, identify business factors which influence that particular use case.
  5. For each of the factor identify related data points
  6. Identify the systems and subsystems which generate these data points and figure out ways to connect them to implement the use case.

These steps can be demarcated based on its value as “Strategic alignment” steps and “Operational alignment” steps. The first four belong to the first category and the remaining two to the latter.

Let us see the manifestation of the strategic alignment framework for the case of an insurance company. Let us take the case of  a single function within the value chain i.e ‘Customer Management’.


Figure 3 : Alignment process for an Insurace company

The trail for analysis for the customer management function is as depicted in figure 3 above. To ensure that data science strategy is aligned to organisational goals, the first step of the process is to identify Key Performance Indicators ( KPI’s )  for each function within the value chain. For the function ‘customer management’ which we are analysing, one critical KPI which has substantial impact on the top line and bottom line is “Improving customer retention rate”.

Having identified a critical performance indicator, alignment to it would entail deriving data science use cases which will help in achievement of this performance indicator . For customer management function a  use case which will help in improving customer retention rate would be to predict probability of premium renewal. The output from this use case can be used for targeted campaigns towards customers who have low probability of renewing premiums, there by enabling achievement of the KPI.  In addition to use cases which are directly related to the KPI we should also derive related use cases which will enable the process of achieving that KPI. For example having known which customer to be targeted, it would also be valuable to know specifics of how to target them,like predicting right time and channel to reach out to target customers or predicting right price point for giving them specific offers.

In a similar fashion, we have to look across all the functions,critical metrics within each function and derive all primary and related predictive use cases. These use cases can be formed into an interconnected web called Strategic Alignment Map ( SAM). Figure 4 below  is a representative SAM depicting the business value chain,its critical functions, interconnected web of use cases and its corresponding category ( Natural Language Processing, Inferential , Machine Learning/Deep Learning, Other AI etc). A comprehensive SAM would form the blue print for aligning data science projects to organisational strategy and also in indicating inter dependencies between different use cases / models. In addition, it will also be an aid to get a view on various data science competencies which are required to add value to an organisation


Figure 4 : Strategic Alignment Map

Now that we have seen the process of creating the SAM, let us dive deeper and decipher the operational aspects of data science strategy.

Once we have an interconnected web of use cases critical for the organisation, the next task would be in getting data acquisition and integration strategies aligned to the overall strategy. To align data acquisition strategies to overall strategy we first have to know what kind of data points are required for implementing the use cases depicted in the SAM and also the characteristics of the data points like formats, velocity, frequency, data systems which generate them etc. A good approach to derive those details is to look at each use case, identify business factors influencing  each of them and then working our way downwards.

For our specific use case i.e predicting renewal rate,  some factors which have a bearing on the renewal rate are

(a) competition (b) pricing (c) customer experience & expectations & (d) channel effectiveness etc

A comprehensive list of  factors like the above have to be identified through close discussions with business/domain teams. Having identified various factors affecting each use case the next task is to identify data points related to each factor. Some of the major data points related to factors influencing renewal rate is depicted in figure 5 below.


Figure 5 : Data points related to factors

The requirement for data points related to each factor governs data sourcing and integration strategies. From the various data points depicted above we can see that data requirements can be from within the organisation and also from external sources. For example, data points related to competition in all probability will have to be acquired from external sources. Other data points predominantly can be acquired from various systems within the organisation.

In addition, the factor analysis will also help in determining the data types related to each use case. Some of the data types related to the identified data points are as follows

  • Traditional RDBMS data (eg. demographics, customer records, policy transactions etc.)
  • Text data ( customer reviews)
  • Voice ( Call centre data)
  • Log files( channel usage metrics,channel cookies etc.)

To have a comprehensive view of data requirements one will have to look at each factor through different facets. Various facets through which one have to look at each factor is as listed below

  • What are the data points ?
  • How varied are the data types ?
  • What are the sources of data ?
  • Whether external or internal ?
  • What frequency are these generated and captured ?
  • How do we connect them together for implementing the use cases ?

These comprehensive views derived on the data requirements will help in aligning different components of data engineering strategies like data acquisition, data integration, data pre-processing and cleansing, data storage etc to overall organisational strategy.

Wrapping Up

Having seen the data science strategic alignment framework in action one can not help but wonder if we can draw parallels from the framework to some of the perspectives of Mintzberg’s “Strategy Safari”. The process steps encapsulated within this framework have elements of the Learning, Cognitive and Planning Schools of strategy formulation. However at the end of the day this framework, like any other framework, is aimed at structuring one’s though process towards achievement of certain objectives. The objective it aims to accomplish is to ensure that your data science efforts are aligned to overall Organisational goals and strategies.




Applied Data Science Series : Solving a Predictive Maintenance Business Problem – Part III


In the previous post of the series we discussed the exploratory analysis phase and saw how the combination of domain knowledge and single variable exploration unravels intuitions from the data. In this post we will expand our analysis to multiple variables and then see how intuitions we develop during the exploration phase, can lead to generating new features for modelling.

In the example we were discussing, we were limited to analysis of a single variable i.e conductance. However to get more meaningful insights we have to connect other variables layer by layer to the initial variable which we have analysed to get more insights on the problem. As far as battery is concerned some of the critical variables other than conductance are voltage and discharge. Let us connect these two variables along with the conductance profile to gain more intuitions from the data.


The above figure is a plot which depicts three variables across the same time span. The idea of plotting multiple variables together across a common time span is to unearth any discernible trends we can see together. A cursory look at this plot will reveal some obvious observations.

  1. The fall in current and voltage in conjunction with drop in conductance.
  2. The cyclic nature of the voltage profile.
  3. A gradual drop in the troughs of the voltage profile.

Having made some observations,we now need to ascertain whether these observations can be codified to some definitive trends. This can be verified only by observing plots for many samples of similar variables. By sampling data pertaining to many batteries if we can get similar observations, then we can be sure that we have unearthed some trends explaining behaviors of different variables. However just unearthing some trends will not suffice. We have to get some intuitions from such trends which will help in transforming the raw variables to some form which will help in the modelling task. This is achieved by feature engineering the raw variables.

Feature Engineering

Many a times the given set of raw variables will not suffice for extracting the required predictive power from the model. We will have to transform the raw variables to generate new variables giving us the extra thrust towards better predictive metrics. What transformation has to be done, will be based on the intuitions we build during the exploratory analysis phase and also by combining domain knowledge. For the case of batteries let us revisit some of the intuitions we build during the exploratory analysis phase and see how these intuitions we build can be used for feature engineering.

In the previous post , we found out that precipitous fall in conductance is an indicator of failing health of a battery. So a probable feature we can extract from the conductance variable is the slope of the data points over a fixed time span.The rationale for such a feature is this, if precipitous fall in conductance over time is an indicator of failing health of a battery  then the slope of data points for a battery which is failing will be more steeper than the battery which is healthy. It was observed that through such transformation there was a positive influence on predictive metrics. The dynamics of such transformation is as follows, if we have conductance data for the battery for three years, we can take consecutive three month window of conductance data and take the slope of all the data points and make it as a feature.  By doing this, the number of rows of data for the variable also gets consolidated to much fewer numbers.

Let us also look at another example of feature engineering which we can introduce to the variable, discharge voltage. As seen from the above figure, the discharge voltage follows a wave like profile. It turns out that when a battery discharges the voltage first drops and then it rises. This behavior is called the “Coupe De Fouet” (CDF) effect. Now our thought should be, how do we combine the observed wave like pattern and the knowledge about CDF into a feature ? Again we have to dig into domain knowledge. As per theory on the state of health of batteries there are standards for the CDF profile of a healthy battery and that of a failing battery. These are prescribed by the manufacturer of the battery. For example the manufacturing standards prescribe certain depth to which the voltage will fall during discharge and certain height to which it will go up during a typical CDF effect. The deviance between the observed CDF and the manufacture prescribed standard can be taken as another feature. Similarly we can also think of other features related to voltage, like depth of discharge ( DOD), number of cycles etc. Our focus should be in using the available domain knowledge to transform raw variables into features.

As seen from the above two examples the essence of feature engineering is all about translating the domain knowledge and the trends seen in the data to more meaningful features. The veracity of the models which are built depends a lot on the strength of  the features built. Now that we have seen the feature engineering phase let us now look at modelling strategy for this use case.

Modelling Phase

In the first part of this use case we discussed about labeling strategy for training the model. Since the use case is to predict which battery would fail and at what period of time, we have to look back in time from the failure point label for creating different classes related to periods of failure. In this specific case, the different features were created by consolidating 3 months of data into a single row. So one period before failure would denote 3 months before failure. So if the requirement is to predict failure 6 months prior to when it is likely to happen, then we will have 4 different classes i.e  failure point,one period before failure(3 months prior to failure point) ,two periods before failure and (6 months prior to failure point) & normal state. All periods prior to 6 months can be labelled as normal state.

With respect to modelling, we can spot check with different classification algorithms ( logistic regression, Naive bayes, SVM, Random Forest, XGboost .. etc). The choice of final model will be based on the accuracy metrics ( sensitivity , specificity etc) of the spot checked models. Another aspect which might be useful to note is also that, data set could be highly unbalanced i.e the number of normal battery classes is likely to outnumber the failure classes disproportionately. It will be a good idea to try out class balancing methods on the data set before modelling.

Wrapping up

This post brings down curtains to the three part series on predictive analytics for industrial batteries. Any use case within the manufacturing sector can be quite challenging as the variables involved are very technical and would require lot of interventions from related domain teams. Constant engagement of domain specialist as part of the data science team is very important for the success of such projects.

I have tried my best to write the nuances of such a difficult use case. I have tried to cover the critical elements in the process. In case of any clarifications on the use case and details of its implementation you can connect with me through the following email id Looking forward to hearing from you.  Till then let me sign off.

Watch this space for more such use cases.

Applied Data Science Series : Solving a Predictive Maintenance Business Problem – Part II




In the first part of the applied data science series, we discussed about first three phases of the data science process namely business discovery, data discovery and data preparation. In business discover phase we talked on how the business problem i.e. predicting end life of batteries, defines the choice of  variables that comes into play. In the data discovery phase we discussed data sufficiency and other considerations like variety and velocity of data and how these considerations affect the data science problem formulation. In the last phase we touched upon how the data points and its various constituents drive the predictive problem formulation. In this post we will discuss further on how exploratory analysis can be used for getting insights for feature engineering.

Exploratory Analysis – Unraveling latent trends

This phase entails digging deep to get a feel of the data and extract intuitions for feature engineering. When embarking upon exploratory analysis, it would be a good idea to get inputs from domain team on the relation between variables and the business problem. Such inputs are often the starting point for this phase.

Let us now get to the context of our preventive maintenance problem and evolve a philosophy for exploratory analysis.In the case of industrial batteries, a key variable which affects the state of health of a battery is its conductance. It turns out that an indicator of failing health of  battery is the precipitous drop in conductance. Armed with this information our next task should be to  identify, from our available data set,batteries that have higher probability to fail. Since precipitous fall in conductance is an indicator of failing health,the conductance data of  unhealthy batteries will have more variance than the normal ones. So the best way to identify failing batteries from the normal ones would be to apply some consolidating metric like standard deviation or variance on the conductance data and further drill deep on samples which stand apart from the normal population.

SD1_Plot The above is a plot depicting standard deviation of conductance for all batteries. Now what might be of interest to us is the red zone which we can call the “Potential failure Zone”. The potential failure zone consists of those batteries whose conductance values show high standard deviation. Batteries with failing health are likely to exhibit large fall in conductance and as a corollary their values will also show higher standard deviation. This implies that the samples of batteries which have higher probability of failure will in all likelihood be from this failure zone. However to ascertain this hypothesis we will have to dig deep into batteries in the failure zone and look for patterns which might differentiate them from normal batteries. Another objective to dig deep is also to elicit clues from the underlying patterns on what features to include in the predictive model. We will discuss more on the feature extraction when we discuss about feature engineering. Now let us come back to our discussion on digging deep into the failure zone and ferreting out significant patterns. It has to be noted that in addition to the samples in the failure zone we will also have to observe patterns from the normal zone to help separate wheat from the chaff . Intuitions derived by observing different patterns would become vital during feature engineering stage.


The above figure is a comparison of patterns from either zones. The figure on the left is from the failure zone and the one on the right is from the other. We can clearly see how the precipitous fall is manifested in the sample from the failure zone. The other aspect to note is also the magnitude of the fall. Every battery will have degrading conductance over time. However the magnitude of  degradation is what differentiates the unhealthy  battery from a normal one. We can observe from the plot on the left that the fall in conductance is more than 50%, however for the battery to the right the drop is more muted.  Another aspect we can observe is the slope of conductance. As evident from the two plots, the slope of  conductance profile for the battery on the left is much more steeper over time than the one on the right. These intuitions which we have derived so far might become critical from the overall scheme of feature engineering and modelling. Similar to the intuitions which we have disinterred so far, more could be extracted by observing more samples. The philosophy behind exploratory analysis entails visualizing more and more samples, observing patterns and extracting clues for feature engineering. The more time we spend on doing this more ammunition we get for feature engineering.

Wrapping up

So far we discussed different considerations for the exploratory analysis phase. To summarize, here are some of the pointers during this phase.

  1. Take inputs from domain team related to the problem we are trying to solve. In our case the clue which we got was the relation between conductance and health of batteries.
  2. Identify any consolidating metric for the variable under consideration to separate out anomalous samples. In the example above we used standard deviation of conductance values to find anomalies.
  3. Once the samples are demarcated using the consolidation metric, visualize samples from different sets to identify discernible patterns in data.
  4. From the patterns we observe root out clues for feature engineering. In our example we identified that % fall in conductance and slope of conductance over time could be potential features.

The above pointers are general guidelines on how one should think through during  exploratory analysis phase.

The discussions so far were centered on exploratory analysis on a single variable. Next we have to connect other variables to the one which we already observed and identify trends in unison. When we combine trends from multiple variables we will be able to unravel more insights for feature engineering. We will continue our discussions on combining more variables and subsequent feature engineering in our next post. Watch out this space for more.


Applied Data Science Series : Solving a Predictive Maintenance Business Problem


Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. From this week onward, we would be starting a new series  called the Applied Data Science Series. In this series I would be giving an end to end perspective on tackling business use cases or societal problems within the framework of Data Science. In this first article of the applied data science series we will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries, which falls under the genre of use cases called preventive maintenance use cases.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science projectBigPicture.

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

  1. Gradual drop in conductance over time would mean normal behavior and sudden drop would mean anomalous behavior
  2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behavior

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

  1. Understand the business problem which we are set out to solve
  2. Identify all key variables related to the business problem
  3. Identify the lead indicators within these variable which will help in solving the business problem.
  4. Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

  • Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
  • What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
  • What is the frequency of data capture ? Does it vary across the variables ?
  • Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

  • Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
    • Time stamp when the conductance data was taken
    • Unique identifier for each battery
    • Other related information like manufacturer , installation location, model , string it was connected to etc
  • Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
    • Voltage of the battery
    • Temperature
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc
  • Discharge Data : Discharge data is collected once every 3 months. Key variable include
    • Discharge voltage
    • Current at which voltage discharges
    • Other related information like battery identifier, manufacturer, installation location, model, string data etc


As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

  1. Which battery will fail &
  2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

  • Conductance : One reading of a battery captured once every 3 days.
  • Voltage & Temperature : 4-5 readings per battery captured every day.
  • Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.


Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).


The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

  1. Understand the business problem and formulate the response variables.
  2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
  3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
  4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
  5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
  6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

Wrapping up till we meet again

So far we have discussed the initial two phase of a data science project . The first phase entails defining the business problem and carrying out the business discovery. In the next phase, which is the data discovery phase, we align the available data points to the business problem and then formulate the predictive problem. Once we have a clear understanding of how the predictive problem have to be formulated our next task will be to get into exploratory analysis and feature engineering phases. These phases and the subsequent phases would be dealt in detail in the next post of this series. Watch out this space for more.


Mind of a Data Scientist – Part II


In the last post of this series we had a glimpse into the nuances of the business discovery and data engineering phases. These phases dealt with breaking down a business problem into the factors which influence the problem and collating data points related to the business problem. In this post, we will go further as to how the data we collected is further analysed to give us insights into our modeling process. This phase is called the data discovery phase.

Data Discovery Phase

This phase is one of the most critical phases in the whole life cycle where one gets acclimatized with the data structure and the inter relationships between the variables. There are two perspectives as to how we approach the data discovery phase.  One perspective is the business perspective and the second is the statistical perspective. Both these perspectives can be depicted as follows.


The business perspective deals with relationship between the variables from the domain of the business problem.  In contrast the statistical perspective will look more on the statistical characteristics of the data at hand like its distributions, normality,skew etc. To help us elucidate these concepts let us take a case study.

Let us assume that a client of ours who have various cell sites approaches us with a problem they are grappling with. They would like to know in advance the state of health of the batteries which are powering their cell sites. They want our help in predicting when their batteries would fail. For this they have given us historical data related to the measurements they have taken over time. Some of the key variables involved are readings related to conductance, voltage, current, temperature, cell site location etc. Our client has also given us some clues as to what might constitute the failure of a battery. They have asked us to look at trends where the conductance values show precipitous fall over time which might be an indicator of failing batteries. Equipped with these information let us see how we can go about our task of data discovery. Let us first look at it from the business perspective.

Data Discovery – Business Perspective

The best way to embark the data discovery phase is to think from the perspective of our business problem. Our business problem was to predict the impending failure of batteries. The obvious question which comes to our mind is what constitutes failure of batteries ? We might not have a clear cut recipe for failure at this point of time however what we have is a trail which we have to follow. The trail we have, is that of batteries which show a trend of dropping conductance over time. To follow this trail we need to first separate those batteries with falling trend from those which do not show that trend. The next question would be how do we separate out those batteries which have a falling trend from the rest ? The best way to do that is to go for some aggregating metric for the basic unit connected with our business problem. Let me elaborate the last sentence by going into a pictorial representation of our data set.


Let the sample of the  data we have at hand be as shown in the figure above. We have number of batteries, say around 20,000 of them. For each battery we have readings of conductance over a time period of around 2- 3 years. Each battery is associated with a plant ( cell location) . A plant may have multiple batteries however a battery will be associated with only one plant.

Now that we have seen the structure of our data set let us come back to the earlier statement  i.e. ” aggregating metric  for the basic unit connected with the business problem“. Looking at this statement there are two main terms which are important.

  1. Basic unit &
  2. Aggregating metric.

In our case the basic unit connected with the business problem are the individual batteries themselves. If our business problem were to predict plant sites which can potentially fail, then our basic unit would be each plant site. Talking about the second term, the aggregating metric, it is an aggregated measure of variable associated with the basic unit under consideration. In our case it would be some aggregation of the conductance of each battery. Again the type of aggregation metric would depend on the business problem. So let us take a step back into the problem we set out for ourselves. We were concerned about identifying the batteries which had a falling trend. The more pronounced the falling trend, more likely for it to be a failing battery. So when we think about an aggregating metric we should think about a metric which will accentuate the spread of data. A very handy metric to represent the spread of data would be the standard deviation. So if we aggregate the values of each battery by taking the standard deviation of its conductance we have a very effective method to identify the set of batteries we want. The same is represented in the plot below.


The above figure is a plot of the batteries along x axis and the standard deviation of conductance along y axis. We can clearly see that using our aggregating metric we clearly have two groups of batteries, one with standard deviation less than 100 and the other with more than 300. The second group i.e batteries A & C whose standard deviation is way above the rest are potentially the cases we are looking for. Let us also try and plot the real conductance value of these batteries over time to corroborate our hypothesis.


We can clearly see from the above plot that battery A & C shows a dropping trend which was indicated by the high standard deviation for these batteries. So taking an aggregating metric like this will help us in zeroing on to the cases where we want to further dig our hands into.

Deep Diving

Now that we have identified our set of batteries which potentially could be problematic, the next step is to dive deep into those cases and try to identify other indicators which are associated with falling conductance. We need to look closely at some pictorial representation of the data and then ask further questions

  1. Are there any period of time when such trends are happening ?
  2. Are there any specific patterns which we can unearth before the falling trend in conductance
  3. Are there any thing special about the slope of the curve which shows a falling trend… etc

We need to look at all discernible patterns within that variable and build our intuitions on them. Once we build our intuitions on one variable it is time to move further and associate other variables. We can bring in  variables like voltage, current, temperature etc and see how they behave with respect to the specific trends which we saw when we analysed only one variable (Conductance) . Some of the trends we can look at are the following

  1. How has voltage, current or temperature behaved during the period when we saw a drop in conductance ?
  2. Are there any specific trends for these variables before we saw the trend in falling conductance ?
  3. How have these variables behaved after the fall in conductance values ?
  4. Are there any prospects for any more variables other than the ones we have ? … etc

These are the kind of questions we have to ask to help us in unearthing various relationships which exists within the variables in our data set. Asking all these questions and slicing and dicing into each of the variables help us achieve the following

  1. Helps in determining relative importance of variables
  2. Provides a rough idea about relationships between variables
  3. Gives insights into any variables that needs to be derived out of the existing variables
  4. Gives us intuitions on any new variables which needs to be brought in

All insights we unearth by asking such questions will help us immensely when we get into the downstream modelling activities.

Summing Up

Now that we have seen the business perspective of the data discovery phase, let us encapsulate the main steps in the process

  1. Identify a variable which potentially give indication of the problem we are trying to solve
  2. Derive some aggregation metric for the identified variable to help us split the basic unit related to our problem
  3. Dive down deep into cases we have earmarked and look for trends with respect to the variable we are looking for
  4. Introduce other variables and look for association of the newly introduced variables with the trends we saw in the first variable.
  5. Look for relationship between variables which give clues to the problem statement
  6. Build intuitions on any new variable that can be introduced which can help in solving the problem.

The above are a set of broad guideline as to how we can structure our thought process for business perspective of the data discovery phase. In the next post we will deal with the statistical perspective of data discovery and how we can connect the dots between both these perspectives so as to give us intuitions for feature engineering and modelling. Watch out this space for more.




Mind of the Data Scientist – Part I

Mind of the Data Scientist – Part I


Over the past few months various people have been asking me to give them an end to end view on what it entails to be a data scientist. When I was contemplating on this request I thought,rather than just providing an end to end process, lets go a little deeper into how she or he thinks when confronted with an analytic problem. So from this week we are starting a new series called the “The Mind of a Data Scientist”. The name of the series might ring a bell to many of you due to its similarity with Kenichi  Omhae’s famous book ” The mind of a strategist”. Well the name of the series is inspired from Kenichi Omhae’s book. However the similarity ends with the name. The path we would tread when trying to unravel the thinking process of a data scientist is as depicted below.


The above depiction is a birds eye view of the maze, a data scientist has to traverse in trying to address a problem .  So let us tread this path and embark on a safari through the mind of a data scientist.

Business Discovery : In the Beginning……

As always, in the beginning there was some business challenge or problem which paved way to a data science initiative. To be more contextual let us take an example.Lets assume Eggs Incorporated,an agro products company,approached us to help them in predicting the yield of eggs. To help them solve this business problem they gave us historic data available in their internal systems.

So where do you think we will start in our quest to solve the problem at hand. The best way to start is by building our intuitions and hypothesis on the factors which are detrimental to the variable which we are going to predict. We can call this variable the response variable, which in our case is the yield of egg production. To gain intuitions on key factors which affect our response variable we have to embark on some secondary research and also engage with the business folks of Eggs Inc. We can call this phase of our safari ,business discovery phase. During this phase we build our intuitions on the key factors which affect our response variable. These key factors are called the independent variables or features. Through our business discovery phase we find that the key features which affect the yield of egg production are temperature, availability of electricity, good water, nutrients, quality of chicken feed, prevalence of diseases, vaccinations etc.  In addition to the identification of key features, we also build our intuitions on the relationships between features and the response variable, like  ….

What kind of relationship exist between temperature and the yield of eggs ?

Do the kind of chicken feed  affect the yield ?

Is there an association between availability of electricity and the yield ?

…… etc.

These intuitions we build in the beginning will help us when we do our discovery of the data at later phases. After gaining intuitions on the variables that come into play  and the relationships that exists between the variables, next task is to validate our intuitions and hypothesis. Let us see how we do that

The Grind …… : Getting the data ready to test our intuitions and hypothesis

To validate our hypothesis and intuitions we need to have data points related to the problem we are trying to solve. Aggregating these data points in the format we want is the most tedious part of our journey. Many of these data points might be available in various forms and modes within the organisation. There would also be a need to supplement the data available within the organisation with what is available outside. For example social media data or open data available in public domain.  Our aim would be to get all the relevant data points in a neat form and shape so that we can work our way through it. There are no set rules as to how we do it. The only guide for us in getting this task accomplished is the problem statement we are set to solve. However this task is one of the most time consuming task in our whole journey.

When we talk about getting the data ready, we have to do an assessment of the four V’s connected with data

  1. Volume of data
  2. Variety of data
  3. Velocity of data and
  4. Veracity of data.

Volume deals with the quantum of data we have at our disposal to play with. In most cases larger the volume better it is in creating a more representative model. However bigger volumes also pose challenges in terms of speed and ability of the resources we have at hand to process this data. Volume assessment will help us in our decision on adopting  suitable parallel processing technologies so as to speed up the processing time.

Variety refers to the disparate forms in which our data points are generated at the source. Data might reside in many forms i.e traditional RDBMS, text, images, videos, log files etc. The more disparate the data sets are, the more complex our aggregation process is. The variety of data points will give clues on the adoption of the right data aggregating technologies.

The third ‘V’ i.e velocity deals with the frequency in which data points are generated. There could be data points which are generated very regularly like web stream data, whereas there could also be data which are generated intermittently. The velocity of data is an important consideration in feature engineering and also in adoption of the right data aggregation technologies.

The last ‘V’, i.e veracity is the value each data point provides in the overall context of the problem. If we are not judicious in the selection of variables based on its veracity we will be inundated in a deluge of noisy variables, making it difficult to extract signals from the data we have.

All the above factors have to be borne in mind when we set about our task of molding the data points in a form which will make later analysis easy. The complexity and the importance involved in the whole process has given rise to a stream called the Data Engineering stream. In short Data Engineering is all about extracting, collecting and processing the myriad data points so that it become congenial for downstream value realization processes.

Wrapping up the first part…

So far we have seen the formulation of the business problem and engineering the data points to give shape and direction to our subsequent steps in the data science journey.  In the next post we will deal with two other critical elements in our life-cycle namely Exploratory data analysis and Feature engineering. These processes are detrimental in the formulation of the right model for the problem. Watch out this space for more as we take our safari through the mind of the data scientist.

Classification Algorithms: Random Forest – Part II



In the first part of this series we set the context for Random Forest algorithm by introducing the tree based algorithm for classification problems. In this post we will look at some of the limitations of the tree based model and how they were overcome paving the way to a powerful model – Random Forest. Two major methods that were employed to overcome those pitfalls are Bootstrapping and Bagging. We will discuss them first before delving into random forest.

Bootstrapping and Bagging

When we discussed the tree based model we saw that such models are very intuitive i.e. they are easy to interpret. However such models suffer from a major drawback i.e high variance.Let us understand what high variance means in this context. Suppose we were to have a data set which we divide it into three parts. If three different tree models were fit on these data sets and we were to predict the result of a new observation based on these three models. The result we might get from each of these three models for the same observation can be very different. This is what we call in statistical jargon as ” Model with high variance”. High variance obviously is bad as the reliability of the results we get is compromised. One effective way to overcome high variance is to do averaging. This would mean taking multiple data sets, fitting a tree based model on each of these data sets, do predictions on new observations and then averaging the results got from each of the tree model to get a more reliable result. This seems a very plausible solution. However we have a major problem here. Doing averaging would require having multiple data sets. But what if the data we have is quite limited and obtaining additional data is prohibitively expensive ?

……….. Lo and Behold, we have a powerful method to help us out of this predicament and it is called Bootstrapping.

The etymological meaning of the word Bootstrapping is “Pulling oneself up by ones bootstrap”.In essence it means doing some task considered impossible. In statistics bootstrapping procedure entails sampling from the available data set with replacement. Let me elaborate with an example. Suppose our data set were to have 10 observations ( rows 1:10). From this data set we were to randomly pick an observation, say row 6. After that we replace the row 6 into the data set and we randomly pick another number. Say this time we got row 8. We again put this observation back and repeat the process till we get around 10 observations. Let us assume that the first set of observations we picked looks like this : 6,8,4,8,5,6,9,1,2,5. You might have noticed that there are observations which repeat within the above set. That is perfectly all-right in bootstrapping. We continue this process till we get a collection of bootstrapped samples of 10 observations each. Once we get a collection or a bag of bootstrapped data sets, we fit a tree model for each of these sets, carry out predictions and then average the results. This whole process is called bagging. Bagging helps us get over our original problem of high variance and the results mirror more closely to reality.

Random Forest

Now that we have discussed bootstrapping and bagging we are in a position to get into the nuances of random forest. Random Forest algorithm provides an improvement over bagging in terms of de-correlating the trees. Let me elaborate the de-correlating part. When we were discussing the tree based methods in the last post, we talked about splitting the data set based on the best features.When we grow our trees on the bootstrapped samples , more often than not it is those set of best features which gets picked, to do the split and thereby grow trees. This will result in getting a bunch of trees which look almost the same or in statistical terms “co-related”. We also have discussed that the final result will be obtained by averaging results from all the tree models grown on the bootstrapped samples. It works out that averaging predictions from co-related trees will result in sub-optimal predictions.

To overcome this, Random Forest algorithm randomly picks a smaller subset of features to do split. If there were “P” features in the data set, the subset picked is approximately √P.  The idea of randomly picking a subset of features for each tree is to avoid being biased towards the best predictors. In the new setting, all the predictors have equal chance of being picked and the tree models will be more “representative”. Averaging the results from these representative trees will provide more accurate predictions. In effect the combination of bootstrapping, bagging and random picking of features provides the robustness inherent in the random forest model.

Out of Bag Error Estimation

There is a very straight forward method to estimate the error in a bagged model and it is called “Out of Bag”(OOB) error estimation. In the example we discussed on bootstrapping ,we had 10 observations in our first sample, (6,8,4,8,5,6,9,1,2,5). We can see that the following observations ( 3,7,10) have not been picked in the first bootstrapped sample. These elements are called “Out of Bag” observations. In general it is seen that in the bootstrapping process approximately only 2/3rd of the observations are generally picked. That means about 1/3rd of the observations are OOB in each bootstrapped sample. OOBs have some very important purpose in the overall scheme of things i.e. they act as test beds for estimating error in the model. Let me emphasize this idea with an example. Let us take the case of observation 3. As seen, it is an OOB observation for the first bootstrapped sample. Let us assume that the same observation ends up as OOB for the 6th and 12th bootstrapped data set too. When a tree model is fit on the first, sixth and the twelfth bootstrapped set, the observation 3 will be used as a test set to predict three distinct results corresponding to each model. The three results for observation 3 will thereby be averaged(for regression) to get a single prediction. In case of classification problems the most prevalent class out of the three will be taken. Once we get one single prediction by averaging, the error is estimated by comparing against the true class the observation 3 fall into. Similarly the error estimation is done for all the OOB elements to get an overall aggregation of error. This method of error estimation eliminates the need for cross validation which can be cumbersome for large data sets.

Wrapping Up

The ideas behind random forest model i.e bootstrapping, bagging, random feature selection etc has aided the making of a very powerful algorithm. However random forest is not bereft of pitfalls. One major pitfalls of the model is that it cant be interpreted easily. However the positives of this model far outweighs the negative and because of this random forest is one of the most powerful algorithms providing realistic results.

It is time to wrap up our discussion on tree based algorithms and random forest in particular. From the next post onward we start a new series called the “Mind of a Data Scientist”. In this series we do an exploratory walk, through the thought process of a data scientist in enabling, data driven informed decision making. Watch out this space for more