# III : Build and Deploy Data Science Products : Looking under the hood of Machine translation model – LSTM Forward Propagation

This is the third part of our series on building a machine translation application. In the last two posts we understood the solution landscape for machine translation and also explored different architecture choices for sequence to sequence models. In this post we take a deep dive into the dynamics of the model we use for machine translation, LSTM model. This series consists of 8 posts.

Dissecting the LSTM network

I was recently reading the book ” The Agony and the Ecstacy’ written by Irving Stone. This book was about the Reniassence genius, master sculptor and artist Michelangelo. When sculptuing human forms, in his quest for perfection , Miehelangelo used to spent months dissecting dead bodies to understand the anotomy of human beings. His thought process was that unless he understood in detail how each fibre of human muscle work, it would be difficult to bring his work to life. I think his experience in dissecting and understanding the anatomy of the human body has had a profound impact on his masterpieces like Moses, Pieta,David and his paintings in the Sistine Chapel.

I too believe in that philosophy of getting a handle on the inner working of algorithms to really appreciate how they can be used for getting the right business outcomes. In this post we will understand the LSTM network in depth and explore its therotical underpinnings. We will see a worked out example of the forward pass for a LSTM network.

Forward pass of the LSTM

Let us learn the dynamics of the forward pass of LSTM with a simple network. Our network has two time steps as represented in the below figure. The first time step is represented as `'t-1'` and the subsequent one as time step `'t'`

Let us try to understand each of the terms in the above network. A LSTM unit receives as its input the following

1. `c<t-2>` : The cell state of the previous time step
2. `a<t-2> ` : The output from the previous time step
3. `x<t-1> ` : The input of the present time step

The cell state is the unit which is responsible for trasmitting the context accross different time steps. At each time step certain add and forget operations happens to the context transmitted from the previous time steps. These Operations are controlled through multiple gates. Let us understand each of the gates.

Forget Gate

The forget gate determines what part of the input have to be introduced into cell state and what needs to be forgotten. The forget gate operation can be represented as follows

`Ґf = sigmoid(Wf*[ xt ] + Uf * [ at-1 ] + bf)`

There are two weight parameters `( Wf and Uf )` which transforms the input `( xt )` and the output from the previous time step `( at-1)` . This equation can be simplified by concatenating both the weight parameters and the corresponding `xt & at` vectors to a form given below.

`Ґf = sigmoid(Wf *[xt , at-1] + bf)`

`Ґf `is the forget gate

Wf is the new weight matrix got by concatenating `[ Wf , Uf]`

`[xt , at-1]`is the concatenation of the current time step input and the previous time step output from the

`bf `is the bias term.

The purpose of the sigmoid function is to quash the values within the bracket to act as a gate with values between 0 & 1 . These gates are used to control the flow of information. A value of 0 means no information can flow and 1 means all information needs to pass through. We will see more of those steps in a short while.

Update Gate

Update gate equation is similar to that of the forget gate . The only difference is the use of a different weight for this operation.

`Ґu = sigmoid(Wu *[xt , at-1] + bu)`

`Wu` is the weight matrix

`Bu` is the bias term for the update gate operation

All other operations and terms are similar to that in the forget gate

Input activation

In this operation the input layer is activated using a tanh non linear activation.

`C~ = tanh(Wc *[x , a] + bc)`

`C~` is the input activation

`Wc` is the weight matrix

`bc` is the bias term which is added.

operation converts the terms within the bracket to values between -1 & 1 . Let us take a pause and analyse why a `sigmoid` is used for the gate operations and `tanh` used for the input activation layers.

The property of sigmoid is to give an output between 0 and 1. So in effect after the sigmoid gate, we either add to the available information or do not add any thing at all. However for the input activation we also might need to forget some items. Forgetting is done by having negative values as output. tanh layer ranges from -1 to 1 which you can see have negative values. This will ensure that we will be able to forget some elments and remember others when using the tanh operation.

Internal Cell State

Now that we have seen some of the building block operations, let us see how all of them come together. The first operation where all these individual terms come together is to define the internal cell state.

We already know that the forget and update gates which have values ranging between 0 to 1, act as controllers of information. The forget gate is applied on the previous time step cell state and then decides which of the information within the previous cell state has to be retained and what has to be eliminated.

`Ґf * C<t-1>`

The update gate is applied on the input activation information and determines which of these information needs to be retained and what needs to be eliminated .

`Ґu * C~`

These two informations block i.e the balance of the previous cell state and the selected information of the input activation are combined together to form the current cell state. This is represented in the equation as below.

`C<t> = Ґu * C~ + Ґf * C<t-1>`

Output Gate

Now that the cell state is defined it is time to work on the output from the current cell. As always, before we define the output candidates we first define the decision gate. The operations in the output gate is similar to the forget gate and the update gate .

`Ґo = sigmoid(Wo *[x , a] + bo)`

`Wo `is the weight matrix

`Bo` is the bias term for the update gate operation

Output

The final operation within the LSTM cell is to define the output layer. The output candidates are determined by carrying out a tanh() operation on the internal cell state. The output decision gate is then applied on this candidate to derive the output from the network. The equation for the output is as follows

`a<t> = tanh(C<t>) * Ґo`

In this operation using the tanh operation on the cell state we arrive at some candidates to be forgotten ( -ve values) and some to be remembered or added to the context. The decision on which of these have to be there in the output is decided by the final gate, output gate.

This sums up the mathematical operations within LSTM. Let us see these operations in action using a numerical example.

Dynamics of the Forward Pass

Now that we have seen the individual components of a LSTM let us understand the real dynamics using a toy numerical examples.

The basic building block of LSTM like any neural network is its hidden layer, which comprises of a set of neurons. The number of neurons within its hidden unit is a hyperparameter when initializing a LSTM. The dimensions of all the other components of a LSTM depends on the dimension of the hidden unit. Let us now define the dimensions of all the components of the LSTM.

Let us now look at how the dimensions of the different outputs evolve after different operations within the LSTM .

``Please note that when we do matrix multiplications with two matrices of size ( a,b) * (b,c) we get an output of size (a,c)``

Let us do a toy example with a two time step network with random inputs and observe the dynamics of LSTM.

The network is as defined below with the following inputs for each time steps. We also define the actual outputs for each time step. As you might be aware the actual output will not be relevant during the forward pass, however it will be relevant during the back propogation phase.

Our toy example will have two time steps with its inputs (Xt) having two features as shown in the figure above. For time step 1 the input is Xt-1 = `[0.4,0.3] `and for time step 2 the input is Xt = `[0.2,0.6]`. As there are two features, the size of the input unit is n_x = 2. Let us tabulate these values

For simplicity the hidden layer of the LSTM has only one unit which means that `n_a = 1. `For the first time step we can assume initial values for the cell state Ct-2 and output from previous layers at-2 as ‘0’.

Next we have to define the values for the weights and biases for all the gates. Let us randomly initialize values for the weights. As far as the weights are concerned, what needs to be carefully defined are the dimensions of the weights. In the earlier table where we defined the dimensions of all the components we defined the dimension of the weights as `(n_a , n_x + n_a).` But why do the weights be with these dimensions ? Let us dig deeper.

From our earlier discussions we know that the weights are used to get the sigmoid gates which are multiplied element wise on the cell states. For example

`Ct = Ґu * C~ + Ґf * Ct-1 `

or

`at = tanh(Ct) * Ґo. `

From these equations we see that the gates are multiplied element wise to the cell states. To do an element wise multiplication, the gates have to be of the same dimensions as the cell state, i.e. `(n_a, m)`. However, to derive the gates, we need to do a dot product of the initialised weights with the concatenation of previous cell state and the input vector `[n_x+n_a]`. Therefore to get an output dimension of `(n_a, m)` we need to have the weights with dimensions of `(n_a , n_x + n_a)` so that the equation of the gate ,`Ґf = sigmoid(Wf *[x , a] + bf)`, generates an output of dimension of `(n_a ,m )`. In terms of matrix multiplication dynamics this equation can be represented as below

Having seen how the dimensions are derived, let us tabulate the values of weights and its biases .Please note that the values for all the weight matrices and its biases are randomly initialized.

Having defined the initial values and the dimensions let us now traverse through each of the time steps and unravel the numerical example for forward propagation.

#### Time Step 1 :

Inputs : X t-1 = [0.4, 0.3]

Initial values of the previous state

at-2= [0] ,

Ct-2 = [0]

Forget gate => Ґf = sigmoid(Wf *[x , a] + bf) =>

`= sigmoid( [-2.3 , 0.6 , -0.13 ] * [0.4, 0.3, 0] + [0.51] )`

`= sigmoid(((-2.3 * 0.4) + (0.6 * 0.3) + (-0.13 * 0 )) + 0.51)`

`= sigmoid(-0.23) = 0.443`

`Please note  sigmoid (-0.23) = 1/(1 + e(-(-0.23))`

Update gate => Ґu = sigmoid(Wu *[x , a] + bu) =>

`= sigmoid( [1.51 ,-0.61 , 1.31] * [0.4, 0.3, 0] + [1.30] )`

`= sigmoid((1.51 * 0.4) + (-0.61 * 0.3) + (1.31 * 0 ) + 1.30)`

`= sigmoid(1.721) = 0.848`

Input activation => C~ = tanh(Wc *[x , a] + bc)

`= tanh( [0.82,-0.57,-0.13] * [0.4, 0.3, 0] + [-0.57] )`

`= tanh (((0.82 * 0.4) + (-0.57 * 0.3) + (-0.13 * 0 )) + -0.57)`

`= tanh(-0.413) = -0.39`

`Please note tanh = ex – e-x / ( ex + e-x) where x = -0.413`
`= e-0.413 – e-(-0.413) / ( e-0.413 + e-(-0.413)) = -0.39`

Output Gate => Ґo = sigmoid(Wo *[x , a] + bo)

`= sigmoid( [-0.75 ,-0.95 , -0.34] * [0.4, 0.3, 0] + [-0.46] )`

`= sigmoid(((-0.75 * 0.4) + (-0.95 * 0.3) + (-0.34 * 0 )) + -0.46)`

`= sigmoid(-1.045)= 0.26`

We now have all the components required to calculate the internal state and the outputs

Internal state => Ct-1 = Ґu * C~ + Ґf * Ct-2

`= 0.848 * -0.39 + 0.443 * 0`

`= -0.33`

Output => at-1 = tanh(Ct-1) * Ґo

`= tanh(-0.33) * 0.26 = -0.083`

Let us now represent all the numerical values for the first time step on the network.

With the calculated values of time step 1 let us proceed to calculating the values of time step 2

#### Time Step 2:

`Inputs : Xt = [0.2, 0.6] `

Values of the previous state output and cell states

`at-1 = [-0.083]`

`Ct-1 = [-0.33]`

Forget gate => Ґf = sigmoid(Wf *[xt , at-1] + bf) =>

`= sigmoid( [-2.3 , 0.6 , -0.13 ] * [0.2, 0.6, -0.083] + [0.51] )`

`= sigmoid(((-2.3 * 0.2) + (0.6 * 0.6) + (-0.13 * -0.083 )) + 0.51)`

`= sigmoid(0.421) = 0.60`

Update gate => Ґu = sigmoid(Wu *[xt , at-1] + bu) =>

`= sigmoid( [1.51 ,-0.61 , 1.31] * [0.2, 0.6, -0.083] + [1.30] )`

`= sigmoid(((1.51 * 0.2) + (-0.61 * 0.6) + (1.31 * -0.083 )) + 1.30)`

`= sigmoid(1.13) = 0.755`

Input activation => C~ = tanh(Wc *[xt , at-1] + bc)

`= tanh( [0.82,-0.57,-0.13] * [0.2, 0.6, -0.083] + [-0.57] )`

`= tanh(((0.82 * 0.2) + (-0.57 * 0.6) + (-0.13 * -0.083 )) + -0.57)`

=` tanh(-0.737) = -0.63`

Output Gate => Ґo = sigmoid(Wo *[x , a] + bo)

`= sigmoid( [[-0.75 ,-0.95 , -0.34] * [0.2, 0.6, -0.083] + [-0.46] )`

`= sigmoid(((-0.75 * 0.2) + (-0.95 * 0.6) + (-0.34 * -0.083 )) + -0.46)`

`= sigmoid(-1.15178)= 0.24`

Internal state => Ct = Ґu * C~ + Ґf * Ct-1

`= 0.755 * -0.63 + 0.60 * -0.33`

`= -0.674`

Output => at = tanh(Ct) * Ґo

`= tanh(-0.674) * 0.24 = -0.1410252`

Let us now represent the second time step within the LSTM unit

Let us also look at both the time steps together with all its numerical values

This sums a single forward pass for the LSTM. Once the forward pass is calculated the next step is to determine the error term and the backpropagating the error to determine the adjusted weights and bias terms. We will see those steps in the back propagation steps, which will be covered in the next post.

Go to article 4 of this series : Back propagation of the LSTM unit

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

# II : Build and Deploy Data Science Products : Exploring Sequence to Sequence architecture for Machine Translation.

This is the second part of our series on building a machine translation application. In this post we explore sequence to sequence model architecture in greater depth. This series consists of the following eight posts.

In the first part of this series we surveyed the solution landscape of machine translation applications and understood why sequence to sequence models are best suited for machine translation. In this post we will go little deeper and expore architectur choices for sequence to sequence models. We will specifically look at the encoder – decoder architecture which will be the specific architecture we will use for machine translation. We will also get a glimpse of the LSTM model which is the building block for the machine translation application we would be building.

We already know that the problem of machine translation entails deciphering sequence of words in a source language to predict a sequence of target language. For example if you look at the following input German sequence

`Ich freue mich darauf, etwas über maschinelle Übersetzung zu lernen.`
```Which can be translated to

I look forward to learning about machine translation```

From these sequences we can observe the following.

1. The length of input sequence and the length of the target sequence are different
2. There is no one to one mapping between words from the input language to the target language
3. There is dependence on the context which needs to be learned from the input language to get the best translation for the target language.

The inherent complexities like these in machine translation made models like multi layer perceptron ineffective for machine translation. The need of the hour was a model architecuture which was capable of looking accross sequences of words and understand the context of the source language to effectively translate to the target language. This is where Recurrent Neural Networks (RNNs) became popular for solving machine translation problems. Let us now take a deeper look at RNNs.

Recurrent Neural Networks ( RNNs)

RNN models which fall under the category of Sequence to sequence models are designed to learn the context of any input language. But why is learning the context important ? Let us understand this with a simple example.

Suppose we are predicting the next character in a sequence for the string “Happy B….”. We need to predict the next character after the letter ‘B’. For the time being let us assume that we are ignoring the word “Happy” falling before the letter B. In such a scenario the best bet would be to look for all the words which start with “B” and choose the word which is most frequent. Let us say the most frequent word starting with “B” is the word “Baby”. So the next character which will be predicted would be the letter “a”. Now let us imagine that we started looking at all the characters which preceeds B. Given the information about the preceeding charachters “H”,”A”,”P”,”P”,”Y” “B”, then the probability of predicting ‘i’ would be the highest since the word “Birthday” is the most likely word given the context “Happy B” . This is where the concept of context becomes very significant. Language translation depends a lot on the context and therefore there was the need to adopt an architecture where context was learned. Sequence to sequence models like RNNs became an obvious choice.

The dynamics of RNN can be represented as above. The circular nodes represents each time step in the sequence. Each of the time steps receives an input represetend as the arrow pointing upwards. In this context each letter in the string becomes the input at each time step. With each character input the output or the prediction is represented at the top. So given the letter ‘H’ the prediction is the letter ‘A’. Once the letter ‘A’ is predicted it becomes the next input and we need to predict the next letter given the context that we had the letter ‘H’ at the previous time step. At each time step we can also see that there is an arrow which points to the right. This is the information or context each time step passes on to the subsequent time step enabling it to predict contextually.

Unlike vanilla neural networks where each layer has a set of parameters, RNNs shares the same parameters accross all the time steps. Because the parameters are shared accross all time steps, the implementation of back propogation is a little different for the case of RNNs. The type of back propogation implemented in RNN is called Back propogation through time(BPTT). We will be covering the dynamics of BPTT with a toy example in the fourth blog of this series.

Earlier we saw that the RNN keeps the context of the previous time steps in memory and applies it when predicting for the time step in consideration. However in practice vanilla RNNs fails when it encounters large sequences. The parameters blow up or shrink to very small values in such cases. These scenarios are called exploding gradients and vanishing gradients respectively. So in practice a RNN can only leaverage few time steps to extract the context. To over come these shortcomings different variations sequence to sequence models are used. One such variation is the LSTM Long Short Term Memory network. We will be using the LSTM network in our application for machine translation. Let us first look at what an LSTM looks like.

Long Short Term Memory Network ( LSTM)

LSTMs, like vanialla RNNs, have the recurrent connections which entails that the context from the previous time steps are passed on to the current time step when generating an output. However we discussed in the previous section on RNN that they suffer from a major problem of exploding or vanishing gradients when encountered with long sequences. This shortcoming was overcome by building a memory block in LSTMs.

The LSTM has three information sources,two from previous time steps and one from the current time step. The first one is the cell state denoted by ‘Ct’ . The cell state transmits the information about the context from the previous cell states. The second information which passes from the previous layer is its output denoted by ‘ht’. The third is the input for the present time step. In our context of predicting characters, the input from the time step t1 is the letter ‘H’. All these inputs get processed within the LSTM layer enabling it to have memory for longer sequences. We will be having a very detailed worked out example on the dynamics of LSTM in the next post.

An important part of building applications using sequence to sequence models is the selection of right architecture for the use case. Let us now look at different architecture choices for different use cases.

Network Architecture for Sequence to Sequence Models

There are different architecture choices for sequence to sequence models which varies according to the use case. Some of the prominent ones are

• Many to one architecture

This is architecture is ideal for use cases like sentiment analysis where seeing a sequences of words in a string, predict a single output which in this case is the sentiment.

• One to many architecture

This architecture is well suited for use cases like image translation. In such use cases, an image is provided as the input and a sequence of words describing the image is predicted as output. In this case there is one input and multiple outputs.

• Many to many architecture

This is the architecuture which is ideal for a use case like Machine translation. In this architecture, a sequence of words is given as input and the output is also another sequence of words. The below figure is a representation of German to English translation using the many to many architecture.

This architecture is also called Encoder-Decoder architecture. We will see the encoder-decoder architecture in greater depth during our prototype building phase.

Wrapping up

Its now time to wrap up our discussion on sequence to sequence. In this post we had an introduction on RNNs and in specific LSTM which we will be using for the machine translation application. We also looked at different types of architecture choices and identified the encoder-decoder architecture which will be more suited for our use case.

Having seen the conceptual level introduction of sequence to sequence models its time to look under the hood of the LSTM model. In the next post we will work out a toy numerical example and understand in greater depth how LSTM works.

Go to article 3 of the series : Deep dive into the LSTM model with worked out numerical example.

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

# I : Build and Deploy Data Science Products : A Practical Guide to Building a Machine Translation Application.

I was searching for a good quote to start this blog and that’s when I came across the above quote by Benjamin Franklin. I think the above quote best sums up what we are going to achieve in this series. We are going to invest our time in gaining an end to end perspective of a use case. We would be embarking on an exciting journey where we will get to experience a machine learning use case in its full glory, right from its theoretical base to building an application and deploying it. Our learning objectives are summed up in the below figure.

This journey is going to be a 8 post series. In this series we will take a use case, understand the solution landscape and its evolution, explore different architecture choices, look under the hood of the architecture to understand the nuts and bolts, build a prototype, convert the prototype into production ready code, build an application from the production ready code and finally understand the process for deploying the application .The use case we will be dealing with will be Machine Translation. By the end of the series you would have working knowledge on how to build and deploy a Machine translation application, which translates, German sentences into English. This series will comprise of the following posts.

The first four posts lays the theoretical base and in the subsequent 4 posts we will see how the theory can be put to action. You can also watch videos of this series on Youtube.

Let us get started on this journey with an introduction to machine translation.

#### Introduction to Machine Translation

Language translation has always been a tough nut to crack. What makes it tough is the variations in structure and lexicon when one traverses from one language to the other. For this reason the problem of automated language translation or Machine translation has fascinated and inspired the best minds. Over the past decade some trailblazing advances have happened within this field. We have now reached a stage where machine translation has become quite ubiquitous. These technologies are now embedded in all our devices, mobiles, watches, desktops, tablets etc and have become an integral part of our every day life. A common example is the Google Translate service which has the capability to identify our input languge and subsequently translate it to multitudes of languages.

Machine translation technologies have transcended different approaches before reaching the state we are in at present. Let us take a quick look at the evolution of the solution landscape of machine translation.

#### Evolution of Solution landscape for Machine Translation

The journey to the current state of the art translation technologies tells a fascinating tale of the strides in machine learning.

The evolution of machine translation can be demarcated to three distinct phase. Let us look at each one of them and understand its distinct characteristics.

##### Classical Machine Translation

Classical machine translation methods relies heavily on linguisitc rules and deep domain knowledge to translate from a source language to a target language. There are three approaches under this method.

Direct Translation

Source : Speech and Language processing : Daniel Jurafsky, James H Martin: 2nd Edition.

As the name suggests this method adopts a word-to-word translation of the source language to the target language. After the word to word translation a re-ordering of the translated words are required based on linguistic rules formulated between the source language and target language.

Let us look at an example

In the above example, the first two boxes represent the source English sentence and the final translated Spanish sentences respectively. The last box is a word to word mapping of the translated Spanish sentence to its English conuterpart. We can see how the word to word translation has been transformed by re-ordering to form a coherent sentence in the target language. These transformations are aided by comprehensive linguistic rules and deep domain knowledge.

Transfer Method

In the example we saw on direct translation method, we saw how the mapping of the English words for the translated Spanish sentence had a complete different ordering from the source English sentence. Every language has such structural charachteristics inherent in them. Transfer methods looks at tapping the structural differences between different language pairs.

Unlike the direct method where there is word to word tranlation followed by re-ordering, transfer methods relies on codification of the `contrastive knowledge` i.e difference between languages, for translation from the source to the target language. Similar to the direct method, this method also relies on deep domain knowledge and codification of complex rules governing language construction.

Interlingua Method

The intelingua method works on a completely different approach to the word to word and contrastive translations methods we have already seen.

The intelingua method resonates very closely to the process by which human translators work. When translating , a human translator understands the meaning of the source sentence and translate it to the target language so that the essence of the conversation is not lost. There might not be a word to word mapping of the source sentence and translated sentence. However the meaning would remain intact. This is the principle adopted in the intelingua methods. Like the other two methods in the classical approach, intelingua method also depends on the rich codification of rules and dictionaries

The classical machine translation methods were effective for a large set of use cases. However the classical methods relied on comprehensive set of rules and large dictionaries. Building such knowledge base was a mammoth task requiring specialised skills and expertise. The complexity increased many fold when designing systems able to handle translation of multiple languages. There was a need for an approach different from the domain intensive classical techniques. This led to the rise in popularity of the statistical methods in machine translation.

Statistical Machine Translation

When we explored the classical methods we understood the over dependence on domain knowledge in creating linguistic rules and dictionaries. However it was also a fact that no amount of domain knowledge was enough to handle the intricate nuances of languages. What if phrases, idioms and specialised usages in a language do not have any parallels in another language ? In such circumstances what a linguist would do is to go for the closest match given the source language.

This idea of selecting the most probable sentence in the target languge given a sentence in source language is what is leaveraged in statistical machine translation.

Statistical methods builds probabilistic models that aims at maximizing the probability of the target sentence which best captures the essence of the source sentence. In probability terms we can represent this as

`argmaxT P(T|S)`

where T and S are the target and source languages respectively. The above form is the representation of a posterior probability as per Bayes Theorm. This is proportional to

`= argmaxT P(S|T) * P(T)`

The first term `( P(S|T) ) ` is called the translation model and can be interpreted as the likelihood of finding the source sentence given the target sentence. The second term `P(T)` is called the language model which represents the conditional probability of a word in the languge given some preceeding words.

The statistical model aims at finding the conditional probabilities of words within a corpora and using these probabilities find the best possible translation. Statistical machine translation models make use of large corpora or text available on the source and target languages. Eventhough statistical methods were effective, they also had some weaknesses. This method was predominantly focussed on phrases being translated thereby compromising the broder context of the target language. This method struggled when required to translate to a target language which was different in context from the source context. These shortcomings paved the way to advances in other methods which were more robust to retaining the context between the source and target languages.

Neural Machine Translation

Neural machine translation is a different approach where artifical neural networks are used for machine translation. In the statistical machine translation approaches we saw that it uses multiple components like the translation model and language model to do the translations.In NMT models the entire sentence is a single integrated model. In term of approach there isnt drastic deviations from the statistical approaches. However NMTs uses vector representations of words and sentences, which helps in retaining the context of the source and target sentences.

There are different approaches for machine translation using artificial neural networks. One of the earlier approach was to use a multi layer perceptron or a fully connected network for machine translation. However these models werent effective for large sequences of sentences.

Many shortfalls of the earlier approaches were addressed by the adoption of Recurrent Neural network models (RNNs) for machine translation. RNNs are those class of neural networks suited for sequence data. Languages as you know are manifestations of sequence of words with interdependencies between the words within the sequence. RNNs are capable of handling such interdependencies which made such class of models more suited for machine translation. There are different variations of Sequence models which are used for machine translation like encoder-decoder, encoder-decoder with attention etc. We will be using the encoder-decoder models for building our application and will be dealt with in greater depth in the next post.

The state of the art models for machine translation currently are the Transformer models. Transformer models make use of the concept of attention and then builds on it.

Wrapping up the discussions

In this post we introduced the landscape of machine translation approaches. We got introduced to different generations of machine translations solutions starting from the classical approaches,statistical machine translation and neural machine translation approaches.

In the next post we will dive deep into different types of sequence to sequence models and will understand different architecture choices for implementing sequence to sequence models.

We will continue our discussion in the second part of the series which is on sequence to sequence models. See you there.

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

# Applied Data Science Series : Solving a Predictive Maintenance Business Problem

JMJPFU

Over the past few months, many people have been asking me to write on what it entails to do a data science project end to end i.e from the business problem defining phase to modelling and its final deployment. When I pondered on that request, I thought it made sense. The data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. From this week onward, we would be starting a new series  called the Applied Data Science Series. In this series I would be giving an end to end perspective on tackling business use cases or societal problems within the framework of Data Science. In this first article of the applied data science series we will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries, which falls under the genre of use cases called preventive maintenance use cases.

The big picture

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science project.

The above figure is a depiction of the big picture on what it entails to solve a business problem from a Data Science perspective. Let us deal with each of the components end to end.

In the Beginning …… : Business Discovery

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure  is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

1. Gradual drop in conductance over time would mean normal behavior and sudden drop would mean anomalous behavior
2. Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behavior

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

1. Understand the business problem which we are set out to solve
2. Identify all key variables related to the business problem
3. Identify the lead indicators within these variable which will help in solving the business problem.

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

And then came data ……. : Data Discovery

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

• Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
• What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
• What is the frequency of data capture ? Does it vary across the variables ?
• Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

• Conductance data set : Data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
• Time stamp when the conductance data was taken
• Unique identifier for each battery
• Other related information like manufacturer , installation location, model , string it was connected to etc
• Terminal voltage data : Data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
• Voltage of the battery
• Temperature
• Other related information like battery identifier, manufacturer, installation location, model, string data etc
• Discharge Data : Discharge data is collected once every 3 months. Key variable include
• Discharge voltage
• Current at which voltage discharges
• Other related information like battery identifier, manufacturer, installation location, model, string data etc

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.

Formulating the Predictive Problem : Connecting the dots……

To help formulate the predictive problem, let us revisit the business problem we have at hand and then connect it with the data points which we have at hand.  The predictive problem requires us to predict two things

1. Which battery will fail &
2.  Which period of time in future will the battery fail.

Since the prediction is at a battery level, our unit of reference for formulating the predictive problem is individual battery. This means that all the variables which are present across the multiple data sets have to be consolidated at the individual battery level.

The next question is, at what period of time do we have to consolidate the variables for each battery ? To answer this question, we will have to look at the frequency of data collection for each variable. In the case of our battery data set, the data points for each of the variables are capture at different intervals. In addition the volume of data collected for each of those variables at those instances of time also vary substantially.

• Conductance : One reading of a battery captured once every 3 days.
• Voltage & Temperature : 4-5 readings per battery captured every day.
• Discharge : A set of reading captured every second at different intervals of a day once every 3 months (approximately 4500 – 5000 data points collected in a day).

Since we have to predict the probability of failure at a period of time in future, we will have to have our model learn the behavior of these variables across time periods. However we have to select a time period, where we will have sufficient data points for each of the variables. The ideal time period we should choose in this scenario is every 3 months as discharge data is available only once every 3 months. This would mean that all the data points for each battery for each variable would have to be consolidated to a single record for every 3 months. So if each battery has around 3 years of data it would entail 12 records for a battery.

Another aspect we have to look at is how 3 months of data points for a battery can be consolidated to make one record corresponding to each variable. For this we have to resort to some suitable form of consolidation metric for each variable. What that consolidation metric should be can be finalized after exploratory analysis and feature engineering . We will deal with those aspects in detail when we talk about exploratory analysis and feature engineering phases.

The next important point which we have to deal with would be the labeling of the response variable. Since the business problem is to predict which battery would fail, the response variable would be classifying whether a record of a battery falls under a failure class or not. However there is a lacunae in this approach. What we want is to predict well ahead of time when a battery is likely to fail and therefore we will have to factor in the “when” part also into the classification task. This would entail, looking at samples of batteries which has actually failed and identifying the point of time when failure happened. We label that point as “failure point” and then look back in time from the failure point to classify periods leading to failure. Since the consolidation period for data points is three months, we can fix the “looking back” period also to be 3 months. This would mean, for those samples of batteries where we know the failure point, we look at the record which is one time period( 3 months) before failure and label the data as 1 period before failure, record of data which corresponds to 6 month before failure will be labelled as 2 periods before failure and so on. We can continue labeling the data according to periods before failure, till we reach a comfortable point in time ahead of failure ( say 1 year). If the comfortable period we have in mind is 1 year, we would have 4 failure classes i.e 1 period before failure, 2 periods before failure, 3 periods before failure and 4 periods before failure. All records before the 1 year period of time can be labelled as “Normal Periods”. This labeling strategy will mean that our predictive problem is a multinomial classification problem, with 5 classes ( 4 failure period classes and 1 normal period class).

The above discussed, labeling strategy is for samples of batteries within our data set which have actually failed and where we know when the failure has happened. However if we do not have information about the list of batteries which have failed and which have not failed, we have to resort to intense exploratory analysis to first determine samples of batteries which have failed and then label them according to the labeling strategy discussed above. We can discuss about how we can use exploratory analysis to identify batteries which have failed, in the next post. Needless to say, the records of all batteries which have not failed, will be labelled as “Normal Periods”.

Now that we have seen the predictive problem formulation part, let us recap our discussions so far. The predictive problem formulation step involves the following

1. Understand the business problem and formulate the response variables.
2. Identify the unit of reference to which the business problem will apply ( each battery in our case)
3. Look at the key variables related to the unit of reference and the volume and velocity at which data for these variables are generated
4. Depending on the velocity of data, decide on a data consolidation period and identify the number of records which will be present for the unit of reference.
5. From the data set, identify those units which have failed and which have not failed. Such information will generally be available from past maintenance contracts for each units.
6. Adopt a labeling strategy for both the failed units and normal units. Identify the number of classes which will be applied to all records of the units. For the failed units, label the records as failed classes till a convenient period( 1 year in this case). All records before that period will be labelled the same as the units which have not failed ( “Normal Periods”)

Wrapping up till we meet again

So far we have discussed the initial two phase of a data science project . The first phase entails defining the business problem and carrying out the business discovery. In the next phase, which is the data discovery phase, we align the available data points to the business problem and then formulate the predictive problem. Once we have a clear understanding of how the predictive problem have to be formulated our next task will be to get into exploratory analysis and feature engineering phases. These phases and the subsequent phases would be dealt in detail in the next post of this series. Watch out this space for more.

# Machine Learning in Action – Word Prediction

In my previous blog on machine learning, I explained the science behind how a machine learns from its parameters. In this week, I will delve on a very common application which we use in our day to day life – Next Word Prediction.

When we text with our smartphones all  of us would have appreciated how our phones make our typing so easy by predicting or suggesting the word which we have in mind. And many would also have noticed the fact that, our phones predict words which we tend to use regularly in our personal lexicon. Our phones have learned from our pattern of usage and is giving us a personalized offering. This genre of machine learning falls under a very potent field called the Natural Language Processing ( NLP).

Natural Language Processing, deals with ways in which machines derives its learning from human languages. The basic input within the NLP world is something called a Corpora, which essentially is a collection of words or groups of words, within the language. Some of the most prominent corpora for English are Brown Corpus, American National Corpus etc. Even Google has its own linguistic corpora with which it achieves many of the amazing features in many of its products. Deriving learning out of the corpora is the essence of NLP. In the context which we are discussing, i.e. word prediction, its about learning from the corpora to do prediction. Let us now see, how we do it.

The way we do learning from the corpora is through the use of some simple rules in probabilities. It all starts with calculating the frequencies of words or group of words within the corpora. For finding the frequencies, what we use is something called a n-gram model, where the “n” stands for the number of words which are grouped together. The most common n-gram models are the trigram and the bigram models. For example the sentence “the quick red fox jumps over the lazy brown dog” has the following word level trigrams:(Source : Wikipedia)

```the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog
```

Similarly a bi-gram model will split a given sentence into combinations of two word groups. These groups of trigrams or bigrams forms the basic building blocks for calculating the frequencies of word combinations. The idea behind calculation of frequencies of word groups goes like this. Suppose we want to calculate the frequency of the trigram “the quick red”. What we look for in this calculation is how often we find the combination of the words “the” and “quick” followed by “red” within the whole corpora. Suppose in our corpora there were other 5 instances where the words “the” and “quick” was followed by the word “red”, then the frequency of this trigram is 5.

Once the frequencies of the words are found, the next step is to calculate the probabilities of the trigram. The probability is just the frequency divided by the total number of trigrams within the corpora.Suppose there are around 500,000 trigrams in our corpora, then the probability of our trigram “the quick red” will be 5/500,000.The probabilities so calculated comes under a subjective probability model called the Hidden Markov Model(HMM).By the term subjective probability what we mean is the probability of an event happening subject to something else happening. In our trigram model context it means,the probability of seeing the word “red” subject to having preceded with words “the” and “quick”. Extending the same concept to bigrams, it would mean probability of seeing the second word subject to have seen the first word. So if “My God” is a bigram, then the subjective probability would be the probability of seeing the word “God” followed by the word “My”

The trigrams and bigrams along with the calculated probabilities arranged in a huge table forms the basis of the word prediction algorithm.The mechanism of prediction works like this. Suppose you were planning to type “Oh my God” and you typed the first word “Oh”. The algorithm will quickly go through the n-gram table and identify those n-grams starting with word “Oh” in the order of its probabilities. So if the top words in the n-gram table starting with “Oh” are “Oh come on”,”Oh my God” and “Oh Dear Lord” in decreasing order of probabilities, the algorithm will predict the words “Come” ,”my” and “Dear” as your three choices as soon as you type the first word “Oh”.After you type “Oh” you also type “my” the algorithm reworks the prediction and looks at the highest probabilities of n-gram combinations preceded with words “Oh” and “my”. In this case the word “God” might be the most probable choice which is predicted. The algorithm will keep on giving prediction as you keep on typing more and more words. At every instance of your texting process the algorithm will look at the penultimate two words you have already typed to do the prediction of the running word and the process continues.

The algorithm which I have explained here is a very simple algorithm involving n-grams and HMM models. Needless to say there are more complex models which involves more complex models like Neural Networks. I will explain about Neural Networks and its applications in a future post.

# Machine Learning: Teaching a machine to learn

In my previous post on recommendation engines, I fleetingly mentioned about machine learning. Talking about machine learning, what comes to my mind is a recent conversation I had with my uncle. He was asking me on what I was working on and I started mentioning about machine learning and data science . He listened very attentively and later on told my mother that he had absolutely no clue  what I was talking about. So  I thought it would be a good idea to try and unravel the science behind machine learning.

Let me start with an analogy. When we were all toddlers whenever we saw something new ,say a dog, our  parents would point and tell us “Look , a dog”. This is how we start to learn about things around us,from inputs such as these that we receive from our parents and elders . The science behind  machine learning works pretty similar. In this context, the toddler is the machine and the elder which teaches the machine is a bunch of  data .

In very simple terms the setup for a machine learning context works  like this. The machine is fed with a set of data. This data consists of two parts, one part is called  features and the other labels. Let me elaborate a little bit more. Suppose we are training the machine to identify the image of a dog. As a first step we feed multiple images of dogs to the machine. Each image which is fed, say a jpeg or png image, consists of millions of pixels. Each pixel in turn is composed of some value of the three primary colors Red, Blue and Green. The values of these primary colors ranges between 0 to 256. This is called the pixel intensity. For example the pixel intensity for the color orange would be (255,102,0), where 255 is the intensity of its red component, 102 its green component and 0 its blue component. Like wise, every pixel in an image will have various combinations of these primary colors.

These pixel intensities are the features of the image which are provided as inputs to the machine. Against each of these features, we also provide a class or category describing the features we provided. This is the label. This data set is our basic input. To  visualize the data set, think of it as a huge table of pixel values and its labels. If we have,say 10 pixels per image and there are 10 images. Our table will have 10 rows, corresponding to each image and for each row there would be 11 columns. The first 10 columns would correspond to  pixel values and the 11th column would be the label.

Now that we have provided the machine its data, let us look at how it learns. For this let me take you back to your school days. In your basic geometry, you would have learnt the equation of a line as Y = C + (theta * X). In this equation, the variable C is called the intercept and theta the slope of the line. These two variables govern the  properties of the line Y . The relevance of these variables is that, if we are given any other value of X, then by our knowledge of C and theta we will be able to predict or create a line. So by learning  two parameters we are in effect predicting an outcome. This  is the essence of machine learning. In a machine learning, setup the machine is made to learn the parameters from the features which is provided.Equipped with the knowledge of these parameters the machine will be able to predict the most probable values of Y(Outcomes) when new values of X(features) are provided.

In our dog identification example, the X values are the pixel intensities of the images we provided, Y denotes labels of the dogs. The parameters are learned from the provided data. If we are to give the machine new values of X’s which contain say  features of both dogs and cats, the machine will correctly identify which is a dog and which is a cat, with its knowledge of the parameters. The first set of data which we provide to the machine for it to learn parameters is called the training set and the new data which we provide for prediction is called the test set. The above mentioned genre of machine learning is called Supervised Learning. Needless to say, the earlier equation of the line is one among multiple types of algorithm used in machine learning. This type of algorithm for the line is called linear regression. There are multiple algorithms like these which enables machines to learn parameters and carry out predictions.

What I have described herewith is a very simple version of machine learning. Advances are being made in this field and scientists are trying to mimic the learning mechanism of human brain on to machines. An important and growing field aligned to this idea is called Deep Learning. I will delve on deep learning in a future post.

The power of machine learning is quite prevalent in the world around us and quite often the learning  is inconspicuous. As a matter of fact, we are all party to the training process inconspicuously. A very popular example is the photo tagging process in Facebook. When we tag pictures which we post on Facebook, we are in fact providing labels enabling a machine to learn.  Facebook’s powerful machines will extract features from the photos we tag. Next time we tag a new photo, Facebook will automatically predict the correct tag through the parameters which it has learned. So next time you tag a picture on Facebook, realize that you are also playing your part in teaching a machine to learn.