IV : Build and Deploy Data Science Products : Looking under the hood of Machine translation model – LSTM Backpropagation

“True knowledge come with deep understanding of a topic and its inner working”
Albert Einsteen

This is the fourth part of the series where we continue on our quest to understand the innerworking of a LSTM model. Deep understanding of the model is a step towards acquiring comprehensive knowledge on our machine translation application. This series comprises of 8 posts.

Understand the landscape of solutions available for machine translation
Explore sequence to sequence model architecture for machine translation.
Deep dive into the LSTM model with worked out numerical example.
Understand the back propagation algorithm for a LSTM model worked out with a numerical example.( This post)
Build a prototype of the machine translation model using a Google colab / Jupyter notebook.
Build the production grade code for the training module using Python scripts.
Building the Machine Translation application -From Prototype to Production : Inference process
Build the machine translation application using Flask and understand the process to deploy the application on Heroku

In the previous 3 posts we understood the solution landscape for machine translation ,explored different architecture choices for sequence to sequence models and did a deep dive into the forward propagation algorithm. Having understood the forward propagation its now time to explore the back propagation of the LSTM model

Back Propagation Through Time

We already know that Recurrent networks have a time component and we saw the calculations of different components during the forward propogation phase. We saw in the previous post that we traversed one time step at a time to get the expected outputs.

The backpropagation operation works in the reverse order and traverses one time step at a time in the reverse order to get the gradients of all he parameters. This process is called back propagation through time.

The initiation step of the back propagation is the error term .As you know the dynamics of optimization entails calculating the error between the predicted output and ground truth, propagating the gradient of the error to the layers thereby updating the parameters of the model. The work horse of back propagation is partial differentiation,using the chain rule. Let us see that in action.

Backpropagation calculation @ time step 2

We start with the output from our last layer which as calculated in the forward propagation stage is ( refer the figure above)

a _t = -0.141

The label for this time step is

Y_t = 0.3

In this toy example we will take a simple loss function , a squared loss. The error would be derived as the average of squared difference between the last time step and the ground truth (label).

Error = ( a_t – y_t )²/2

Before we start the back propagation it would be a good idea to write down all the equations in the order in which we will be taking the derivative.

Error = ( a_t – y_t )²/2
a_t = tanh( C_t ) * Ґ_o
C_t = Ґ_u * C^~ + Ґ_f * C_t-1
Ґ_o = sigmoid(W_o *[x_t , a_t-1] + b_o)
C^~ = tanh(W_c *[x_t , a_t-1] + b_c)
Ґ_u = sigmoid(W_u *[x_t , a_t-1] + b_u)
Ґ_f = sigmoid(W_f*[x_t , a_t-1] + b_f)

We mentioned earlier that the dynamics of backpropogation is the propogation of the gradients . But why is getting the gradients important and what information does it carry ? Let us answer these questions.

A gradient represents unit rate of change i.e the rate at which parameters have to change to get a desired reduction in error. The error which we get is dependent on how close to reality our initial assumptions of weights were. If our initial assumption of weights were far off from reality, the error also would be large and viceversa. Now our aim is to adjust our initial weights so that the error is reduced. This adjustment is done through the back propagation algorithm. To make the adjustment we need to know the quantum of adjustment and also the direction( i.e whether we have to add or subtract the adjustments from initially assumed weights). To derive the quantum and direction we use partial differentiation. You would have learned in school that partial differentiation gives you the rate of change of a variable with respect to another. In this case we need to know the rate at which the error would change when we make adjustments to our assumed parameters like weights and bias terms.

Our goal is to get the rate of change of error with respect to the weights and the biases. However if you look at our first equation

Error = ( a_t – y_t )²/2

we can see that it dosent have any weights in it. We only have the variable a_t . But we know from our forward propagation equations that a_t is derived by different operations involving weights and biases. Let us traverse downwards from the error term and trace out different trails to the weights and biases.

The above figure represents different trails ( purple,green,red and blue ) to reach the weights and biases from the error term. Let us first traverse the purple coloured trail.

The purple coloured trail is to calculate the gradients associated with the output gate. When we say gradients it entails finding the rate of change of the error term with respect to the weights and biases associated with the output gate, which in mathematical form is represented as ∂E/∂W_o. If we look down the purple trail we can see that W_o appears in the equation at the tail end of the trail. This is where we apply the chain rule of differentiation. Chain rule of differentiation helps us to differentiate the error with respect to the connecting links till we get to the terms which we want, like weights and biases. This operation can be represented using the following equation.

This image has an empty alt attribute; its file name is image.png

The green coloured trail is a longer one. Please note that the initial part of the green coloured trail, where the differentiation with respect to error term is involved ( ∂E/∂a ), is the same as the purple trail. From the second box onwards a distinct green trail takes shape and continues till the box with the update weight (W_u). The equation for this can be represented as follows

∂E/∂Wu = ∂E/∂a * ∂a/∂C_t * ∂C_t/ ∂Γu * ∂Γu/∂Wu

The other trails are similar. We will traverse through each of these trails using the numerical values we got after the forward propagation stage in the last post.

Gradients of Equation 1 : Error = ( a_t – y_t )²/2

The first step in backpropagation is to take the derivative of the error with respect to a_t

da_t = ∂E/∂a_t = ∂/∂a_t[ (a_t -y)²/2]
= 2 * (a_t-y)/ 2
= a_t-y

Let us substitute the values

Derivative 2.1.1.	Numerical Eqn	Value
da_t = a_t– y	= -0.141 – 0.3	-0.441

If you are thinking that thats all for this term then you are in for a surprise. In the case of a sequence model like LSTM there would be an error term associated with each time step and also an error term which back propagates from the previous time step. The error term for certain time step would be the sum total of both these errors. Let me explain this pictorially.

This image has an empty alt attribute; its file name is timestep_errorearlierstep.jpeg

Let us assume that we are taking the derivative of the error with respect to the output from the first time step (∂E/∂a_t-1), which is represented in the above figure as the time step on the left. Now if you notice the same output term a_t-1 is also propogated to the second time step during the forward propagation stage. So when we take the derivative there is a gradient of this output term which is with respect to the error from the second time step. This gets propogated through all the equations within the second time step using the chain rule as represented by the purple arrow. This gradient has to be added to the gradient derived from the first time step to get the final gradient for that output term.

However in the case of the top example since we are taking the gradient of the second time step and there are no further time step, this term which gets propogated from the previous time step would be 0. So ideally the equation for the derivative for the second output should be written as follows

Derivative – 1.1	Numerical Eqn	Value
da = a_t – y + 0	= -0.141 – 0.3 + 0	-0.441

In this equation ‘0’ corresponds to the gradient from the third time step, which in this case dosent exist.

Gradients of Equation 2 : [a_t = tanh( C_t ) * Ґ_o ]

Having found the gradient for the first equation which had the error term, the next step is to find the gradient of the output term a_t and its component terms C_t and Ґ_o.

Let us first differentiate it with respect to C_t. The equations for this step are as follows

∂E/∂C_t = ∂E/∂a_t * ∂a_t/∂C_t = da * ∂a_t/∂C_t

In this equation we have already found the value of the first term which is ∂E/∂a_t in the first step. Next we have to find the partial derivative of the second term ∂a_t/∂C_t

∂a_t/∂C_t = Γ_o * ∂/∂C_t [ tanh(C_t)] = Γ_o * [1 - tanh²(C_t)]

Please note => ∂/∂x [ tanh(x)]= 1 – tanh²(x)

So the complete derivation for ∂E/∂C_t is

∂E/∂C_t = da * Γ_o * [1 - tanh²(C_t)]

So the above is the derivation of the partial derivative with respect to state 2. Well not quite. There is one more term to be added to this where state 2 will appear. Let me demonstrate that. Let us assume that we had 3 time steps as shown in the table below

We can see that the term C_tappears in the 3^rd time step as circled in the table. When we take the derivative of error of time step 2 with respect to C_t we will have to take the derivative from the third time step also. However in our case since the third step doesn’t exist that term will be ‘0’ as of now. However when we take the derivative of the first time step we will have to consider the corresponding term from the second time step. We will come to that when we take the derivative of the first time step. For the time being it is ‘0’ for us now as there is no third time step.

Derivative – 2.2.1	Numerical Eqn	Value
dC_t = da * Ґ_o* (1 – tanh²(C_t ) + 0	= -0.441 * 0.24* (1-tanh²(-0.674 ) = -0.441 * 0.24* ( 1 – (-0.59 * -0.59))	-0.069

Let us now take the gradient with the second term of equation 2 which is Ґ_o. The complete equation for this term is as follows

∂E/Γ_o = ∂E/∂a_t * ∂a_t/∂Γ_o = da * ∂a_t/∂Γ_o

The above equation is very similar to the earlier derivation. However there are some nuances with the derivation of the term with Γ_o . If you remember this term is a sigmoid gate with the following equation.

Γ_o = sigmoid(W_o *[x_t , a_t-1] + b_o) = sigmoid(u) Where u = W_o *[x_t , a_t-1] + b_o

When we take the derivative of the output term with respect to Γ_o (∂a_t/∂Γ_o ), this should be with respect to the terms inside the sigmoid function ( u). So ∂a_t/∂Γ_o would actually mean ∂a_t/∂u . So the entire equation can be rewritten as

a_t = tanh(C_t) * sigmoid(u) where Γ_o = sigmoid(u).

Therefore ∂a_t/∂Γ_o = tanh(C_t) * Γ_o *( 1 - Γ_o )

Please note if y = sigmoid(x) , ∂y/∂x = y(1-y)

The complete equation for the term ∂E/Γ_o =

da * tanh(C_t) * Γ_o *( 1 - Γ_o )

Substituting the numerical terms we get

Derivative – 2.2.2	Numerical Eqn	Value
d Ґ₀ = da tanh(C_t) Ґo *(1 – Ґo)	= -0.441 * tanh(-0.674) * 0.24 *(1-0.24)	0.047

Gradients of Equation 3 : [ C_t = Ґ_u * C^~ + Ґ_f * C_t-1 ]

Let us now find the gradients with respect to the third equation

This equation has 4 terms, Ґ_u, C^~ , Ґ_f and C_t-1 , for which we have to calculate the gradients. Let us start from the first term Ґ_u whose equation is the following

∂E/Γ_u = ∂E/∂a_t * ∂a_t/∂C_t * ∂C_t/ ∂Γ_u

However the first two terms of the above equation, ∂E/∂a_t * ∂a_t/∂C_t were already calculated in derivation 2.1, which can be represented as dC_t . The above equation can be re-written as

∂E/Γ_u = dC_t * ∂C_t/ ∂Γ_u

From the new equation we are left with the second part of the equation which is ∂C_t / ∂Γ_u ,which is the derivative of equation 3 with respect to Γ_u.

C_t = Ґ_u * C~ + Ґ_f * C_t-1.......... (3)

∂C_t/Γ_u = C~ * ∂/ ∂Γ_u [ Ґ_u ] + 0.........(3.1)

In equation 3 we can see that there are two components, one with the term Ґ_u in it and the other with Ґ_f in it. The partial derivative of the first term which is partial derivative with respect to Ґ_u is what is represented in the first half of equation 3.1. The partial derivative of second half which is the part with the gate Ґ_f will be 0 as there is no Ґ_u in it. Equation 3.1 represents the final form of the partial derivative.

Now similar to derivative 2.2 which we developed earlier, the partial derivative of the sigmoid gate ∂/ ∂Γ_u [ Ґ_u ] will be Γ_u * ( 1 - Γ_u ). The final form of equation 3.1 would be

∂C_t/Γ_u = C~ * Γ_u * ( 1 - Γ_u )

The solution for the gradient with respect to the update gate would be

∂E/Γ_u = dC_t* C~ * Γ_u * ( 1 - Γ_u )

Derivative 2.3.1	Numerical Eqn	Value
d Ґ_u = dC_t C^~ Ґ_u *(1 – Ґ_u)	= -0.069 * -0.63 * 0.755 *(1-0.755)	0.0080

Next let us calculate the gradient with respect to the internal state C^~ . The complete equation for this is as follows

∂E/C^~ = ∂E/∂a_t * ∂a_t/∂C_t * ∂C_t/ ∂C^~

= dC_t * ∂C_t/ ∂C^~

= dC_t * ∂/ ∂C^~ [ Ґ_u * C~]

= dC_t * Ґ_u * ∂/ ∂C^~ [ C~]

However we know that, C^~ = tanh(Wc *[x , a] + bc) . Let us represent the terms within the tanh function as u .

C^~ = tanh(u), where u = Wc *[x , a] + bc

Similar to the derivations we have done for the sigmoid gates, we take the derivatives with respect to the terms within the tanh() function which is ‘u’ . Therefore

∂/ ∂C^~ [ C~] = 1-tanh² (u)

= 1 - tanh² (Wc *[x , a] + bc)

= 1 - (C^~ )²

since, C^~ = tanh(Wc *[x , a] + bc) and therefore

(C^~ )² = tanh² (Wc *[x , a] + bc)

The final equation for this term would be

∂E/C^~ = dC_t * Ґ_u * 1 - (C^~ )²

Derivative 2.3.2	Numerical Eqn	Value
dC^~ = dC_t* Ґ_u* (1 – (C^~)²)	= -0.069 * 0.755 * (1-(-0.63)²)	-0.0314

The gradient with respect to the third term Ґ_f would be very similar to the derivation of the first term Ґ_u

∂E/Γ_f = ∂E/∂a_t * ∂a_t/∂C_t* ∂C_t/ ∂Γ_f

= dC_t* C_t-1 * Γ_f * ( 1 - Γ_f )

Derivative 2.3.3	Numerical Eqn	Value
d Ґ_f = dC_t* C_t-1* Ґ_f *(1 – Ґ_f)	= -0.069 * -0.33 * 0.60 *(1-0.60)	0.0055

Finally we come to the gradient with respect to the fourth term C_t-1 ,the equation for which is as follows

∂E/C_t-1 = ∂E/∂a_t * ∂a_t/∂C_t* ∂C_t/ ∂C_t-1

= dC_t* Γ_f

Derivative 2.3.4	Numerical Eqn	Value
dC_t-1 = dC_t *Ґ_f	= -0.069 * 0.60	-0.0414

In this step we have got the gradient of cell state 1 which will come in handy when we find the gradients of time step 1.

Gradients of previous time step output (a_t-1)

In the previous step we calculated the gradients with respect to equation 3. Now it is time to find the gradients with respect to the output from time step 1 , a_t-1.

However one fact which we have to be cognizant is that a_t-1 is present in 4 different equations, 4,5,6 & 7. So we have to take derivative with respect to all these equations and then sum it up. The gradient of a_t-1 within equation 4 is represented as below

∂E/∂a_t-1 = ∂E/∂a_t * ∂a_t/∂C_t * ∂C_t/ ∂Γ_o * ∂Γ_o/∂a_t-1

However we have already found the gradient of the first three terms of the above equation, which is with respect to the terms within the sigmoid function i.e ‘u’ as dΓ_o in derivative 2.1. Therefore the above equation can be simplified as

∂E/∂a_t-1 = dΓ_o * ∂Γ_o/∂a_t-1

The term ∂Γ_o/∂a_t-1 in reality is ∂u/∂a_t-1 where u = W_o *[x , a_t-1] + b_o, because when we took the derivative of Γ_o we took it with respect to all the terms within the sigmoid() function,which we called as ‘u’.

From the above equation the derivative will take the form

∂Γ_o/∂a_t-1 = ∂u/∂a_t-1 = W_o

The complete equation from the gradient is therefore

∂E/∂a_t-1 = dΓ_o * W_o

There are some nuances to be taken care in the above equation since there is a multiplication by W_o . When we looked at the equation for the forward pass we saw that to get the equation of the gates, we originally had two weights, one for the x term and the other for the ‘a’ term as below

Ґ_o = sigmoid(W_o*[x ] + U_o* [a] + b_o)

This equation was simplified by concatenating both the weight parameters and the corresponding x & a vectors to a form given below.

Ґ_o = sigmoid(W_o*[x , a] + b_o)

So in effect there is a part of the final weight parameter W_o which is applied to ‘x’ and another part which is applied to ‘a’. Our initial value of the weight parameter W_o was [-0.75 ,-0.95 , -0.34]. The first two values are the values corresponding to ‘x’ as it is of dimension 2 and the last value ( -0.34) is what is applicable for ‘a’. So in our final equation for the gradient of a_t-1, ∂E/∂a_t-1 = dΓ_o * W_o , we will multiply dΓ_o only with -0.34.

Similar to the above equation we have to take the derivative of a_t-1for all other equations 5,6 and 7 which will take the form

Equation 5 = > ∂E/∂a_t-1 = dC^~* W_c

Equation 6= > ∂E/∂a_t-1 = dΓ_u* W_u

Equation 7 = > ∂E/∂a_t-1 = dΓ_f* W_f

The final equation of will be the sum total of all these components

Derivative 2.4.1	Numerical Eqn	Value
da_t-1 = W_{o }dҐ_o + W_c dC^~ + W_u * dҐ_u + W_f * dҐ_f	= [(-0.34 * 0.047) + (-0.13 * -0.0314) + (1.31 * 0.0080) + (-0.13*0.0055) ]	-0.00213

Now that we have calculated the gradients of all the components for time step 2, let us proceed with the calculations for time step 1

Back Propagation @ time step 1.

All the equations and derivations for time step 1 is similar to time step 2. Let us calculate the gradients of all the equations as we did with time step 1

Gradients of Equation 1 : Error term

Gradient with respect to error term of the current time step

Derivative 1.1.1	Numerical Eqn	Value
da_t-1 = a_t-1 – y_t-1	= -0.083 – 0.8	-0.883

However we know that da_t-1 = Gradient with respect current layer + Gradient from next time step, as shown in the figure below

Gradient propagated from the 2^nd layer is the derivative of a_t-1which was derived as the last step of the previous time step ( Derivative 2.4.1)

Total Gradient

Derivative 1.1.1	Numerical Eqn	Value
da_t-1= Gradient from this layer + Gradient from previous layer	= -0.883 + -0.00213	-0.88513

Gradients of Equation 2

Next we have to find the gradients of equation 2 with respect to the cell state, C_t-1and Ґ_o.When deriving gradient of cell state state we discussed that the cell state of the current layer appears in the next layer also, which will have to be considered. So the total derivative would be

Derivative 1.2.1	Formulae
dC_t-1 = dC_t-1 in current time step + dC_t-1 from next time step	dC_t-1 = da * Ґ_o* (1 – tanh²(C_t-1 ) + dC_t-1 *Ґ_f( Derivative 3.4)

Derivative 1.2.1	Numerical Eqn	Value
dC_t-1 = da * Ґ_o* (1 – tanh²(C_t-1 ) + dC_t-1in next layer	= -0.88513 * 0.26 * (1-tanh²(-0.33 ) + (-0.0414) = -0.88513* 0.26 * ( 1 – (-0.319 * -0.319)) – 0.0414	-0.25

Next is the gradient with respect to Ґ_o .

Derivative 1.2.2	Numerical Eqn	Value
d Ґ₀ = da tanh(C_t-1) Ґo *(1 – Ґo)	= -0.88513 * tanh(-0.33) * 0.26 *(1-0.26)	0.054

Gradients of Equation 3

Derivative with respect to Ґ_u

Derivative 1.3.1	Numerical Eqn	Value
d Ґ_u = dC_t-1 C^~ Ґ_u *(1 – Ґ_u)	= -0.25 * -0.39 * 0.848 *(1-0.848)	0.013

Derivative with respect to C^~

Derivative 1.3.2	Numerical Eqn	Value
dC^~ = dC_t-1 * Ґ_u* (1 – (C^~)²)	= -0.25 * 0.848 * (1-(-0.39)²)	-0.18

Derivative with respect to Ґ_f

Derivative 1.3.3	Numerical Eqn	Value
d Ґ_f = dC_t-1 C₀ Ґ_f *(1 – Ґ_f)	= -0.25 * 0 * 0.443 *(1-0.443)	0

Derivative with respect to initial cell state C^<0>

Derivative 1.3.4	Numerical Eqn	Value
dC₀ = dC_t-1 *Ґ_f	= -0.25 * 0.443	-0.11

Gradients of initial output (a₀)

Similar to the previous time step this has 4 components pertaining to equations 4,5,6 & 7

Derivative 1.4.1	Numerical Eqn	Value
da₀ = W_{o }dҐ_o + W_c dC^~ + W_u * dҐ_u + W_f * dҐ_f	= [(-0.34 * 0.054) + (-0.13 * -0.18) + (1.31 * 0.013) + (-0.13*0) ]	0.022

Now that we have completed the gradients for both time steps let us tabularize the results of all the gradients we have got so far

Eqn	Gradients	Values
2.1.1	da_t = a_t – y_t + 0	0.441
2.2.1	dC_t = da_t * Ґo * (1 – tanh2(C_t ) + 0	-0.069
2.2.2	d Ґ0 = da_t * tanh(C_t )* Ґo *(1 – Ґo)	0.047
2.3.1	d Ґu = dC_t * C~ * Ґu *(1 – Ґu)	0.0080
2.3.2	dC~ = dC_t * Ґu * (1 – (C~)2)	-0.0314
2.3.3	d Ґf = dC_t * C_t-1 * Ґf *(1 – Ґf)	0.0055
2.3.4	dC_t-1 = dC_t * Ґf	-0.0414
2.4.1	da_t-1 = Wo * dҐo + Wc * dC~ + Wu * dҐu + Wf * dҐf	-0.00213
1.1.1	da_t-1 = a_t-1 – y_t-1 + eq (2.4.1)	-0.88513
1.2.1	dC_t-1 = da_t-1 * Ґo * (1 – tanh2(C_t-1 ) + eq 2.3.4	-0.25
1.2.2	d Ґ0 = da_t-1 * tanh(C_t-1 )* Ґo *(1 – Ґo)	0.054
1.3.1	d Ґu = dC_t-1* C~ * Ґu *(1 – Ґu)	0.013
1.3.2	dC~ = dC_t-1* Ґu * (1 – (C~)2)	-0.18
1.3.3	d Ґf = dC_t-1 * C₀ * Ґf *(1 – Ґf)	0
1.3.4	dC₀ = dC_t-1 * Ґf	-0.11
1.4.1	da₀ = Wo * dҐo + Wc * dC~ + Wu * dҐu + Wf * dҐf	0.022

Gradients with respect to weights

The next important derivative which we have to derive is with respect to the weights. We have to remember that the weights of an LSTM is shared across all the time steps. So the derivative of the weight will be the sum total of the derivatives from each individual time step. Let us first define the equation for the derivative of one of the weights, W_u.

The relevant equation for this is the following

The first three terms of the gradient is equal to dҐ_u which was already derived through equations 2.3.1 and 1.3.1 in the table above. Also remember dҐ_u is the gradient with respect to the terms inside the sigmoid function ( i.e W_u *[x_t , a_t-1] + b_u) . Therefore the derivative of the last term, ∂Γ_u/∂W_u would be

∂Γ_u/∂W_u = [x_t , a_t-1]

The complete equation for the gradient with respect to the weight W_u would be

∂E/∂W_u = dҐ_u * [x_t , a_t-1]

The important thing to note in the above equation is the dimensions of each of the terms. The first term, dҐ_u, is a scalar of dimension (1,1) , however the second term is a vector of dimension (1 ,3) . The resultant gradient would be another vector of dimension (1,3) as the scalar value will be multiplied with all the terms of the vector. We will come to that shortly. However for now let us find the gradients of all other weights.The derivation for other weights are similar to the one we saw. The equations for the gradients with respect to all weights for time step 1 are as follows

Derivative	Equation
dW_f	= dҐ_{f }[x_t , a_t-1]*
dW_o	= dҐo_*[x_t , a_t-1]
dW_c	= dC^~_*[ x_t , a_t-1]
dW_u	= dҐ_{u }[x_t , a_t-1]*

The total weight derivative would be sum of weight derivatives of all the time steps.

dW = dW₁ + dW₂

As discussed above to find the total gradient it would be convenient and more efficient to stack all these equations in a matrix form and then multiply it with the input terms and then adding them across the different time steps. This operation can be represented as below

Let us substitue the numberical values and calculate the gradients for the weights

This image has an empty alt attribute; its file name is image-19.png

The matrix multiplication will have the following values

This image has an empty alt attribute; its file name is image-25.png

The final gradients for all the weights are the following

This image has an empty alt attribute; its file name is image-24.png

Gradients with respect to biases

The next task is to get the gradients of the bias . The derivation of the gradients for bias is similar to that of the weights. The equation for the bias terms would be as follows

Similar to what we have done for the weights, the first three terms of the gradient is equal to dҐ_u . The derivative of the fourth term which are the terms inside the sigmoid function ( i.e W_u *[x_t , a_t-1] + b_u) will be

∂Γ_u/∂b_u = 1

The complete equation for the gradient with respect to the weight W_u would be

∂E/∂b_u = dҐ_u * 1

The final gradient of the bias term would be the sum of the gradients of the first time step and the second. As we have seen in case of the weights, the matrix form would be as follows

This image has an empty alt attribute; its file name is image-27.png

The final gradients for the bias terms are

This image has an empty alt attribute; its file name is image-28.png

Weights and bias updates

Calculating the gradients using back propagation is not an end by itself. After the gradients are calculated, they are used to update the initial weights and biases .

The equation is as follows

W_new = W_old - α * Gradients

Here α is a constant which is the learning rate. Let us assume it to be 0.01

The new weights would be as follows

Similarly the updated bias would be

This image has an empty alt attribute; its file name is image-3.png

As you would have learned, these updated weights and bias terms would take the place of the initial weights and biases in the next forward pass and then the back progagation again kicks in to calculate the new set of gradients which will be applied on the updated weights and gradients to get the new set of parameters. This process will continue till the pre-defined epochs.

Error terms when softmax is used

The toy example which we saw just now has a squared error term, which was used for backpropagation. The adoption of such an example was only to demonstrate the concepts with a simple numerical example. However in the problem which we are dealing with or for that matter many of the problems which we will deal with will have a softmax layer as its final layer and thereafter cross entropy as its error function. How would the backpropogation derivation differ when we have a different error term than what we have just done in the toy example ? The major change would be in terms of how the error term is generated and how the error is propogated till the output term a_t. After this step the flow will be the same as what we we have seen earlier.

Let us quickly see an example for a softmax layer and a cross entropy error term. To demonstrate this we will have to revisit the forward pass from the point we generate the output from each layer.

The above figure is a representation of the equations for the forward pass and backward pass of an LSTM. Let us look at each of those steps

Dense Layer

We know that the output layer a_t from a LSTM will be a vector with dimension equal to number of units of the LSTM. However for the application which we are trying to build the output we require is the most probable word in the target vocabulary. For example if we are translating from German to English, given a German sentence, for each time step we need to predict a corresponding English word. The prediction from the final layer would be in the form of a probability distribution over all the words in the English vocabulary we have in our corpus.

This image has an empty alt attribute; its file name is image-14.png

Let us look at the above representation to understand this better. We have an input German sentence 'Wie geht es dir' which translates to 'How are you'. For simiplicity let us assume that there are only 3 words in the target vocabulary ( English vocabulary). Now the predictions will have to be a probability distribution over the three words over the target vocabulary and the index which has the highest probability will be the prediction. In the prediction we see that the first time step has the maximum probability ( 0.6) on the second index which corresponds to the word ‘How’ in the vocabulary. The second and third time steps have maximum probability on the first and third indexes respectively giving us the predicted string as ‘How are you’. ( Please note that in the figure above the index of the probability is from bottom to top, which means the bottom most box corresponds to index 1 and top most to index 3)

Coming back to our equation, the output layer a_t is only a vector and not a probability distribution. To get a probability distribution we need to have a dense layer and a final softmax layer. Let us understand the dynamics of how the conversion from the output layer to the probability distribution happens.

The first stage is the dense layer where the output layer vector is converted to a vector with the same dimension as of the vocabulary. This is achieved by the multiplication of the output vector with the weights of the dense layer. The weight matrix will have the dimension [ length of vocabulary , num of units in output layer]. So if there are 3 words in the vocabulary and one unit in the output layer then weight matrix will be of dimension [ 3, 1], ie it has 3 rows and one column. Another way of seeing this dimension is each row of the weight matrix corresponds to each word in the vocabulary. The dense layer would be derived by the dot product of weight matrix with the output layer

Z = W_y* a_t

The dimensions of the resultant vector will be as follows

[3,1] * [1,1] => [3,1]

The resultant vector Z after the dense layer operation will have 3 rows and 1 column as shown below.

This vector is still not a probability distribution. To convert it to a probability distribution we take the softmax of this dense layer and the resultant vector will be a probability distribution with the same dimension.

This image has an empty alt attribute; its file name is image-20.png

The resultant probability distribution will be called as Y^{^} which will have three components ( equal to the dimension of the vocabulary), each component will be the probability of the corresponding word in the vocabulary

This image has an empty alt attribute; its file name is image-21.png

Having seen the forward pass let us look at how the back propogation works. Let us start with the error term which in this case will be cross entropy loss as this is a classification problem. The cross entropy loss will have the form

This image has an empty alt attribute; its file name is image-29.png

In this equation the term y is the true label in one hot encoded form. So if the first index ( y₁) is the true label for this example the label vector in one hot encoded format will be

This image has an empty alt attribute; its file name is image-34.png

Now let us get into the motions of backpropogation. We need to back propogate till a_t after which the backpropogation equation will be the same as what we derived in the toy example. The complete back propogation equation till a_t according to the chain rule will be

Let us look at the derivations term by term

Back propogation derivation for first term

The first equation ∂E/∂Y^{^} is a differentiation with respect to a vector as Y^{^} ,having three components, Y₁ , Y₂ and Y₃ . A differentiation with a vector is called a Jacobian which will again be a vector of the same dimension as Y.

Let us look at deriving each of these terms within the vector

This image has an empty alt attribute; its file name is image-30.png

Please note ∂/∂y(Logy) = 1/y

Similarly we get

This image has an empty alt attribute; its file name is image-31.png

So the Jacobian of ∂E/∂Y^{^} will be

This image has an empty alt attribute; its file name is image-32.png

Suppose the first label (y₁) is the true label. Then the one hot encoded form which is [ 1 0 0 ] will make the above Jacobian

This image has an empty alt attribute; its file name is image-36.png

Back propogation derivation for second term

Let us do the derivation for the second term ∂Y^{^}/∂Z . This term is a little more interesting as both the Y and Z are vectors. The differentiation will result a Jacobian matrix of the form

This image has an empty alt attribute; its file name is image-38.png

Let us look at the first row and get the derivatives first. To recap let us look at the equations involved in the derivatives

This image has an empty alt attribute; its file name is image-40.png

Let us take the derivative of the first term ∂Y₁^{^}/∂Z₁ . This term will be the derivative of the first element in the matrix

This image has an empty alt attribute; its file name is image-41.png

Taking the derivation based on the division rule of differentiation we get

This image has an empty alt attribute; its file name is image-44.png

Please note ∂/∂y(e^y) = e^y

Taking the common terms in the numerator and denominator and re-arranging the equation we get

This image has an empty alt attribute; its file name is image-45.png

Dividing through inside the bracket we get

This image has an empty alt attribute; its file name is image-47.png

Which can be simplified as

This image has an empty alt attribute; its file name is image-48.png

Since

This image has an empty alt attribute; its file name is image-49.png

Let us take the derivative of the second term ∂Y₁^{^}/∂Z₂ . This term will be the derivative of the first element in the matrix with respect to Z₂

This image has an empty alt attribute; its file name is image-50.png

With these two derivations we can get all the values of the Jacobian. The final form of the Jacobian would be as follows

This image has an empty alt attribute; its file name is image-57.png

Well that was a long derivation. Now to get on to the third term.

Back propogation derivation for third term

Let us do the derivation for the last term ∂Z/ ∂a_t . We know from the dense layer we have

Z = W_y* a_t

So in vector form this will be

This image has an empty alt attribute; its file name is image-53.png

So when we take the derivation we get another Jacobian vector of the form

This image has an empty alt attribute; its file name is image-54.png

So thats all in this derivation. Now let us tie everything together to get the derivative with respect to the output term using the chain rule

Gradient with respect to the output term

We earlier saw that the equation of the gradient as

Let us substitute with the derivations which we already found out

This image has an empty alt attribute; its file name is image-58.png

The dot product of the first two terms will get you

This image has an empty alt attribute; its file name is image-59.png

The dot product of the above term with the last vector will give you the result you want

This image has an empty alt attribute; its file name is image-61.png

This is the derivation till ∂E/∂a_t . The rest of the derivation down from here to various components inside the LSTM layer will be the same as we have seen earlier in the toy example.

In terms of dimensions let us convince ourselves that we get the dimension equal to the dimension of ∂a_t

[1 , 3 ] * [3, 3] * [3, 1] ==> [1,1]

Wrapping up

That takes us to the end of the “Looking inside the hood” sessions for our model. In the two sessions we saw the forward propagation part of the LSTM cell and also derived the backward propagation part of the LSTM using toy examples. These examples are aimed at giving you an intuitive sense of what is going on inside the cells. Having seen the mathematical details, let us now get into real action. In the next post we will build our prototype using python on a Jupyter notebook. We will be implementing the encoder decoder architecture using LSTM. Having equipped with the nuances of the encoder decoder architecture and also the inner working of the LSTM you would be in a better position to appreciate the models which we will be using to build our Machine translation application.

Go to article 5 of this series : Building the prototype using Jupyter notebook

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, I would recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning