Building Self Learning Recommendation system – III : Recommendation System as a K-armed Bandit

This is the third post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 8 posts where in we progressively build a self learning recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem
  3. Self learning recommendation system as a K-armed bandit ( This post )
  4. Build the prototype of the self learning recommendation system: Part I
  5. Build the prototype of the self learning recommendation system: Part II
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

Introduction

In our previous post we implemented couple of experiments with K-armed bandit. When we discussed the idea of the K-armed bandits from the context of recommendation systems, we briefly touched upon the idea that the buying behavior of a customer depends on the customers context. In this post we will take the idea of the context forward and how the context will be used to build the recommendation system using the K-armed bandit solution.

Defining the context for customer buying

When we discussed about reinforcement learning in our first post, we learned about the elements of a reinforcement learning setting like state, actions, rewards etc. Let us now identify these elements in the context of the recommendation system we are building.

State

When we discussed about reinforcement learning in the first post, we learned that when an agent interacts with the environment at each time step, the agent manifests a certain state. In the example of the robot picking trash the different states were that of high charge or low charge. However in the context of the recommendation system, what would be our states ? Let us try to derive the states from the context of a customer who makes an online purchase. What would be those influencing factors which defines the product the customer buys ? Some of these are

  • The segment the customer belongs
  • The season or time of year the purchase is made
  • The day in which purchase is made

There could be many other influencing factors other than this. For simplicity let us restrict to these factors for now. A state could be made from the combinations of all these factors. Let us arrive at these factors through some exploratory analysis of the data

The data set we would be using is the online retail data set available in the UCI Machine learning library. We will download the data and the place it in local folder and the read the file from the local folder.

import numpy as np
import pandas as pd
from dateutil.parser import parse

Lines 1-3 imports all the necessary packages for our purpose. Let us now load the data as a pandas data frame

# Please use the path to the actual data
filename = "data/Online Retail.xlsx"
# Let us load the customer Details
custDetails = pd.read_excel(filename, engine='openpyxl')
custDetails.head()
Figure 1: Head of the retail data set

In line 5, we load the data from disk and then read the excel shee using the ‘openpyxl’ engine. Please note to pip install the ‘openpyxl’ package if not available.

Let us now parse the date column using date parser and extract information from the date column.

#Parsing  the date
custDetails['Parse_date'] = custDetails["InvoiceDate"].apply(lambda x: parse(str(x)))
# Parsing the weekdaty
custDetails['Weekday'] = custDetails['Parse_date'].apply(lambda x: x.weekday())
# Parsing the Day
custDetails['Day'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%A"))
# Parsing the Month
custDetails['Month'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%B"))
# Getting the year
custDetails['Year'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%Y"))
# Getting year and month together as one feature
custDetails['year_month'] = custDetails['Year'] + "_" +custDetails['Month']
# Feature engineering of the customer details data frame
# Get the date  as a seperate column
custDetails['Date'] = custDetails['Parse_date'].apply(lambda x: x.strftime("%d"))
# Converting date to float for easy comparison
custDetails['Date']  = custDetails['Date'] .astype('float64')
# Get the period of month column
custDetails['monthPeriod'] = custDetails['Date'].apply(lambda x: int(x > 15))

custDetails.head()
Figure 2 : Parsed Data

As seen from line 11 we have used the lambda() function to first parse the ‘date’ column. The parsed date is stored in a new column called ‘Parse_date’. After parsing the dates first, we carry out different operations, again using the lambda() function on the parsed date. The different operations we carry out are

  1. Extract weekday and store it in a new column called ‘Weekday’ : line 13
  2. Extract the day of the week and store it in the column ‘Day’ : line 15
  3. Extract the month and store in the column ‘Month’ : line 17
  4. Extract year and store in the column ‘Year’ : line 19

In line 21 we combine the year and month to form a new column called ‘year_month’. This is done to enable easy filtering of data, based on the combination of a year and month.

We make some more changes from line 24-28. In line 24, we extract the date of the month and then convert it into a float type in line 26. The purpose of taking the date is to find out which of these transactions have happened before 15th of the month and which after 15th. We extract those details in line 28, where we create a binary points ( 0 & 1) as to whether a date falls in the last 15 days or the first 15 days of the month.

We will also create a column which gives you the gross value of each puchase. Gross value can be calculated by multiplying the quantity with unit price. After that we will consolidate the data for each unique invoice number and then explore some of the elements of states which we want to explore

# Creating gross value column
custDetails['grossValue'] = custDetails["Quantity"] * custDetails["UnitPrice"]
# Consolidating accross the invoice number for gross value
retailConsol = custDetails.groupby('InvoiceNo')['grossValue'].sum().reset_index()
print(retailConsol.shape)
retailConsol.head()
Figure 3: Aggregated Data

Now that we have got the data consolidated based on each invoice number, let us merge the date related features from the original data frame with this consolidated data. We merge the consolidated data with the custDetails data frame and then drop all the duplicate data so that we get a record per invoice number, along with its date features.

# Merge the other information like date, week, month etc
retail = pd.merge(retailConsol,custDetails[["InvoiceNo",'Parse_date','Weekday','Day','Month','Year','year_month','monthPeriod']],how='left',on='InvoiceNo')
# dropping ALL duplicate values
retail.drop_duplicates(subset ="InvoiceNo",keep = 'first', inplace = True)
print(retail.shape)
retail.head()
Figure 4 : Consolidated data

Let us first look at the month wise consolidation of data and then plot the data. We will use a functions to map the months to its index position. This is required to plot the data according to months. The function ‘monthMapping‘, maps an integer value to the month and then sort the data frame.

# Create a map for each month
def monthMapping(mnthTrend):
    # Get the map
    mnthMap = {"January": 1, "February": 2,"March": 3, "April": 4,"May": 5, "June": 6,"July": 7, "August": 8,"September": 9, "October": 10,"November": 11, "December": 12}
    # Create a new feature for month
    mnthTrend['mnth'] = mnthTrend.Month
    # Replace with the numerical value
    mnthTrend['mnth'] = mnthTrend['mnth'].map(mnthMap)
    # Sort the data frame according to the month value
    return mnthTrend.sort_values(by = 'mnth').reset_index()

We will use the above function to consolidate the data according to the months and then plot month wise grossvalue data

mnthTrend = retail.groupby(['Month'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)
# sort the months in the right order
mnthTrend = monthMapping(mnthTrend)
sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=mnthTrend, x='Month', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can see that there is sufficient amount of variability of data month on month. So therefore we will take months as one of the context items on which the states can be constructed.

Let us now look at buying pattern within each month and check how the buying pattern is within the first 15 days and the latter half

# Aggregating data for the first 15 days and latter 15 days
fortnighTrend = retail.groupby(['monthPeriod'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)

sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=fortnighTrend, x='monthPeriod', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can see that there is as small difference between buying patterns in the first 15 days of the month and the latter half of the month. Eventhough the difference is not significant, we will still take this difference as another context.

Next let us aggregate data as per the days of the week and and check the trend

# Aggregating data accross weekdays
dayTrend = retail.groupby(['Weekday'])['grossValue'].agg('mean').reset_index().sort_values(by = 'grossValue',ascending = False)

sns.set(rc = {'figure.figsize':(20,8)})
sns.lineplot(data=dayTrend, x='Weekday', y='grossValue')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

We can also see that there is quite a bit of variability of buying patterns accross the days of the week. We will therefore take the week days also as another context

So far we have observed 4 different features, which will become our context for recommending products. The context which we have defined would act as the states from the reinforcement learning setting perspective. Let us now look at the big picture of how we will formulate the recommendation task as reinforcement learning setting.

The Big Picture

Figure 5: The Big Picture

We will now have a look at the big picture of this implementation. The above figure is the representation of what we will implement in code in the next few posts.

The process starts with the customer context, consisting of segment, month, period in the month and day of the week. The combination of all the contexts will form the state. From an implementation perspective we will run simulations to generate the context since we do not have a real system where customers logs in and thereby we automatically capture context.

Based on the context, the system will recommend different products to the customer. From a reinforcement learning context these are the actions which are taken from each state. The initial recommendation of products ( actions taken) will be based on the value function learned from the historical data.

The customer will give rewards/feedback based on the actions taken( products recommended ). The feedback would be the manifestation of the choices the customer make. The choice the customer makes like the products the customer buys, browses and ignores from the recommended list. Depending on the choice made by the customer, a certain reward will be generated. Again from an implementation perspective, since we do not have real customers giving feedback, we will be simulating the customer feedback mechanism.

Finally the update of the value functions based on the reward generated will be done based on the simple averaging method. Based on the value update, the bandit will learn and adapt to the affinities of the customers in the long run.

What next ?

In this post we explored the data and then got a big picture of what we will implement going forward. In the next post we will start implementing these processes and building a prototype using Jupyter notebook. Later on we will build an application using Python scripts and then explore options to deploy the application. Watch out this space for more.

 The next post will be published next week. Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

Building Self Learning Recommendation system using Reinforcement Learning – II : The bandit problem

This is the second post of our series on building a self learning recommendation system using reinforcement learning. This series consists of 7 posts where in we progressively build a self learning recommendation system.

  1. Recommendation system and reinforcement learning primer
  2. Introduction to multi armed bandit problem ( This post )
  3. Self learning recommendation system as a bandit problem
  4. Build the prototype of the self learning recommendation system: Part I
  5. Build the prototype of the self learning recommendation system: Part II
  6. Productionising the self learning recommendation system: Part I – Customer Segmentation
  7. Productionising the self learning recommendation system: Part II – Implementing self learning recommendation
  8. Evaluating different deployment options for the self learning recommendation systems.

Introduction

Figure 1 : Reinforcement Learning Setting

In our previous post we introduced different types of recommendation systems and explored some of the basic elements of reinforcement learning. We found out that reinforcement learning problems evaluates different actions when the agent is in a specific state. The action taken generates a certain reward. In other words we get a feedback on how good the action was based on the reward we got. However we wont get the feed back as to whether the action taken was the best available. This is what contrasts reinforcement learning from supervised learning. In supervised learning the feed back is instructive and gives you the quantum of the correctness of an action based on the error. Since reinforcement learning is evaluative, it depends a lot on exploring different actions under different states to find the best one. This tradeoff between exploration and exploitation is the bedrock of reinforcement learning problems like the K armed bandit. Let us dive in.

The Bandit Problem.

In this section we will try to understand K armed bandit problem setting from the perspective of product recommendation.

A recommendation system recommends a set of products to a customer based on the customers buying patterns which we call as the context. The context of the customer can be the segment the customer belongs to, the period in which the customer buys, like which month, which week of the month, which day of the week etc. Once recommendations are made to a customer, the customer based on his or her affinity can take different type of actions i.e. (i) ignore the recommendation (ii) click on the product and further explore (iii) buy the recommended product. The objective of the recommendation system would be to recommend those products which are most likely to be accepted by the customer or in other words maximize the value from the recommendations.

Based on the recommendation example let us try to draw parallels to the K armed bandit. The K-armed bandit is a slot machine which has ‘K’ different arms or levers. Each pull of the lever can have a different outcome. The outcomes can vary from no payoff to winning a jackpot. Your objective is to find the best arm which yields the best payoff through repeated selection of the ‘K’ arms. This is where we can draw parallels’ between armed bandits and recommendation systems. The products recommended to a customer are like the levers of the bandit. The value realization from the recommended products happens based on whether the customer clicks on the recommended product or buys them. So the aim of the recommendation system is to identify the products which will generate the best value i.e which will very likely be bought or clicked by the customer.

Figure 2 : Recommendation system as K lever bandits

Having set the context of the problem statement , we will understand in depth the dynamics of the K-armed bandit problem and couple of solutions for solving them. This will lay the necessary foundation for us to try this in creating our recommendation system.

Non-Stationary Armed bandit problem

When we discussed about reinforcement learning we learned about the reward function. The reward function can be of two types, stationary and non-stationary. In stationary type the reward function will not change over time. So over time if we explore different levers we will be able to figure out which lever gives the best value and stick to it. In contrast,in the non stationary problem, the reward function changes over time. For non stationary problem, identifying the arms which gives the best value will be based on observing the rewards generated in the past for each of the arms. This scenario is more aligned with real life cases where we really do not know what would drive a customer at a certain point of time. However we might be able to draw a behaviour profile by observing different transactions over time. We will be exploring the non-stationary type of problem in this post.

Exploration v/s exploitation

Figure 3 : Should I exploit the current lever or explore ?

One major dilemma in problems like the bandit is the choice between exploration and exploitation. Let us explain this with our context. Let us say after few pulls of the first four levers we found that lever 3 has been consistently giving good rewards. In this scenario, a prudent strategy would be to keep on pulling the 3rd lever as we are sure that this is the best known lever. This is called exploitation. In this case we are exploiting our knowledge about the lever which gives the best reward. We also call the exploitation of the best know lever as the greedy method.

However the question is, will exploitation of our current knowledge guarantee that we get the best value in the long run ? The answer is no. This is because, so far we have only tried the first 4 levers, we haven’t tried the other levers from 5 to 10. What if there was another lever which is capable of giving higher reward ? How will we identify those unknown high value levers if we keep sticking to our known best lever ? This dilemma is called the exploitation v/s exploration. Having said that, resorting to always exploring will also be not judicious. It is found out that a mix of exploitation and exploration yields the best value over a long run.

Methods which adopt a mix of exploitation and exploration are called ε greedy methods. In such methods we exploit the greedy method most of the time. However at some instances, say with a small probability of ε we randomly sample from other levers also so that we get a mix of exploitation and exploration. We will explore different ε greedy methods in the subsequent sections

Simple averaging method

In our discussions so far we have seen that the dynamics of reinforcement learning involves actions taken from different states yielding rewards based on the state-action pair chosen. The ultimate aim is to maximize the rewards in the long run. In order to maximize the overall rewards, it is required to exploit the actions which gets you the maximum rewards in the long run. However to identify the actions with the highest potential we need to estimate the value of that action over time. Let us first explore one of the methods called simple averaging method.

Let us denote the value of an action (a) at time t as Qt(a). Using simple averaging method Qt(a) can be estimated by summing up all the rewards received for the action (a) divided by the number of times action (a) was selected. This can be represented mathematically as

In this equation R1 .. Rn-1 represents the rewards received till time (t) for action (a)

However we know that the estimate of value are a moving average, which means that there would be further instances when action (a) will be selected and corresponding rewards received. However it would be tedious to always sum up all the rewards and then divide it by the number of instances. To avoid such tedious steps, the above equation can be rewritten as follows

This is a simple update formulae where Qn+1 is the new estimate for the n+1 occurance of action a, Qn is the estimate till the nth try and Rn is the reward received for the nth try .

In simple terms this formulae can be represented as follows

New Estimate <----- Old estimate + Step Size [ Reward - Old Estimate]

For simple averaging method the Step Size is the reciprocal of the number of times that particular action was selected ( 1/n)

Now that we have seen the estimate generation using the simple averaging method, let us look at the complete algorithm.

  1. Initialize values for the bandit arms from 1 to K. Usually we initialize a value of 0 for all the bandit arms
  2. Define matrices to store the Value estimates for all the arms ( Qt(a) ) and initialize it to zero
  3. Define matrices to store the tracker for all the arms i.e a tracker which stores the number of times each arm was pulled
  4. Start a iterative loop and
    • Sample a random probability value
    • if the probability value is greater than ε, pick the arm with the largest value. If the probability value is less than ε, randomly pick an arm.
  5. Get the reward for the selected arm
  6. Update the number tracking matrix with 1 for the arm which was selected
  7. Update the Qt(a) matrix, for the arm which was picked using the simple averaging formulae.

Let us look at python implementation of the simple averaging problem next

Implementation of Simple averaging method for K armed bandit

In this implementation we will experiment with around 2000 different bandits with each bandit having 10 arms each. We will be evaluating these bandits for around 10000 steps. Finally we will average the values across all the bandits for each time step. Let us dive into the implementation.

Let us first import all the required packages for the implementation in lines 1-4

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from numpy.random import normal as GaussianDistribution

We will start off by defining all the parameters of our bandit implementation. We would have 2000 seperate bandit experiments. Each bandit experiment will run for around 10000 steps. As defined earlier each bandit will have 10 arms. Let us now first define these parameters

# Define the armed bandit variables
nB = 2000 # Number of bandits
nS = 10000 # Number of steps we will take for each bandit
nA = 10 # Number of arms or actions of the bandit
nT = 2 # Number of solutions we would apply

As we discussed in the previous post the way we arrive at the most optimal policy is through the rewards an agent receives in the process of interacting with the environment. The policy defines the actions the agent will take. In our case, the actions we are going to take is the arms which we are going to pull. The reward which we get from our actions is based on the internal calibration of the armed bandit. The policy we will adopt is a mix of exploitation and exploration. This means that most of the time we will exploit the which action which was found to give the best reward. However once in a while we also do a bit of exploration. The exploration is controlled by a parameter ε.

Next let us define the containers to store the rewards which we get from each arm and also to track whether the reward we got was the most optimal reward.

# Defining the rewards container
rewards = np.full((nT, nB, nS), fill_value=0.)
# Defining the optimal selection container
optimal_selections = np.full((nT, nB, nS), fill_value=0.)
print('Rewards tracker shape',rewards.shape)
print('Optimal reward tracker shape',optimal_selections.shape)

We saw earlier that the policy with which we would pull each arm would be a mixture of exploitation and exploration. The way we do exploitation is by looking at the average reward obtained from each arm and then selecting the arm which has the maximum reward. For tracking the rewards obtained from each arm we initialize some values for each of the arm and then store the rewards we receive after each pull of the arm.

To start off we initialize all these values as zero as we don’t have any information about the arms and its reward possibilities.

# Set the initial values of our actions
action_Mental_model = np.full(nA, fill_value=0.0) # action_value_estimates > action_Mental_model
print(action_Mental_model.shape)
action_Mental_model

The rewards generated by each arm of the bandit is through the internal calibration of the bandit. Let us also define how that calibration has to be. For this case we will assume that the internal calibration follows a non stationary process. This means that with each pull of the armed bandit the existing value of the armed bandit is incremented by a small value. The value to increment the internal value of the armed bandits is through a Gaussian process with its mean at 0 and a standard deviation of 1.

As a start we will initialize the calibrated values of the bandit to be zero.

# Initialize the bandit calibration values
arm_caliberated_value = np.full(nA, fill_value=0.0) 
arm_caliberated_value

We also need to track how many times a particular action was selected. Therefore we define a counter to store those values.

# Initialize the count of how many times an action was selected
arm_selected_count = np.full(nA, fill_value=0, dtype="int64") 
arm_selected_count

The last of the parameters we will define is the exploration probability value. This value defines how often we would be exploring non greedy arms to find their potential.

# Define the epsilon (ε) value 
epsilon=0.1

Now we are ready to start our experiments. The first step in the process is to decide whether we want to do exploration or exploitation. To decide this , we randomly sample a value between 0 and 1 and compare it with the exploration probability value ( ε) value we selected. If the sampled value is less than the epsilon value, we will explore, otherwise we will exploit. To explore we randomly choose one of the 10 actions or bandit arms irrespective of the value we know it has. If the random probability value is greater than the epsilon value we go into the exploitation zone. For exploitation we pick the arm which we know generates the maximum reward.

# First determine whether we need to explore or exploit
probability = np.random.rand()
probability

The value which we got is greater than the epsilon value and therefore we will resort to exploitation. If the value were to be less than 0.1 (epsilon value : ε ) we would have explored different arms. Please note that the probability value you will get will be different as this is a random generation process.

Now,let us define a decision mechanism so as to give us the arm which needs to be pulled ( our action) based on the probabiliy value.

# Our decision mechanism
if probability >= epsilon:
  my_action = np.argmax(action_Mental_model)
else:
  my_action = np.random.choice(nA)
print('Selected Action',my_action)

In the above section, in line 31 we check whether the probability we generated is greater than the epsilon value . if It it is greater, we exploit our knowledge about the value of the arms and select the arm which has so far provided the greatest reward ( line 33 ). If the value is less than the epsilon value, we resort to exploration wherein we randomly select an arm as shown in line 35. We can see that the action selected is the first action ( index 0) as we are still in the initial values.

Once we have selected our action (arm) ,we have to determine whether the arm is the best arm in terms of the reward potential in comparison with other arms of the bandit. To do that, we find the arm of the bandit which provides the greatest reward. We do this by taking the argmax of all the values of the bandit as in line 38.

# Find the most optimal arm of the bandits based on its internal calibration calculations
optimal_calibrated_arm = np.argmax(arm_caliberated_value)
optimal_calibrated_arm

Having found the best arm its now time to determine if the value which we as the user have received is equal to the most optimal value of the bandit. The most optimal value of the bandit is the value corresponding to the best arm. We do that in line 40.

# Find the value corresponding to the most optimal calibrated arm
optimal_calibrated_value = arm_caliberated_value[optimal_calibrated_arm]

Now we check if the maximum value of the bandit is equal to the value the user has received. If both are equal then the user has made the most optimal pull, otherwise the pull is not optimal as represented in line 42.

# Check whether the value corresponding to action selected by the user and the internal optimal action value are same.
optimal_pull = float(optimal_calibrated_value == arm_caliberated_value[my_action])
optimal_pull

As we are still on the initial values we know that both values are the same and therefore the pull is optimal as represented by the boolean value 1.0 for optimal pull.

Now that we have made the most optimal pull, we also need to get rewards conssumerate with our action. Let us assume that the rewards are generated from the armed bandit using a gaussian process centered on the value of the arm which the user has pulled.

# Calculate the reward which is a random distribution centered at the selected action value
reward = GaussianDistribution(loc=arm_caliberated_value[my_action], scale=1, size=1)[0]
reward

1.52

In line 45 we generate rewards using a Gaussian distribution with its mean value as the value of the arm the user has pulled. In this example we get a value of around 1.52 which we will further store as the reward we have received. Please note that since this is a random generation process, the values you would get could be different from this value.

Next we will keep track of the arms we pulled in the current experiment.

# Update the arm selected count by 1
arm_selected_count[my_action] += 1
arm_selected_count

Since the optimal arm was the first arm, we update the count of the first arm as 1 as shown in the output.

Next we are going to update our estimated value of each of the arms we select. The values we will be updating will be a function of the reward we get and also the current value it already has. So if the current value is Vcur, then the new value to be updated will be Vcur + (1/n) * (r - Vcur) where n is the number of times we have visited that particular arm and 'r' the reward we have got for pulling that arm.

To calcualte this updated value we need to first find the following values

Vcur and n . Let us get those values first

Vcur would be estimated value corresponding to the arm we have just pulled

# Get the current value of our action
Vcur = action_Mental_model[my_action]
Vcur

0.0

n would be the number of times the current arm was pulled

# Get the count of the number of times the arm was exploited
n = arm_selected_count[my_action]
n

1

Now we will update the new value against the estimates of the arms we are tracking.

# Update the new value for the selected action
action_Mental_model[my_action] = Vcur + (1/n) * (reward - Vcur)
action_Mental_model

As seen from the output the current value of the arm we pulled is updated in the tracker. With each successive pull of the arm, we will keep updating the reward estimates. After updating the value generated from each pull the next task we have to do is to update the internal calibration of the armed bandit as we are dealing with a non stationary value function.

# Increment the calibration value based on a Gaussian distribution
increment = GaussianDistribution(loc=0, scale=0.01, size=nA)
# Update the arm values with the updated value
arm_caliberated_value += increment
# Updated arm value
arm_caliberated_value

As seen from lines 59-64, we first generate a small incremental value from a Gaussian distribution with mean 0 and standard deviation 0.01. We add this value to the current value of the internal calibration of the arm to get the new value. Please note that you will get a different value for these processes as this is a random generation of values.

These are the set of processes for one iteration of a bandit. We will continue these iterations for 2000 bandits and for each bandit we will iterate for 10000 steps. In order to run these processes for all the iterations, it is better to represent many of the processes as separate functions and then iterate it through. Let us get going with that task.

Function 1 : Function to select actions

The first of the functions is the one to generate the actions we are going to take.

def Myaction(epsilon,action_Mental_model):
    probability = np.random.rand()
    if probability >= epsilon:
        return np.argmax(action_Mental_model)

    return np.random.choice(nA)

Function 2 : Function to check whether action is optimal and generate rewards

The next function is to check whether our action is the most optimal one and generate the reward for our action.

def Optimalaction_reward(my_action,arm_caliberated_value):
  # Find the most optimal arm of the bandits based on its internal calibration calculations
  optimal_calibrated_arm = np.argmax(arm_caliberated_value)
  # Then find the value corresponding to the most optimal calibrated arm
  optimal_calibrated_value = arm_caliberated_value[optimal_calibrated_arm]
  # Check whether the value of the test bed corresponding to action selected by the user and the internal optimal action value of the test bed are same.
  optimal_pull = float(optimal_calibrated_value == arm_caliberated_value[my_action])
  # Calculate the reward which is a random distribution centred at the selected action value
  reward = GaussianDistribution(loc=arm_caliberated_value[my_action], scale=1, size=1)[0]
  return optimal_pull,reward

Function 3 : Function to update the estimated value of arms of the bandit

def updateMental_model(my_action, reward,arm_selected_count,action_Mental_model):
  # Update the arm selected count with the latest count
  arm_selected_count[my_action] += 1
  # find the current value of the arm selected
  Vcur = action_Mental_model[my_action]
  # Find the number of times the arm was pulled
  n = arm_selected_count[my_action]
  # Update the value of the current arm 
  action_Mental_model[my_action] = Vcur + (1/n) * (reward - Vcur)
  # Return the arm selected and our mental model
  return arm_selected_count,action_Mental_model

Function 4 : Function to increment reward values of the bandits

The last of the functions is the function we use to make the reward generation non-stationary.

def calibrateArm(arm_caliberated_value):
    increment = GaussianDistribution(loc=0, scale=0.01, size=nA)
    arm_caliberated_value += increment
    return arm_caliberated_value

Now that we have defined the functions, we will use these functions to iterate through different bandits and multiple steps for each bandit.

for nB_i in tqdm(range(nB)):
  # Initialize the calibration values for the bandits
  arm_caliberated_value = np.full(nA, fill_value=0.0)
  # Set the initial values of the mental model for each bandit
  action_Mental_model = np.full(nA, fill_value=0.0)
  # Initialize the count of how many times an arm was selected
  arm_selected_count = np.full(nA, fill_value=0, dtype="int64")
  # Define the epsilon value for probability of exploration
  epsilon=0.1
  for nS_i in range(nS):
    # First select an action using the helper function
    my_action = Myaction(epsilon,action_Mental_model)
    # Check whether the action is optimal and calculate the reward
    optimal_pull,reward = Optimalaction_reward(my_action,arm_caliberated_value)
    # Update the mental model estimates with the latest action selected and also the reward received
    arm_selected_count,action_Mental_model = updateMental_model(my_action, reward,arm_selected_count,action_Mental_model)
    # store the rewards
    rewards[0][nB_i][nS_i] = reward
    # Update the optimal step selection counter
    optimal_selections[0][nB_i][nS_i] = optimal_pull
    # Recalibrate the bandit values
    arm_caliberated_value = calibrateArm(arm_caliberated_value)

In line 96, we start the first iterative loop to iterate through each of the set of bandits . Lines 98-104, we initialize the value trackers of the bandit and also the rewards we receive from the bandits. Finally we also define the epsilon value. From lines 105-117, we carry out many of the processes we mentioned earlier like

  • Selecting our action i.e the arm we would be pulling ( line 107)
  • Validating whether our action is optimal or not and getting the rewards for our action ( line 109)
  • Updating the count of our actions and updating the rewards for the actions ( line 111 )
  • Store the rewards and optimal action counts ( lines 113-115)
  • Incrementing the internal value of the bandit ( line 117)

Let us now run the processes and capture the values.

Let us now average the rewards which we have got accross the number of bandit experiments and visualise the reward trends as the number of steps increase.

# Averaging the rewards for all the bandits along the number of steps taken
avgRewards = np.average(rewards[0], axis=0)
avgRewards.shape
plt.plot(avgRewards, label='Sample weighted average')
plt.legend()
plt.xlabel("Steps")
plt.ylabel("Average reward")
plt.show()

From the plot we can see that the average value of rewards increases as the number of steps increases. This means that with increasing number of steps, we move towards optimality which is reflected in the rewards we get.

Let us now look at the estimated values of each arm and also look at how many times each of the arms were pulled.

# Average rewards received by each arm
action_Mental_model

From the average values we can see that the last arm has the highest value of 1.1065. Let us now look at the counts where these arms were pulled.

# No of times each arm was pulled
arm_selected_count

From the arm selection counts, we can see that the last arm was pulled the maximum. This indicates that as the number of steps increased our actions were aligned to the arms which gave the maximum value.

However even though the average value increased with more steps, does it mean that most of the times our actions were the most optimal ? Let us now look at how many times we selected the most optimal actions by visualizing the optimal pull counts.

# Plot of the most optimal actions 
average_run_optimality = np.average(optimal_selections[0], axis=0)
average_run_optimality.shape
plt.plot(average_run_optimality, label='Simple weighted averaging')
plt.legend()
plt.xlabel("Steps")
plt.ylabel("% Optimal action")
plt.show()

From the above plot we can see that there is an increase in the counts of optimal actions selected in the initial steps after which the counts of the optimal actions, plateau’s. And finally we can see that the optimal actions were selected only around 40% of the time. This means that even though there is an increasing trend in the reward value with number of steps, there is still room for more value to be obtained. So if we increase the proportion of the most optimal actions, there would be a commensurate increase in the average value which will be rewarded by the bandits. To achieve that we might have to tweak the way how the rewards are calculated and stored for each arm. One effective way is to use the weighted averaging method

Weighted Averaging Method

When we were dealing with the simple averaging method, we found that the update formule was as follows

New Estimate <----- Old estimate + Step Size [ Reward - Old Estimate]

In the formule, the Step Size for simple averaging method is the reciprocal of the number of times that particular action was selected ( 1/n)

In weighted averaging method we make a small variation in the step size. In this method we use a constant step size method called alpha. The new update formule would be as follows

Qn+1 = Qn + alpha * (reward - Qn)

Usually we take some small values of alpha less than 1 say 0.1 or 0.01 or values similar to that.

Let us now try the weighted averaging method with a step size of 0.1 and observe what difference this method have on the optimal values of each arm.

In the weighted averaging method all the steps are the same as the simple averaging, except for the arm update method which is a little different. Let us define the new update function.

def updateMental_model_WA(my_action, reward,action_Mental_model):
  alpha=0.1 
  qn = action_Mental_model[my_action]
  action_Mental_model[my_action] = qn + alpha * (reward - qn)
  return action_Mental_model

Let us now run the process again with the updated method. Please note that we store the values in the same rewards and optimal_selection matrices. We store the value of weighted average method in index [1]

for nB_i in tqdm(range(nB)):
  # Initialize the calibration values for the bandits
  arm_caliberated_value = np.full(nA, fill_value=0.0)
  # Set the initial values of the mental model for each bandit
  action_Mental_model = np.full(nA, fill_value=0.0)  
  # Define the epsilon value for probability of exploration
  epsilon=0.1
  for nS_i in range(nS):
    # First select an action using the helper function
    my_action = Myaction(epsilon,action_Mental_model)
    # Check whether the action is optimal and calculate the reward
    optimal_pull,reward = Optimalaction_reward(my_action,arm_caliberated_value)
    # Update the mental model estimates with the latest action selected and also the reward received
    action_Mental_model = updateMental_model_WA(my_action, reward,action_Mental_model)
    # store the rewards
    rewards[1][nB_i][nS_i] = reward
    # Update the optimal step selection counter
    optimal_selections[1][nB_i][nS_i] = optimal_pull
    # Recalibrate the bandit values
    arm_caliberated_value = calibrateArm(arm_caliberated_value)

Let us look at the plots for the weighted averaging method.

average_run_rewards = np.average(rewards[1], axis=0)
average_run_rewards.shape
plt.plot(average_run_rewards, label='weighted average')

plt.legend()
plt.xlabel("Steps")
plt.ylabel("Average reward")
plt.show()

From the plot we can see that the average reward increasing with number of steps. We can also notice that the average values obtained higher than the simple averaging method. In the simple averaging method the average value was between 1 and 1.2. However in the weighted averaging method the average value reaches within the range of 1.2 to 1.4. Let us now see how the optimal pull counts fare.

average_run_optimality = np.average(optimal_selections[1], axis=0)
average_run_optimality.shape
plt.plot(average_run_optimality, label='Weighted averaging')
plt.legend()
plt.xlabel("Steps")
plt.ylabel("% Optimal action")
plt.show()

We can observe from the above plot that we take the optimal action for almost 80% of the time as the number of steps progress towards 10000. If you remember the optimal action percentage was around 40% for the simple averaging method. The plots show that the weighted averaging method performs better than the simple averaging method.

Wrapping up

In this post we have understood two methods of finding optimal values for a K armed bandit. The solution space is not limited to these two methods and there are many more methods for solving the bandit problem. The list below are just few of them

  • Upper Confidence Bound Algorithm ( UCB )
  • Bayesian UCB Algorithm
  • Exponential weighted Algorithm
  • Softmax Algorithm

Bandit problems are very useful for many use cases like recommendation engines, website optimization, click through rate etc. We will see more use cases of bandit algorithm in some future posts

What next ?

Having understood the bandit problem, our next endeavor would be to use the concepts in building a self learning recommendation system. The next post would be a pre-cursor to that. In the next post we will formulate our problem context and define the processes for building the self learning recommendation system using a bandit algorithm. This post will be released next week ( Jan 17th 2022).

Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!

.

Building Self learning Recommendation System using Reinforcement Learning : Part I

In our previous series on building data science products we learned how to build a machine translation application and how to deploy the application. In this post we start a new series where in we will build a self learning recommendation system. We will be building this system using reinforcement learning methods. We will be leveraging the principles of the bandit problem to build our self learning recommendation engine. This series will be split across 8 posts.

Let us start our series with a primer on Recommendations systems and Reinforcement learning.

Primer on Recommendation Systems

Recommendation systems (RS) are omnipresent phenomenon which we encounter in our day to day life. Powering the modern day e-commerce systems are gigantic ‘recommendation engines’, looking at each individual transactions, mapping customer profiles and using them to recommend personalized products and services.

Before we embark on building our self learning recommendation system, it will be a good idea to take a quick tour on different types of recommendation systems powering large e-commerce systems of the modern era.

For convenience let us classify recommendation systems into three categories,

Figure 1 : Different types of Recommendation Systems

Traditional recommendation systems leverages data involving user-item interactions ( buying behavior of users ,ratings given by users, etc. ) and attribute information of items ( textual descriptions of items ). Some of the popular type of recommendation systems under the traditional umbrella are

  • Collaborative Filtering Recommendation Systems : The fundamental principle behind collaborative filtering systems is that, two or more individuals who share similar interests tend to have similar propensities for buying. The similarity between individuals are unearthed using the ratings they give, their browsing patterns or their buying patterns. There are different types of collaborative filtering systems like user-based collaborative filtering, item-based collaborative filtering, model based methods ( decision trees, Bayesian models, latent factor models ),etc.
Figure 2 : Collaborative filtering Recommendation Systems
  • Content Based Recommendation Systems : In content based recommendation systems, attribute descriptions of the items are the basis of recommendations. For example if you are a person who watched the Jason Borne series movies and haven’t given any ratings, content based recommendation systems would infer your tastes from the attributes of the movies you watched like action thriller ,CIA , covert operations etc. Based on these attributes the system would recommend movies like Mission Impossible series, as they follow a similar genre.
Figure 3 : Content Based Recommendation Systems
  • Knowledge Based Recommendation Systems : These type of systems make recommendations based on similarity between a users requirements and an item descriptions. Knowledge based recommendation systems are usually useful in context where the purchases infrequent like buying an automobile, real estate, luxury goods etc.
Figure 4 : Knowledge Based Recommendation System
  • Hybrid Recommendation Systems : Hybrid systems or Ensemble systems as they might be called combine best features of the above mentioned approaches to generate recommendations. For example Netflix uses a combination of collaborative filtering ( based on user ratings) and content based( attribute descriptions of movies) to make recommendations.

Traditional class of recommendation systems like collaborative filtering are predominantly linear in its approach. However personalization of customer preferences are not necessarily linear and therefore there was need for modelling recommendation systems for behavior data which are mostly non linear and this led to the rise of deep learning based systems. There are many proponents of deep learning methods some of the notable ones are Youtube, ebay, Twitter and Spotify. Let us now see some of the most popular types of deep learning based recommendation systems.

  • Multi Layer Perceptron based Recommendation Systems : Multi layer perceptron or MLP based recommendation systems are feed-forward neural network systems with multiple layers between the input layer and output layer. The basic setting for this approach is to vectorize user information and item information as the basic inputs. This input layer is fed into the feed forward network and then the output is whether there is an interaction for the item or not. By modelling these interactions as a MLP we will be able to rank specific products in terms of the propensity of the user to that item.
Figure 5: MLP Based Recommendation System
  • CNN based Recommendation Systems : Convolutional Neural networks ( CNN ) are great feature extractors, i.e. they extract global level and local level features. These features are used for providing context which will aid in better recommendations.
  • RNN based Recommendation System: RNNs are good choices when there are sequences of data. In the context of a recommendation systems use cases like recommending what the user will click next can be treated as a sequence to sequence problem. In such problems the interactions between the user and an item at each session will be the basic data and the output will be what the customer clicked next. So if we have data pertaining to session information and the interaction of the user to items during the sessions, we will be able to model a RNN to be used as a Recommender system.
  • Neural attention based Recommendation Systems : Attention based recommendation systems leverage the attention mechanism which has great utility in use cases like machine translation, image captioning, to name a few. Attention based recommendation systems are more apt for recommending multimedia items like photos and videos. In multimedia recommendation, the user preferences can be implicit ( likes, views ). These implicit feed back need not always mean that a user liked that item. For example I might give a like to a video or photo shared by a friend even if I really don’t like those items. In such cases, attention based models attempt to weight the user-item interactions to give more emphasis to parts of the video or image which could be more aligned to the users preferences.
Figure 6 : Deep Learning Based Recommendation System ( Image source : dzone.com/articles/building-a-recommendation-system-using-deep-learni)

The above are some of the prevalent deep learning based recommendation systems. In addition to these there are other deep learning algorithms like Restricted Boltzmann machines based recommendation systems , Autoencoder based recommendation systems and Neural autoregressive recommendation systems . We will explore creation of recommendation systems with these types of models in a future post. Deep learning methods for recommendation systems have tremendous abilities to model non linear relationships between user interactions with items. However on the flip side, there is a severe problem of interpretability of models and the hunger for more data in deep learning methods. Off late deep reinforcement learning methods are widely used as recommendation systems. Deep reinforcement learning systems have abilities to model with large number of states and action spaces and this has made reinforcement learning methods to be used as recommendation systems. Let us explore some of the prominent types of reinforcement learning based recommendation systems.

Since this series is about the application of reinforcement learning to the application of recommendation systems, let us first understand what reinforcement learning is and then get into application of reinforcement learning as recommendation systems.

Primer on Reinforcement Learning

Unlike the supervised learning setting where there is a guide telling you what the right action is, reinforcement learning relies on the environment to discover what the right action is. The learning in reinforcement learning is through interaction with an environment. In the reinforcement learning setting there is an agent ( recommendation system in our context) which receives rewards ( feed back from users like buying, clicking) from the environment ( users ) . The rewards acts as an indicator as to whether the course of action taken by the agent is right or wrong. The agent ultimately learns to take the right action through feed backs received from the environment over a period of time.

Figure 7 : Reinforcement Learning Setting

Elements of Reinforcement Learning

Let us try to understand different elements of reinforcement learning with an example of a robot picking trash.

We will first explore an element of reinforcement learning called the ‘State’. A state can be defined as the representation of the environment in which a task has to be performed. In the context of our robot, we can say that it has two states.

State 1 : High charge

State 2 : Low charge.

Depending on the state the robot is in, it has three decision points to make.

  1. The robot can go on searching for trash.
  2. The robot can wait at its current location so that some one will pick up trash and give it to the robot
  3. The robot can got to its charging station to recharge so that it doesn’t go off power.

These decision points which are taken at each state is called an ‘Action‘ in reinforcement learning parlance.

Let us represent the states and its corresponding actions for our robot

From the above figure we can observe the states of the robot and the actions the robot can take. When the robot has high charge, there would only be two actions the robot is likely to take as there would be no point in recharging as the current charge is high.

Depending on the current state and the action taken from that state, the robot will transition to the next state. Let us look at some possible states the robot can end up based on the initial state and the action it takes.

State : High Charge ,Action : Search

When the current state is high charge and the action taken is search, there are two possible states the robot can attain, stay in its high charge( because the search was quickly over) or deplete its charge and end up with low charge.

State : High Charge ,Action : Wait

If the robot decides to wait when it is high on charge, the robot continues in its state of high charge.

State : Low Charge ,Action : Search

When the charge is low and the robot decides to search there can be two resultant states. One plausible scenario is for the charge to be completely drained making the robot unable to take further action. In such circumstance someone will have to physically take the robot to the charging point and the robot ends up with high charge.

The second scenario is when the robot do not make extensive search and as a result doesn’t drain much. In this scenario the robot continues in its state of low charge.

State : Low Charge ,Action : Wait

When the action is to wait with low charge the robot continues to remain in a state of low charge.

State : Low Charge ,Action : Recharge

Recharging the robot will enable the robot to return to a state of high charge.

Based on our discussions let us now represent the states, different action choices and the subsequent states the robot will end up

So far we have seen different starting states and subsequent states the robot ends up based on the action choices the robot makes. However, what about the consequences for the different action choices the robot makes ? We can see that there are some desirable consequences and some undesirable consequences. For example remaining in high charge state by searching for trash is a desirable consequence. However draining off its charge is an undesirable consequence. To optimize the behavior of the robot we need to encourage desirable consequences and strictly discourage undesirable tendencies. How do we inculcate desirable tendencies and discourage undesirable ones ? This is where the concept of rewards comes in.

The sole purpose of the robot is to collect as much trash as possible. This purpose can be effectively done only when the robot searches for trash. However in the process of searching for trash the robot is also supposed to take care of itself i.e. it should ensure that it has enough charge to go about the search so that it doesn’t drain of charge and make itself ineffective. So the desired behavior for the robot is to search and collect trash and the undesired behavior is to get into a drained state. In order to inculcate the desired behaviors we can introduce rewards when the robot collects trash and also penalizes the robot when it drains itself of its charge. The other actions of waiting and recharging will not have any reward or penalties involved. This system of rewards will imbibe right behaviors in the robot.

The example we have seen of the robot is a manifestation of reinforcement learning. Let us now try to derive the elements of reinforcement learning from the context of the robot.

As seen from the robot example, reinforcement learning is the process of learning what to do at different scenarios based on feed back received. Within this context the part of the robot which learns and decides what to do is called the agent .

The context within which an agent interacts is called the environment. In the context of the robot, it is the space where the robot interacts in the process of carrying out its task of picking trash.

When the agent interacts with its environment, at each time step, the agent manifests a certain state. In our example we saw that the robot had two states of high charge and low charge.

From each state the agent carries out certain actions. The actions an agent carries out from a state will determine the subsequent state. In our context we saw how the actions like searching, waiting or recharging from a starting state defined the state the robot ended up.

The other important element is the reward function. The reward function quantifies the consequences of following certain actions from a state. The kind of reward an agent receives for a state-action pair defines the desirability of doing that action given its state. The objective of an agent is to maximize the rewards it gets in the long run.

The reward function is the quantification of the consequences which is got immediately after following a certain action. It doesn’t look far out in the future whether the course of action is good in the long term. That is what a value function does. A value function looks at maximizing the rewards which gets accumulated over a long term horizon. Imagine that the robot was in a state of low charge and then it spots some trash at a certain distance. So the robot decides to search and pick that trash as it would give an immediate reward. However in the process of searching and picking up the trash its charge drains off and the robot become ineffective, getting a large penalty in the process. In this case, even though the short term reward was good, the long term effect was harmful. If the robot had instead moved to its charging station to get charged and then gone in search of the trash, the overall value would have been more rewarding.

The next element of the reinforcement learning context is the policy. A policy defines how the agent has to behave at different circumstances at a given time. It guides the agent on what needs to be done depending on the circumstances. Let us revisit the situation we saw earlier on the decision point of the robot to recharge or to search for trash it spotted. Let us say there was a policy which said that the robot will have to recharge when the charge drops below a certain threshold. In such cases, the robot could have avoided the situation where the charge was drained. The policy is like the heart of the reinforcement learning context. The policy drives the behavior of agents at different situations.

An optional element of a reinforcement context is the model of the environment. A model is a broad representation of how an environment will behave. Given a state and the action taken from the state a model can be used to predict the next states and also the rewards which will be generated from those actions. A model is used for planning the course of action the agent has to take based on the situation the agent is in.

To sum up, we have seen that Reinforcement learning is a framework which aims at automating the task of learning and decision making. The automation of the learning and decision making process is achieved through the interaction between an agent and its environment through its various states, actions and rewards. The end objective of an agent is to maximize the value function and to learn a policy which maximizes the value function. We will be diving more deeper into some specific types of reinforcement learning problems in the future posts. Let us now look at some of the approaches to solve a reinforcement learning problem.

Different approaches using reinforcement learning

  • Multi-armed bandits : The name multi-armed bandits is derived from the context of a gambler who tries to maximize his/her returns by pulling multiple arms of a slot machine. The gambler through the process of exploration has to find which of the n arms provide the best rewards and once a set of best arms are identified, try to exploit those arms to maximize the rewards he/she gets from the process. In the context of reinforcement learning the problem can be formulated from the perspective of an agent who tries to get sufficient information of the environment ( different slots ) based on extensive exploration and then using the information gained to maximize the returns. Different use cases where multi armed bandits can be used involves clinical trials, recommendation systems, A/B testing, etc.
Figure 8 : Multi Armed Bandit – Exploration v/s Exploitation
  • Markov Decision Process and Dynamic Programming: Markov decision process falls under a class of algorithms called the model based algorithms. The main constituents of a Markov process involves the agent which interacts with the environment by taking actions from different states in which the agent finds itself. In a model based process there is a well defined probability distribution when going from one state to the other. This probability distribution is called the transition probability. The below figure depicts the transition probability of the robot we saw earlier. For example, if the robot is at state of low charge and it takes the action search, it would remain in the same state with probability and attains high charge with transition probability of 1-. Similarly, from a high state, when taking the action wait, the robot will remain in high charge with probability of 1.
Figure 9 : Markov Decision Process for a Robot ( Image source : Reinforcement Learning, Sutton & Barto )

Markov decision process entails that when moving from one state to the other, the history of all the states in which an agent was, doesn’t matter. All that matters is the current state. Markov decision processes are generally best implemented by a collection of algorithms called dynamic programming. Dynamic programming helps in computing the most optimal policies as a Markov decision process given a perfect model of the environment. What we mean by a perfect model of the environment is where we know all the states, actions and the transition probabilities when moving from one state to the other.

There are different use cases involving MDP process, some of the notable ones include determination of number of patients in a hospital, reducing wait time at intersections etc.

  • Monte Carlo Methods : Monte Carlo methods unlike dynamic programming and Markov decision processes, do not make assumptions on knowledge on the environment. Monte Carlo methods learns through experience. These methods rely on sampling sequences of states, actions and rewards to attain the most optimal solution.
  • Temporal difference Methods : Temporal difference methods can be said as a combination of dynamic programming methods and Monte Carlo methods. Temporal difference methods can learn from experience like Monte Carlo methods and they also can also estimate the value function based on earlier learning without waiting for the end of an episode. Due to its simplicity temporal difference methods are great for learning experiences derived from interaction with environment in an online mode. Temporal difference methods are great to make long term predictions like predicting customer purchases, weather patterns, election outcomes etc.
Figure 10 : Comparison of backup diagram for MC,TD & DP ( Image source : David Silver’s RL Course, lecture 4 )
  • Deep Reinforcement Learning methods : Deep reinforcement learning methods combine traditional reinforcement learning and deep learning techniques. One pre-requisite of traditional reinforcement learning is the understanding of states and making decisions on what actions to take from each state. However reinforcement learning gets constrained when the number of states become very huge as in the case of many of the online data sets. This is where deep reinforcement learning techniques comes in handy. Deep reinforcement learning algorithms are able to take large input sets, which has large state spaces and make decisions on what actions to take to optimize the end objective. Deep reinforcement learning methods have wide applications in robotic, natural language processing, computer vision, finance, healthcare to name a few.
Figure 11 : Deep Reinforcement Learning

Having got an overview of the types of reinforcement learning systems, let us look at how reinforcement learning approaches can be used for building recommendation systems.

Reinforcement learning for recommendation systems

User interactions with items are sequential and it has a rich context to it. For this reason the problem of predicting the best item to a user can also be viewed as a sequential decision problem. In the primer on reinforcement systems, we learned that in a reinforcement learning setting, an agent aims to maximize a numerical reward through interactions with an environment. Bringing this to the recommendation system context, it is like the recommendation system (agent) trying to recommend an item ( an action ) to the user to maximize the user satisfaction ( reward ).

Let us now look at some of the approaches in which reinforcement learning is used as recommendation systems.

  • Multi armed bandit based recommendation systems : Recommendation systems can learn policies or decisions on what to recommend to whom by broadly two approaches. The first one is the traditional offline learning mode which we explored at the start of this article. The next approach is the online learning mode where the recommendation system will suggest an item to the user based on the users context like time of day, place, history, previous interactions etc. One of the basic type of systems which implement the online system is the multi armed bandit approach. This approach will basically treat the recommendation task like pulling the levers of an armed bandit.
Figure 12 : Multi-Armed Bandit as Recommendation System
  • Normal reinforcement learning based recommendation systems : Many of the reinforcement learning approaches which we explored earlier like MDP, Monte Carlo and Temporal Difference are widely used as recommendation systems. One such example is in the use of MDP based recommendation systems for recommending songs to users. In this problem setting the states represents the list of songs to be recommended, the action is the act of listening to songs, the transition probability is the probability of selecting a particular song having listened to a song and the reward is the implicit feed back received when the user actually listens to the recommended song.
  • Deep Reinforcement Learning based recommendation systems : Deep reinforcement learning systems have the ability to learn multiple states and action spaces. In a typical personalized online recommendation systems the number of states and actions are quite large and deep reinforcement learning systems are good fit for such use cases. Take the case of an approach to recommend movies based on a framework called Deep Deterministic Policy Gradient Framework. In this use case, user preferences are used to learn a policy which will thereby be used to select the movies to be recommended for the user. The learning of the policy is done using the deep deterministic policy gradient framework, which enables learning policies dynamically. The dynamic policy vector is then applied on the candidate set of movies to get a personalized set of movies to the user.

There are different use cases and multiple approaches to use reinforcement learning systems as recommendation systems. We will deal with more sophisticated reinforcement learning based recommendation systems in future posts.

What Next ?

So far in this post we have taken a quick overview of the main concepts. Obviously the proof of the pudding is in the eating. So we will get to that in the next post.

As this series is based on the multi armed bandit approach for recommendation systems, we will get hands on programming experience with multi armed bandit problems. In the next post we will build a multi armed bandit problem formulation from scratch using Python and then implement some simulated experiments using multi armed bandits. The next post will be published next week. Please subscribe to this blog post to get notifications when the next post is published.

You can also subscribe to our Youtube channel for all the videos related to this series.

The complete code base for the series is in the Bayesian Quest Git hub repository

Do you want to Climb the Machine Learning Knowledge Pyramid ?

Knowledge acquisition is such a liberating experience. The more you invest in your knowledge enhacement, the more empowered you become. The best way to acquire knowledge is by practical application or learn by doing. If you are inspired by the prospect of being empowerd by practical knowledge in Machine learning, subscribe to our Youtube channel

I would also recommend two books I have co-authored. The first one is specialised in deep learning with practical hands on exercises and interactive video and audio aids for learning

This book is accessible using the following links

The Deep Learning Workshop on Amazon

The Deep Learning Workshop on Packt

The second book equips you with practical machine learning skill sets. The pedagogy is through practical interactive exercises and activities.

The Data Science Workshop Book

This book can be accessed using the following links

The Data Science Workshop on Amazon

The Data Science Workshop on Packt

Enjoy your learning experience and be empowered !!!!