MAB Analysis of Softmax Algorithm

Moving beyond the Epsilon Greedy algorithm, the Softmax algorithm provides further optimisation in terms of enhancing the chance of rewards during exploration.

To get a better intuition, consider the following two cases of 2 Bernoulli arms:

  • The first arm has 0.1 reward average while the other has 0.9 reward average.
  • The first arm has 0.1 reward average while the other has 0.11 reward average.

Using Epsilon Greedy in both cases, for a specified epsilon percentage of exploration trials, the algorithm will choose randomly between both arms in both situations regardless of how different the average rewards between both arms are.

Therein lies the opportunity for an optimized solution that the Softmax algorithm can provide. For any given round, instead of uniformly exploring all arms, it provides arms of different chance of being interacted with based on the difference in their current reward averages.

In summary, the Softmax proposes the following probability distribution of choosing each arm at each given round:

where r_i represents the current reward return average of each arm at the current round.

The tau parameter is a hyperparameter that ultimately determines how much randomisation.

  • When tau is large, the overall exponential element of all arms approach a resultant value of 1, and thus the chance of exploring any specific arm is 1.
  • When tau is small, the overall exponential element of each arm is exponentially proportional to their current return, and thus, arms with higher average return will have a higher chance of being chosen in the exploration phase.

It is important to note that the structured exploration of the Softmax algorithm combines aspects of exploitation and exploration. Instead of segmenting the trials into exploration and exploitation phases, the Softmax incorporates exploitation by using it to increase the chance of picking the higher return arm, while also yet making it also possible to pick the lower return arm (which is some form of exploration).

The following analysis is based on the book “Bandit Algorithms for Website Optimization” by John Myles White. For further understanding of the code, I have included comments.

Below is the code for creation of the SoftMax algorithm setup and progressive updates of counts and values for arms.

  • Counts: Represent recorded times when arm was pulled.
  • Values: Represent the known mean reward. In the case of a Bernoulli arm, values represent the probability of reward which ranges from 0 to 1.
import math
import random

# Arm selection based on Softmax probability
def categorical_draw(probs):
    z = random.random()
    cum_prob = 0.0
    
    for i in range(len(probs)):
        prob = probs[i]
        cum_prob += prob
        
        if cum_prob > z:
            return i
    return len(probs) - 1

# Softmax algorithm
class Softmax:
    def __init__(self, tau, counts, values):
        self.tau = tau
        self.counts = counts # Count represent counts of pulls for each arm. For multiple arms, this will be a list of counts.
        self.values = values # Value represent average reward for specific arm. For multiple arms, this will be a list of values.
        return
    
    # Initialise k number of arms
    def initialize(self, n_arms):
        self.counts = [0 for col in range(n_arms)]
        self.values = [0.0 for col in range(n_arms)]
        return
    
    def select_arm(self):
        # Calculate Softmax probabilities based on each round
        z = sum([math.exp(v / self.tau) for v in self.values])
        probs = [math.exp(v / self.tau) / z for v in self.values]
        
        # Use categorical_draw to pick arm
        return categorical_draw(probs)
    
    # Choose to update chosen arm and reward
    def update(self, chosen_arm, reward):
        # update counts pulled for chosen arm
        self.counts[chosen_arm] = self.counts[chosen_arm] + 1
        n = self.counts[chosen_arm]
        
        # Update average/mean value/reward for chosen arm
        value = self.values[chosen_arm]
        new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
        self.values[chosen_arm] = new_value
        return

As per previous discussion in the Epsilon Greedy article, we will use a Bernoulli distribution to represent the reward function of each arm.

class BernoulliArm():
    def __init__(self, p):
        self.p = p
    
    # Reward system based on Bernoulli
    def draw(self):
        if random.random() > self.p:
            return 0.0
        else:
            return 1.0

To proceed with any further analysis, an operational script is required to process the simulation. The following code helps to create a simulation for a specific run of epsilon value.

  • num_sims: Represents the number of independent simulations, each of length equal to ‘horizon’.
  • horizon: Represents the number of time steps/trials per round of simulation
def test_algorithm(algo, arms, num_sims, horizon):
    
    # Initialise variables for duration of accumulated simulation (num_sims * horizon_per_simulation)
    chosen_arms = [0.0 for i in range(num_sims * horizon)]
    rewards = [0.0 for i in range(num_sims * horizon)]
    cumulative_rewards = [0 for i in range(num_sims * horizon)]
    sim_nums = [0.0 for i in range(num_sims *horizon)]
    times = [0.0 for i in range (num_sims*horizon)]
    
    for sim in range(num_sims):
        sim = sim + 1
        algo.initialize(len(arms))
        
        for t in range(horizon):
            t = t + 1
            index = (sim -1) * horizon + t -1
            sim_nums[index] = sim
            times[index] = t
            
            # Selection of best arm and engaging it
            chosen_arm = algo.select_arm()
            chosen_arms[index] = chosen_arm
            
            # Engage chosen Bernoulli Arm and obtain reward info
            reward = arms[chosen_arm].draw()
            rewards[index] = reward
            
            if t ==1:
                cumulative_rewards[index] = reward
            else:
                cumulative_rewards[index] = cumulative_rewards[index-1] + reward
                
            algo.update(chosen_arm, reward)
    
    return [sim_nums, times, chosen_arms, rewards, cumulative_rewards]

Simulation of Arms with relatively big differences in Means

Similar to what was done previously in the analysis for Epsilon-greedy, the simulation comprises of the following:

  • Create 5 arms, four of which have average reward of 0.1, and the last/best has average reward of 0.9.
  • Save simulation output to a tab-separated file
  • Create 5000 independent simulations for each tau value for a total of 5 epsilon values ranging from 0.1 to 0.5.

The choice for 5000 independent simulations is because we want to determine the average performance. Each simulation might be subject to the stochastic nature/run and the performances might be skewed due to random chance. Thus it is important to run a reasonably high number of simulations to evaluate the average mean/performance.

import random

random.seed(1)
# out of 5 arms, 1 arm is clearly the best
means = [0.1, 0.1, 0.1, 0.1, 0.9]
n_arms = len(means)
# Shuffling arms
random.shuffle(means)

# Create list of Bernoulli Arms with Reward Information
arms = list(map(lambda mu: BernoulliArm(mu), means))
print("Best arm is " + str(np.argmax(means)))

f = open("standard_results_soft.tsv", "w+")

# Create simulations for each tau/temperature value
for tau in [0.1, 0.2, 0.3, 0.4, 0.5]:
    algo = SoftMax(tau, [], [])
    algo.initialize(n_arms)
    results = test_algorithm(algo, arms, 5000, 250)
    
    # Store data
    for i in range(len(results[0])):
        f.write(str(epsilon) + "\t")
        f.write("\t".join([str(results[j][i]) for j in range(len(results))]) + "\n")
f.close()

Using some data-preprocessing and basic Altair visualisation, we can plot the probability of pulling the best arm for each epsilon value.

For different values of tau, there are different asymptotic limits to the eventual rate of chooosing the best arm. As mentioned previously, the higher the value of tau, the higher degree of randomisation in picking arms which explains the asymptotes. For a tau value of 0.1, it does seem to eventually converge close to 1.

The rate of convergence towards their corresponding asymptotes seem to be similar for all values of tau, except for 0.1 which seems to take about twice as long as the rest. From an inverted perspective of reaching convergence, it seems that there is some manner of exponential decay towards the asymptote as shown by the nearly similar gradients at the start (from time steps between 0 to 10).

Looking at the cumulative reward system, there is a stark difference of the Softmax algorithm compared to the Epsilon Greedy algorithm. We saw that for the Epsilon Greedy algorithm simulation, the epsilon value of 0.1 had to play catchup since it was not exploring enough in the early stage of the experiment to discover the best arm. Thus, its cumulative return was below that of other epsilon values for a large portion of the 250 time steps period.

For the Softmax algorithm, we see that its algorithm provides sufficient exploration amidst exploitation. Thus, lower values of tau remained constantly higher in terms of cumulative reward. This is intuitive as the lower the value of tau, the lower the degree of randomised exploration. Thus, there is a lower degree of picking poorer return arms.

It must also be noted that for Epsilon Greedy, the cumulative reward values ranged between 140 to 180. In this case, Softmax tau values of 0.1 and 0.2 are able to reach cumulative reward range of 200 to 220, which means they constantly exploited the best arm and did so quite early on.

Simulation of Arms with relatively smaller differences in Means

The previous analysis was a simulation exercise on arms with big differences in reward returns. We extend the analysis to a situation where the arms are relatively closer.

In the following case, we simulate 5 arms, 4 of which have a mean of 0.8 while the last/best has a mean of 0.9.

With the reward returns of arms being much closer now, we see a big deterioration in the performance of the Softmax algorithm. The rate of choosing the best arm now ranges between 0.22 to 0.32. It is interesting to observe that a tau value of 0.2 has an asymptote of 0.32, while a tau value of 0.1 has an asymptote of 0.25 (which is similar to that of tau value 0.4).

This seems to imply that the reduced difference in reward function makes it harder to determine which is the best arm especially with respect to tau value 0.1. In fact, for a higher degree of randomisation represented by a higher tau value 0.2 (or even 0.3), we obtain better rates of selecting the best arm.

It should be noted that in this scenario, for Epsilon Greedy algorithm, the rate of choosing the best arm is actually higher as represented by the ranges of 0.5 to 0.7, compared to the Softmax algorithm range of 0.22 to 0.32. This seems to imply that Epsilon Greedy might be better suited for multi-armed based situations where the difference in means are much smaller.

Similar to what we saw in the Epsilon Greedy analysis, the cumulative rewards for all tau values are much closer in nature so much so that they are indistinguishable. Likewise, this is probably due to the fact that the reward means of all arms are quite close.

Since the arms are close in average returns, it will be more interesting to verify the overall cumulative regret.

Based on the cumulative regret plots, we see that tau value of 0.2 is the best at around 15.5 while tau value of 0.5 is the worst at 19.2. On a whole, the Softmax algorithm is worst off compared to the Epsilon Greedy algorithm which had a range of 12.3 to 14.8. For the better performing tau values, there is a tapering off in the plots but as a whole, the Softmax algorithm does seem to fare worser than Epsilon Greedy.

Summary

In this analysis of Softmax algorithm, we covered the algorithmic intuition as well as the experimental results of applying the Softmax algorithm. The results were benchmarked against the Epsilon Greedy algorithm which was covered previously.

A learning takeaway is that for arms with closer means, the Softmax algorithm does not seem to be as robust in terms of determining the best arm, for which Epsilon Greedy is more suitable.

For reference on this project on bandit simulations analysis, please refer to this Github repo. For quick reference on the actual code, please refer to this Jupyter notebook

Kenneth Foo Fangwei

Kenneth Foo Fangwei

Hi there! My name is Kenneth Foo and I'm originally from Singapore. I’m graduated with a Master of Data Science degree from the University of British Columbia, and I'm currently working as a Senior Quantitative Engineer in OPALai (https://www.opalai.com/). With a passion for product data science and personalisation using machine learning, my key interests lies in topics such as bandit algorithms, online learning, Bayesian statistics, time series prediction, and causality frameworks.