## Hierarchical Actions and Reinforcement Learning

One of the issues of reinforcement learning is how it handles hierarchical actions.

## What are Hierarchical Actions?

In order to explain hierarchical actions, let us take a look at a real-world example. Consider the task of baking a sweet potato pie. The high-level action of making a sweet potato pie can be broken down into numerous low-level sub steps: cut the sweet potatoes, cook the sweet potatoes, add sugar, add flour, etc.

You will also notice that each of the low-level sub steps mentioned above can further be broken down into even further steps. For example, the task of cutting a sweet potato can be broken down into the following steps: move right arm to the right, orient right arm above the pie, bring arm down, etc.

Each of those sub steps of sub steps can then be further broken down into even smaller steps. For example, “moving right arm to the right” might involve thousands of different muscle contractions. Can you see where we are going here?

Reinforcement learning involves training a software agent to learn by experience through trial and error. A basic reinforcement learning algorithm would need to do a search over thousands of low-level actions in order to execute the task of making a sweet potato pie. Thus, reinforcement learning methods would quickly get inefficient for tasks that require a large number of low-level actions.

## How to Solve the Hierarchical Action Problem

One way to solve the hierarchical action problem is to represent a high-level behavior (e.g. making a sweet potato pie) as a small sequence of high-level actions.

For example, where the solution of making a sweet potato pie might entail 1000 low-level actions, we might condense these actions into 10 high-level actions. We could then have a single master policy that switches between each of the 10 sub-policies (one for each action) every N timesteps. The algorithm explained here is known as meta-learning shared hierarchies and is explained in more detail at OpenAi.com.

We could also integrate supervised learning techniques such as ID3 decision trees. Each sub-policy would be represented as a decision tree where the appropriate action taken is the output of the tree. The input would be a transformed version of the state and reward that was received from the environment. In essence, you would have decisions taken within decisions.

## Partial Observability and Reinforcement Learning

In this post, I’m going to discuss how supervised learning can address the partial observability issue in reinforcement learning.

## What is Partial Observability?

In a lot of the textbook examples of reinforcement learning, we assume that the agent, for example a robot, can perfectly observe the environment around it in order to extract relevant information about the current state. When this is the case, we say that the environment around the agent is fully observable

However, in many cases, such as in the real world, the environment is not always fully observable. For example, there might be noisy sensors, missing information about the state, or outside interferences that prohibit an agent from being able to develop an accurate picture of the state of its surrounding environment. When this is the case, we say that the environment is partially observable.

Let us take a look at an example of partial observability using the classic cart-pole balancing task that is often found in discussions on reinforcement learning.

Below is a video demonstrating the cart-pole balancing task. The goal is to keep to keep a pole from falling over by making small adjustments to the cart support underneath the pole.

In the video above, the agent learns to keep the pole balanced for 30 minutes after 600 trials. The state of the world consists of two parts:

1. The pole angle
2. The angular velocity

However, what happens if one of those parts is missing? For example, the pole angle reading might disappear.

Also, what happens if the readings are noisy, where the pole angle and angular velocity measurements deviate significantly from the true value?

In these cases, a reinforcement learning policy that depends only on the current observation xt (where x is the pole angle or angular velocity value and time t) will suffer in performance. This in a nutshell is the partial observability problem that is inherent in reinforcement learning techniques.

## Addressing Partial Observability Using Long Short-Term Memory (LSTM) Networks

One strategy for addressing the partial observability problem (where information about the actual state of the environment is missing or noisy) is to use long short-term memory neural networks. In contrast to artificial feedforward neural networks which have a one-way flow of information from the input layer, LSTMs have feedback connections. Past information persists from run to run of the network, giving the system a “memory.” This memory can then be used to make predictions about the current state of the environment.

The details of exactly how the memory explained above is created is described in this paper written by Bram Baker of the Unit of Experimental and Theoretical Psychology at Leyden University. Baker showed that LSTM neural networks can help improve reinforcement learning policies by creating a “belief state.” This “belief state” is based on probabilities of reward, state transitions, and observations, given prior states and actions.

Thus, when the actual state (as measured by a robot’s sensor for example) is unavailable or super noisy, an agent can use belief state information generated by an LSTM to determine the appropriate action to take.

## Combining Deep Neural Networks With Reinforcement Learning for Improved Performance

The performance of reinforcement learning can be improved by incorporating supervised learning techniques. Let us take a look at a concrete example.

You all might be familiar with the Roomba robot created by iRobot. The Roomba robot is perhaps the most popular robot vacuum sold in the United States.

The Roomba is completely autonomous, moving around the room with ease, cleaning up dust, pet hair, and dirt along the way. In order to do its job, the Roomba contains a number of sensors that enable it to perceive the current state of the environment (i.e. your house).

Let us suppose that the Roomba is governed by a reinforcement learning policy. This learning policy could be improved if we have accurate readings of the current state of the environment. And one way to improve these readings is to incorporate computer vision.

Since reinforcement learning depends heavily on accurate readings of the current state of the environment, we could use deep neural networks (a supervised learning technique) to pre-train the robot so that it can perform common computer vision tasks such as recognizing objects, localizing objects, and classifying objects before we even start running the reinforcement learning algorithm. These “readings” would improve the state portion of the reinforcement learning loop.

Deep neural networks have already displayed remarkable accuracy for computer vision problems. We can use these techniques to enable the robot to get a more accurate reading of the current state of the environment so that it can then take the most appropriate actions towards maximizing cumulative reward.

# How Does the Boltzmann Distribution Fit Into the Discussion of Epsilon Greedy Search?

In order to answer your question, let us take a closer look at the definition of epsilon greedy search. With our knowledge of how that works, we can then see how the Boltzmann distribution fits into the discussion of epsilon greedy search.

# What is Epsilon Greedy Search?

When you are training an agent (e.g. race car, robot, etc.) with an algorithm like Q-learning, you can either have the agent take a random action with probability ϵ or have the agent be greedy and take the action that corresponds to its policy with probability 1-ϵ (i.e. the action for a given state that has the highest Q-value). The former is known as exploration while the latter is called exploitation. In reinforcement learning, we have this constant dichotomy of:

• exploration vs. exploitation
• learn vs. earn
• not greedy vs. greedy
• Exploration: Try a new bar in your city.
• Exploitation: Go to the same watering hole you have been going to for decades.
• Exploitation: Get a job.
• Exploration: Try to make new friends.
• Exploitation: Keep inviting over your college buddies.
• Exploitation: Call the ex.
• Exploitation (with probability 1-ϵ): Make a decision based on the best information (i.e. policy) that is currently available.

The epsilon greedy algorithm in which ϵ is 0.20 says that most of the time the agent will select the trusted action a, the one prescribed by its policy π(s) -> a. However, 20% of the time, the agent will choose a random action instead of following its policy.

We often want to have the epsilon-greedy algorithm in place for a reinforcement learning problem because often what is best for the agent long term (e.g. trying something totally random that pays off in a big way down the road) might not be the best for the agent in the short term (e.g. sticking with the best option we already know).

# What Does the Boltzmann Distribution Have to Do With Epsilon Greedy Search?

Notice in the epsilon greedy search section above, I said that 20% of the time the agent will choose a random action instead of following its policy. The problem with this is that it treats all actions equally when making a decision on what action to take. What happens though if some actions might look more promising than others? Plain old epsilon greedy search cannot handle a situation like this. The fact is that, in the real world, all actions are not created equal.

A common method is to use the Boltzmann distribution (also known as Gibbs distribution). Rather than blindly accepting any random action when it comes time for the agent to explore the environment from a given state s, the agent selections an action a (from a set of actions A) with probability:

What this system is doing above is ranking and weighting all actions in the set of possible actions based on their Q-values. This system is often referred to as softmax selection rules.

Take a closer look at the equation above to see what we are doing here. A really high value of tau means that all actions are equally likely to be selected because we are diluting the impact of the Q-values for each action (by dividing by tau). However, as tau gets lower and lower, there will be greater differences in the selection probabilities for each action. The action with the highest Q[s,a] value is therefore much more likely to get selected. And when tau gets close to zero, the Boltzmann selection criteria I outlined above becomes indistinguishable from greedy search. For an extremely low value of tau, the agent will select the action with the highest Q-value and therefore never explore the environment via a random action.

## Value Iteration vs. Q-Learning Algorithm in Python Step-By-Step

In this post, I will walk you through how to implement the value iteration and Q-learning reinforcement learning algorithms from scratch, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to implement and compare the performance of the value iteration and Q-learning reinforcement learning algorithms on the racetrack problem (Barto, Bradtke, & Singh, 1995).

In the racetrack problem, the goal is to control the movement of a race car along a predefined racetrack so that the race car travels from the starting position to the finish line in a minimum amount of time. The race car must stay on the racetrack and avoid crashing into walls along the way. The racetrack problem is analogous to a time trial rather than a competitive race since there are no other cars on the track to avoid.

Note: Before you deep dive into a section below, I recommend you check out my introduction to reinforcement learning post so you can familiarize yourself with what reinforcement learning is and how it works.

Without further ado, let’s get started!

# Testable Hypotheses

The two reinforcement learning algorithms implemented in this project were value iteration and Q-learning. Both algorithms were tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to take from the starting position to the finish line was calculated for each algorithm-racetrack combination.

Using the implementations of value iteration and Q-learning, three hypotheses will be tested.

## Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

I hypothesize that value iteration and Q-learning will both generate policies that will enable the race car to successfully finish the race on all racetracks tested (i.e. move from the starting position of the racetracks to the finish line).

## Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

I hypothesize that value iteration will generate a learning policy faster than Q-learning because it has access to the transition and reward functions (explained in detail in the next section “Algorithms Implemented”).

## Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

I hypothesize that it will take longer for the car to finish the race for the crash scenario in which the race car needs to return to the original starting position each time it crashes into a wall. In other words, Bad Crash Version 1 (return to nearest open position) performance will be better than Bad Crash Version 2 (return to start) performance.

# How Value Iteration Works

In the case of the value iteration reinforcement learning algorithm, the race car (i.e. agent) knows two things before taking a specific action (i.e. accelerate in x and/or y directions) (Alpaydin, 2014):

1. The probabilities of ending up in other new states given a particular action is taken from a current state.
• More formally, the race car knows the transition function.
• As discussed in the previous section, the transition function takes as input the current state s and selected action a and outputs the probability of transitioning to some new state s’.
2. The immediate reward (e.g. race car gets -1 reward each time it moves to a new state) that would be received for taking a particular action from a current state.
• More formally, this is known as the reward function.
• The reward function takes as input the current state s and selected action a and outputs the reward.

Value iteration is known as a model-based reinforcement learning technique because it tries to learn a model of the environment where the model has two components:

1. Transition Function
• The race car looks at each time it was in a particular state s and took a particular action a and ended up in a new state s’. It then updates the transition function (i.e. transition probabilities) according to these frequency counts.
2. Reward Function
• This function answers the question: “Each time the race car was in a particular state s and took a particular action a, what was the reward?”

In short, in value iteration, the race car knows the transition probabilities and reward function (i.e. the model of the environment) and uses that information to govern how the race car should act when in a particular state. Being able to look ahead and calculate the expected rewards of a potential action gives value iteration a distinct advantage over other reinforcement learning algorithms when the set of states is small in size.

Let us take a look at an example. Suppose the following:

• The race car is in state so, where s0 = <x0, y0, vx0, vy0>, corresponding to the x-position on a racetrack, y-position on a race track, velocity in the x direction, and velocity in the y direction, at timestep 0.
• The race car takes action a0, where a0 = (ax0, ay0) = (change in velocity in x-direction, change in velocity in y-direction)
• The race car then observes the new state, where the new state is s1, where s1 = <x1, y1, vx1, vy1>.
• The race car receives a reward r0 for taking action a0 in state s0.

Putting the bold-faced variables together, we get the following expression which is known as an experience tuple. Tuples are just like lists except the data inside of them cannot be changed.

Experience Tuple = <s0, a0, s1, r0>

What the experience tuple above says is that if the race car is in state s0 and takes action a0, the race car will observe a new state s1 and will receive reward r0.

Then at the next time step, we generate another experience tuple that is represented as follows.

Experience Tuple = <s1, a1, s2, r1>

This process of collecting experience tuples as the race car explores the race track (i.e. environment) happens repeatedly.

Because value iteration is a model based approach, it builds a model of the transition function T[s, a, s’] and reward function R[s,a,s’] using the experience tuples.

• The transition function can be built and updated by adding up how many times the race car was in state s and took a particular action a and then observed a new state s’. Recall that T[s, a, s’] stands for the probability the race car finds itself in a new state s’ (at the next timestep) given that it was in state s and took action a.
• The reward function can be built by examining how many times the race car was in state s and took a particular action a and received a reward r. From that information the average reward for that particular scenario can be calculated.

Once these models are built, the race car can then can use value iteration to determine the optimal values of each state (hence the name value iteration). In some texts, values are referred to as utilities (Russell, Norvig, & Davis, 2010).

What are optimal values?

Each state s in the environment (denoted as <xt, yt, vxt, vyt > in this racetrack project) has a value V(s). Different actions can be taken in a given state s. The optimal values of each state s are based on the action a that generates the best expected cumulative reward.

Expected cumulative reward is defined as the immediate reward that the race car would receive if it takes action a and ends up in the state s + the discounted reward that the race car would receive if it always chose the best action (the one that maximizes total reward) from that state onward to the terminal state (i.e. the finish line of the racetrack).

V*(s) = best possible (immediate reward + discounted future reward)

where the * means “optimal.”

The reason why those future rewards are discounted (typically by a number in the range [0,1), known as gamma γ) is because rewards received far into the future are worth less than rewards received immediately. For all we know, the race car might have a gas-powered engine, and there is always the risk of running out of gas. After all, would you rather receive \$10 today or \$10 ten years from now? \$10 received today is worth more (e.g. you can invest it to generate even more cash). Future rewards are more uncertain so that is why we incorporate the discount rate γ.

It is common in control applications to see state-action pairs denoted as Q*(s, a) instead of V*(s) (Alpaydin, 2014). Q*(s,a) is the [optimal] expected cumulative reward when the race car takes an action a in state s and always takes the best action after that timestep. All of these values are typically stored in a table. The table maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state]

Note that Q(s,a) above is not labeled Q*(s,a). Only once value iteration is done can we label it Q*(s,a) because it takes time to find the optimal Q values for each state-action pair.

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep of the race, the race car observes the state (i.e. position and velocity) and decides what action to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state. The car then takes that action.

The pseudocode for calculating the optimal value for each state-action pair (denoted as Q*(s,a)) in the environment is below. This algorithm is the value iteration algorithm because it iterates over all the state-action pairs in the environment and gives them a value based on the cumulative expected reward of that state-action pair (Alpaydin, 2014; Russell, Norvig, & Davis, 2010; Sutton & Barto, 2018):

## Value Iteration Algorithm Pseudocode

### Inputs

• States:
• List of all possible values of x
• List of all possible values of y
• List of all possible values of vx
• List of all possible values of vy
• Actions:
• List of all the possible values of ax
• List of all possible values of ay
• Model:
• Model = Transition Model T[s, a, s’] + Reward Model R[s, a, s’]
• Where Model is a single table with the following row entries
• [s, a, s’, probability of ending up in a new state s’ given state s and action a, immediate reward for ending up in new state s’ given state s and action a]
• = [s, a, s’, T(s, a, s’), R(s, a, s’)]
• Note that the reward will be -1 for each state except the finish line states (i.e. absorbing or terminal states), which will have a reward of 0.
• Discount Rate:
• γ
• Where 0 ≤ γ < 1
• If γ = 0, rewards received immediately are the only thing that counts.
• As γ gets close to 1, rewards received further in the future count more.
• Error Threshold
• Ɵ
• Ɵ is a small number that helps us to determine when we have optimal values for each state-action pair (also known as convergence).
• Ɵ helps us know when the values of each state-action pair, denoted as Q(s,a), stabilize (i.e. stop changing a lot from run to run of the loop).

### Process

• Create a table V(s) that will store the optimal Q-value for each state. This table will help us determine when we should stop the algorithm and return the output. Initialize all the values of V(s) to arbitrary values, except the terminal state (i.e. finish line state) that has a value of 0.
• Initialize all Q(s,a) to arbitrary values, except the terminal state (i.e. finish line states) that has a value of 0.
• Initialize a Boolean variable called done that is a flag to indicate when we are done building the model (or set a fixed number of training iterations using a for loop).
• While (Not Done)
• Initialize a value called Δ to 0. When this value gets below the error threshold Ɵ, we exit the main loop and return the output.
• For all possible states of the environment
• v := V(s)            // Extract and store the most recent value of the state
• For all possible actions the race car (agent) can take
• Q(s,a) = (expected immediate reward given the state s and an action a) + (discount rate) * [summation_over_all_new_states(probability of ending up in a new state s’given a state s and action a * value of the new state s’)]
• More formally,
•
• where  = expected immediate return when taking action a from state s
• V(s) := maximum value of Q(s,a) across all the different actions that the race car can take from state s
• Δ := maximum(Δ, |v – V(s)|)
• Is Δ < Ɵ? If yes, we are done (because the values of V(s) have converged i.e. stabilized), otherwise we are not done.

### Output

The output of the value iteration algorithm is the value table, the table that maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state]

### Policy Determination

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep t of the race, the race car observes the state st (i.e. position and velocity) and decides what action at to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state (the action that will generate the highest expected cumulative reward). The car then takes that action at*, where at* means the optimal action to take at time t.

More formally, the optimal policy for a given states s at timestep t is π*(s) where:

For each state s do

## Value Iteration Summary

So, to sum up, on a high level, the complete implementation of an autonomous race car using value iteration has two main steps:

• Step 1: During the training phase, calculate the value of each state-action pair.
• Step 2: At each timestep of the time trial, given a current state s, select the action a where the state-action pair value (i.e. Q*(s,a)) is the highest.

# How Q-Learning Works

## Overview

In most real-world cases, it is not computationally efficient to build a complete model (i.e. transition and reward function) of the environment beforehand.  In these cases, the transition and reward functions are unknown. To solve this issue, model-free reinforcement learning algorithms like Q-learning were invented (Sutton & Barto, 2018) .

In the case of model-free learning, the race car (i.e. agent) has to interact with the environment and learn through trial-and-error. It cannot look ahead to inspect what the value would be if it were to take a particular action from some state. Instead, in Q-learning, the race car builds a table of Q-values denoted as Q(s,a) as it takes an action a from a state s and then receives a reward r. The race car uses the Q-table at each timestep to determine the best action to take given what it has learned in previous timesteps.

One can consider Q[s,a] as a two-dimensional table that has the states s as the rows and the available actions a as the columns. The value in each cell of this table (i.e. the Q-value) represents the value of taking action a in state s.

Just like in value iteration, value has two components, the immediate reward and the discounted future reward.

More formally,

Q[s,a] = immediate reward + discounted future reward

Where immediate reward is the reward you get when taking action a from state s, and discounted future reward is the reward you get for taking actions in the future, discounted by some discount rate γ. The reason why future rewards are discounted is because rewards received immediately are more certain than rewards received far into the future (e.g. \$1 received today is worth more than \$1 received fifty years from now).

So, how does the race car use the table Q[s,a] to determine what action to take at each timestep of the time trial (i.e. what is the policy given a state…π(s))?

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

After Q-learning is run a number of times, an optimal policy will eventually be found. The optimal policy is denoted as π*(s). Similarly the optimal Q-value is Q*[s,a].

## How the Q[s,a] Table Is Built

Building the Q[s,a] table can be broken down into the following steps:

1. Choose the environment we want to train on.
• In this race car problem, the racetrack environment is what we want the race car to explore and train on.
• The racetrack provides the x and y position data that constitutes a portion of the state s that will be observed at each timestep by the racecar (agent).
2. Starting from the starting position, the race car moves through the environment, observing the state s at a given timestep
3. The race car consults the policy π(s), the function that takes as input the current state and outputs an action a.
4. The action a taken by the race car changes the state of the environment from s to s’. The environment also generates a reward r that is passed to the race car.
5. Steps 2-4 generate a complete experience tuple (i.e. a tuple is a list that does not change) of the form <s, a, s’, r> = <state, action, new state, reward>.
6. The experience tuple in 5 is used to update the Q[s,a] table.
7. Repeat steps 3-6 above until the race car gets to the terminal state (i.e. the finish line of the racetrack).

With the Q[s,a] table built, the race car can now test the policy. Testing the policy means starting at the starting position and consulting the Q[s,a] table at each timestep all the way until the race car reaches the finish line. We count how many timesteps it took for the race car to reach the finish line. If the time it took the race car to complete the time trial is not improving, we are done. Otherwise, we make the race car go back through the training process and test its policy again with an updated Q[s,a] table.

We expect that eventually, the performance of the race car will reach a plateau.

## How the Q[s,a] Table Is Updated

With those high level steps, let us take a closer look now at how the Q[s,a] table is updated in step 6 in the previous section.

Each time the race car takes an action, the environment transitions to a new state s’ and gives the race car a reward r. This information is used by the race car to update its Q[s,a] table.

The update rule consists of two parts, where Q’[s,a] is the new updated Q-value for an action a taken at state s.

Q’[s,a] = Q[s,a] + α * (improved estimate of Q[s,a] – Q[s,a])

where α is the learning rate.

The learning rate is a number 0 < α ≤ 1 (commonly 0.2). Thus, the new updated Q-value is a blend of the old Q-value and the improved estimate of the Q value for a given state-action pair. The higher the learning rate, the more new information is considered when updating the Q-value. A learning rate of 1 means that the new updated Q-value is only considering the new information.

The equation above needs to be expanded further.

Q’[s,a] =Q[s,a] + α*(immediate reward + discounted future reward – Q[s,a])

Q’[s,a] = Q[s,a] + α * (r + γ * future reward – Q[s,a])

Where γ is the discount rate. It is typically by a number in the range [0,1). It means that rewards received by the race car (agent) in the far future are worth less than an immediate reward.

Continuing with the expansion of the equation, we have:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, optimal action a’ at new state s’] – Q[s,a])

Note in the above equation that we assume that the race car reaches state s’ and takes the best action from there, where best action is the action a’ that has the highest Q-value for that given new state s’.

More formally the update rule is as follows:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])

Where argmaxa’(Q[s’,a’]) returns the action a’ that has the highest Q value for a given state s’.

## How an Action Is Selected at Each Step

The performance of Q-learning is improved the more the race car explores different states and takes different actions in those states. In other words, the more state-action pairs the race car explores, the better Q[s,a] table the race car is able to build.

One strategy for forcing the race car to explore as much of the state-action space as possible is to add randomness into the Q-learning procedure. This is called exploration. More specifically, at the step of Q-learning where the race car selects an action based on the Q[s,a] table:

• There is a 20% chance ε that the race car will do some random action, and a 80% chance the race car will do the action with the highest Q-value (as determined by the Q[s,a] table). The latter is known as exploitation.
• If the race car does some random action, the action is chosen randomly from the set of possible actions the race car can perform.

I chose 20% as the starting probability (i.e. ε) of doing some random action, but it could be another number (e.g. 30%) As the race car gains more experience and the Q[s,a] table gets better and better, we want the race car to take fewer random actions and follow its policy more often. For this reason, it is common to reduce ε with each successive iteration, until we get to a point where the race car stops exploring random actions.

## Q-Learning Algorithm Pseudocode

Below are is the Q-learning algorithm pseudocode on a high level (Alpaydin, 2014; Sutton & Barto, 2018).

### Inputs

• Learning rate α, where 0 < α ≤ 1 (e.g. 0.2)
• Discount rate γ (e.g. 0.9)
• Exploration probability ε, corresponding to the probability that the race car will take a random action given a state (e.g. 0.2)
• Reduction of exploration probability, corresponding to how much we want to decrease ε at each training iteration (defined as a complete trip from the starting position to the finish line terminal state) of the Q-learning process. (e.g. 0.5)
• Number of training iterations (e.g. 1,000,000)
• States:
• List of all possible values of x
• List of all possible values of y
• List of all possible values of vx
• List of all possible values of vy
• Actions:
• List of all the possible values of ax
• List of all possible values of ay

### Process

• Initialize the Q[s,a] table to small random values, except for the terminal state which will have a value of 0.
• For a fixed number of training iterations
• Initialize the state to the starting position of the race track
• Initialize a Boolean variable to see if the race car has crossed the finish line (i.e. reached the terminal state)
• While the race car has not reached the finish line
• Select the action a using the Q[s,a] table that corresponds to the highest Q value given state s
• Take action a
• Observe the reward r and the new state s’
• Update the Q[s,a] table:
• Q[s,a] := Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])
• s := s’
• Check if we have crossed the finish line

### Output

Algorithm returns the Q[s,a] table which is used to determine the optimal policy.

### Policy Determination

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

Here is a good video where Professor Tucker Balch from Georgia Tech goes through Q-learning, step-by-step:

# Experimental Design and Tuning Process

## Experimental Design

The Q-learning and value iteration algorithms were implemented for the racetrack problem and then tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to find the finish line was calculated for each algorithm-racetrack combination. The number of training iterations was also monitored. 10 experiments for each algorithm-racetrack combination were performed to create data to graph a learning curve.

At each timestep t of the racetrack problem, we have the following variables (below). We assume the racetrack is a Cartesian grid with an x-axis and a y-axis, and that the system is governed by the standard laws of kinematics:

• State = <xt, yt, vxt, vyt> = variables that describe the current situation (i.e. environment)
• xt = x coordinate of the location of the car at timestep t
• yt = y coordinate of the location of the car at timestep t
• vxt = velocity in the x direction at time step t, where vxt = xt – xt-1
• Maximum vxt = 5
• Minimum vxt = -5
• vyt = velocity in the y direction at timestep t, where vyt = yt – yt-1
• Action = <axt, ayt> = control variables = what the race car (i.e. agent) can do to influence the state
• axt = accelerate in the x direction at timestep t, where axt = vxt  – vx t-1
• -1 (decrease velocity in x-direction by 1)
• 0 (keep same velocity), or
• 1 (increase velocity in x-direction by 1)
• ayt = acceleration in the y direction at timestep t, where ayt = vyt  – vy t-1
• -1 (decrease velocity in y-direction by 1)
• 0 (keep same velocity)
• 1 (increase velocity in y-direction by 1)

### Example Calculation

At t = 0, the race car observes the current state. Location is (2,2) and velocity is (1,0).

• State = <xt, yt, vxt, vyt> = <x0, y0, vx0, vy0> = <2, 2, 1, 0>
• This means the race car is moving east one grid square per timestep because the x-component of the velocity is 1, and the y-component is 0.

After observing the state, the race car selects an action. It accelerates with acceleration (1,1).

• Action=(ax0, ay0)=(1, 1) = (increase velocity in x-direction by 1, increase velocity in y-direction by 1)

At t = 1, the race car observes the state.

• Velocity is now (vx0 + ax0, vy0 + ay0) = (1 + 1, 0 + 1) = (vx1, vy1) = (2, 1)
• Position (i.e. x and y coordinates) is now (x0 + vx1, y0 + vy1) = (2 + 2, 2 + 1) = (x1, y1) = (4, 3)
• Thus, putting it all together, the new state is now <x1, y1, vx1, vy1> = <4, 3, 2, 1>

### Acceleration is Not Guaranteed

There is another twist with the racetrack problem. At any timestep t, when the race car attempts to accelerate, there is a 20% chance that the attempt will fail. In other words, each time the race car selects an action, there is a:

• 20% chance that the attempt fails, and velocity remains unchanged at the next timestep; i.e. (axt, ayt) = (0, 0)
• 80% chance velocity changes at the next timestep; (axt, ayt) = (selected acceleration in the x-direction, selected acceleration in y-direction)

### Must Avoid Crashing Into Walls

The race car must stay on the track. Crashing into walls is bad. Two versions of a “bad crash” will be implemented in this project:

• Bad Crash Version 1: If the car crashes into a wall,
• The car is placed at the nearest position on the track to the place where the car crashed.
• The velocity is set to (0, 0).
• Bad Crash Version 2: If the car crashes into a wall,
• The car must go back to the original starting position.
• The velocity is set to (0, 0).

In this project, the performance of both versions of bad crash will be compared side by side for the reinforcement learning algorithms implemented in this project.

### Reward Function

Because the goal is for the race car to go from the starting position to the finish line in the minimum number of timesteps, we assign a reward of -1 for each move the car makes. Because the reward is negative, this means that the race car incurs a cost (i.e. penalty) of 1 each time the race car moves to another position. The goal is to get to the finish line in as few moves (i.e. time steps) as possible in order to minimize the total cost (i.e. maximize the total reward).

The finish line is the final state (often referred to as terminal or absorbing state). This state has a reward of 0. There is no requirement to stop right on the finish position (i.e. finish line). It is sufficient to cross it.

## Description of Any Tuning Process Applied

In order to compare the performance of the algorithm for the different crash scenarios, the principal hyperparameters were kept the same for all runs. These hyperparameters are listed below.

### Learning Rate

The Learning rate α was set to 0.2.

### Discount Rate

The discount rate γ was set to 0.9.

### Exploration Probability

The exploration probability ε was set to 0.2 as dictated by the problem statement.

### Number of Training Iterations

The number of training iterations was varied in order to computer the learning curve (see Results section below)

# Input Files

Here are the tracks used in this project (S= Starting States; F = Finish States, # = Wall, . = Track):

# Value Iteration Code in Python

Here is the full code for value iteration. This and the racetrack text files above are all you need to run the program:

```import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: value_iteration.py
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the value iteration reinforcement learning
# algorithm for the racetrack problem.
# The racetrack problem is described in full detail in:
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning :
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Value_Iteration"
FILENAME = "R-track.txt"
THIS_TRACK = "R_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
# we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a
# according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 10 # A single training iteration runs through all
# possible states s
NO_RACES = 10 # How many times the race car does a single time trial from
# starting position to the finish line
FRAME_TIME = 0.3 # How many seconds between frames printed to the console
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
MAX_TRAIN_ITER = 50 # Maximum number of training iterations

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# All actions A the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
(-1,0) , (0,0),  (1,0),
(-1,1) , (0,1),  (1,1)]

"""
This method reads in the environment (i.e. racetrack)
:param str filename
:return environment: list of lists
:rtype list
"""
# Open the file named filename in READ mode.
# Make the file an object named 'file'
with open(filename, 'r') as file:

# readline() then returns a list of the lines
# of the input file

# Close the file
file.close()

# Initialize an empty list
environment = []

# Adds a counter to each line in the environment_data list,
# i is the index of each line and line is the actual contents.
# enumerate() helps get not only the line of the environment but also
# the index of that line (i.e. row)
for i,line in enumerate(environment_data):
# Ignore the first line that contains the dimensions of the racetrack
if i > 0:
# Remove leading or trailing whitespace if applicable
line = line.strip()

# If the line is empty, ignore it
if line == "": continue

# Creates a list of lines. Within each line is a list of
# individual characters
# The stuff inside append(stuff) first creates a new_list = []
# It then appends all the values in a given line to that new
# list (e.g. new_list.append(all values inside the line))
# Then we append that new list to the environment list.
# Therefoer, environment is a list of lists.
environment.append([x for x in line])

# Return the environment (i.e. a list of lists/lines)
return environment

def print_environment(environment, car_position = [0,0]):
"""
This method reads in the environment and current
(y,x) position of the car and prints the environment to the console
:param list environment
:param list car_position
"""
# Store value of current grid square
temp = environment[car_position[0]][car_position[1]]

# Move the car to current grid square
environment[car_position[0]][car_position[1]] = "X"

# Delay
time.sleep(FRAME_TIME)

# Clear the printed output
clear()

# For each line in the environment
for line in environment:

# Initialize a string
text = ""

# Add each character to create a line
for character in line:
text += character

# Print the line of the environment
print(text)

# Retstore value of current grid square
environment[car_position[0]][car_position[1]] = temp

def clear():
"""
This method clears the print output
"""
os.system( 'cls' )

def get_random_start_position(environment):
"""
This method reads in the environment and selects a random
starting position on the racetrack (y, x). Note that
(0,0) corresponds to the upper left corner of the racetrack.
:param list environment: list of lines
:return random starting coordinate (y,x) on the racetrack
:rtype tuple
"""
# Collect all possible starting positions on the racetrack
starting_positions = []

# For each row in the environment
for y,row in enumerate(environment):

# For each column in each row of the environment
for x,col in enumerate(row):

# If we are at the starting position
if col == START:

# Add the coordiante to the list of available
# starting positions in the environment
starting_positions += [(y,x)]

# Random shuffle the list of starting positions
shuffle(starting_positions)

# Select a starting position
return starting_positions[0]

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
"""
Get the new velocity values
:param tuple old_vel: (vy, vx)
:param tuple accel: (ay, ax)
:param int min_vel: Minimum velocity of the car
:param int max_vel: Maximum velocity of the car
:return new velocities in y and x directions
"""
new_y = old_vel[0] + accel[0]
new_x = old_vel[1] + accel[1]
if new_x < min_vel: new_x = min_vel
if new_x > max_vel: new_x = max_vel
if new_y < min_vel: new_y = min_vel
if new_y > max_vel: new_y = max_vel

# Return the new velocities
return new_y, new_x

def get_new_position(old_loc, vel, environment):
"""
Get a new position using the old position and the velocity
:param tuple old_loc: (y,x) position of the car
:param tuple vel: (vy,vx) velocity of the car
:param list environment
:return y+vy, x+vx: (new_y,new_x)
"""
y,x = old_loc[0], old_loc[1]
vy, vx = vel[0], vel[1]

# new_y = y+vy, new_x = x + vx
return y+vy, x+vx

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
0), open = [TRACK, START, GOAL]):
"""
Locate the nearest open cell in order to handle crash scenario.
Distance is calculated as the Manhattan distance.
Start from the crash grid square and expand outward from there with
a radius of 1, 2, 3, etc. Forms a diamond search pattern.

For example, a Manhattan distance of 2 would look like the following:
.
...
..#..
...
.
If velocity is provided, search in opposite direction of velocity so that
there is no movement over walls
:param list environment
:param int ycrash: y coordinate where crash happened
:param int xcrash: x coordinate where crash happened
:param int vy: velocity in y direction when crash occurred
:param int vx: velocity in x direction when crash occurred
:param list of strings open: Contains environment types
:return tuple of the nearest open y and x position on the racetrack
"""
# Record number of rows (lines) and columns in the environment
rows = len(environment)
cols = len(environment[0])

# Add expanded coverage for searching for nearest open cell

# Generate a search radius for each scenario

# If car is not moving in y direction
if vy == 0:
# If the velocity in y-direction is negative
elif vy < 0:
# Search in the positive direction
y_off_range = range(0, radius + 1)
else:
# Otherwise search in the negative direction

# For each value in the search radius range of y
for y_offset in y_off_range:

# Start near to crash site and work outwards from there
y = y_crash + y_offset

# If car is not moving in x direction
if vx == 0:
# If the velocity in x-direction is negative
elif vx < 0:
x_range = range(x_crash, x_crash + x_radius + 1)
# If the velocity in y-direction is positive
else:
x_range = range(x_crash - x_radius, x_crash + 1)

# For each value in the search radius range of x
for x in x_range:
# We can't go outside the environment(racetrack) boundary
if y < 0 or y >= rows: continue
if x < 0 or x >= cols: continue

# If we find and open cell, return that (y,x) open cell
if environment[y][x] in open:
return(y,x)

# No open grid squares found on the racetrack
return

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
"""
This method generates the new state s' (position and velocity) from the old
state s and the action a taken by the race car. It also takes as parameters
the two different crash scenarios (i.e. go to nearest
open position on the race track or go back to start)
:param int old_y: The old y position of the car
:param int old_x: The old x position of the car
:param int old_vy: The old y velocity of the car
:param int old_vx: The old x velocity of the car
:param tuple accel: (ay,ax) - acceleration in y and x directions
:param list environment: The racetrack
:param boolean deterministic: True if we always follow the policy
:return s' where s' = new_y, new_x, new_vy, and new_vx
:rtype int
"""
# This method is deterministic if the same output is returned given
# the same input information
if not deterministic:

# If action fails (car fails to take the prescribed action a)
if random() > PROB_ACCELER_SUCCESS:
#print("Car failed to accelerate!")
accel = (0,0)

# Using the old velocity values and the new acceleration values,
# get the new velocity value
new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

# Using the new velocity values, update with the new position
temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),(
environment))

# Find the nearest open cell on the racetrack to this new position
new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy,
new_vx)
# If a crash happens (i.e. new position is not equal to the nearest
# open position on the racetrack
if new_y != temp_y or new_x != temp_x:

# If this is a crash in which we would have to return to the
# starting position of the racetrack and we are not yet
# on the finish line
if bad_crash and environment[new_y][new_x] != GOAL:

new_y, new_x = get_random_start_position(environment)

# Velocity of the race car is set to 0.
new_vy, new_vx = 0,0

# Return the new state
return new_y, new_x, new_vy, new_vx

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
"""
This method returns the policy pi(s) based on the action taken in each state
that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
best action that the race car should take in each state is the one that
maximizes the value of Q. (* means optimal)
:param int cols: Number of columns in the environment
:param int rows: Number of rows (i.e. lines) in the environment
:param list vel_range: Range of the velocity of the race car
:param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
:return pi : the policy
:rtype: dictionary: key is the state tuple, value is the
action tuple (ay,ax)
"""
# Create an empty dictionary called pi
pi = {}

# For each state s in the environment
for y in range(rows):
for x in range(cols):
for vy in vel_range:
for vx in vel_range:
# Store the best action for each state...the one that
# maximizes the value of Q.
# argmax looks across all actions given a state and
# returns the index ai of the maximum Q value
pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]

# Return pi(s)
return(pi)

def value_iteration(environment, bad_crash = False, reward = (
0.0), no_training_iter = NO_TRAINING_ITERATIONS):
"""
This method is the value iteration algorithm.
:param list environment
:param int reward of the terminal states (i.e. finish line)
:param int no_training_iter
:return policy pi(s) which maps a given state to an optimal action
:rtype dictionary
"""
# Calculate the number of rows and columns of the environment
rows = len(environment)
cols = len(environment[0])

# Create a table V(s) that will store the optimal Q-value for each state.
# This table will help us determine when we should stop the algorithm and
# return the output. Initialize all the values of V(s) to arbitrary values,
# except the terminal state (i.e. finish line state) that has a value of 0.
# values[y][x][vy][vx]
# Read from left to right, we create a list of vx values. Then for each
# vy value we assign the list of vx values. Then for each x value, we assign
# the list of vy values (which contain a list of vx values), etc.
# This is called list comprehension.
values = [[[[random() for _ in vel_range] for _ in vel_range] for _ in (
line)] for line in environment]

# Set the finish line states to 0
for y in range(rows):
for x in range(cols):
# Terminal state has a value of 0
if environment[y][x] == GOAL:
for vy in vel_range:
for vx in vel_range:
values[y][x][vy][vx] = reward

# Initialize all Q(s,a) to arbitrary values, except the terminal state
# (i.e. finish line states) that has a value of 0.
# Q[y][x][vy][vx][ai]
Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
vel_range)] for _ in line] for line in environment]

# Set finish line state-action pairs to 0
for y in range(rows):
for x in range(cols):
# Terminal state has a value of 0
if environment[y][x] == GOAL:
for vy in vel_range:
for vx in vel_range:
for ai, a in enumerate(actions):
Q[y][x][vy][vx][ai] = reward

# This is where we train the agent (i.e. race car). Training entails
# optimizing the values in the tables of V(s) and Q(s,a)
for t in range(no_training_iter):

# Keep track of the old V(s) values so we know if we reach stopping
# criterion
values_prev = deepcopy(values)

# When this value gets below the error threshold, we stop training.
# This is the maximum change of V(s)
delta = 0.0

# For all the possible states s in S
for y in range(rows):
for x in range(cols):
for vy in vel_range:
for vx in vel_range:

# If the car crashes into a wall
if environment[y][x] == WALL:

# Wall states have a negative value
# I set some arbitrary negative value since
# we want to train the car not to hit walls
values[y][x][vy][vx] = -9.9

# Go back to the beginning
# of this inner for loop so that we set
# all the other wall states to a negative value
continue

# For each action a in the set of possible actions A
for ai, a in enumerate(actions):

# The reward is -1 for every state except
# for the finish line states
if environment[y][x] == GOAL:
r = reward
else:
r = -1

# Get the new state s'. s' is based on the current
# state s and the current action a
new_y, new_x, new_vy, new_vx = act(
y,x,vy,vx,a,environment,deterministic = True,

# V(s'): value of the new state when taking action
# a from state s. This is the one step look ahead.
value_of_new_state = values_prev[new_y][new_x][
new_vy][new_vx]

# Get the new state s'. s' is based on the current
# state s and the action (0,0)
new_y, new_x, new_vy, new_vx = act(
y,x,vy,vx,(0,0),environment,deterministic = (

# V(s'): value of the new state when taking action
# (0,0) from state s. This is the value if for some
# reason the race car attemps to accelerate but
# fails
value_of_new_state_if_action_fails = values_prev[
new_y][new_x][new_vy][new_vx]

# Expected value of the new state s'
# Note that each state-action pair has a unique
# value for s'
expected_value = (
PROB_ACCELER_SUCCESS * value_of_new_state) + (
PROB_ACCELER_FAILURE * (
value_of_new_state_if_action_fails))

# Update the Q-value in Q[s,a]
# immediate reward + discounted future value
Q[y][x][vy][vx][ai] = r + (
DISC_RATE * expected_value)

# Get the action with the highest Q value
argMaxQ = np.argmax(Q[y][x][vy][vx])

# Update V(s)
values[y][x][vy][vx] = Q[y][x][vy][vx][argMaxQ]

# Make sure all the rewards to 0 in the terminal state
for y in range(rows):
for x in range(cols):
# Terminal state has a value of 0
if environment[y][x] == GOAL:
for vy in vel_range:
for vx in vel_range:
values[y][x][vy][vx] = reward

# See if the V(s) values are stabilizing
# Finds the maximum change of any of the states. Delta is a float.
delta = max([max([max([max([abs(values[y][x][vy][vx] - values_prev[y][
x][vy][vx]) for vx in vel_range]) for vy in (
vel_range)]) for x in range(cols)]) for y in range(rows)])

# If the values of each state are stabilizing, return the policy
# and exit this method.
if delta < ERROR_THRES:
return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

def do_time_trial(environment, policy, bad_crash = False, animate = True,
max_steps = MAX_STEPS):
"""
Race car will do a time trial on the race track according to the policy.
:param list environment
:param dictionary policy: A dictionary containing the best action for a
given state. The key is the state y,x,vy,vx and value is the action
(ax,ay) acceleration
:param boolean bad_crash: The crash scenario. If true, race car returns to
starting position after crashes
:param boolean animate: If true, show the car on the racetrack at each
timestep
:return i: Total steps to complete race (i.e. from starting position to
finish line)
:rtype int

"""
# Copy the environment
environment_display = deepcopy(environment)

# Get a starting position on the race track
starting_pos = get_random_start_position(environment)
y,x = starting_pos
vy,vx = 0,0  # We initialize velocity to 0

# Keep track if we get stuck
stop_clock = 0

# Begin time trial
for i in range(max_steps):

# Show the race car on the racetrack
if animate:
print_environment(environment_display, car_position = [y,x])

a = policy[(y,x,vy,vx)]

# If we are at the finish line, stop the time trial
if environment[y][x] == GOAL:
return i

# Take action and get new a new state s'

# Determine if the car gets stuck
if vy == 0 and vx == 0:
stop_clock += 1
else:
stop_clock = 0

# We have gotten stuck as the car has not been moving for 5 timesteps
if stop_clock == 5:
return max_steps

# Program has timed out
return max_steps

def main():
"""
The main method of the program
"""
print("Welcome to the Racetrack Control Program!")
" Reinforcement Learning Algorithm\n")
print("Track: " + THIS_TRACK)
print()
print("What happens to the car if it crashes into a wall?")
option_1 = """1. Starts from the nearest position on the track to the
place where it crashed."""
option_2 = """2. Returns back to the original starting position."""
print(option_1)
print(option_2)
crash_scenario = int(input("Crash scenario (1 or 2): "))
no_training_iter = int(input(
"Enter the initial number of training iterations (e.g. 5): "))
print("\nThe race car is training. Please wait...")

# Directory where the racetrack is located
#racetrack_name = input("Enter the path to your input file: ")
racetrack_name = FILENAME

# How many times the race car will do a single time trial
races = NO_RACES

while(no_training_iter < MAX_TRAIN_ITER):

# Keep track of the total number of steps
total_steps = 0

# Record the crash scenario
if crash_scenario == 1:
else:

# Retrieve the policy
no_training_iter=no_training_iter)

for each_race in range(races):
total_steps += do_time_trial(racetrack, policy, bad_crash = (

print("Number of Training Iterations: " + str(no_training_iter))
if crash_scenario == 1:
print("Bad Crash Scenario: " + option_1)
else:
print("Bad Crash Scenario: " + option_2)
print("Average Number of Steps the Race Car Needs to Take Before " +
"Finding the Finish Line: " + str(total_steps/races) + " steps\n")
print("\nThe race car is training. Please wait...")

# Delay
time.sleep(FRAME_TIME + 4)

# Testing statistics
test_stats_file = THIS_TRACK
test_stats_file += "_"
test_stats_file += ALGORITHM_NAME + "_iter"
test_stats_file += str(no_training_iter)+ "_cr"
test_stats_file += str(crash_scenario) + "_stats.txt"

## Open a test_stats_file
outfile_ts = open(test_stats_file,"w")

outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write("Track: ")
outfile_ts.write(THIS_TRACK)
outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
if crash_scenario == 1:
outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
else:
outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
outfile_ts.write("Average Number of Steps the Race Car Took " +
"Before Finding the Finish Line: " + str(total_steps/races) +
" steps\n")

# Show functioning of the program
trace_runs_file = THIS_TRACK
trace_runs_file += "_"
trace_runs_file += ALGORITHM_NAME + "_iter"
trace_runs_file += str(no_training_iter) + "_cr"
trace_runs_file += str(crash_scenario) + "_trace.txt"

if no_training_iter <= 5:
## Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")

# Print trace runs that demonstrate proper functioning of the code
outfile_tr.write(str(policy))

outfile_tr.close()

## Close the files
outfile_ts.close()

no_training_iter += 5

main()
```

# Q-Learning Code in Python

And here is the code for Q-Learning:

```import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: q_learning.py
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the Q-learning reinforcement learning
# algorithm for the racetrack problem
# The racetrack problem is described in full detail in:
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning :
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Q_Learning"
FILENAME = "L-track.txt"
THIS_TRACK = "L_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
# we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a
# according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 500000 # A single training iteration runs through all
# possible states s
TRAIN_ITER_LENGTH = 10 # The maximum length of each training iteration
MAX_TRAIN_ITER = 10000000 # Maximum number of training iterations
NO_RACES = 10 # How many times the race car does a single time trial from
# starting position to the finish line
LEARNING_RATE = 0.25 # If applicable, also known as alpha
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
FRAME_TIME = 0.1 # How many seconds between frames printed to the console

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# Actions the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
(-1,0) , (0,0),  (1,0),
(-1,1) , (0,1),  (1,1)]

"""
This method reads in the environment (i.e. racetrack)
:param str filename
:return environment: list of lists
:rtype list
"""
# Open the file named filename in READ mode.
# Make the file an object named 'file'
with open(filename, 'r') as file:

# readline() then returns a list of the lines
# of the input file

# Close the file
file.close()

# Initialize an empty list
environment = []

# Adds a counter to each line in the environment_data list,
# i is the index of each line and line is the actual contents.
# enumerate() helps get not only the line of the environment but also
# the index of that line (i.e. row)
for i,line in enumerate(environment_data):
# Ignore the first line that contains the dimensions of the racetrack
if i > 0:
# Remove leading or trailing whitespace if applicable
line = line.strip()

# If the line is empty, ignore it
if line == "": continue

# Creates a list of lines. Within each line is a list of
# individual characters
# The stuff inside append(stuff) first creates a new_list = []
# It then appends all the values in a given line to that new
# list (e.g. new_list.append(all values inside the line))
# Then we append that new list to the environment list.
# Therefoer, environment is a list of lists.
environment.append([x for x in line])

# Return the environment (i.e. a list of lists/lines)
return environment

def print_environment(environment, car_position = [0,0]):
"""
This method reads in the environment and current
(y,x) position of the car and prints the environment to the console
:param list environment
:param list car_position
"""
# Store value of current grid square
temp = environment[car_position[0]][car_position[1]]

# Move the car to current grid square
environment[car_position[0]][car_position[1]] = "X"

# Delay
time.sleep(FRAME_TIME)

# Clear the printed output
clear()

# For each line in the environment
for line in environment:

# Initialize a string
text = ""

# Add each character to create a line
for character in line:
text += character

# Print the line of the environment
print(text)

# Retstore value of current grid square
environment[car_position[0]][car_position[1]] = temp

def clear():
"""
This method clears the print output
"""
os.system( 'cls' )

def get_random_start_position(environment):
"""
This method reads in the environment and selects a random
starting position on the racetrack (y, x). Note that
(0,0) corresponds to the upper left corner of the racetrack.
:param list environment: list of lines
:return random starting coordinate (y,x) on the racetrack
:rtype tuple
"""
# Collect all possible starting positions on the racetrack
starting_positions = []

# For each row in the environment
for y,row in enumerate(environment):

# For each column in each row of the environment
for x,col in enumerate(row):

# If we are at the starting position
if col == START:

# Add the coordiante to the list of available
# starting positions in the environment
starting_positions += [(y,x)]

# Random shuffle the list of starting positions
shuffle(starting_positions)

# Select a starting position
return starting_positions[0]

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
"""
This method generates the new state s' (position and velocity) from the old
state s and the action a taken by the race car. It also takes as parameters
the two different crash scenarios (i.e. go to nearest
open position on the race track or go back to start)
:param int old_y: The old y position of the car
:param int old_x: The old x position of the car
:param int old_vy: The old y velocity of the car
:param int old_vx: The old x velocity of the car
:param tuple accel: (ay,ax) - acceleration in y and x directions
:param list environment: The racetrack
:param boolean deterministic: True if we always follow the policy
:return s' where s' = new_y, new_x, new_vy, and new_vx
:rtype int
"""
# This method is deterministic if the same output is returned given
# the same input information
if not deterministic:

# If action fails (car fails to take the prescribed action a)
if random() > PROB_ACCELER_SUCCESS:
#print("Car failed to accelerate!")
accel = (0,0)

# Using the old velocity values and the new acceleration values,
# get the new velocity value
new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

# Using the new velocity values, update with the new position
temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),(
environment))

# Find the nearest open cell on the racetrack to this new position
new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy,
new_vx)
# If a crash happens (i.e. new position is not equal to the nearest
# open position on the racetrack
if new_y != temp_y or new_x != temp_x:

# If this is a crash in which we would have to return to the
# starting position of the racetrack and we are not yet
# on the finish line
if bad_crash and environment[new_y][new_x] != GOAL:

new_y, new_x = get_random_start_position(environment)

# Velocity of the race car is set to 0.
new_vy, new_vx = 0,0

# Return the new state
return new_y, new_x, new_vy, new_vx

def get_new_position(old_loc, vel, environment):
"""
Get a new position using the old position and the velocity
:param tuple old_loc: (y,x) position of the car
:param tuple vel: (vy,vx) velocity of the car
:param list environment
:return y+vy, x+vx: (new_y,new_x)
"""
y,x = old_loc[0], old_loc[1]
vy, vx = vel[0], vel[1]

# new_y = y+vy, new_x = x + vx
return y+vy, x+vx

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
"""
Get the new velocity values
:param tuple old_vel: (vy, vx)
:param tuple accel: (ay, ax)
:param int min_vel: Minimum velocity of the car
:param int max_vel: Maximum velocity of the car
:return new velocities in y and x directions
"""
new_y = old_vel[0] + accel[0]
new_x = old_vel[1] + accel[1]
if new_x < min_vel: new_x = min_vel
if new_x > max_vel: new_x = max_vel
if new_y < min_vel: new_y = min_vel
if new_y > max_vel: new_y = max_vel

# Return the new velocities
return new_y, new_x

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
0), open = [TRACK, START, GOAL]):
"""
Locate the nearest open cell in order to handle crash scenario.
Distance is calculated as the Manhattan distance.
Start from the crash grid square and expand outward from there with
a radius of 1, 2, 3, etc. Forms a diamond search pattern.

For example, a Manhattan distance of 2 would look like the following:
.
...
..#..
...
.
If velocity is provided, search in opposite direction of velocity so that
there is no movement over walls
:param list environment
:param int ycrash: y coordinate where crash happened
:param int xcrash: x coordinate where crash happened
:param int vy: velocity in y direction when crash occurred
:param int vx: velocity in x direction when crash occurred
:param list of strings open: Contains environment types
:return tuple of the nearest open y and x position on the racetrack
"""
# Record number of rows (lines) and columns in the environment
rows = len(environment)
cols = len(environment[0])

# Add expanded coverage for searching for nearest open cell

# Generate a search radius for each scenario

# If car is not moving in y direction
if vy == 0:
# If the velocity in y-direction is negative
elif vy < 0:
# Search in the positive direction
y_off_range = range(0, radius + 1)
else:
# Otherwise search in the negative direction

# For each value in the search radius range of y
for y_offset in y_off_range:

# Start near to crash site and work outwards from there
y = y_crash + y_offset

# If car is not moving in x direction
if vx == 0:
# If the velocity in x-direction is negative
elif vx < 0:
x_range = range(x_crash, x_crash + x_radius + 1)
# If the velocity in y-direction is positive
else:
x_range = range(x_crash - x_radius, x_crash + 1)

# For each value in the search radius range of x
for x in x_range:
# We can't go outside the environment(racetrack) boundary
if y < 0 or y >= rows: continue
if x < 0 or x >= cols: continue

# If we find and open cell, return that (y,x) open cell
if environment[y][x] in open:
return(y,x)

# No open grid squares found on the racetrack
return

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
"""
This method returns the policy pi(s) based on the action taken in each state
that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
best action that the race car should take in each state is the one that
maximizes the value of Q. (* means optimal)
:param int cols: Number of columns in the environment
:param int rows: Number of rows (i.e. lines) in the environment
:param list vel_range: Range of the velocity of the race car
:param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
:return pi : the policy
:rtype: dictionary: key is the state tuple, value is the
action tuple (ay,ax)
"""
# Create an empty dictionary called pi
pi = {}

# For each state s in the environment
for y in range(rows):
for x in range(cols):
for vy in vel_range:
for vx in vel_range:
# Store the best action for each state...the one that
# maximizes the value of Q.
# argmax looks across all actions given a state and
# returns the index ai of the maximum Q value
pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]

# Return pi(s)
return(pi)

def q_learning(environment, bad_crash = False, reward = 0.0, no_training_iter =(
NO_TRAINING_ITERATIONS), train_iter_length = TRAIN_ITER_LENGTH):
"""
Return a policy pi that maps states to actions
Each episode uses a different initial state. This forces the agent to fully
:param list environment
:param int reward of the terminal states (i.e. finish line)
:param int no_training_iter
:param int train_iter_length
:return policy pi(s) which maps a given state to an optimal action
:rtype dictionary
"""
rows = len(environment)
cols = len(environment[0])

# Initialize all Q(s,a) to arbitrary values, except the terminal state
# (i.e. finish line states) that has a value of 0.
# Q[y][x][vy][vx][ai]
Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
vel_range)] for _ in line] for line in environment]

# Set finish line state-action pairs to 0
for y in range(rows):
for x in range(cols):
# Terminal state has a value of 0
if environment[y][x] == GOAL:
for vy in vel_range:
for vx in vel_range:
for ai, a in enumerate(actions):
Q[y][x][vy][vx][ai] = reward

# We run many training iterations for different initial states in order
# to explore the environment as much as possible
for iter in range(no_training_iter):

# Reset all the terminal states to the value of the goal
for y in range(rows):
for x in range(cols):
if environment[y][x] == GOAL:
Q[y][x] = [[[reward for _ in actions] for _ in (
vel_range)] for _ in vel_range]

# Select a random initial state
# from anywhere on the racetrack
y = np.random.choice(range(rows))
x = np.random.choice(range(cols))
vy = np.random.choice(vel_range)
vx = np.random.choice(vel_range)

# Do a certain number of iterations for each episode
for t in range(train_iter_length):
if environment[y][x] == GOAL:
break
if environment[y][x] == WALL:
break

# Choose the best action for the state s
a = np.argmax(Q[y][x][vy][vx])

# Act and then observe a new state s'
new_y, new_x, new_vy, new_vx = act(y, x, vy, vx, actions[
r = -1

# Update the Q table based on the immediate reward received from
# taking action a in state s plus the discounted future reward
Q[y][x][vy][vx][a] = ((1 - LEARNING_RATE)*Q[y][x][vy][vx][a] +
LEARNING_RATE*(r + DISC_RATE*max(Q[new_y][new_x][
new_vy][new_vx])))

# The new state s' now becomes s
y, x, vy, vx = new_y, new_x, new_vy, new_vx

return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

def do_time_trial(environment, policy, bad_crash = False, animate = True,
max_steps = MAX_STEPS):
"""
Race car will do a time trial on the race track according to the policy.
:param list environment
:param dictionary policy: A dictionary containing the best action for a
given state. The key is the state y,x,vy,vx and value is the action
(ax,ay) acceleration
:param boolean bad_crash: The crash scenario. If true, race car returns to
starting position after crashes
:param boolean animate: If true, show the car on the racetrack at each
timestep
:return i: Total steps to complete race (i.e. from starting position to
finish line)
:rtype int

"""
# Copy the environment
environment_display = deepcopy(environment)

# Get a starting position on the race track
starting_pos = get_random_start_position(environment)
y,x = starting_pos
vy,vx = 0,0  # We initialize velocity to 0

# Keep track if we get stuck
stop_clock = 0

# Begin time trial
for i in range(max_steps):

# Show the race car on the racetrack
if animate:
print_environment(environment_display, car_position = [y,x])

a = policy[(y,x,vy,vx)]

# If we are at the finish line, stop the time trial
if environment[y][x] == GOAL:
return i

# Take action and get new a new state s'

# Determine if the car gets stuck
if vy == 0 and vx == 0:
stop_clock += 1
else:
stop_clock = 0

# We have gotten stuck as the car has not been moving for 5 timesteps
if stop_clock == 5:
return max_steps

# Program has timed out
return max_steps

def main():
"""
The main method of the program
"""
print("Welcome to the Racetrack Control Program!")
" Reinforcement Learning Algorithm\n")
print("Track: " + THIS_TRACK)
print()
print("What happens to the car if it crashes into a wall?")
option_1 = """1. Starts from the nearest position on the track to the
place where it crashed."""
option_2 = """2. Returns back to the original starting position."""
print(option_1)
print(option_2)
crash_scenario = int(input("Crash scenario (1 or 2): "))
no_training_iter = int(input(
"Enter the initial number of training iterations (e.g. 500000): "))
print("\nThe race car is training. Please wait...")

# Directory where the racetrack is located
#racetrack_name = input("Enter the path to your input file: ")
racetrack_name = FILENAME

# How many times the race car will do a single time trial
races = NO_RACES

while(no_training_iter <= MAX_TRAIN_ITER):

# Keep track of the total number of steps
total_steps = 0

# Record the crash scenario
if crash_scenario == 1:
else:

# Retrieve the policy
policy =  q_learning(racetrack, bad_crash = (

for each_race in range(races):
total_steps += do_time_trial(racetrack, policy, bad_crash = (

print("Number of Training Iterations: " + str(no_training_iter))
if crash_scenario == 1:
print("Bad Crash Scenario: " + option_1)
else:
print("Bad Crash Scenario: " + option_2)
print("Average Number of Steps the Race Car Needs to Take Before " +
"Finding the Finish Line: " + str(total_steps/races) + " steps\n")
print("\nThe race car is training. Please wait...")

# Delay
time.sleep(FRAME_TIME + 4)

# Testing statistics
test_stats_file = THIS_TRACK
test_stats_file += "_"
test_stats_file += ALGORITHM_NAME + "_iter"
test_stats_file += str(no_training_iter)+ "_cr"
test_stats_file += str(crash_scenario) + "_stats.txt"

## Open a test_stats_file
outfile_ts = open(test_stats_file,"w")

outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write("Track: ")
outfile_ts.write(THIS_TRACK)
outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
if crash_scenario == 1:
outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
else:
outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
outfile_ts.write("Average Number of Steps the Race Car Took " +
"Before Finding the Finish Line: " + str(total_steps/races) +
" steps\n")

# Show functioning of the program
trace_runs_file = THIS_TRACK
trace_runs_file += "_"
trace_runs_file += ALGORITHM_NAME + "_iter"
trace_runs_file += str(no_training_iter) + "_cr"
trace_runs_file += str(crash_scenario) + "_trace.txt"

if no_training_iter <= 5:
## Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")

# Print trace runs that demonstrate proper functioning of the code
outfile_tr.write(str(policy))

outfile_tr.close()

## Close the files
outfile_ts.close()

if no_training_iter == 500000:
no_training_iter += 500000
else:
no_training_iter += 1000000

main()

```

# Experimental Output and Analysis

Here are the results:

# Analysis

## Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

Both the value iteration and Q-learning algorithms were successfully able to finish the race for both crash scenarios and for both the R-track and the L-track. The race car completed the R-track in fewer time steps than the L-track, which was expected.

The crash policy in which the car had to return to the starting position after it crashed versus going to the nearest open position negatively impacted performance for both algorithms. Time trial performance was superior for the value iteration algorithm compared to the Q learning algorithm. This result was expected given that the value iteration algorithm had access to the transition and the reward probabilities during the training phase. In other words, value iteration it gets a complete model of the entire track, whereas Q-learning has to be forced to explore the environment in order to learn.

## Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

My hypothesis was correct. Value iteration generated a learning policy faster than Q-learning because it had access to the transition and reward functions. Q-learning required more than 6,000,000 training iterations on the R-track to achieve the time trial performance that the value iteration algorithm was able to produce in only 45 training iterations (holding all other hyperparameters constant). And whereas the Q-learning algorithm took 2-3 hours to train to get good time trial results, value iteration never took more than 10 minutes for both the R-track and L-track under both of the bad crash scenarios.

## Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

The results show that my hypothesis was correct. It took longer for the car to finish the race for the crash scenario in which the race car needed to return to the original starting position each time it crashed into a wall. In other words, Bad Crash Version 1 (return to start) performance was better than Bad Crash Version 2 (return to nearest open position) performance. These results are what I expected since always returning to the starting position after a crash hampers the ability of the race car to continually explore new states.

# Summary and Conclusions

My hypotheses were correct. Here are the following conclusions:

• Value iteration and Q-learning are powerful reinforcement learning algorithms that can enable an agent to learn autonomously.
• Value iteration led to faster learning than the Q-learning algorithm.
• A crash policy in which the race car always returns to the starting position after a crash negatively impacts performance.

# Video Demonstration

Here is a video demonstration of the algorithms in action.

# References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to Act Using Real-Time Dynamic Programming. Artificial Intelligence, 81-138.

Russell, S. J., Norvig, P., & Davis, E. (2010). Artificial intelligence : A Modern Approach. Upper Saddle River, NJ: Prentice Hall.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning : An Introduction . Cambridge, Massachusetts: The MIT Press.

## How Reinforcement Learning Works

Let’s examine what reinforcement learning is all about by looking at an example.

We have a race car that drives all by itself (i.e. autonomous race car). The race car will move around in an environment. Formally, the race car is known in reinforcement learning language as the agent. The agent is the entity in a reinforcement learning problem that is learning and making decisions. An agent in this example is a race car, but in other applications it could be a robot, self-driving car, autonomous helicopter, etc.

The race car will interact with the environment. The environment in this example is the racetrack. It is represented as the little cloud in the image above. Formally, the environment is the world in which the agent learns and makes decisions. In this example, it is a race track, but in other applications it could be a chessboard, bedroom, warehouse, stock market, factory, etc.

In order to interact with the environment, the race car has to take actions (i.e. accelerate, decelerate, turn). Each time the race car takes an action it will change the environment. Formally, actions are the set of things that an agent can do. In a board game like chess, for example, the set of actions could be move-left, move-right, move-up, and move-down.

Each time the environment changes, the race car will sense the environment. In this example, sensing the environment means taking an observation of its current position on the racetrack and its velocity (i.e. speed). The position and velocity variables are what we call the current state of the environment. The state is what describes the current environment. In this example, position and velocity constitute the state, but in other applications the state could be any kind of observable value (i.e. roll, pitch, yaw, angle of rotation, force, etc.)

After the race car observes the new state, it has to process that state in order to decide what to do next. That decision-making process is called the policy of the agent. You can think of the policy as a function. It takes as inputs the observed state (denoted as s) and outputs an action (denoted as a). The policy is commonly represented as the Greek letter π. So, the function is formally π(s) = a.

In robotics, the cycle I described above is often referred to as the sense-think-act cycle

Diving into even greater detail on how this learning cycle works…once the policy of the race car (i.e. the agent) outputs an action, the race car then takes that action. That action causes the environment to transition to a new state. That transition is often represented as the letter T (i.e. the transition function). The transition function takes as input the previous state and the new action. The output of the transition function is a new state.

The robot then observes the new state, examines its policy, and executes a new action. This cycle repeats over and over again.

So the question now is, how is the policy π(s) determined? In other words, how does the race car know what action to take (i.e. accelerate, decelerate, turn) when it observes a particular state (i.e. position on the racetrack and velocity)?

In order to answer those questions, we need to add another piece to the diagram. That piece is known as the reward. The reward is what helps the race car develop a policy (i.e. a plan of what action to take in each state). Let us draw the reward piece on the diagram and then take a closer look at what this is.

Every time the race car is in a particular state and takes an action, there is a reward that the race car receives from the environment for taking that action in that particular state. That reward can be either a positive reward or a negative reward. Consider the reward the feedback that the race car (i.e. agent) receives from the environment for taking an action. The reward tells the race car how well it is doing.

It is helpful to imagine that there is a pocket on the side of the race car that stores “money.” Every time the race car takes an action from a particular state, the environment either adds money to that pocket (positive reward) or removes money from that pocket (negative reward).

The actual value for the reward (i.e. r in the diagram above) depends on what is known as the reward function. We can denote the reward function as R. The reward function takes as inputs a state, action, and new state and outputs the reward.

R[s, a, s’] —-> r

The reward is commonly an integer (e.g. +1 for a positive reward; -1 for a negative reward). In this project, the reward for each move is -1. The reason why the reward for each move is -1 is because we want to teach the race car to move from the starting position to the finish line in the minimum number of timesteps. By penalizing the race car each time it makes a move, we ensure that it gradually learns how to avoid penalties and finishes the race as quickly as possible.

Inside the race car, there is an algorithm that keeps track of what the rewards were when it took an action in a particular state. This algorithm helps the race car improve over time. Improvement in this case means learning a policy that enables it to take actions (given a particular state) that maximize its total reward.

We will represent this algorithm as Q, where Q is used by the race car in order to develop an optimal policy, a policy that enables the race car to maximize its total reward.

# Reinforcement Learning Summary

Summarizing the steps in the previous section, on a high level reinforcement learning has the following steps:

1. The agent (i.e race car) takes an observation of the environment. This observation is known as the state.
2. Based on this state, the race car uses its policy to determine what action to take, where policy is typically a lookup table that matches states to particular actions.
3. Based on the state and the action taken by the agent, the environment transitions to a new state and generates a reward.
4. The reward is passed from the environment to the agent.
5. The agent keeps track of the rewards it receives for the actions it takes in the environment from a particular state. It uses the reward information to update its policy. The goal of the agent is to learn the best policy over time (i.e. where policy is typically a lookup table that tells the race car what action to take given a particular state). The best policy is the one that maximizes the total reward of the agent.

# Markov Decision Problem

The description of reinforcement learning I outlined in the previous section is known as a Markov decision problem. A Markov decision problem has the following components:

• Set of states S: All the different values of s that can be observed by the race car (e.g. position on the racetrack, velocity). In this project, the set of states is all the different values that <xt, yt, vxt, vyt> can be.
• Set of actions A: All the different values of a that are the actions performed by the race car on the environment (e.g. accelerate, decelerate, turn). In this project, the set of actions is all the different values that = <axt, ayt> can be.
• Transition Function T[s, a, s’]
• Where T[s, a, s’] stands for the probability the race car finds itself in a new state s’ (at the next timestep) given that it was in state s and took action a.
• In probability notation, this is formally written as:
• T[s, a, s’] = P[new state = s’ | state = s AND action = a], where P means probability and | means “given” or “conditioned upon.”
• Reward Function R[s,a,s’]
• Given a particular state s and a particular action a that changes to state from s to s’, the reward function determines what the reward will be (e.g. -1 reward for all portions of the racetrack except for the finish line, which has reward of 0).

The goal of a reinforcement learning algorithm is to find the policy (denoted as π(s)) that can tell the race car the best action to take in any particular state, where “best” in this context means the action that will maximize the total reward the race car receives from the environment. The end result is the best sequence of actions the race car can take to maximize its reward (thereby finishing the race in the minimum number of timesteps). The sequence of actions the race car takes from the starting position to the finish line is known denotes an episode or trial. (Alpaydin, 2014). Recall that the policy is typically a lookup table that matches states (input) to actions (output).

There are a number of algorithms that can be used to enable the race car to learn a policy that will enable it to achieve its goal of moving from the starting position of the racetrack to the finish line as quickly as possible. The two algorithms implemented in this project were value iteration and Q-learning. These algorithms will be discussed in detail in future posts.

This video series by Dr. Tucker Balch, a professor at Georgia Tech, provides an excellent introduction to reinforcement learning.

In order to answer this question, let us break it down piece-by-piece.

# What is Quickprop?

Normal backpropagation when training an artificial feedforward neural network can take a long time to converge on the optimal set of weights. Updating the weights in the first layers of a neural network requires errors that are propagated backwards from the output layer. This procedure might work well for small networks, but when there are a lot of hidden layers, standard backpropagation can really slow down the training process.

Scott E. Fahlman, a Professor Emeritus at Carnegie Mellon University in the School of Computer Science, decided in the late 1980s to speed up backpropagation by inventing a new algorithm called Quickprop which was inspired by Newton’s method, a popular root-finding algorithm.

The idea behind QuickProp is to take the biggest steps possible without overstepping the solution (Kirk, 2015). Instead of the error information propagating backwards from the output layer (as is the case with standard backpropagation), QuickProp needs just the current and previous values of the weights, as well as the error derivative that was calculated during the most recent epoch. This twist on backpropagation means that the weight update process is now entirely local with respect to each node.

In Quickprop, we assume that the shape of the cost function for a given weight when near its optimal value is a paraboloid. With this assumption we can then follow a method similar to Newton’s method for converging on the optimal solution.

The downside of Quickprop is that it can get chaotic if the surface of the cost function has a lot of local minima (Kirk, 2015).

Title Image Source: Wikimedia Commons

# What is the Vanishing Gradient Problem?

The vanishing gradient problem is as follows: As a neural network contains more and more layers, the gradients of the cost function with respect to the weights at each of the nodes gets smaller and smaller, approaching zero. Teeny tiny gradients make the network increasingly more difficult to train.

The vanishing gradient problem is caused by the nature of backpropagation as well as the activation function (e.g. logistic sigmoid function). Backpropagation computes gradients using the chain rule. And since the sigmoid function generates gradients in the range (0, 1), at each step of training, you are multiplying a chain of small numbers together in order to calculate the gradients.

Consistently multiplying smaller and smaller gradient values produces smaller and smaller weight update values since the magnitude of the weight updates (i.e. step size) is proportional to the learning rate and the gradients.

Thus, because the gradients become vanishingly small, the algorithm takes longer and longer to converge on an optimal set of weights. And since the weights will only make teeny tiny changes during each step of the training phase, the weights do not update effectively, and the neural network network losses accuracy.

Cascade correlation is a method that addresses the question of how many hidden layers and how many hidden nodes we need. Instead of starting with a fixed network with a set number of hidden layers and nodes per hidden layer, the cascade correlation method starts with a minimal network and then trains and adds one hidden layer at a time.

Some scientists have combined the Cascade correlation learning architecture with the Quickprop weight update rule (Abrahart, 2004). When combined in this fashion, Cascade correlation prevents there from being a vanishing gradient problem due to the fact the gradient is never actually propagated from output all the way back to input through the various layers.

As noted by Marquez et al. (2018), “such training of deep networks in a cascade, directly circumvents the well-known vanishing gradient problem by ensuring that the output is always adjacent to the layer being trained.” This is because of the way the weights are locked as new layers are inserted, and the inputs are still tied directly to the outputs.

So to answer the question posed in the title, it’s really the cascade correlation architecture that has the greatest effect on reducing the vanishing gradient problem even though there is some benefit from the Quickprop algorithm.

# References

Abrahart, Robert J., Pauline E. Kneale, and Linda M. See. Neural networks for hydrological modelling. Leiden New York: A.A. Balkema Publishers, 2004. Print.

Marquez, Enrique S., Jonathon S. Hare, Mahesan Niranjan: Deep Cascade Learning. IEEE Trans. Neural Netw. Learning Syst. 29(11): 5475-5485 (2018).

Kirk, Matthew, et al. Thoughtful Machine learning : A Test-driven Approach. Sebastopol, California: O’Reilly, 2015. Print.

## Artificial Feedforward Neural Network With Backpropagation From Scratch

In this post, I will walk you through how to build an artificial feedforward neural network trained with backpropagation, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to evaluate the performance of an artificial feedforward neural network trained with backpropagation and to compare the performance using no hidden layers, one hidden layer, and two hidden layers. Five different data sets from the UCI Machine Learning Repository are used to compare performance: Breast Cancer, Glass, Iris, Soybean (small), and Vote.

We will use our neural network to do the following:

• Predict if someone has breast cancer
• Identify glass type
• Identify flower species
• Determine soybean disease type
• Classify a representative as either a Democrat or Republican based on their voting patterns

I hypothesize that the neural networks with no hidden layers will outperform the networks with two hidden layers. My hypothesis is based on the notion that the simplest solutions are often the best solutions (i.e. Occam’s Razor).

The classification accuracy of the algorithms on the data sets will be evaluated as follows, using five-fold stratified cross-validation:

• Accuracy = (number of correct predictions)/(total number of predictions)

Title image source: Wikimedia commons

# What is an Artificial Feedforward Neural Network Trained with Backpropagation?

## Background

An artificial feed-forward neural network (also known as multilayer perceptron) trained with backpropagation is an old machine learning technique that was developed in order to have machines that can mimic the brain. Neural networks were the focus of a lot of machine learning research during the 1980s and early 1990s but declined in popularity during the late 1990s.

Since 2010, neural networks have experienced a resurgence in popularity due to improvements in computing speed and the availability of massive amounts of data with which to train large-scale neural networks. In the real world, neural networks have been used to recognize speech, caption images, and even help self-driving cars learn how to park autonomously.

## The Brain as Inspiration for Artificial Neural Networks

In order to understand neural networks, it helps to first take a look at the basic architecture of the human brain. The brain has 1011 neurons (Alpaydin, 2014). Neurons are cells inside the brain that process information.

Each neuron contains a number of input wires called dendrites. Each neuron also has one output wire called an axon. The axon is used to send messages to other neurons. The axon of a sending neuron is connected to the dendrites of the receiving neuron via a synapse.

So, in short, a neuron receives inputs from dendrites, performs a computation, and sends the output to other neurons via the axon. This process is how information flows through the brain. The messages sent between neurons are in the form of electric pulses.

An artificial neural network, the one used in machine learning, is a simplified model of the actual human neural network explained above. It is typically composed of zero or more layers.

Each layer of the neural network is made up of nodes (analogous to neurons in the brain). Nodes of one layer are connected to nodes in another layer by connection weights, which are typically just floating-point numbers (e.g. 0.23342341). These numbers represent the strength of the connection between two nodes.

The job of a node in a hidden layer is to:

1. Receive input values from each node in a preceding layer
2. Compute a weighted sum of those input values
3. Send that weighted sum through some activation function (e.g. logistic sigmoid function or hyperbolic tangent function)
4. Send the result of the computation in #3 to each node in the next layer of the neural network.

Thus, the output from the nodes in a given layer becomes the input for all nodes in the next layer.

The output layer of a network does steps 1-3 above. However, the result of the computation from step #3 is a class prediction instead of an input to another layer (since the output layer is the final layer).

Here is a diagram of the process I explained above:

Here is a diagram showing a single layer neural network:

b stands for the bias term. This is a constant. It is like the b in the equation for a line, y = mx + b. It enables the model to have flexibility because, without that bias term, you cannot as easily adapt the weighted sum of inputs (i.e. mx) to fit the data (i.e. in the example of a simple line, the line cannot move up and down the y-axis without that b term).

w in the diagram above stands for the weights, and x stands for the input values.

Here is a similar diagram, but now it is a two-layer neural network instead of single layer.

And here is one last way to look at the same thing I explained above:

Note that the yellow circles on the left represent the input values. w represents the weights. The sigma inside the box means that we calculated the weighted sum of the input values. We run that through the activation function f(S)…e.g. sigmoid function. And then out of that, pops the output, which is passed on to the nodes in the following layer.

Neural networks that contain many layers, for example more than 100, are called deep neural networks. Deep neural networks are the cornerstone of the rapidly growing field known as deep learning.

### Training Phase

The objective during the training phase of a neural network is to determine all the connection weights. At the start of training, the weights of the network are initialized to small random values close to 0. After this step, training proceeds to the two main phases of the algorithm: forward propagation and backpropagation.

#### Forward Propagation

During the forward propagation phase of a neural network, we process one instance (i.e. one set of inputs) at a time. Hidden layers extract important features contained in the input data by computing a weighted sum of the inputs and running the result through the logistic sigmoid activation function. This output relays to nodes in the next hidden layer where the data is transformed yet again. This process continues until the data reaches the output layer.

The output of the output layer is a predicted class value, which in this project is a one-hot encoded class prediction vector. The index of the vector corresponds to each class. For example, if a 1 is in the 0 index of the vector (and a 0 is in all other indices of the vector), the class prediction is class 0. Because we are dealing with 0s and 1s, the output vector can also be considered the probability that an instance is in a given class.

#### Backpropagation

After the input signal produced by a training instance propagates through the network one layer at a time to the output layer, the backpropagation phase commences. An error value is calculated at the output layer. This error corresponds to the difference between the class predicted by the network and the actual (i.e. true) class of the training instance.

The prediction error is then propagated backward from the output layer to the input layer. Blame for the error is assigned to each node in each layer, and then the weights of each node of the neural network are updated accordingly (with the goal to make more accurate class predictions for the next instance that flows through the neural network) using stochastic gradient descent for the weight optimization procedure.

Note that weights of the neural network are adjusted on a training instance by training instance basis. This online learning method is the preferred one for classification problems of large size (Ĭordanov & Jain, 2013).

The forward propagation and backpropagation phases continue for a certain number of epochs. A single epoch finishes when each training instance has been processed exactly once.

### Testing Phase

Once the neural network has been trained, it can be used to make predictions on new, unseen test instances. Test instances flow through the network one-by-one, and the resulting output (which is a vector of class probabilities) determines the classification.

Below is a helpful video by Andrew Ng, a professor at Stanford University, that explains neural networks and is helpful for getting your head around the math. The video gets pretty complicated in some spots (esp. where he starts writing all sorts of mathematical notation and derivatives). My advice is to lookup anything that he explains that isn’t clear. Take it slow as you are learning about neural networks. There is no rush. This stuff isn’t easy to understand on your first encounter with it. Over time, the fog will begin to lift, and you will be able to understand how it all works.

# Artificial Feedforward Neural Network Trained with Backpropagation Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

## Required Data Set Format for Feedforward Neural Network Trained with Backpropagation

Columns (0 through N)

• 0: Instance ID
• 1: Attribute 1
• 2: Attribute 2
• 3: Attribute 3
• N: Actual Class

• N + 1: Predicted Class
• N + 2: Prediction Correct? (1 if yes, 0 if no)

## Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

### Modification of Attribute Values

• The actual class value was changed to “Benign” or “Malignant.”
• Attribute values were normalized to be in the range 0 to 1.
• Class values were vectorized using one-hot encoding.

### Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

## Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

### Modification of Attribute Values

• Attribute values were normalized to be in the range 0 to 1.
• Class values were vectorized using one-hot encoding.

### Missing Data

There are no missing values in this data set.

## Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

### Modification of Attribute Values

• Attribute values were normalized to be in the range 0 to 1.
• Class values were vectorized using one-hot encoding.

### Missing Data

There were no missing attribute values.

## Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

### Modification of Attribute Values

• Attribute values were normalized to be in the range 0 to 1.
• Class values were vectorized using one-hot encoding.
• Attribute values that were all the same value were removed.

### Missing Data

There are no missing values in this data set.

## Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

• 267 Democrats
• 168 Republicans

### Modification of Attribute Values

• I did the following modifications:
• Changed all “y” to 1 and all “n” to 0.
• Class values were vectorized using one-hot encoding.

### Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

I used stochastic gradient descent for optimizing the weights.

In normal gradient descent, we need to calculate the partial derivative of the cost function with respect to each weight. For each partial derivative, we have to tally up the terms for each training instance to compute the partial derivative of the cost function with respect to that weight. What this means is that, if we have a lot of attributes and a large dataset, gradient descent is slow. For this reason, stochastic gradient descent was chosen since weights are updated after each training instance (as opposed to after all training instances).

Here is a good video that explains stochastic gradient descent.

## Logistic (Sigmoid) Activation Function

The logistic (sigmoid) activation function was used for the nodes in the neural network.

# Description of Any Tuning Process Applied

## Learning Rate

Some tuning was performed in this project. The learning rate was set to 0.1, which was different than the 0.01 value that is often used for multi-layer feedforward neural networks (Montavon, 2012). Lower values resulted in much longer training times and did not result in large improvements in classification accuracy.

## Epochs

The number of epochs chosen for the main runs of the algorithm on the data sets was chosen to be 1000. Other values were tested, but the number of epochs did not have a large impact on classification accuracy.

## Number of Nodes per Hidden Layer

In order to tune the number of nodes per hidden layer, I used a constant learning rate and constant number of epochs. I then calculated the classification accuracy for each data set for a set number of nodes per hidden layer. I performed this process using networks with one hidden layer and networks with two hidden layers. The results of this tuning process are below.

Note that the mean classification accuracy across all data sets when one hidden layer was used for the neural network reached a peak at eight nodes per hidden layer. This value of eight nodes per hidden layer was used for the actual runs on the data sets.

For two hidden layers, the peak mean classification accuracy was attained at five nodes per hidden layer. Thus, when the algorithm was run on the data sets for two hidden layers, I used five nodes per hidden layer for each data set to compare the classification accuracy across the data sets.

# Artificial Feedforward Neural Network Trained with Backpropagation Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the full code for the neural network. This is all you need to run the program:

```import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
from random import shuffle # Import shuffle() method from the random module
from random import seed # Import seed() method from the random module
from random import random # Import random() method from the random module
from collections import Counter # Used for counting
from math import exp # Import exp() function from the math module

# File name: neural_network.py
# Date created: 7/30/2019
# Python version: 3.7
# Description: An artificial feedforward neural network trained
#   with backpropagation (also called multilayer perceptron)

# Required Data Set Format
# Columns (0 through N)
# 0: Attribute 0
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class

# N + 1: Predicted Class
# N + 2: Prediction Correct?

ALGORITHM_NAME = "Feedforward Neural Network With Backpropagation"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def normalize(dataset):
"""
Normalize the attribute values so that they are between 0 and 1, inclusive
:param pandas_dataframe dataset: The original dataset as a Pandas dataframe
:return: normalized_dataset
:rtype: Pandas dataframe
"""
# Generate a list of the column names
column_names = list(dataset)

# For every column except the actual class column
for col in range(0, len(column_names) - 1):
temp = dataset[column_names[col]] # Go column by column
minimum = temp.min() # Get the minimum of the column
maximum = temp.max() # Get the maximum of the column

# Normalized all values in the column so that they
# are between 0 and 1.
# x_norm = (x_i - min(x))/(max(x) - min(x))
dataset[column_names[col]] = dataset[column_names[col]] - minimum
dataset[column_names[col]] = dataset[column_names[col]] / (
maximum - minimum)

normalized_dataset = dataset

return normalized_dataset

def get_five_stratified_folds(dataset):
"""
Implementation of five-fold stratified cross-validation. Divide the data
set into five random folds. Make sure that the proportion of each class
in each fold is roughly equal to its proportion in the entire data set.
:param pandas_dataframe dataset: The original dataset as a Pandas dataframe
:return: five_folds
:rtype: list of folds where each fold is a list of instances(i.e. examples)
"""
# Create five empty folds
five_folds = list()
fold0 = list()
fold1 = list()
fold2 = list()
fold3 = list()
fold4 = list()

# Get the number of columns in the data set
class_column = len(dataset[0]) - 1

# Shuffle the data randomly
shuffle(dataset)

# Generate a list of the unique class values and their counts
classes = list()  # Create an empty list named 'classes'

# For each instance in the dataset, append the value of the class
# to the end of the classes list
for instance in dataset:
classes.append(instance[class_column])

# Create a list of the unique classes
unique_classes = list(Counter(classes).keys())

# For each unique class in the unique class list
for uniqueclass in unique_classes:

# Initialize the counter to 0
counter = 0

# Go through each instance of the data set and find instances that
# are part of this unique class. Distribute them among one
# of five folds
for instance in dataset:

# If we have a match
if uniqueclass == instance[class_column]:

# Allocate instance to fold0
if counter == 0:

# Append this instance to the fold
fold0.append(instance)

# Increase the counter by 1
counter += 1

# Allocate instance to fold1
elif counter == 1:

# Append this instance to the fold
fold1.append(instance)

# Increase the counter by 1
counter += 1

# Allocate instance to fold2
elif counter == 2:

# Append this instance to the fold
fold2.append(instance)

# Increase the counter by 1
counter += 1

# Allocate instance to fold3
elif counter == 3:

# Append this instance to the fold
fold3.append(instance)

# Increase the counter by 1
counter += 1

# Allocate instance to fold4
else:

# Append this instance to the fold
fold4.append(instance)

# Reset the counter to 0
counter = 0

# Shuffle the folds
shuffle(fold0)
shuffle(fold1)
shuffle(fold2)
shuffle(fold3)
shuffle(fold4)

# Add the five stratified folds to the list
five_folds.append(fold0)
five_folds.append(fold1)
five_folds.append(fold2)
five_folds.append(fold3)
five_folds.append(fold4)

return five_folds

def initialize_neural_net(
no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs):
"""
Generates a new neural network that is ready to be trained.
Network (list of layers): 0+ hidden layers, and output layer
Input Layer (list of attribute values): A row from the training set
Hidden Layer (list of dictionaries): A set of nodes (i.e. neurons)
Output Layer (list of dictionaries): A set of nodes, one node per class
Node (dictionary): Contains a set of weights, one weight for each input
to the layer containing that node + an additional weight for the bias.
Each node is represented as a dictionary that stores key-value pairs
Each key corresponds to a property of that node (e.g. weights).
Weights will be initialized to small random values between 0 and 1.
:param int no_inputs: Numper of inputs (i.e. attributes)
:param int no_hidden_layers: Numper of hidden layers (0 or more)
:param int no_nodes_per_hidden_layer: Numper of nodes per hidden layer
:param int no_outputs: Numper of outputs (one output node per class)
:return: network
:rtype:list (i.e. list of layers: hidden layers, output layer)
"""

# Create an empty list
network = list()

# Create the the hidden layers
hidden_layer = list()
hl_counter = 0

# Create the output layer
output_layer = list()

# If this neural network contains hidden layers
if no_hidden_layers > 0:

# Build one hidden layer at a time
for layer in range(no_hidden_layers):

# Reset to an empty hidden layer
hidden_layer = list()

# If this is the first hidden layer
if hl_counter == 0:

# Build one node at a time
for node in range(no_nodes_per_hidden_layer):

initial_weights = list()

# Each node in the hidden layer has no_inputs + 1 weights,
# initialized to a random number in the range [0.0, 1.0)
for i in range(no_inputs + 1):
initial_weights.append(random())

# Add the node to the first hidden layer
hidden_layer.append({'weights':initial_weights})

# Finished building the first hidden layer
hl_counter += 1

# Add this first hidden layer to the front of the neural
# network
network.append(hidden_layer)

# If this is not the first hidden layer
else:

# Build one node at a time
for node in range(no_nodes_per_hidden_layer):

initial_weights = list()

# Each node in the hidden layer has
# no_nodes_per_hidden_layer + 1 weights, initialized to
# a random number in the range [0.0, 1.0)
for i in range(no_nodes_per_hidden_layer + 1):
initial_weights.append(random())

hidden_layer.append({'weights':initial_weights})

# Add this newly built hidden layer to the neural network
network.append(hidden_layer)

# Build the output layer
for outputnode in range(no_outputs):

initial_weights = list()

# Each node in the output layer has no_nodes_per_hidden_layer
# + 1 weights, initialized to a random number in
# the range [0.0, 1.0)
for i in range(no_nodes_per_hidden_layer + 1):
initial_weights.append(random())

# Add this output node to the output layer
output_layer.append({'weights':initial_weights})

# Add the output layer to the neural network
network.append(output_layer)

# A neural network has no hidden layers
else:

# Build the output layer
for outputnode in range(no_outputs):

initial_weights = list()

# Each node in the hidden layer has no_inputs + 1 weights,
# initialized to a random number in the range [0.0, 1.0)
for i in range(no_inputs + 1):
initial_weights.append(random())

# Add this output node to the output layer
output_layer.append({'weights':initial_weights})

network.append(output_layer)

# Finished building the initial neural network
return network

def weighted_sum_of_inputs(weights, inputs):
"""
Calculates the weighted sum of inputs plus the bias
:param list weights: A list of weights. Each node has a list of weights.
:param list inputs: A list of input values. These can be a single row
of attribute values or the output from nodes from the previous layer
:return: weighted_sum
:rtype: float
"""
# We assume that the last weight is the bias value
# The bias value is a special weight that does not multiply with an input
# value (or we could assume its corresponding input value is always 1)
# The bias is similar to the intercept constant b in y = mx + b. It enables
# a (e.g. sigmoid) curve to be shifted to create a better fit
# to the data. Without the bias term b, the line always goes through the
# origin (0,0) and cannot adapt as well to the data.
# In y = mx + b, we assume b * x_0 where x_0 = 1

# Initiate the weighted sum with the bias term. Assume the last weight is
# the bias term
weighted_sum = weights[-1]

for index in range(len(weights) - 1):
weighted_sum += weights[index] * inputs[index]

return weighted_sum

def sigmoid(weighted_sum_of_inputs_plus_bias):
"""
Run the weighted sum of the inputs + bias through
the sigmoid activation function.
:param float weighted_sum_of_inputs_plus_bias: Node summation term
:return: sigmoid(weighted_sum_of_inputs_plus_bias)
"""
return 1.0 / (1.0 + exp(-weighted_sum_of_inputs_plus_bias))

def forward_propagate(network, instance):
"""
Instances move forward through the neural network from one layer
to the next layer. At each layer, the outputs are calculated for each
node. These outputs are the inputs for the nodes in the next layer.
The last set of outputs is the output for the nodes in the output
layer.
:param list network: List of layers: 0+ hidden layers, 1 output layer
:param list instance (a single training/test instance from the data set)
:return: outputs
:rtype: list
"""
inputs = instance

# For each layer in the neural network
for layer in network:

# These will store the outputs for this layer
new_inputs = list()

# For each node in this layer
for node in layer:

# Calculate the weighted sum + bias term
weighted_sum = weighted_sum_of_inputs(node['weights'], inputs)

# Run the weighted sum through the activation function
# and store the result in this node's dictionary.
# Now the node's dictionary has two keys, weights and output.
node['output'] = sigmoid(weighted_sum)

# Used for debugging
#print(node)

# Add the output of the node to the new_inputs list
new_inputs.append(node['output'])

# Update the inputs list
inputs = new_inputs

# We have reached the output layer
outputs = inputs

return outputs

def sigmoid_derivative(output):
"""
The derivative of the sigmoid activation function with respect
to the weighted summation term of the node.
Formally (after a lot of calculus), this derivative is:
derivative = sigmoid(weighted_sum_of_inputs_plus_bias) *
(1 - sigmoid(weighted_sum_of_inputs_plus_bias))
= node_ouput * (1 - node_output)
This method is used during the backpropagation phase.
:param list output: Output of a node (generated during the forward
propagation phase)
:return: sigmoid_der
:rtype: float
"""
return output * (1.0 - output)

def back_propagate(network, actual):
"""
In backpropagation, the error is computed between the predicted output by
the network and the actual output as determined by the data set. This error
propagates backwards from the output layer to the first hidden layer. The
weights in each layer are updated along the way in response to the error.
The goal is to reduce the prediction error for the next training instance
that forward propagates through the network.
:param network list: The neural network
:param actual list: A list of the actual output from the data set
"""
# Iterate in reverse order (i.e. starts from the output layer)
for i in reversed(range(len(network))):

# Work one layer at a time
layer = network[i]

# Keep track of the errors for the nodes in this layer
errors = list()

# If this is a hidden layer
if i != len(network) - 1:

# For each node_j in this hidden layer
for j in range(len(layer)):

# Reset the error value
error = 0.0

# Calculate the weighted error.
# The error values come from the error (i.e. delta) calculated
# at each node in the layer just to the "right" of this layer.
# This error is weighted by the weight connections between the
# node in this hidden layer and the nodes in the layer just
# to the "right" of this layer.
for node in network[i + 1]:
error += (node['weights'][j] * node['delta'])

# Add the weighted error for node_j to the
# errors list
errors.append(error)

# If this is the output layer
else:

# For each node in the output layer
for j in range(len(layer)):

# Store this node (i.e. dictionary)
node = layer[j]

# Actual - Predicted = Error
errors.append(actual[j] - node['output'])

# Calculate the delta for each node_j in this layer
for j in range(len(layer)):
node = layer[j]

# Add an item to the node's dictionary with the
# key as delta.
node['delta'] = errors[j] * sigmoid_derivative(node['output'])

def update_weights(network, instance, learning_rate):
"""
After the deltas (errors) have been calculated for each node in
each layer of the neural network, the weights can be updated.
new_weight = old_weight + learning_rate * delta * input_value
:param list network: List of layers: 0+ hidden layers, 1 output layer
:param list instance: A single training/test instance from the data set
:param float learning_rate: Controls step size in the stochastic gradient
descent procedure.
"""
# For each layer in the network
for layer_index in range(len(network)):

# Extract all the attribute values, excluding the class value
inputs = instance[:-1]

# If this is not the first hidden layer
if layer_index != 0:

# Go through each node in the previous layer and add extract the
# output from that node. The output from the previous layer
# is the input to this layer.
inputs = [node['output'] for node in network[layer_index - 1]]

# For each node in this layer
for node in network[layer_index]:

# Go through each input value
for j in range(len(inputs)):

# Update the weights
node['weights'][j] += learning_rate * node['delta'] * inputs[j]

# Updating the bias weight
node['weights'][-1] += learning_rate * node['delta']

def train_neural_net(
network, training_set, learning_rate, no_epochs, no_outputs):
"""
Train a neural network that has already been initialized.
Training is done using stochastic gradient descent where the weights
are updated one training instance at a time rather than after the
entire training set (as is the case with gradient descent).
:param list network: The neural network, which is a list of layers
:param list training_set: A list of training instances (i.e. examples)
:param float learning_rate: Controls step size of gradient descent
:param int no_epochs: How many passes we will make through training set
:param int no_outputs: The number of output nodes (equal to # of classes)
"""
# Go through the entire training set a fixed number of times (i.e. epochs)
for epoch in range(no_epochs):

# Update the weights one instance at a time
for instance in training_set:

# Forward propagate the training instance through the network
# and produce the output, which is a list.
outputs = forward_propagate(network, instance)

# Vectorize the output using one hot encoding.
# Create a list called actual_output that is the same length
# as the number of outputs. Put a 1 in the place of the actual
# class.
actual_output = [0 for i in range(no_outputs)]
actual_output[int(instance[-1])] = 1

back_propagate(network, actual_output)
update_weights(network, instance, learning_rate)

def predict_class(network, instance):
"""
Make a class prediction given a trained neural network and
an instance from the test data set.
:param list network: The neural network, which is a list of layers
:param list instance: A single training/test instance from the data set
:return class_prediction
:rtype int
"""
outputs = forward_propagate(network, instance)

# Return the index that has the highest probability. This index
# is the class value. Assume class values begin at 0 and go
# upwards by 1 (i.e. 0, 1, 2, ...)
class_prediction = outputs.index(max(outputs))

return class_prediction

def calculate_accuracy(actual, predicted):
"""
Calculates the accuracy percentages
:param list actual: Actual class values
:param list predicted: predicted class values
:return: classification_accuracy
:rtype: float (as a percentage)
"""
number_of_correct_predictions = 0
for index in range(len(actual)):
if actual[index] == predicted[index]:
number_of_correct_predictions += 1

classification_accuracy = (
number_of_correct_predictions / float(len(actual))) * 100.0
return classification_accuracy

def get_test_set_predictions(
training_set, test_set, learning_rate, no_epochs,
no_hidden_layers, no_nodes_per_hidden_layer):
"""
This method is the workhorse.
A new neutal network is initialized.
The network is trained on the training set.
The trained neural network is used to generate predictions on the
test data set.
:param list training_set
:param list test_set
:param float learning_rate
:param int no_epochs
:param int no_hidden_layers
:param int no_nodes_per_hidden_layer
:return network, class_predictions
:rtype list, list
"""
# Get the number of attribute values
no_inputs = len(training_set[0]) - 1

# Calculate the number of unique classes
no_outputs = len(set([instance[-1] for instance in training_set]))

# Build a new neural network
network = initialize_neural_net(
no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs)

train_neural_net(
network, training_set, learning_rate, no_epochs, no_outputs)

# Store the class predictions for each test instance
class_predictions = list()
for instance in test_set:
cl_prediction = predict_class(network, instance)
class_predictions.append(cl_prediction)

# Return the learned model as well as the class predictions
return network, class_predictions

###############################################################

def main():
"""
The main method of the program
"""
LEARNING_RATE = 0.1 # Used for stochastic gradient descent procedure
NO_EPOCHS = 1000 # Epoch is one complete pass through training data
NO_HIDDEN_LAYERS = 1 # Number of hidden layers
NO_NODES_PER_HIDDEN_LAYER = 8 # Number of nodes per hidden layer

# Welcome message
print("Welcome to the " +  ALGORITHM_NAME + " Program!")
print()

# Directory where data set is located
#data_path = input("Enter the path to your input file: ")
data_path = "vote.txt"

# Read the full text file and store records in a Pandas dataframe

# Show functioning of the program
#trace_runs_file = input("Enter the name of your trace runs file: ")
trace_runs_file = "vote_nn_trace_runs.txt"

## Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")

# Testing statistics
#test_stats_file = input("Enter the name of your test statistics file: ")
test_stats_file = "vote_nn_test_stats.txt"

## Open a test_stats_file
outfile_ts = open(test_stats_file,"w")

# Generate a list of the column names
column_names = list(pd_data_set)

# The input layer in the neural network
# will have one node for each attribute value
no_of_inputs = len(column_names) - 1

# Make a list of the unique classes
list_of_unique_classes = pd.unique(pd_data_set["Actual Class"])

# The output layer in the neural network
# will have one node for each class value
no_of_outputs = len(list_of_unique_classes)

# Replace all the class values with numbers, starting from 0
# in the Pandas dataframe.
for cl in range(0, len(list_of_unique_classes)):
pd_data_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)

# Normalize the attribute values so that they are all between 0
# and 1, inclusive
normalized_pd_data_set = normalize(pd_data_set)

# Convert normalized Pandas dataframe into a list
dataset_as_list = normalized_pd_data_set.values.tolist()

# Set the seed for random number generator
seed(1)

# Get a list of 5 stratified folds because we are doing
# five-fold stratified cross-validation
fv_folds = get_five_stratified_folds(dataset_as_list)

# Keep track of the scores for each of the five experiments
scores = list()

experiment_counter = 0
for fold in fv_folds:

print()
print("Running Experiment " + str(experiment_counter) + " ...")
print()
outfile_tr.write("Running Experiment " + str(
experiment_counter) + " ...\n")
outfile_tr.write("\n")

# Get all the folds and store them in the training set
training_set = list(fv_folds)

# Four folds make up the training set
training_set.remove(fold)

# Combined all the folds so that all we have is a list
# of training instances
training_set = sum(training_set, [])

# Initialize a test set
test_set = list()

# For each instance in this test fold
for instance in fold:

# Create a copy and store it
copy_of_instance = list(instance)
test_set.append(copy_of_instance)

# Get the trained neural network and the predicted values
# for each test instance
neural_net, predicted_values = get_test_set_predictions(
training_set, test_set,LEARNING_RATE,NO_EPOCHS,
NO_HIDDEN_LAYERS,NO_NODES_PER_HIDDEN_LAYER)
actual_values = [instance[-1] for instance in fold]
accuracy = calculate_accuracy(actual_values, predicted_values)
scores.append(accuracy)

# Print the learned model
print("Experiment " + str(
experiment_counter) + " Trained Neural Network")
print()
for layer in neural_net:
print(layer)
print()
outfile_tr.write("Experiment " + str(
experiment_counter) + " Trained Neural Network")
outfile_tr.write("\n")
outfile_tr.write("\n")
for layer in neural_net:
outfile_tr.write(str(layer))
outfile_tr.write("\n")
outfile_tr.write("\n\n")

# Print the classifications on the test instances
print("Experiment " + str(
experiment_counter) + " Classifications on Test Instances")
print()
outfile_tr.write("Experiment " + str(
experiment_counter) + " Classifications on Test Instances")
outfile_tr.write("\n\n")
test_df = pd.DataFrame(test_set, columns=column_names)

test_df = test_df.reindex(
columns=[*test_df.columns.tolist(
), 'Predicted Class', 'Prediction Correct?'])

# Add the predicted values to the "Predicted Class" column
# Indicate if the prediction was correct or not.
for pre_val_index in range(len(predicted_values)):
test_df.loc[pre_val_index, "Predicted Class"] = predicted_values[
pre_val_index]
if test_df.loc[pre_val_index, "Actual Class"] == test_df.loc[
pre_val_index, "Predicted Class"]:
test_df.loc[pre_val_index, "Prediction Correct?"] = "Yes"
else:
test_df.loc[pre_val_index, "Prediction Correct?"] = "No"

# Replace all the class values with the name of the class
for cl in range(0, len(list_of_unique_classes)):
test_df["Actual Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
test_df["Predicted Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)

# Print out the test data frame
print(test_df)
print()
print()
outfile_tr.write(str(test_df))
outfile_tr.write("\n\n")

# Go to the next experiment
experiment_counter += 1

print("Experiments Completed.\n")
outfile_tr.write("Experiments Completed.\n\n")

# Print the test stats
print("------------------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics")
print("------------------------------------------------------------------")
print("Data Set : " + data_path)
print()
print("Learning Rate: " + str(LEARNING_RATE))
print("Number of Epochs: " + str(NO_EPOCHS))
print("Number of Hidden Layers: " + str(NO_HIDDEN_LAYERS))
print("Number of Nodes Per Hidden Layer: " + str(
NO_NODES_PER_HIDDEN_LAYER))
print()
print("Accuracy Statistics for All 5 Experiments: %s" % scores)
print()
print()
print("Classification Accuracy: %.3f%%" % (
sum(scores)/float(len(scores))))

outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write("Data Set : " + data_path +"\n\n")
outfile_ts.write("Learning Rate: " + str(LEARNING_RATE) + "\n")
outfile_ts.write("Number of Epochs: " + str(NO_EPOCHS) + "\n")
outfile_ts.write("Number of Hidden Layers: " + str(
NO_HIDDEN_LAYERS) + "\n")
outfile_ts.write("Number of Nodes Per Hidden Layer: " + str(
NO_NODES_PER_HIDDEN_LAYER) + "\n")
outfile_ts.write(
"Accuracy Statistics for All 5 Experiments: %s" % str(scores))
outfile_ts.write("\n\n")
outfile_ts.write("Classification Accuracy: %.3f%%" % (
sum(scores)/float(len(scores))))

## Close the files
outfile_tr.close()
outfile_ts.close()

main()
```

# Artificial Feedforward Neural Network Trained with Backpropagation Output

Here are the trace runs:

Here are the results:

Here are the test statistics for each data set:

# Analysis

## Breast Cancer Data Set

The breast cancer data set results were in line with what I expected. The simpler model, the one with no hidden layers, ended up generating the highest classification accuracy. Classification accuracy was just short of 97%. In other words, the neural network that had no hidden layers successfully classified a patient as either malignant or benign with an almost 97% accuracy.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data (699 instances in this case) can lead to better learning and better classification accuracy on new, unseen instances.

## Glass Data Set

The performance of the neural network on the glass data set was the worst out of all of the data sets. The ability of the network to correctly identify the type of glass given the attribute values never exceeded 70%.

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

## Iris Data Set

Classification accuracy was superb on the iris dataset, attaining a classification accuracy around 97%. The results of the iris dataset were surprising given that the more complicated neural network with two hidden layers and five nodes per hidden layer outperformed the simpler neural network that had no hidden layers. In this case, it appears that the iris dataset benefited from the increasing layers of abstraction provided by a higher number of layers.

## Soybean Data Set (small)

Performance on the soybean data set was stellar and was the highest of all of the data sets but also had the largest standard deviation for the classification accuracy. Note that classification accuracy reached a peak of 100% using one hidden layer and eight nodes for the hidden layer. However, when I added an additional hidden layer, classification accuracy dropped to under 70%.

The reason for the high standard deviation of the classification accuracy is unclear, but I hypothesize it has to do with the relatively small number of training instances. Future work would need to be performed with the soybean large dataset available from the UCI Machine Learning Repository to see if these results remain consistent.

The results of the soybean runs suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

## Vote Data Set

The vote data set did not yield the stellar performance of the soybean data set, but classification accuracy was still solid at ~96% using one hidden layer and eight nodes per hidden layer. These results are in line with what I expected because voting behavior should provide a powerful predictor of whether a candidate is a Democrat or Republican. I would have been surprised had I observed classification accuracies that were lower since members of Congress tend to vote along party lines on most issues.

# Summary and Conclusions

My hypothesis was incorrect. In some cases, simple neural networks with no hidden layers outperformed more complex neural networks with 1+ hidden layers. However, in other cases, more complex neural networks with multiple hidden layers outperformed the network with no hidden layers. The reason why some data is more amenable to networks with hidden layers instead of without hidden layers is unclear.

Other conclusions include the following:

• Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
• Large numbers of relevant attributes can help a neural network create more accurate classifications.
• Neural networks are powerful and can achieve excellent results on both binary and multi-class classification problems.

# References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Ĭordanov, I., & Jain, L. C. (2013). Innovations in Intelligent Machines –3 : Contemporary Achievements in Intelligent Systems. Berlin: Springer.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Montavon, G. O. (2012). Neural Networks : Tricks of the Trade. New York: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

## Logistic Regression Algorithm From Scratch

In this post, I will walk you through the Logistic Regression algorithm step-by-step.

• We will develop the code for the algorithm from scratch using Python.
• We will run the algorithm on real-world data sets from the UCI Machine Learning Repository.

# What is Logistic Regression?

Logistic regression, contrary to the name, is a classification algorithm. Unlike linear regression which outputs a continuous value (e.g. house price) for the prediction, Logistic Regression transforms the output into a probability value (i.e. a number between 0 and 1) using what is known as the logistic sigmoid function. This function is also known as the squashing function since it maps a line — that can run from negative infinity to positive infinity along the y-axis — to a number between 0 and 1.

Here is what the graph of the sigmoid function looks like:

The function is called the sigmoid function because it is s-shaped. Here is what the sigmoid function looks like in mathematical notation:

where:

• h(z) is the predicted probability of a given instance (i.e. example) being in the positive class…that is the class represented as 1 in a data set. For example, in an e-mail classification data set, this would be the probability that a given e-mail instance is spam (If h(z) = 0.73, for example, that would mean that the instance has a 73% chance of being spam).
• 1- h(z) is the probability of an instance being in the negative class, the class represented as 0 (e.g. not spam). h(z) is always a number between 0 and 1. Going back to the example in the bullet point above, this would mean that the instance has a 27% change of being not spam.
• z is the input (e.g. a weighted sum of the attributes of a given instance)
• e is Euler’s number

z is commonly expressed as the dot product, w · x, where w is a 1-dimensional vector containing the weights for each attribute, and x is a vector containing the values of each attribute for a specific instance of the data set (i.e. example).

Often the dot product, w · x, is written as matrix multiplication. In that case, z = wTx where T means transpose of the single dimensional weight vector w. The symbol Ɵ is often used in place of w.

So substituting w · x into the sigmoid equation, we getthe following equation:

where

• w is a 1-dimensional vector containing the weights for each attribute.
• The subscript w on hw means the attributes x are weighted by the weight vector w.
• hw(x) is the probability (a value between 0 and 1) that an instance is a member of the positive class (i.e. probability an e-mail is spam).
• x is a vector containing the values of each attribute for a specific instance of the data set.
• w · x = w0x0 + w1x1 + w2x2 + …. + wdx(analogous to the equation of a line y = mx + b from grade school)
• d is the number of attributes in the data set
• x0 = 1 by convention, for all instances. This attribute has to be added by the programmer for all instances. It is known formally as the “bias” term.

As is the case for many machine learning algorithms, the starting point for Logistic Regression is to create a trained model. One we have a trained model, we can use it to make predictions on new, unseen instances.

## Training Phase

Creating a trained model entails determining the weight vector w. Once we have the weights, we can make predictions on new unseen examples. All we need are the values of the attributes of those examples (i.e. the x values), and we can weight the x values with the values of w to compute the probabilities h(x) for that example using the sigmoid function.

The rule for making predictions using the sigmoid function is as follows:

• If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
• If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

To determine the weights in linear regression, the sum of the squared error was the cost function (where error = actual values – predicted values by the line). The cost function represents how wrong a prediction is. In linear regression, it represents how wrong a line of best fit is on a set of observed training instances. The lower the sum of the squared error, the better a line fits the training data, and, in theory, the better the line will predict new, unseen instances.

Instead, the cost function in Logistic Regression is called cross-entropy. Without getting too detailed into the mathematics and notation of this particular equation, the cross-entropy equation is the one that we want to minimize. Minimizing this equation will yield us a sigmoid curve that best fits the training data and enables us to make the best classification predictions possible for new, unseen test instances. A minimum of the cost function is attained when the gradient of the cost function is close to zero (i.e. the calculated weights stop changing). The formal term for the gradient of the cost function getting close to zero is called convergence.

In order to minimize the cost function, we need to find its gradient (i.e. derivative, slope, etc.) and determine the values for the weight vector w that make its derivative as close to 0 as possible. We cannot just set the gradient to 0 and then enter x-values and calculate the weights directly. Instead, we have to use a method called gradient descent in order to find the weights.

In the gradient descent algorithm for Logistic Regression, we:

1. Start off with an empty weight vector (initialized to random values between -0.01 and 0.01). The size of the vector is equal to the number of attributes in the data set.
2. Initialize an empty weight change vector initialized to all zeros. The size of the vector is equal to the number of attributes in the data set.
3. For each training instance, one at a time.
• a. Make a probability prediction by calculating the weighted sum of the attribute values and running that value through the sigmoid function.
• b. We evaluate the gradient of the cost function by plugging in the actual (i.e. observed) class value and the predicted class value from bullet point 3a above.
• c. The gradient value from 3b gets added to the weight change vector.
4. After we finish with the last training instance from 3, we multiply each value in the weight change vector by a learning rate (commonly 0.01).
5. The vector from 4 gets added to the empty weight vector to update the weights.
6. We then ask two questions
• a. Have the weights continued to change (i.e. is the norm (i.e. magnitude) of the weight change vector less than a certain threshold like 0.001)?
• b. Have we been through the data set less than 10,000 (or whatever we set the maximum iterations to) times?
• c. If the answer is yes to both 6a and 6b, go back to step 2. Otherwise, we return the final weight vector, exiting the algorithm.

The gradient descent pseudocode for Logistic Regression is provided in Figure 10.6 of Introduction to Machine Learning by Ethem Alpaydin (Alpaydin, 2014).

## Testing Phase

Once training is completed, we have the weights and can use these weights, attribute values, and the sigmoid function to make predictions for the set of test instances.

Predictions for a given test instance are made using the aforementioned sigmoid function:

Where the rule for making predictions using the sigmoid function is as follows:

• If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
• If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

## Multi-class Logistic Regression

A Multi-class Logistic Regression problem is a twist on the binary Logistic Regression method presented above. Multi-class Logistic Regression can make predictions on both binary and multi-class classification problems.

In order to make predictions for multi-class datasets, we take the training set and create multiple separate binary classification problems (one for each class in the data set). For each of those training sets that we generated, we set the class values for one class to 1 (representing the positive class), and we set all other classes to 0 (i.e. the negative class).

In other words, if there are k classes in a data set, k separate training sets are generated. In each of those k separate training sets, one class is set to 1 and all other classes are set to 0.

In Multi-class Logistic Regression, the training phase entails creating k different weight vectors, one for each class rather than just a single weight vector (which was the case in binary Logistic Regression). Each weight vector will help to predict the probability of an instance being a member of that class. Thus, in the testing phase, when there is an unseen new instance, three different predictions need to be made. This method is called the one-vs-all strategy, sometimes called one-vs-rest.

The rule for making predictions for a given instance are as follows:

• For each new test instance,
• Make k separate probability predictions.
• Pick the class that has the highest probability (i.e. the class that is the most enthusiastic about that instance being a member of its class)

Other multi-class Logistic Regression algorithms include Softmax Regression and the one-vs-one strategy. The one-vs-all strategy was selected due to its popularity as being the default strategy used in practice for many of the well-known machine learning libraries for Python (Rebala, Ravi, & Churiwala, 2019)

## Video

Here is an excellent video on logistic regression that explains the whole process I described above, step-by-step.

# Logistic Regression Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

## Required Data Set Format for Logistic Regression

Columns (0 through N)

• 0: Instance ID
• 1: Attribute 1
• 2: Attribute 2
• 3: Attribute 3
• N: Actual Class

• N + 1: Predicted Class
• N + 2: Prediction Correct? (1 if yes, 0 if no)

## Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

### Modification of Attribute Values

The actual class value was changed to “Benign” or “Malignant.”

I transformed the attributes into binary numbers so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0.

### Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

## Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

### Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

### Missing Data

There are no missing values in this data set.

## Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

### Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

### Missing Data

There were no missing attribute values.

## Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

### Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

### Missing Data

There are no missing values in this data set.

## Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

• 267 Democrats
• 168 Republicans

### Modification of Attribute Values

I did the following modifications:

• Changed all “y” to 1 and all “n” to 0.

### Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

## Description of Any Tuning Process Applied

Some tuning was performed in this project. The learning rate was set to 0.01 by convention. A higher learning rate (0.5) resulted in poor results for the norm of the gradient (>1).

The stopping criteria for gradient descent was as follows:

• Maximum iterations = 10,000
• Euclidean norm of weight change vector < 0.001

When I tried max iterations at 100, the Euclidean norm of the weight change vector returned high values (> 0.2) which indicated that I needed to set a higher max iterations value in order to have a higher chance of convergence (i.e. weights stop changing) based on the norm stopping criteria.

# Logistic Regression Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the driver code. This is where the main method is located:

```import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
import five_fold_stratified_cv
import logistic_regression

# File name: logistic_regression_driver.py
# Date created: 7/19/2019
# Python version: 3.7
# Description: Driver of the logistic_regression.py program

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class

# for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

ALGORITHM_NAME = "Logistic Regression"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def main():

print("Welcome to the " +  ALGORITHM_NAME + " Program!")
print()

# Directory where data set is located
data_path = input("Enter the path to your input file: ")
#data_path = "iris.txt"

# Read the full text file and store records in a Pandas dataframe

# Show functioning of the program
trace_runs_file = input("Enter the name of your trace runs file: ")
#trace_runs_file = "iris_logistic_regression_trace_runs.txt"

# Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")

# Testing statistics
test_stats_file = input("Enter the name of your test statistics file: ")
#test_stats_file = "iris_logistic_regression_test_stats.txt"

# Open a test_stats_file
outfile_ts = open(test_stats_file,"w")

# The number of folds in the cross-validation
NO_OF_FOLDS = 5

# Generate the five stratified folds
fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
pd_data_set)

training_dataset = None
test_dataset = None

# Create an empty array of length 5 to store the accuracy_statistics
# (classification accuracy)
accuracy_statistics = np.zeros(NO_OF_FOLDS)

# Run Logistic Regression the designated number of times as indicated by the
# number of folds
for experiment in range(0, NO_OF_FOLDS):

print()
print("Running Experiment " + str(experiment + 1) + " ...")
print()
outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
outfile_tr.write("\n")

# Each fold will have a chance to be the test data set
if experiment == 0:
test_dataset = fold0
training_dataset = pd.concat([
fold1, fold2, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 1:
test_dataset = fold1
training_dataset = pd.concat([
fold0, fold2, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 2:
test_dataset = fold2
training_dataset = pd.concat([
fold0, fold1, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 3:
test_dataset = fold3
training_dataset = pd.concat([
fold0, fold1, fold2, fold4], ignore_index=True, sort=False)
else:
test_dataset = fold4
training_dataset = pd.concat([
fold0, fold1, fold2, fold3], ignore_index=True, sort=False)

accuracy, predictions, weights_for_each_class, no_of_instances_test = (
logistic_regression.logistic_regression(training_dataset,test_dataset))

# Print the trace runs of each experiment
print("Accuracy:")
print(str(accuracy * 100) + "%")
print()
print("Classifications:")
print(predictions)
print()
print("Learned Model:")
print(weights_for_each_class)
print()
print("Number of Test Instances:")
print(str(no_of_instances_test))
print()

outfile_tr.write("Accuracy:")
outfile_tr.write(str(accuracy * 100) + "%\n\n")
outfile_tr.write("Classifications:\n")
outfile_tr.write(str(predictions) + "\n\n")
outfile_tr.write("Learned Model:\n")
outfile_tr.write(str(weights_for_each_class) + "\n\n")
outfile_tr.write("Number of Test Instances:")
outfile_tr.write(str(no_of_instances_test) + "\n\n")

# Store the accuracy in the accuracy_statistics array
accuracy_statistics[experiment] = accuracy

outfile_tr.write("Experiments Completed.\n")
print("Experiments Completed.\n")

# Write to a file
outfile_ts.write("----------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write("----------------------------------------------------------\n")
outfile_ts.write("Data Set : " + data_path + "\n")
outfile_ts.write("\n")
outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
outfile_ts.write(np.array2string(
accuracy_statistics, precision=2, separator=',',
suppress_small=True))
outfile_ts.write("\n")
outfile_ts.write("\n")
accuracy = np.mean(accuracy_statistics)
accuracy *= 100
outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")

# Print to the console
print()
print("----------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics")
print("----------------------------------------------------------")
print("Data Set : " + data_path)
print()
print()
print("Accuracy Statistics for All 5 Experiments:")
print(accuracy_statistics)
print()
print()
print("Classification Accuracy : " + str(accuracy) + "%")
print()

# Close the files
outfile_tr.close()
outfile_ts.close()

main()
```

Here is the code for logistic regression:

```import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library

# File name: logistic_regression.py
# Date created: 7/19/2019
# Python version: 3.7
# Description: Multi-class logistic regression using one-vs-all.

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class

# This program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

def sigmoid(z):
"""
Parameters:
z: A real number
Returns:
1.0/(1 + np.exp(-z))
"""
return 1.0/(1 + np.exp(-z))

"""
Gradient descent for logistic regression. Follows method presented
in the textbook Introduction to Machine Learning 3rd Edition by
Ethem Alpaydin (pg. 252)

Parameters:
training_set: The training instances as a Numpy array
Returns:
weights: The vector of weights, commonly called w or THETA
"""

no_of_columns_training_set = training_set.shape[1]
no_of_rows_training_set = training_set.shape[0]

# Extract the attributes from the training set.
# x is still a 2d array
x = training_set[:,:(no_of_columns_training_set - 1)]
no_of_attributes = x.shape[1]

# Extract the classes from the training set.
# actual_class is a 1d array.
actual_class = training_set[:,(no_of_columns_training_set - 1)]

# Set a learning rate
LEARNING_RATE = 0.01

# Set the maximum number of iterations
MAX_ITER = 10000

# Set the iteration variable to 0
iter = 0

# Set a flag to determine if we have exceeded the maximum number of
# iterations
exceeded_max_iter = False

# Set the tolerance. When the euclidean norm of the gradient vector
# (i.e. magnitude of the changes in the weights) gets below this value,
# stop iterating through the while loop

# Set a flag to determine if we have reached the minimum of the
# cost (i.e. error) function.
converged = False

# Create the weights vector with random floats between -0.01 and 0.01
# The number of weights is equal to the number of attributes
weights = np.random.uniform(-0.01,0.01,(no_of_attributes))
changes_in_weights = None

# Keep running the loop below until convergence on the minimum of the
# cost function or we exceed the max number of iterations
while(not(converged) and not(exceeded_max_iter)):

# Initialize a weight change vector that stores the changes in
# the weights at each iteration
changes_in_weights = np.zeros(no_of_attributes)

# For each training instance
for inst in range(0, no_of_rows_training_set):

# Calculate weighted sum of the attributes for
# this instance
output = np.dot(weights, x[inst,:])

# Calculate the sigmoid of the weighted sum
# This y is the probability that this instance belongs
# to the positive class
y =  sigmoid(output)

# Calculate difference
difference = (actual_class[inst] - y)

# Multiply the difference by the attribute vector
product = np.multiply(x[inst,:], difference)

# For each attribute, update the weight changes

# Calculate the step size
step_size = np.multiply(changes_in_weights, LEARNING_RATE)

# Update the weights vector

# Test to see if we have converged on the minimum of the error
# function

converged = True

# Update the number of iterations
iter += 1

# If we have exceeded the maximum number of iterations
if (iter > MAX_ITER):
exceeded_max_iter = True

#For debugging purposes
#print("Number of Iterations: " + str(iter - 1))
#print(changes_in_weights)
#print()
return weights

def logistic_regression(training_set, test_set):
"""
Multi-class one-vs-all logistic regression
Parameters:
training_set: The training instances as a Pandas dataframe
test_set: The test instances as a Pandas dataframe
Returns:
accuracy: Classification accuracy as a decimal
predictions: Classifications of all the test instances as a
Pandas dataframe
weights_for_each_class: The weight vectors for each class (one-vs-all)
no_of_instances_test: The number of test instances
"""

# Remove the instance ID column
training_set = training_set.drop(
training_set.columns[[0]], axis=1)
test_set = test_set.drop(
test_set.columns[[0]], axis=1)

# Make a list of the unique classes
list_of_unique_classes = pd.unique(training_set["Actual Class"])

# Replace all the class values with numbers, starting from 0
# in both the test and training sets.
for cl in range(0, len(list_of_unique_classes)):
training_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)
test_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)

# Insert a column of 1s in column 0 of both the training
# and test sets. This is the bias and helps with gradient
# descent. (i.e. X0 = 1 for all instances)
training_set.insert(0, "Bias", 1)
test_set.insert(0, "Bias", 1)

# Convert dataframes to numpy arrays
np_training_set = training_set.values
np_test_set = test_set.values

test_set = test_set.reindex(
columns=[*test_set.columns.tolist(
), 'Predicted Class', 'Prediction Correct?'])

############################# Training Phase ##############################

no_of_columns_training_set = np_training_set.shape[1]
no_of_rows_training_set = np_training_set.shape[0]

# Create and store a training set for each unique class
# to create separate binary classification
# problems
trainingsets = []
for cl in range(0, len(list_of_unique_classes)):

# Create a copy of the training set
temp = np.copy(np_training_set)

# This class becomes the positive class 1
# and all other classes become the negative class 0
for row in range(0, no_of_rows_training_set):
if (temp[row, (no_of_columns_training_set - 1)]) == cl:
temp[row, (no_of_columns_training_set - 1)] = 1
else:
temp[row, (no_of_columns_training_set - 1)] = 0

# Add the new training set to the trainingsets list
trainingsets.append(temp)

# Calculate and store the weights for the training set
# of each class. Execute gradient descent on each training set
# in order to calculate the weights
weights_for_each_class = []

for cl in range(0, len(list_of_unique_classes)):
weights_for_each_class.append(weights_for_this_class)

# Used for debugging
#print(weights_for_each_class[0])
#print()
#print(weights_for_each_class[1])
#print()
#print(weights_for_each_class[2])

########################### End of Training Phase #########################

############################# Testing Phase ###############################

no_of_columns_test_set = np_test_set.shape[1]
no_of_rows_test_set = np_test_set.shape[0]

# Extract the attributes from the test set.
# x is still a 2d array
x = np_test_set[:,:(no_of_columns_test_set - 1)]
no_of_attributes = x.shape[1]

# Extract the classes from the test set.
# actual_class is a 1d array.
actual_class = np_test_set[:,(no_of_columns_test_set - 1)]

# Go through each row (instance) of the test data
for inst in range(0,  no_of_rows_test_set):

# Create a scorecard that keeps track of the probabilities of this
# instance being a part of each class
scorecard = []

# Calculate and store the probability for each class in the scorecard
for cl in range(0, len(list_of_unique_classes)):

# Calculate weighted sum of the attributes for
# this instance
output = np.dot(weights_for_each_class[cl], x[inst,:])

# Calculate the sigmoid of the weighted sum
# This is the probability that this instance belongs
# to the positive class
this_probability = sigmoid(output)

scorecard.append(this_probability)

most_likely_class = scorecard.index(max(scorecard))

# Store the value of the most likely class in the "Predicted Class"
# column of the test_set data frame
test_set.loc[inst, "Predicted Class"] = most_likely_class

# Update the 'Prediction Correct?' column of the test_set data frame
# 1 if correct, else 0
if test_set.loc[inst, "Actual Class"] == test_set.loc[
inst, "Predicted Class"]:
test_set.loc[inst, "Prediction Correct?"] = 1
else:
test_set.loc[inst, "Prediction Correct?"] = 0

# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (test_set["Prediction Correct?"].sum())/(len(test_set.index))

# Store the revamped dataframe
predictions = test_set

# Replace all the class values with the name of the class
for cl in range(0, len(list_of_unique_classes)):
predictions["Actual Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
predictions["Predicted Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)

# Replace 1 with Yes and 0 with No in the 'Prediction
# Correct?' column
predictions['Prediction Correct?'] = predictions[
'Prediction Correct?'].map({1: "Yes", 0: "No"})

# Reformat the weights_for_each_class list of arrays
weights_for_each_class = pd.DataFrame(np.row_stack(weights_for_each_class))

# Rename the row names
for cl in range(0, len(list_of_unique_classes)):
row_name = str(list_of_unique_classes[cl] + " weights")
weights_for_each_class.rename(index={cl:row_name}, inplace=True)

# Get a list of the names of the attributes
training_set_names = list(training_set.columns.values)
training_set_names.pop() # Remove 'Actual Class'

# Rename the column names
for col in range(0, len(training_set_names)):
col_name = str(training_set_names[col])
weights_for_each_class.rename(columns={col:col_name}, inplace=True)

# Record the number of test instances
no_of_instances_test = len(test_set.index)

# Return statement
return accuracy, predictions, weights_for_each_class, no_of_instances_test

```

Here is the code for five-fold stratified cross-validation:

```import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure
# that the proportion of each class in each group is roughly equal to its
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class

def get_five_folds(instances):
"""
Parameters:
instances: A Pandas data frame containing the instances
Returns:
fold0, fold1, fold2, fold3, fold4
Five folds whose class frequency distributions are
each representative of the entire original data set (i.e. Five-Fold
Stratified Cross Validation)
"""
# Shuffle the data set randomly
instances = instances.sample(frac=1).reset_index(drop=True)

# Record the number of columns in the data set
no_of_columns = len(instances.columns) # number of columns

# Record the number of rows in the data set
no_of_rows = len(instances.index) # number of rows

# Create five empty folds (i.e. Panda Dataframes: fold0 through fold4)
fold0 = pd.DataFrame(columns=(instances.columns))
fold1 = pd.DataFrame(columns=(instances.columns))
fold2 = pd.DataFrame(columns=(instances.columns))
fold3 = pd.DataFrame(columns=(instances.columns))
fold4 = pd.DataFrame(columns=(instances.columns))

# Record the column of the Actual Class
actual_class_column = no_of_columns - 1

# Generate an array containing the unique
# Actual Class values
unique_class_list_df = instances.iloc[:,actual_class_column]
unique_class_list_df = unique_class_list_df.sort_values()
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

unique_class_list_np_size = unique_class_list_np.size

# For each unique class in the unique Actual Class array
for unique_class_list_np_idx in range(0, unique_class_list_np_size):

# Initialize the counter to 0
counter = 0

# Go through each row of the data set and find instances that
# are part of this unique class. Distribute them among one
# of five folds
for row in range(0, no_of_rows):

# If the value of the unique class is equal to the actual
# class in the original data set on this row
if unique_class_list_np[unique_class_list_np_idx] == (
instances.iloc[row,actual_class_column]):

# Allocate instance to fold0
if counter == 0:

# Extract data for the new row
new_row = instances.iloc[row,:]

# Append that entire instance to fold
fold0.loc[len(fold0)] = new_row

# Increase the counter by 1
counter += 1

# Allocate instance to fold1
elif counter == 1:

# Extract data for the new row
new_row = instances.iloc[row,:]

# Append that entire instance to fold
fold1.loc[len(fold1)] = new_row

# Increase the counter by 1
counter += 1

# Allocate instance to fold2
elif counter == 2:

# Extract data for the new row
new_row = instances.iloc[row,:]

# Append that entire instance to fold
fold2.loc[len(fold2)] = new_row

# Increase the counter by 1
counter += 1

# Allocate instance to fold3
elif counter == 3:

# Extract data for the new row
new_row = instances.iloc[row,:]

# Append that entire instance to fold
fold3.loc[len(fold3)] = new_row

# Increase the counter by 1
counter += 1

# Allocate instance to fold4
else:

# Extract data for the new row
new_row = instances.iloc[row,:]

# Append that entire instance to fold
fold4.loc[len(fold4)] = new_row

# Reset counter to 0
counter = 0

return fold0, fold1, fold2, fold3, fold4

```

# Logistic Regression Output

Here are the trace runs:

Here are the results:

Here are the test statistics for each data set:

# Analysis

## Breast Cancer Data Set

I hypothesize that performance was high on this algorithm because of the large number of instances (699 in total). This data set had the highest number of instances out of all the data sets.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.

## Glass Data Set

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

## Iris Data Set

Classification accuracy on the iris data set was satisfactory. This data set was small, and more training data would be needed to see if accuracy could be improved by giving the algorithm more data to learn the underlying relationship between the attributes and the flower types.

## Soybean Data Set (small)

I hypothesize that the large numbers of attributes in the soybean data set (35) helped balance the relatively small number of training instances. These results suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

## Vote Data Set

The results show that classification algorithms like Logistic Regression can have outstanding performance on large data sets that are binary classification problems.

# Summary and Conclusions

• Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
• Large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.
• Classification algorithms like Logistic Regression can achieve excellent classification accuracy on binary classification problems, but performance on multi-class classification algorithms can yield mixed results.

# References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Data Analytics. Cambridge, Massachusetts: The MIT Press.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Rebala, G., Ravi, A., & Churiwala, S. (2019). An Introduction to Machine Learning. Switzerland: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.

## The Difference Between Generative and Discriminative Classifiers

In this post, I will explain the difference between generative classifiers and discriminative classifiers.

Let us suppose we have a class that we want to predict H (hypothesis) and a set of attributes E (evidence). The goal of classification is to create a model based on E and H that can predict the class H given a set of new, unseen attributes E. However, both classifier types, generative and discriminative, go about this classification process differently.

# Generative Classifiers

Classification algorithms such as Naïve Bayes are known as generative classifiers. Generative classifiers take in training data and create probability estimates. Specifically, they estimate the following:

• P(H): The probability of the hypothesis (e.g. spam or not spam). This value is the class prior probability (e.g. probability an e-mail is spam before taking any evidence into account).
• P(E|H): The probability of the evidence given the hypothesis (e.g. probability an e-mail contains the phrase “Buy Now” given that an e-mail is spam). This value is known as the likelihood.

Once the probability estimates above have been computed, the model then uses Bayes Rule to make predictions, choosing the most likely class, based on which class maximizes the expression P(E|H) * P(H).

# Discriminative Classifiers

Rather than estimate likelihoods, discriminative classifiers like Logistic Regression estimate P(H|E) directly. A decision boundary is created that creates a dividing line/plane between instances of one class and instances of another class. New, unseen instances are classified based on which side of the line/plane they fall. In this way, a direct mapping is generated from attributes E to class labels H.

# An Example Using an Analogy

Here is an analogy that demonstrates the difference between generative and discriminative classifiers. Suppose we live in a world in which there are only two classes of animals, cats and rabbits. We want to build a robot that can automatically classify a new animal as either a cat or a rabbit. How would we train this robot using a discriminative algorithm like Logistic Regression?

With a discriminative algorithm, we would feed the model a set of training data containing instances of cats and instances of rabbits. The discriminative algorithm would try to find a straight line/plane (a decision boundary) that separates instances of cats from instances of rabbits. This line would be created by examining the differences in the attributes (e.g. herbivore vs. carnivore, long oval ears vs. small triangular ears, hopping vs. walking, etc.)

Once the training step is complete, the discriminative algorithm is then ready to classify new unseen animals. It will look at new, unseen animals and check which side of the decision boundary the animal should go. The animal is classified based on the side of the decision boundary it falls into.

In contrast, a generative learning algorithm like Naïve Bayes will take in training data and develop a model of what a cat and rabbit should look like. Once trained, a new, unseen animal is compared to the model of a cat and the model of a rabbit. It is then classified based on whether it looks more like the cat instances the model was trained on or the rabbit instances the model was trained on.

Past research has shown that discriminative classifiers like Logistic Regression generally perform better on classification tasks than generative classifiers like Naïve Bayes (Y. Ng & Jordan, 2001).

As a final note, generative classifiers are called generative because we can use the probabilistic information of the data to generate more instances. In other words, given a class y, you can generate its respective attributes x.

# References

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.