Value Iteration vs. Q-Learning Algorithm in Python Step-By-Step

In this post, I will walk you through how to implement the value iteration and Q-learning reinforcement learning algorithms from scratch, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to implement and compare the performance of the value iteration and Q-learning reinforcement learning algorithms on the racetrack problem (Barto, Bradtke, & Singh, 1995).

In the racetrack problem, the goal is to control the movement of a race car along a predefined racetrack so that the race car travels from the starting position to the finish line in a minimum amount of time. The race car must stay on the racetrack and avoid crashing into walls along the way. The racetrack problem is analogous to a time trial rather than a competitive race since there are no other cars on the track to avoid.

Note: Before you deep dive into a section below, I recommend you check out my introduction to reinforcement learning post so you can familiarize yourself with what reinforcement learning is and how it works.

Without further ado, let’s get started!

Table of Contents

Testable Hypotheses

The two reinforcement learning algorithms implemented in this project were value iteration and Q-learning. Both algorithms were tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to take from the starting position to the finish line was calculated for each algorithm-racetrack combination.

Using the implementations of value iteration and Q-learning, three hypotheses will be tested.

Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

I hypothesize that value iteration and Q-learning will both generate policies that will enable the race car to successfully finish the race on all racetracks tested (i.e. move from the starting position of the racetracks to the finish line).

Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

I hypothesize that value iteration will generate a learning policy faster than Q-learning because it has access to the transition and reward functions (explained in detail in the next section “Algorithms Implemented”).  

Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

I hypothesize that it will take longer for the car to finish the race for the crash scenario in which the race car needs to return to the original starting position each time it crashes into a wall. In other words, Bad Crash Version 1 (return to nearest open position) performance will be better than Bad Crash Version 2 (return to start) performance.

Return to Table of Contents

How Value Iteration Works

In the case of the value iteration reinforcement learning algorithm, the race car (i.e. agent) knows two things before taking a specific action (i.e. accelerate in x and/or y directions) (Alpaydin, 2014):

  1. The probabilities of ending up in other new states given a particular action is taken from a current state.
    • More formally, the race car knows the transition function.
    • As discussed in the previous section, the transition function takes as input the current state s and selected action a and outputs the probability of transitioning to some new state s’.
  2. The immediate reward (e.g. race car gets -1 reward each time it moves to a new state) that would be received for taking a particular action from a current state.
    • More formally, this is known as the reward function.
    • The reward function takes as input the current state s and selected action a and outputs the reward.

Value iteration is known as a model-based reinforcement learning technique because it tries to learn a model of the environment where the model has two components:

  1. Transition Function
    • The race car looks at each time it was in a particular state s and took a particular action a and ended up in a new state s’. It then updates the transition function (i.e. transition probabilities) according to these frequency counts.
  2. Reward Function
    • This function answers the question: “Each time the race car was in a particular state s and took a particular action a, what was the reward?”

In short, in value iteration, the race car knows the transition probabilities and reward function (i.e. the model of the environment) and uses that information to govern how the race car should act when in a particular state. Being able to look ahead and calculate the expected rewards of a potential action gives value iteration a distinct advantage over other reinforcement learning algorithms when the set of states is small in size.

Let us take a look at an example. Suppose the following:

  • The race car is in state so, where s0 = <x0, y0, vx0, vy0>, corresponding to the x-position on a racetrack, y-position on a race track, velocity in the x direction, and velocity in the y direction, at timestep 0.
  • The race car takes action a0, where a0 = (ax0, ay0) = (change in velocity in x-direction, change in velocity in y-direction)
  • The race car then observes the new state, where the new state is s1, where s1 = <x1, y1, vx1, vy1>.
  • The race car receives a reward r0 for taking action a0 in state s0.

Putting the bold-faced variables together, we get the following expression which is known as an experience tuple. Tuples are just like lists except the data inside of them cannot be changed.

Experience Tuple = <s0, a0, s1, r0>

What the experience tuple above says is that if the race car is in state s0 and takes action a0, the race car will observe a new state s1 and will receive reward r0.

Then at the next time step, we generate another experience tuple that is represented as follows.

Experience Tuple = <s1, a1, s2, r1>

This process of collecting experience tuples as the race car explores the race track (i.e. environment) happens repeatedly.

Because value iteration is a model based approach, it builds a model of the transition function T[s, a, s’] and reward function R[s,a,s’] using the experience tuples.

  • The transition function can be built and updated by adding up how many times the race car was in state s and took a particular action a and then observed a new state s’. Recall that T[s, a, s’] stands for the probability the race car finds itself in a new state s’ (at the next timestep) given that it was in state s and took action a.
  • The reward function can be built by examining how many times the race car was in state s and took a particular action a and received a reward r. From that information the average reward for that particular scenario can be calculated.

Once these models are built, the race car can then can use value iteration to determine the optimal values of each state (hence the name value iteration). In some texts, values are referred to as utilities (Russell, Norvig, & Davis, 2010).

What are optimal values?

Each state s in the environment (denoted as <xt, yt, vxt, vyt > in this racetrack project) has a value V(s). Different actions can be taken in a given state s. The optimal values of each state s are based on the action a that generates the best expected cumulative reward.

Expected cumulative reward is defined as the immediate reward that the race car would receive if it takes action a and ends up in the state s + the discounted reward that the race car would receive if it always chose the best action (the one that maximizes total reward) from that state onward to the terminal state (i.e. the finish line of the racetrack).

V*(s) = best possible (immediate reward + discounted future reward)

where the * means “optimal.”

The reason why those future rewards are discounted (typically by a number in the range [0,1), known as gamma γ) is because rewards received far into the future are worth less than rewards received immediately. For all we know, the race car might have a gas-powered engine, and there is always the risk of running out of gas. After all, would you rather receive $10 today or $10 ten years from now? $10 received today is worth more (e.g. you can invest it to generate even more cash). Future rewards are more uncertain so that is why we incorporate the discount rate γ.

It is common in control applications to see state-action pairs denoted as Q*(s, a) instead of V*(s) (Alpaydin, 2014). Q*(s,a) is the [optimal] expected cumulative reward when the race car takes an action a in state s and always takes the best action after that timestep. All of these values are typically stored in a table. The table maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state] 

Note that Q(s,a) above is not labeled Q*(s,a). Only once value iteration is done can we label it Q*(s,a) because it takes time to find the optimal Q values for each state-action pair.

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep of the race, the race car observes the state (i.e. position and velocity) and decides what action to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state. The car then takes that action.

The pseudocode for calculating the optimal value for each state-action pair (denoted as Q*(s,a)) in the environment is below. This algorithm is the value iteration algorithm because it iterates over all the state-action pairs in the environment and gives them a value based on the cumulative expected reward of that state-action pair (Alpaydin, 2014; Russell, Norvig, & Davis, 2010; Sutton & Barto, 2018):

Value Iteration Algorithm Pseudocode

Inputs

  • States:
    • List of all possible values of x
    • List of all possible values of y
    • List of all possible values of vx
    • List of all possible values of vy
  • Actions:
    • List of all the possible values of ax
    • List of all possible values of ay
  • Model:
    • Model = Transition Model T[s, a, s’] + Reward Model R[s, a, s’]
    • Where Model is a single table with the following row entries
      • [s, a, s’, probability of ending up in a new state s’ given state s and action a, immediate reward for ending up in new state s’ given state s and action a]
      • = [s, a, s’, T(s, a, s’), R(s, a, s’)]
      • Note that the reward will be -1 for each state except the finish line states (i.e. absorbing or terminal states), which will have a reward of 0.
  • Discount Rate:
    • γ
      • Where 0 ≤ γ < 1
      • If γ = 0, rewards received immediately are the only thing that counts.
      • As γ gets close to 1, rewards received further in the future count more.
  • Error Threshold
    • Ɵ
      • Ɵ is a small number that helps us to determine when we have optimal values for each state-action pair (also known as convergence).
      • Ɵ helps us know when the values of each state-action pair, denoted as Q(s,a), stabilize (i.e. stop changing a lot from run to run of the loop).

Process

  • Create a table V(s) that will store the optimal Q-value for each state. This table will help us determine when we should stop the algorithm and return the output. Initialize all the values of V(s) to arbitrary values, except the terminal state (i.e. finish line state) that has a value of 0.
  • Initialize all Q(s,a) to arbitrary values, except the terminal state (i.e. finish line states) that has a value of 0.
  • Initialize a Boolean variable called done that is a flag to indicate when we are done building the model (or set a fixed number of training iterations using a for loop).
  • While (Not Done)
    • Initialize a value called Δ to 0. When this value gets below the error threshold Ɵ, we exit the main loop and return the output.
    • For all possible states of the environment
      • v := V(s)            // Extract and store the most recent value of the state
      • For all possible actions the race car (agent) can take
        • Q(s,a) = (expected immediate reward given the state s and an action a) + (discount rate) * [summation_over_all_new_states(probability of ending up in a new state s’given a state s and action a * value of the new state s’)]
        • More formally,
          •  
          • where  = expected immediate return when taking action a from state s
      • V(s) := maximum value of Q(s,a) across all the different actions that the race car can take from state s
      • Δ := maximum(Δ, |v – V(s)|)
    • Is Δ < Ɵ? If yes, we are done (because the values of V(s) have converged i.e. stabilized), otherwise we are not done.

Output

The output of the value iteration algorithm is the value table, the table that maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state] 

Policy Determination

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep t of the race, the race car observes the state st (i.e. position and velocity) and decides what action at to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state (the action that will generate the highest expected cumulative reward). The car then takes that action at*, where at* means the optimal action to take at time t.

More formally, the optimal policy for a given states s at timestep t is π*(s) where:

For each state s do

value-iter-policy-equation

Value Iteration Summary

So, to sum up, on a high level, the complete implementation of an autonomous race car using value iteration has two main steps:

  • Step 1: During the training phase, calculate the value of each state-action pair.
  • Step 2: At each timestep of the time trial, given a current state s, select the action a where the state-action pair value (i.e. Q*(s,a)) is the highest.

Return to Table of Contents

How Q-Learning Works

Overview

In most real-world cases, it is not computationally efficient to build a complete model (i.e. transition and reward function) of the environment beforehand.  In these cases, the transition and reward functions are unknown. To solve this issue, model-free reinforcement learning algorithms like Q-learning were invented (Sutton & Barto, 2018) .

In the case of model-free learning, the race car (i.e. agent) has to interact with the environment and learn through trial-and-error. It cannot look ahead to inspect what the value would be if it were to take a particular action from some state. Instead, in Q-learning, the race car builds a table of Q-values denoted as Q(s,a) as it takes an action a from a state s and then receives a reward r. The race car uses the Q-table at each timestep to determine the best action to take given what it has learned in previous timesteps.

One can consider Q[s,a] as a two-dimensional table that has the states s as the rows and the available actions a as the columns. The value in each cell of this table (i.e. the Q-value) represents the value of taking action a in state s.

Just like in value iteration, value has two components, the immediate reward and the discounted future reward.

More formally,

Q[s,a] = immediate reward + discounted future reward

Where immediate reward is the reward you get when taking action a from state s, and discounted future reward is the reward you get for taking actions in the future, discounted by some discount rate γ. The reason why future rewards are discounted is because rewards received immediately are more certain than rewards received far into the future (e.g. $1 received today is worth more than $1 received fifty years from now).

So, how does the race car use the table Q[s,a] to determine what action to take at each timestep of the time trial (i.e. what is the policy given a state…π(s))?

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

After Q-learning is run a number of times, an optimal policy will eventually be found. The optimal policy is denoted as π*(s). Similarly the optimal Q-value is Q*[s,a].

How the Q[s,a] Table Is Built

Building the Q[s,a] table can be broken down into the following steps:

  1. Choose the environment we want to train on.
    • In this race car problem, the racetrack environment is what we want the race car to explore and train on.
    • The racetrack provides the x and y position data that constitutes a portion of the state s that will be observed at each timestep by the racecar (agent).
  2. Starting from the starting position, the race car moves through the environment, observing the state s at a given timestep
  3. The race car consults the policy π(s), the function that takes as input the current state and outputs an action a.
  4. The action a taken by the race car changes the state of the environment from s to s’. The environment also generates a reward r that is passed to the race car.
  5. Steps 2-4 generate a complete experience tuple (i.e. a tuple is a list that does not change) of the form <s, a, s’, r> = <state, action, new state, reward>.
  6. The experience tuple in 5 is used to update the Q[s,a] table.
  7. Repeat steps 3-6 above until the race car gets to the terminal state (i.e. the finish line of the racetrack).

With the Q[s,a] table built, the race car can now test the policy. Testing the policy means starting at the starting position and consulting the Q[s,a] table at each timestep all the way until the race car reaches the finish line. We count how many timesteps it took for the race car to reach the finish line. If the time it took the race car to complete the time trial is not improving, we are done. Otherwise, we make the race car go back through the training process and test its policy again with an updated Q[s,a] table.

We expect that eventually, the performance of the race car will reach a plateau.

How the Q[s,a] Table Is Updated

With those high level steps, let us take a closer look now at how the Q[s,a] table is updated in step 6 in the previous section.

Each time the race car takes an action, the environment transitions to a new state s’ and gives the race car a reward r. This information is used by the race car to update its Q[s,a] table.

The update rule consists of two parts, where Q’[s,a] is the new updated Q-value for an action a taken at state s.

Q’[s,a] = Q[s,a] + α * (improved estimate of Q[s,a] – Q[s,a])

where α is the learning rate.

The learning rate is a number 0 < α ≤ 1 (commonly 0.2). Thus, the new updated Q-value is a blend of the old Q-value and the improved estimate of the Q value for a given state-action pair. The higher the learning rate, the more new information is considered when updating the Q-value. A learning rate of 1 means that the new updated Q-value is only considering the new information.

The equation above needs to be expanded further.

Q’[s,a] =Q[s,a] + α*(immediate reward + discounted future reward – Q[s,a])

Q’[s,a] = Q[s,a] + α * (r + γ * future reward – Q[s,a])

Where γ is the discount rate. It is typically by a number in the range [0,1). It means that rewards received by the race car (agent) in the far future are worth less than an immediate reward.

Continuing with the expansion of the equation, we have:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, optimal action a’ at new state s’] – Q[s,a])

Note in the above equation that we assume that the race car reaches state s’ and takes the best action from there, where best action is the action a’ that has the highest Q-value for that given new state s’.

More formally the update rule is as follows:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])

Where argmaxa’(Q[s’,a’]) returns the action a’ that has the highest Q value for a given state s’.

How an Action Is Selected at Each Step

The performance of Q-learning is improved the more the race car explores different states and takes different actions in those states. In other words, the more state-action pairs the race car explores, the better Q[s,a] table the race car is able to build.

One strategy for forcing the race car to explore as much of the state-action space as possible is to add randomness into the Q-learning procedure. This is called exploration. More specifically, at the step of Q-learning where the race car selects an action based on the Q[s,a] table:

  • There is a 20% chance ε that the race car will do some random action, and a 80% chance the race car will do the action with the highest Q-value (as determined by the Q[s,a] table). The latter is known as exploitation.
  • If the race car does some random action, the action is chosen randomly from the set of possible actions the race car can perform.

I chose 20% as the starting probability (i.e. ε) of doing some random action, but it could be another number (e.g. 30%) As the race car gains more experience and the Q[s,a] table gets better and better, we want the race car to take fewer random actions and follow its policy more often. For this reason, it is common to reduce ε with each successive iteration, until we get to a point where the race car stops exploring random actions.

Q-Learning Algorithm Pseudocode

Below are is the Q-learning algorithm pseudocode on a high level (Alpaydin, 2014; Sutton & Barto, 2018).

Inputs

  • Learning rate α, where 0 < α ≤ 1 (e.g. 0.2)
  • Discount rate γ (e.g. 0.9)
  • Exploration probability ε, corresponding to the probability that the race car will take a random action given a state (e.g. 0.2)
  • Reduction of exploration probability, corresponding to how much we want to decrease ε at each training iteration (defined as a complete trip from the starting position to the finish line terminal state) of the Q-learning process. (e.g. 0.5)
  • Number of training iterations (e.g. 1,000,000)
  • States:
    • List of all possible values of x
    • List of all possible values of y
    • List of all possible values of vx
    • List of all possible values of vy
  • Actions:
    • List of all the possible values of ax
    • List of all possible values of ay

Process

  • Initialize the Q[s,a] table to small random values, except for the terminal state which will have a value of 0.
  • For a fixed number of training iterations
    • Initialize the state to the starting position of the race track
    • Initialize a Boolean variable to see if the race car has crossed the finish line (i.e. reached the terminal state)
    • While the race car has not reached the finish line
      • Select the action a using the Q[s,a] table that corresponds to the highest Q value given state s
      • Take action a
      • Observe the reward r and the new state s’
      • Update the Q[s,a] table:
        • Q[s,a] := Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])
      • s := s’
      • Check if we have crossed the finish line

Output

Algorithm returns the Q[s,a] table which is used to determine the optimal policy.

Policy Determination

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

Helpful Video

Here is a good video where Professor Tucker Balch from Georgia Tech goes through Q-learning, step-by-step:

Return to Table of Contents

Experimental Design and Tuning Process

Experimental Design

The Q-learning and value iteration algorithms were implemented for the racetrack problem and then tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to find the finish line was calculated for each algorithm-racetrack combination. The number of training iterations was also monitored. 10 experiments for each algorithm-racetrack combination were performed to create data to graph a learning curve.

At each timestep t of the racetrack problem, we have the following variables (below). We assume the racetrack is a Cartesian grid with an x-axis and a y-axis, and that the system is governed by the standard laws of kinematics:

  • State = <xt, yt, vxt, vyt> = variables that describe the current situation (i.e. environment)
    • xt = x coordinate of the location of the car at timestep t
    • yt = y coordinate of the location of the car at timestep t
    • vxt = velocity in the x direction at time step t, where vxt = xt – xt-1
      • Maximum vxt = 5
      • Minimum vxt = -5
    • vyt = velocity in the y direction at timestep t, where vyt = yt – yt-1
  • Action = <axt, ayt> = control variables = what the race car (i.e. agent) can do to influence the state
    • axt = accelerate in the x direction at timestep t, where axt = vxt  – vx t-1
      • -1 (decrease velocity in x-direction by 1)
      • 0 (keep same velocity), or
      • 1 (increase velocity in x-direction by 1)
    • ayt = acceleration in the y direction at timestep t, where ayt = vyt  – vy t-1
      • -1 (decrease velocity in y-direction by 1)
      • 0 (keep same velocity)
      • 1 (increase velocity in y-direction by 1)

Example Calculation

At t = 0, the race car observes the current state. Location is (2,2) and velocity is (1,0).

  • State = <xt, yt, vxt, vyt> = <x0, y0, vx0, vy0> = <2, 2, 1, 0>
  • This means the race car is moving east one grid square per timestep because the x-component of the velocity is 1, and the y-component is 0.

After observing the state, the race car selects an action. It accelerates with acceleration (1,1).

  • Action=(ax0, ay0)=(1, 1) = (increase velocity in x-direction by 1, increase velocity in y-direction by 1)

At t = 1, the race car observes the state.

  • Velocity is now (vx0 + ax0, vy0 + ay0) = (1 + 1, 0 + 1) = (vx1, vy1) = (2, 1)
  • Position (i.e. x and y coordinates) is now (x0 + vx1, y0 + vy1) = (2 + 2, 2 + 1) = (x1, y1) = (4, 3)
  • Thus, putting it all together, the new state is now <x1, y1, vx1, vy1> = <4, 3, 2, 1>

Acceleration is Not Guaranteed

There is another twist with the racetrack problem. At any timestep t, when the race car attempts to accelerate, there is a 20% chance that the attempt will fail. In other words, each time the race car selects an action, there is a:

  • 20% chance that the attempt fails, and velocity remains unchanged at the next timestep; i.e. (axt, ayt) = (0, 0)
  • 80% chance velocity changes at the next timestep; (axt, ayt) = (selected acceleration in the x-direction, selected acceleration in y-direction)

Must Avoid Crashing Into Walls

The race car must stay on the track. Crashing into walls is bad. Two versions of a “bad crash” will be implemented in this project:

  • Bad Crash Version 1: If the car crashes into a wall,
    • The car is placed at the nearest position on the track to the place where the car crashed.
    • The velocity is set to (0, 0).
  • Bad Crash Version 2: If the car crashes into a wall,
    • The car must go back to the original starting position.
    • The velocity is set to (0, 0).

In this project, the performance of both versions of bad crash will be compared side by side for the reinforcement learning algorithms implemented in this project.

Reward Function

Because the goal is for the race car to go from the starting position to the finish line in the minimum number of timesteps, we assign a reward of -1 for each move the car makes. Because the reward is negative, this means that the race car incurs a cost (i.e. penalty) of 1 each time the race car moves to another position. The goal is to get to the finish line in as few moves (i.e. time steps) as possible in order to minimize the total cost (i.e. maximize the total reward).

The finish line is the final state (often referred to as terminal or absorbing state). This state has a reward of 0. There is no requirement to stop right on the finish position (i.e. finish line). It is sufficient to cross it.

Description of Any Tuning Process Applied

In order to compare the performance of the algorithm for the different crash scenarios, the principal hyperparameters were kept the same for all runs. These hyperparameters are listed below.

Learning Rate

The Learning rate α was set to 0.2.

Discount Rate

The discount rate γ was set to 0.9.

Exploration Probability

The exploration probability ε was set to 0.2 as dictated by the problem statement.

Number of Training Iterations

The number of training iterations was varied in order to computer the learning curve (see Results section below)

Return to Table of Contents

Input Files

Here are the tracks used in this project (S= Starting States; F = Finish States, # = Wall, . = Track):

Value Iteration Code in Python

Here is the full code for value iteration. This and the racetrack text files above are all you need to run the program:

import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: value_iteration.py
# Author: Addison Sears-Collins
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the value iteration reinforcement learning
# algorithm for the racetrack problem. 
# The racetrack problem is described in full detail in: 
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using 
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and 
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning : 
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Value_Iteration"
FILENAME = "R-track.txt"
THIS_TRACK = "R_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
                 # we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a 
                            # according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 10 # A single training iteration runs through all
                            # possible states s
NO_RACES = 10 # How many times the race car does a single time trial from 
              # starting position to the finish line
FRAME_TIME = 0.3 # How many seconds between frames printed to the console
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
MAX_TRAIN_ITER = 50 # Maximum number of training iterations

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# All actions A the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
           (-1,0) , (0,0),  (1,0),
           (-1,1) , (0,1),  (1,1)]

def read_environment(filename):
    """
    This method reads in the environment (i.e. racetrack)
    :param str filename
    :return environment: list of lists
    :rtype list
    """
    # Open the file named filename in READ mode.
    # Make the file an object named 'file'
    with open(filename, 'r') as file:

        # Read until end of file using readline() 
        # readline() then returns a list of the lines
        # of the input file
        environment_data = file.readlines()
    
    # Close the file
    file.close()

    # Initialize an empty list
    environment = []

    # Adds a counter to each line in the environment_data list,
    # i is the index of each line and line is the actual contents.
    # enumerate() helps get not only the line of the environment but also 
    # the index of that line (i.e. row)
    for i,line in enumerate(environment_data):
        # Ignore the first line that contains the dimensions of the racetrack
        if i > 0:
            # Remove leading or trailing whitespace if applicable
            line = line.strip()

            # If the line is empty, ignore it
            if line == "": continue

            # Creates a list of lines. Within each line is a list of 
            # individual characters
            # The stuff inside append(stuff) first creates a new_list = []
            # It then appends all the values in a given line to that new 
            # list (e.g. new_list.append(all values inside the line))
            # Then we append that new list to the environment list.
            # Therefoer, environment is a list of lists.
            environment.append([x for x in line])

    # Return the environment (i.e. a list of lists/lines)
    return environment

def print_environment(environment, car_position = [0,0]):
    """
    This method reads in the environment and current 
    (y,x) position of the car and prints the environment to the console
    :param list environment
    :param list car_position 
    """
    # Store value of current grid square
    temp = environment[car_position[0]][car_position[1]]

    # Move the car to current grid square
    environment[car_position[0]][car_position[1]] = "X"

    # Delay 
    time.sleep(FRAME_TIME)

    # Clear the printed output
    clear()

    # For each line in the environment
    for line in environment: 

        # Initialize a string
        text = ""

        # Add each character to create a line
        for character in line: 
            text += character

        # Print the line of the environment
        print(text)

    # Retstore value of current grid square
    environment[car_position[0]][car_position[1]] = temp

def clear():
    """
    This method clears the print output
    """    
    os.system( 'cls' )

def get_random_start_position(environment):
    """
    This method reads in the environment and selects a random
    starting position on the racetrack (y, x). Note that 
    (0,0) corresponds to the upper left corner of the racetrack.
    :param list environment: list of lines
    :return random starting coordinate (y,x) on the racetrack
    :rtype tuple
    """
    # Collect all possible starting positions on the racetrack
    starting_positions = []

    # For each row in the environment
    for y,row in enumerate(environment):

        # For each column in each row of the environment
        for x,col in enumerate(row):

            # If we are at the starting position
            if col == START:

                # Add the coordiante to the list of available
                # starting positions in the environment
                starting_positions += [(y,x)]

    # Random shuffle the list of starting positions
    shuffle(starting_positions)

    # Select a starting position
    return starting_positions[0]

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
    """
    Get the new velocity values
    :param tuple old_vel: (vy, vx)
    :param tuple accel: (ay, ax)
    :param int min_vel: Minimum velocity of the car
    :param int max_vel: Maximum velocity of the car
    :return new velocities in y and x directions
    """
    new_y = old_vel[0] + accel[0] 
    new_x = old_vel[1] + accel[1]
    if new_x < min_vel: new_x = min_vel
    if new_x > max_vel: new_x = max_vel
    if new_y < min_vel: new_y = min_vel
    if new_y > max_vel: new_y = max_vel
    
    # Return the new velocities
    return new_y, new_x

def get_new_position(old_loc, vel, environment):
    """
    Get a new position using the old position and the velocity
    :param tuple old_loc: (y,x) position of the car
    :param tuple vel: (vy,vx) velocity of the car
    :param list environment
    :return y+vy, x+vx: (new_y,new_x)
    """
    y,x = old_loc[0], old_loc[1]
    vy, vx = vel[0], vel[1]

    # new_y = y+vy, new_x = x + vx    
    return y+vy, x+vx

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
        0), open = [TRACK, START, GOAL]):
    """
    Locate the nearest open cell in order to handle crash scenario.
    Distance is calculated as the Manhattan distance.
    Start from the crash grid square and expand outward from there with
    a radius of 1, 2, 3, etc. Forms a diamond search pattern.
    
    For example, a Manhattan distance of 2 would look like the following:     
            .
           ...
          ..#..
           ... 
            .   
    If velocity is provided, search in opposite direction of velocity so that
    there is no movement over walls
    :param list environment
    :param int ycrash: y coordinate where crash happened
    :param int xcrash: x coordinate where crash happened
    :param int vy: velocity in y direction when crash occurred
    :param int vx: velocity in x direction when crash occurred
    :param list of strings open: Contains environment types
    :return tuple of the nearest open y and x position on the racetrack
    """ 
    # Record number of rows (lines) and columns in the environment
    rows = len(environment)
    cols = len(environment[0])    
   
    # Add expanded coverage for searching for nearest open cell
    max_radius = max(rows,cols)

    # Generate a search radius for each scenario
    for radius in range(max_radius):

        # If car is not moving in y direction
        if vy == 0: 
            y_off_range = range(-radius, radius + 1)
        # If the velocity in y-direction is negative
        elif vy < 0:
            # Search in the positive direction
            y_off_range = range(0, radius + 1)
        else:
            # Otherwise search in the negative direction
            y_off_range = range(-radius, 1)

        # For each value in the search radius range of y
        for y_offset in y_off_range:

            # Start near to crash site and work outwards from there
            y = y_crash + y_offset
            x_radius = radius - abs(y_offset)

            # If car is not moving in x direction
            if vx == 0:
                x_range = range(x_crash - x_radius, x_crash + x_radius + 1)
            # If the velocity in x-direction is negative
            elif vx < 0:
                x_range = range(x_crash, x_crash + x_radius + 1)
            # If the velocity in y-direction is positive
            else:
                x_range = range(x_crash - x_radius, x_crash + 1)

            # For each value in the search radius range of x
            for x in x_range:
                # We can't go outside the environment(racetrack) boundary
                if y < 0 or y >= rows: continue
                if x < 0 or x >= cols: continue

                # If we find and open cell, return that (y,x) open cell
                if environment[y][x] in open: 
                    return(y,x)        
    
    # No open grid squares found on the racetrack
    return

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
    False),bad_crash = False):
    """
    This method generates the new state s' (position and velocity) from the old
    state s and the action a taken by the race car. It also takes as parameters
    the two different crash scenarios (i.e. go to nearest
    open position on the race track or go back to start)
    :param int old_y: The old y position of the car
    :param int old_x: The old x position of the car
    :param int old_vy: The old y velocity of the car
    :param int old_vx: The old x velocity of the car
    :param tuple accel: (ay,ax) - acceleration in y and x directions
    :param list environment: The racetrack
    :param boolean deterministic: True if we always follow the policy
    :param boolean bad_crash: True if we return to start after crash
    :return s' where s' = new_y, new_x, new_vy, and new_vx
    :rtype int
    """ 
    # This method is deterministic if the same output is returned given
    # the same input information
    if not deterministic:

        # If action fails (car fails to take the prescribed action a)
        if random() > PROB_ACCELER_SUCCESS: 
            #print("Car failed to accelerate!")
            accel = (0,0)
 
    # Using the old velocity values and the new acceleration values,
    # get the new velocity value
    new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

    # Using the new velocity values, update with the new position
    temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),( 
                                     environment))

    # Find the nearest open cell on the racetrack to this new position
    new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy, 
                                     new_vx)
    # If a crash happens (i.e. new position is not equal to the nearest
    # open position on the racetrack
    if new_y != temp_y or new_x != temp_x:

        # If this is a crash in which we would have to return to the
        # starting position of the racetrack and we are not yet
        # on the finish line
        if bad_crash and environment[new_y][new_x] != GOAL:

            # Return to the start of the race track
            new_y, new_x = get_random_start_position(environment)
        
        # Velocity of the race car is set to 0.
        new_vy, new_vx = 0,0

    # Return the new state
    return new_y, new_x, new_vy, new_vx

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
    """
    This method returns the policy pi(s) based on the action taken in each state
    that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
    best action that the race car should take in each state is the one that 
    maximizes the value of Q. (* means optimal)
    :param int cols: Number of columns in the environment
    :param int rows: Number of rows (i.e. lines) in the environment
    :param list vel_range: Range of the velocity of the race car 
    :param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
    :return pi : the policy
    :rtype: dictionary: key is the state tuple, value is the 
       action tuple (ay,ax)
    """
    # Create an empty dictionary called pi
    pi = {}

    # For each state s in the environment
    for y in range(rows): 
        for x in range(cols):
            for vy in vel_range:
                for vx in vel_range:
                    # Store the best action for each state...the one that
                    # maximizes the value of Q.
                    # argmax looks across all actions given a state and 
                    # returns the index ai of the maximum Q value
                    pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]        
                    
    # Return pi(s)
    return(pi)

def value_iteration(environment, bad_crash = False, reward = (
        0.0), no_training_iter = NO_TRAINING_ITERATIONS):
    """
    This method is the value iteration algorithm.
    :param list environment
    :param boolean bad_crash
    :param int reward of the terminal states (i.e. finish line)
    :param int no_training_iter
    :return policy pi(s) which maps a given state to an optimal action
    :rtype dictionary
    """
    # Calculate the number of rows and columns of the environment
    rows = len(environment)
    cols = len(environment[0])

    # Create a table V(s) that will store the optimal Q-value for each state. 
    # This table will help us determine when we should stop the algorithm and 
    # return the output. Initialize all the values of V(s) to arbitrary values, 
    # except the terminal state (i.e. finish line state) that has a value of 0.
    # values[y][x][vy][vx] 
    # Read from left to right, we create a list of vx values. Then for each
    # vy value we assign the list of vx values. Then for each x value, we assign
    # the list of vy values (which contain a list of vx values), etc.
    # This is called list comprehension.
    values = [[[[random() for _ in vel_range] for _ in vel_range] for _ in (
        line)] for line in environment]

    # Set the finish line states to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:                 
                        values[y][x][vy][vx] = reward

    # Initialize all Q(s,a) to arbitrary values, except the terminal state 
    # (i.e. finish line states) that has a value of 0.
    # Q[y][x][vy][vx][ai]
    Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
        vel_range)] for _ in line] for line in environment]

    # Set finish line state-action pairs to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:   
                        for ai, a in enumerate(actions):                        
                            Q[y][x][vy][vx][ai] = reward

    # This is where we train the agent (i.e. race car). Training entails 
    # optimizing the values in the tables of V(s) and Q(s,a)
    for t in range(no_training_iter):

        # Keep track of the old V(s) values so we know if we reach stopping 
        # criterion
        values_prev = deepcopy(values)

        # When this value gets below the error threshold, we stop training.
        # This is the maximum change of V(s)
        delta = 0.0

        # For all the possible states s in S
        for y in range(rows):
            for x in range(cols):
                for vy in vel_range:
                    for vx in vel_range:
                        
                        # If the car crashes into a wall
                        if environment[y][x] == WALL:

                            # Wall states have a negative value
                            # I set some arbitrary negative value since
                            # we want to train the car not to hit walls
                            values[y][x][vy][vx] = -9.9

                            # Go back to the beginning
                            # of this inner for loop so that we set
                            # all the other wall states to a negative value
                            continue

                        # For each action a in the set of possible actions A
                        for ai, a in enumerate(actions):

                            # The reward is -1 for every state except
                            # for the finish line states
                            if environment[y][x] == GOAL:
                                r = reward
                            else:
                                r = -1

                            # Get the new state s'. s' is based on the current 
                            # state s and the current action a
                            new_y, new_x, new_vy, new_vx = act(
                                y,x,vy,vx,a,environment,deterministic = True, 
                                bad_crash = bad_crash)

                            # V(s'): value of the new state when taking action
                            # a from state s. This is the one step look ahead.
                            value_of_new_state = values_prev[new_y][new_x][
                                new_vy][new_vx]

                            # Get the new state s'. s' is based on the current 
                            # state s and the action (0,0)
                            new_y, new_x, new_vy, new_vx = act(
                                y,x,vy,vx,(0,0),environment,deterministic = (
                                    True), bad_crash = bad_crash)

                            # V(s'): value of the new state when taking action
                            # (0,0) from state s. This is the value if for some
                            # reason the race car attemps to accelerate but 
                            # fails
                            value_of_new_state_if_action_fails = values_prev[
                                new_y][new_x][new_vy][new_vx]

                            # Expected value of the new state s'
                            # Note that each state-action pair has a unique 
                            # value for s'
                            expected_value = (
                                PROB_ACCELER_SUCCESS * value_of_new_state) + (
                                PROB_ACCELER_FAILURE * (
                                    value_of_new_state_if_action_fails))

                            # Update the Q-value in Q[s,a]
                            # immediate reward + discounted future value
                            Q[y][x][vy][vx][ai] = r + (
                                DISC_RATE * expected_value)

                        # Get the action with the highest Q value
                        argMaxQ = np.argmax(Q[y][x][vy][vx])

                        # Update V(s)
                        values[y][x][vy][vx] = Q[y][x][vy][vx][argMaxQ]

        # Make sure all the rewards to 0 in the terminal state
        for y in range(rows):
            for x in range(cols):
                # Terminal state has a value of 0
                if environment[y][x] == GOAL:
                    for vy in vel_range:
                        for vx in vel_range:                 
                            values[y][x][vy][vx] = reward

        # See if the V(s) values are stabilizing
        # Finds the maximum change of any of the states. Delta is a float.
        delta = max([max([max([max([abs(values[y][x][vy][vx] - values_prev[y][
            x][vy][vx]) for vx in vel_range]) for vy in (
            vel_range)]) for x in range(cols)]) for y in range(rows)])

        # If the values of each state are stabilizing, return the policy
        # and exit this method.
        if delta < ERROR_THRES:
            return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

    return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

def do_time_trial(environment, policy, bad_crash = False, animate = True, 
                  max_steps = MAX_STEPS):
    """
    Race car will do a time trial on the race track according to the policy.   
    :param list environment
    :param dictionary policy: A dictionary containing the best action for a 
        given state. The key is the state y,x,vy,vx and value is the action 
        (ax,ay) acceleration
    :param boolean bad_crash: The crash scenario. If true, race car returns to
        starting position after crashes
    :param boolean animate: If true, show the car on the racetrack at each 
        timestep
    :return i: Total steps to complete race (i.e. from starting position to 
        finish line)
    :rtype int

    """
    # Copy the environment
    environment_display = deepcopy(environment)

    # Get a starting position on the race track
    starting_pos = get_random_start_position(environment)
    y,x = starting_pos
    vy,vx = 0,0  # We initialize velocity to 0

    # Keep track if we get stuck
    stop_clock = 0    

    # Begin time trial
    for i in range(max_steps):        

        # Show the race car on the racetrack
        if animate: 
            print_environment(environment_display, car_position = [y,x])
        
        # Get the best action given the current state
        a = policy[(y,x,vy,vx)]

        # If we are at the finish line, stop the time trial
        if environment[y][x] == GOAL: 
            return i 
        
        # Take action and get new a new state s'
        y,x,vy,vx = act(y, x, vy, vx, a, environment, bad_crash = bad_crash)

        # Determine if the car gets stuck
        if vy == 0 and vx == 0:
            stop_clock += 1
        else:
            stop_clock = 0

        # We have gotten stuck as the car has not been moving for 5 timesteps
        if stop_clock == 5:
            return max_steps
        
    # Program has timed out
    return max_steps

def main():
    """
    The main method of the program
    """    
    print("Welcome to the Racetrack Control Program!")
    print("Powered by the " + ALGORITHM_NAME + 
          " Reinforcement Learning Algorithm\n")
    print("Track: " + THIS_TRACK)
    print()
    print("What happens to the car if it crashes into a wall?")
    option_1 = """1. Starts from the nearest position on the track to the 
        place where it crashed."""
    option_2 = """2. Returns back to the original starting position."""
    print(option_1)
    print(option_2)
    crash_scenario = int(input("Crash scenario (1 or 2): "))
    no_training_iter = int(input(
        "Enter the initial number of training iterations (e.g. 5): "))
    print("\nThe race car is training. Please wait...")

    # Directory where the racetrack is located
    #racetrack_name = input("Enter the path to your input file: ") 
    racetrack_name = FILENAME
    racetrack = read_environment(racetrack_name)

    # How many times the race car will do a single time trial
    races = NO_RACES

    while(no_training_iter < MAX_TRAIN_ITER):
    
        # Keep track of the total number of steps
        total_steps = 0

        # Record the crash scenario
        bad_crash = False
        if crash_scenario == 1:
            bad_crash = False
        else:
            bad_crash = True

        # Retrieve the policy
        policy = value_iteration(racetrack, bad_crash = bad_crash,
                             no_training_iter=no_training_iter) 
 
        for each_race in range(races):
            total_steps += do_time_trial(racetrack, policy, bad_crash = (
            bad_crash), animate = True)

        print("Number of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            print("Bad Crash Scenario: " + option_1)
        else:
            print("Bad Crash Scenario: " + option_2)
        print("Average Number of Steps the Race Car Needs to Take Before " + 
              "Finding the Finish Line: " + str(total_steps/races) + " steps\n")
        print("\nThe race car is training. Please wait...")

        # Delay 
        time.sleep(FRAME_TIME + 4)

        # Testing statistics
        test_stats_file = THIS_TRACK 
        test_stats_file += "_"
        test_stats_file += ALGORITHM_NAME + "_iter"
        test_stats_file += str(no_training_iter)+ "_cr" 
        test_stats_file += str(crash_scenario) + "_stats.txt" 

        ## Open a test_stats_file 
        outfile_ts = open(test_stats_file,"w")
        
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write("Track: ") 
        outfile_ts.write(THIS_TRACK)
        outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
        else:
            outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
        outfile_ts.write("Average Number of Steps the Race Car Took " + 
              "Before Finding the Finish Line: " + str(total_steps/races) + 
              " steps\n")

        # Show functioning of the program
        trace_runs_file = THIS_TRACK 
        trace_runs_file += "_"
        trace_runs_file += ALGORITHM_NAME + "_iter"
        trace_runs_file += str(no_training_iter) + "_cr"
        trace_runs_file += str(crash_scenario) + "_trace.txt"

        if no_training_iter <= 5:
            ## Open a new file to save trace runs
            outfile_tr = open(trace_runs_file,"w") 
        
            # Print trace runs that demonstrate proper functioning of the code
            outfile_tr.write(str(policy))  
            
            outfile_tr.close()

        ## Close the files        
        outfile_ts.close()

        no_training_iter += 5

main()

Q-Learning Code in Python

And here is the code for Q-Learning:

import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: q_learning.py
# Author: Addison Sears-Collins
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the Q-learning reinforcement learning
# algorithm for the racetrack problem
# The racetrack problem is described in full detail in: 
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using 
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and 
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning : 
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Q_Learning"
FILENAME = "L-track.txt"
THIS_TRACK = "L_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
                 # we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a 
                            # according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 500000 # A single training iteration runs through all
                            # possible states s
TRAIN_ITER_LENGTH = 10 # The maximum length of each training iteration
MAX_TRAIN_ITER = 10000000 # Maximum number of training iterations
NO_RACES = 10 # How many times the race car does a single time trial from 
              # starting position to the finish line
LEARNING_RATE = 0.25 # If applicable, also known as alpha
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
FRAME_TIME = 0.1 # How many seconds between frames printed to the console

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# Actions the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
           (-1,0) , (0,0),  (1,0),
           (-1,1) , (0,1),  (1,1)]

def read_environment(filename):
    """
    This method reads in the environment (i.e. racetrack)
    :param str filename
    :return environment: list of lists
    :rtype list
    """
    # Open the file named filename in READ mode.
    # Make the file an object named 'file'
    with open(filename, 'r') as file:

        # Read until end of file using readline() 
        # readline() then returns a list of the lines
        # of the input file
        environment_data = file.readlines()
    
    # Close the file
    file.close()

    # Initialize an empty list
    environment = []

    # Adds a counter to each line in the environment_data list,
    # i is the index of each line and line is the actual contents.
    # enumerate() helps get not only the line of the environment but also 
    # the index of that line (i.e. row)
    for i,line in enumerate(environment_data):
        # Ignore the first line that contains the dimensions of the racetrack
        if i > 0:
            # Remove leading or trailing whitespace if applicable
            line = line.strip()

            # If the line is empty, ignore it
            if line == "": continue

            # Creates a list of lines. Within each line is a list of 
            # individual characters
            # The stuff inside append(stuff) first creates a new_list = []
            # It then appends all the values in a given line to that new 
            # list (e.g. new_list.append(all values inside the line))
            # Then we append that new list to the environment list.
            # Therefoer, environment is a list of lists.
            environment.append([x for x in line])

    # Return the environment (i.e. a list of lists/lines)
    return environment

def print_environment(environment, car_position = [0,0]):
    """
    This method reads in the environment and current 
    (y,x) position of the car and prints the environment to the console
    :param list environment
    :param list car_position 
    """
    # Store value of current grid square
    temp = environment[car_position[0]][car_position[1]]

    # Move the car to current grid square
    environment[car_position[0]][car_position[1]] = "X"

    # Delay 
    time.sleep(FRAME_TIME)

    # Clear the printed output
    clear()

    # For each line in the environment
    for line in environment: 

        # Initialize a string
        text = ""

        # Add each character to create a line
        for character in line: 
            text += character

        # Print the line of the environment
        print(text)

    # Retstore value of current grid square
    environment[car_position[0]][car_position[1]] = temp

def clear():
    """
    This method clears the print output
    """    
    os.system( 'cls' )

def get_random_start_position(environment):
    """
    This method reads in the environment and selects a random
    starting position on the racetrack (y, x). Note that 
    (0,0) corresponds to the upper left corner of the racetrack.
    :param list environment: list of lines
    :return random starting coordinate (y,x) on the racetrack
    :rtype tuple
    """
    # Collect all possible starting positions on the racetrack
    starting_positions = []

    # For each row in the environment
    for y,row in enumerate(environment):

        # For each column in each row of the environment
        for x,col in enumerate(row):

            # If we are at the starting position
            if col == START:

                # Add the coordiante to the list of available
                # starting positions in the environment
                starting_positions += [(y,x)]

    # Random shuffle the list of starting positions
    shuffle(starting_positions)

    # Select a starting position
    return starting_positions[0]

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
    False),bad_crash = False):
    """
    This method generates the new state s' (position and velocity) from the old
    state s and the action a taken by the race car. It also takes as parameters
    the two different crash scenarios (i.e. go to nearest
    open position on the race track or go back to start)
    :param int old_y: The old y position of the car
    :param int old_x: The old x position of the car
    :param int old_vy: The old y velocity of the car
    :param int old_vx: The old x velocity of the car
    :param tuple accel: (ay,ax) - acceleration in y and x directions
    :param list environment: The racetrack
    :param boolean deterministic: True if we always follow the policy
    :param boolean bad_crash: True if we return to start after crash
    :return s' where s' = new_y, new_x, new_vy, and new_vx
    :rtype int
    """ 
    # This method is deterministic if the same output is returned given
    # the same input information
    if not deterministic:

        # If action fails (car fails to take the prescribed action a)
        if random() > PROB_ACCELER_SUCCESS: 
            #print("Car failed to accelerate!")
            accel = (0,0)
 
    # Using the old velocity values and the new acceleration values,
    # get the new velocity value
    new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

    # Using the new velocity values, update with the new position
    temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),( 
                                     environment))

    # Find the nearest open cell on the racetrack to this new position
    new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy, 
                                     new_vx)
    # If a crash happens (i.e. new position is not equal to the nearest
    # open position on the racetrack
    if new_y != temp_y or new_x != temp_x:

        # If this is a crash in which we would have to return to the
        # starting position of the racetrack and we are not yet
        # on the finish line
        if bad_crash and environment[new_y][new_x] != GOAL:

            # Return to the start of the race track
            new_y, new_x = get_random_start_position(environment)
        
        # Velocity of the race car is set to 0.
        new_vy, new_vx = 0,0

    # Return the new state
    return new_y, new_x, new_vy, new_vx

def get_new_position(old_loc, vel, environment):
    """
    Get a new position using the old position and the velocity
    :param tuple old_loc: (y,x) position of the car
    :param tuple vel: (vy,vx) velocity of the car
    :param list environment
    :return y+vy, x+vx: (new_y,new_x)
    """
    y,x = old_loc[0], old_loc[1]
    vy, vx = vel[0], vel[1]

    # new_y = y+vy, new_x = x + vx    
    return y+vy, x+vx

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
    """
    Get the new velocity values
    :param tuple old_vel: (vy, vx)
    :param tuple accel: (ay, ax)
    :param int min_vel: Minimum velocity of the car
    :param int max_vel: Maximum velocity of the car
    :return new velocities in y and x directions
    """
    new_y = old_vel[0] + accel[0] 
    new_x = old_vel[1] + accel[1]
    if new_x < min_vel: new_x = min_vel
    if new_x > max_vel: new_x = max_vel
    if new_y < min_vel: new_y = min_vel
    if new_y > max_vel: new_y = max_vel
    
    # Return the new velocities
    return new_y, new_x

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
        0), open = [TRACK, START, GOAL]):
    """
    Locate the nearest open cell in order to handle crash scenario.
    Distance is calculated as the Manhattan distance.
    Start from the crash grid square and expand outward from there with
    a radius of 1, 2, 3, etc. Forms a diamond search pattern.
    
    For example, a Manhattan distance of 2 would look like the following:     
            .
           ...
          ..#..
           ... 
            .   
    If velocity is provided, search in opposite direction of velocity so that
    there is no movement over walls
    :param list environment
    :param int ycrash: y coordinate where crash happened
    :param int xcrash: x coordinate where crash happened
    :param int vy: velocity in y direction when crash occurred
    :param int vx: velocity in x direction when crash occurred
    :param list of strings open: Contains environment types
    :return tuple of the nearest open y and x position on the racetrack
    """ 
    # Record number of rows (lines) and columns in the environment
    rows = len(environment)
    cols = len(environment[0])    
   
    # Add expanded coverage for searching for nearest open cell
    max_radius = max(rows,cols)

    # Generate a search radius for each scenario
    for radius in range(max_radius):

        # If car is not moving in y direction
        if vy == 0: 
            y_off_range = range(-radius, radius + 1)
        # If the velocity in y-direction is negative
        elif vy < 0:
            # Search in the positive direction
            y_off_range = range(0, radius + 1)
        else:
            # Otherwise search in the negative direction
            y_off_range = range(-radius, 1)

        # For each value in the search radius range of y
        for y_offset in y_off_range:

            # Start near to crash site and work outwards from there
            y = y_crash + y_offset
            x_radius = radius - abs(y_offset)

            # If car is not moving in x direction
            if vx == 0:
                x_range = range(x_crash - x_radius, x_crash + x_radius + 1)
            # If the velocity in x-direction is negative
            elif vx < 0:
                x_range = range(x_crash, x_crash + x_radius + 1)
            # If the velocity in y-direction is positive
            else:
                x_range = range(x_crash - x_radius, x_crash + 1)

            # For each value in the search radius range of x
            for x in x_range:
                # We can't go outside the environment(racetrack) boundary
                if y < 0 or y >= rows: continue
                if x < 0 or x >= cols: continue

                # If we find and open cell, return that (y,x) open cell
                if environment[y][x] in open: 
                    return(y,x)        
    
    # No open grid squares found on the racetrack
    return

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
    """
    This method returns the policy pi(s) based on the action taken in each state
    that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
    best action that the race car should take in each state is the one that 
    maximizes the value of Q. (* means optimal)
    :param int cols: Number of columns in the environment
    :param int rows: Number of rows (i.e. lines) in the environment
    :param list vel_range: Range of the velocity of the race car 
    :param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
    :return pi : the policy
    :rtype: dictionary: key is the state tuple, value is the 
       action tuple (ay,ax)
    """
    # Create an empty dictionary called pi
    pi = {}

    # For each state s in the environment
    for y in range(rows): 
        for x in range(cols):
            for vy in vel_range:
                for vx in vel_range:
                    # Store the best action for each state...the one that
                    # maximizes the value of Q.
                    # argmax looks across all actions given a state and 
                    # returns the index ai of the maximum Q value
                    pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]        
                    
    # Return pi(s)
    return(pi)

def q_learning(environment, bad_crash = False, reward = 0.0, no_training_iter =( 
    NO_TRAINING_ITERATIONS), train_iter_length = TRAIN_ITER_LENGTH):
    """ 
    Return a policy pi that maps states to actions
    Each episode uses a different initial state. This forces the agent to fully
    explore the environment to create a more informed Q[s,a] table.
    :param list environment
    :param boolean bad_crash
    :param int reward of the terminal states (i.e. finish line)
    :param int no_training_iter
    :param int train_iter_length
    :return policy pi(s) which maps a given state to an optimal action
    :rtype dictionary
    """
    rows = len(environment)
    cols = len(environment[0])    

    # Initialize all Q(s,a) to arbitrary values, except the terminal state 
    # (i.e. finish line states) that has a value of 0.
    # Q[y][x][vy][vx][ai]
    Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
        vel_range)] for _ in line] for line in environment]

    # Set finish line state-action pairs to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:   
                        for ai, a in enumerate(actions):                        
                            Q[y][x][vy][vx][ai] = reward

    # We run many training iterations for different initial states in order
    # to explore the environment as much as possible
    for iter in range(no_training_iter):
               
        # Reset all the terminal states to the value of the goal
        for y in range(rows):
            for x in range(cols): 
                if environment[y][x] == GOAL: 
                    Q[y][x] = [[[reward for _ in actions] for _ in (
                        vel_range)] for _ in vel_range] 
        
        # Select a random initial state
        # from anywhere on the racetrack
        y = np.random.choice(range(rows))
        x = np.random.choice(range(cols))
        vy = np.random.choice(vel_range) 
        vx = np.random.choice(vel_range) 

        # Do a certain number of iterations for each episode
        for t in range(train_iter_length):
            if environment[y][x] == GOAL: 
                break
            if environment[y][x] == WALL: 
                break

            # Choose the best action for the state s
            a = np.argmax(Q[y][x][vy][vx])           
            
            # Act and then observe a new state s'
            new_y, new_x, new_vy, new_vx = act(y, x, vy, vx, actions[
                a], environment, bad_crash = bad_crash) 
            r = -1

            # Update the Q table based on the immediate reward received from
            # taking action a in state s plus the discounted future reward
            Q[y][x][vy][vx][a] = ((1 - LEARNING_RATE)*Q[y][x][vy][vx][a] + 
                LEARNING_RATE*(r + DISC_RATE*max(Q[new_y][new_x][
                    new_vy][new_vx])))

            # The new state s' now becomes s
            y, x, vy, vx = new_y, new_x, new_vy, new_vx

    return(get_policy_from_Q(cols, rows, vel_range, Q, actions))


def do_time_trial(environment, policy, bad_crash = False, animate = True, 
                  max_steps = MAX_STEPS):
    """
    Race car will do a time trial on the race track according to the policy.   
    :param list environment
    :param dictionary policy: A dictionary containing the best action for a 
        given state. The key is the state y,x,vy,vx and value is the action 
        (ax,ay) acceleration
    :param boolean bad_crash: The crash scenario. If true, race car returns to
        starting position after crashes
    :param boolean animate: If true, show the car on the racetrack at each 
        timestep
    :return i: Total steps to complete race (i.e. from starting position to 
        finish line)
    :rtype int

    """
    # Copy the environment
    environment_display = deepcopy(environment)

    # Get a starting position on the race track
    starting_pos = get_random_start_position(environment)
    y,x = starting_pos
    vy,vx = 0,0  # We initialize velocity to 0

    # Keep track if we get stuck
    stop_clock = 0    

    # Begin time trial
    for i in range(max_steps):        

        # Show the race car on the racetrack
        if animate: 
            print_environment(environment_display, car_position = [y,x])
        
        # Get the best action given the current state
        a = policy[(y,x,vy,vx)]

        # If we are at the finish line, stop the time trial
        if environment[y][x] == GOAL: 
            return i 
        
        # Take action and get new a new state s'
        y,x,vy,vx = act(y, x, vy, vx, a, environment, bad_crash = bad_crash)

        # Determine if the car gets stuck
        if vy == 0 and vx == 0:
            stop_clock += 1
        else:
            stop_clock = 0

        # We have gotten stuck as the car has not been moving for 5 timesteps
        if stop_clock == 5:
            return max_steps
        
    # Program has timed out
    return max_steps

def main():
    """
    The main method of the program
    """    
    print("Welcome to the Racetrack Control Program!")
    print("Powered by the " + ALGORITHM_NAME + 
          " Reinforcement Learning Algorithm\n")
    print("Track: " + THIS_TRACK)
    print()
    print("What happens to the car if it crashes into a wall?")
    option_1 = """1. Starts from the nearest position on the track to the 
        place where it crashed."""
    option_2 = """2. Returns back to the original starting position."""
    print(option_1)
    print(option_2)
    crash_scenario = int(input("Crash scenario (1 or 2): "))
    no_training_iter = int(input(
        "Enter the initial number of training iterations (e.g. 500000): "))
    print("\nThe race car is training. Please wait...")

    # Directory where the racetrack is located
    #racetrack_name = input("Enter the path to your input file: ") 
    racetrack_name = FILENAME
    racetrack = read_environment(racetrack_name)

    # How many times the race car will do a single time trial
    races = NO_RACES

    while(no_training_iter <= MAX_TRAIN_ITER):
    
        # Keep track of the total number of steps
        total_steps = 0

        # Record the crash scenario
        bad_crash = False
        if crash_scenario == 1:
            bad_crash = False
        else:
            bad_crash = True

        # Retrieve the policy
        policy =  q_learning(racetrack, bad_crash = (
            bad_crash),no_training_iter=no_training_iter)
 
        for each_race in range(races):
            total_steps += do_time_trial(racetrack, policy, bad_crash = (
            bad_crash), animate = False)

        print("Number of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            print("Bad Crash Scenario: " + option_1)
        else:
            print("Bad Crash Scenario: " + option_2)
        print("Average Number of Steps the Race Car Needs to Take Before " + 
              "Finding the Finish Line: " + str(total_steps/races) + " steps\n")
        print("\nThe race car is training. Please wait...")

        # Delay 
        time.sleep(FRAME_TIME + 4)

        # Testing statistics
        test_stats_file = THIS_TRACK 
        test_stats_file += "_"
        test_stats_file += ALGORITHM_NAME + "_iter"
        test_stats_file += str(no_training_iter)+ "_cr" 
        test_stats_file += str(crash_scenario) + "_stats.txt" 

        ## Open a test_stats_file 
        outfile_ts = open(test_stats_file,"w")
        
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write("Track: ") 
        outfile_ts.write(THIS_TRACK)
        outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
        else:
            outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
        outfile_ts.write("Average Number of Steps the Race Car Took " + 
              "Before Finding the Finish Line: " + str(total_steps/races) + 
              " steps\n")

        # Show functioning of the program
        trace_runs_file = THIS_TRACK 
        trace_runs_file += "_"
        trace_runs_file += ALGORITHM_NAME + "_iter"
        trace_runs_file += str(no_training_iter) + "_cr"
        trace_runs_file += str(crash_scenario) + "_trace.txt"

        if no_training_iter <= 5:
            ## Open a new file to save trace runs
            outfile_tr = open(trace_runs_file,"w") 
        
            # Print trace runs that demonstrate proper functioning of the code
            outfile_tr.write(str(policy))  
            
            outfile_tr.close()

        ## Close the files        
        outfile_ts.close()

        if no_training_iter == 500000:
            no_training_iter += 500000
        else:
            no_training_iter += 1000000
           
main()

Return to Table of Contents

Experimental Output and Analysis

Here are the results:

value-iteration-1
value-iteration-2
q-learning-1
q-learning-2
q-learning-3

Analysis

Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

Both the value iteration and Q-learning algorithms were successfully able to finish the race for both crash scenarios and for both the R-track and the L-track. The race car completed the R-track in fewer time steps than the L-track, which was expected.

The crash policy in which the car had to return to the starting position after it crashed versus going to the nearest open position negatively impacted performance for both algorithms. Time trial performance was superior for the value iteration algorithm compared to the Q learning algorithm. This result was expected given that the value iteration algorithm had access to the transition and the reward probabilities during the training phase. In other words, value iteration it gets a complete model of the entire track, whereas Q-learning has to be forced to explore the environment in order to learn.

Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

My hypothesis was correct. Value iteration generated a learning policy faster than Q-learning because it had access to the transition and reward functions. Q-learning required more than 6,000,000 training iterations on the R-track to achieve the time trial performance that the value iteration algorithm was able to produce in only 45 training iterations (holding all other hyperparameters constant). And whereas the Q-learning algorithm took 2-3 hours to train to get good time trial results, value iteration never took more than 10 minutes for both the R-track and L-track under both of the bad crash scenarios.

Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

The results show that my hypothesis was correct. It took longer for the car to finish the race for the crash scenario in which the race car needed to return to the original starting position each time it crashed into a wall. In other words, Bad Crash Version 1 (return to start) performance was better than Bad Crash Version 2 (return to nearest open position) performance. These results are what I expected since always returning to the starting position after a crash hampers the ability of the race car to continually explore new states.

Summary and Conclusions

My hypotheses were correct. Here are the following conclusions:

  • Value iteration and Q-learning are powerful reinforcement learning algorithms that can enable an agent to learn autonomously.
  • Value iteration led to faster learning than the Q-learning algorithm.
  • A crash policy in which the race car always returns to the starting position after a crash negatively impacts performance.

Return to Table of Contents

Video Demonstration

Here is a video demonstration of the algorithms in action.

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to Act Using Real-Time Dynamic Programming. Artificial Intelligence, 81-138.

Russell, S. J., Norvig, P., & Davis, E. (2010). Artificial intelligence : A Modern Approach. Upper Saddle River, NJ: Prentice Hall.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning : An Introduction . Cambridge, Massachusetts: The MIT Press.

Return to Table of Contents

Artificial Feedforward Neural Network With Backpropagation From Scratch

In this post, I will walk you through how to build an artificial feedforward neural network trained with backpropagation, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to evaluate the performance of an artificial feedforward neural network trained with backpropagation and to compare the performance using no hidden layers, one hidden layer, and two hidden layers. Five different data sets from the UCI Machine Learning Repository are used to compare performance: Breast Cancer, Glass, Iris, Soybean (small), and Vote.

We will use our neural network to do the following:

  • Predict if someone has breast cancer
  • Identify glass type
  • Identify flower species
  • Determine soybean disease type
  • Classify a representative as either a Democrat or Republican based on their voting patterns

I hypothesize that the neural networks with no hidden layers will outperform the networks with two hidden layers. My hypothesis is based on the notion that the simplest solutions are often the best solutions (i.e. Occam’s Razor).

The classification accuracy of the algorithms on the data sets will be evaluated as follows, using five-fold stratified cross-validation:

  • Accuracy = (number of correct predictions)/(total number of predictions)

Title image source: Wikimedia commons

Table of Contents

What is an Artificial Feedforward Neural Network Trained with Backpropagation?

neural_network-1

Background

An artificial feed-forward neural network (also known as multilayer perceptron) trained with backpropagation is an old machine learning technique that was developed in order to have machines that can mimic the brain. Neural networks were the focus of a lot of machine learning research during the 1980s and early 1990s but declined in popularity during the late 1990s.

Since 2010, neural networks have experienced a resurgence in popularity due to improvements in computing speed and the availability of massive amounts of data with which to train large-scale neural networks. In the real world, neural networks have been used to recognize speech, caption images, and even help self-driving cars learn how to park autonomously.

The Brain as Inspiration for Artificial Neural Networks

neuron_t_png

In order to understand neural networks, it helps to first take a look at the basic architecture of the human brain. The brain has 1011 neurons (Alpaydin, 2014). Neurons are cells inside the brain that process information.

Each neuron contains a number of input wires called dendrites. Each neuron also has one output wire called an axon. The axon is used to send messages to other neurons. The axon of a sending neuron is connected to the dendrites of the receiving neuron via a synapse.

So, in short, a neuron receives inputs from dendrites, performs a computation, and sends the output to other neurons via the axon. This process is how information flows through the brain. The messages sent between neurons are in the form of electric pulses.

An artificial neural network, the one used in machine learning, is a simplified model of the actual human neural network explained above. It is typically composed of zero or more layers.

neural-network

Each layer of the neural network is made up of nodes (analogous to neurons in the brain). Nodes of one layer are connected to nodes in another layer by connection weights, which are typically just floating-point numbers (e.g. 0.23342341). These numbers represent the strength of the connection between two nodes.

The job of a node in a hidden layer is to:

  1. Receive input values from each node in a preceding layer
  2. Compute a weighted sum of those input values
  3. Send that weighted sum through some activation function (e.g. logistic sigmoid function or hyperbolic tangent function)
  4. Send the result of the computation in #3 to each node in the next layer of the neural network.

Thus, the output from the nodes in a given layer becomes the input for all nodes in the next layer.

The output layer of a network does steps 1-3 above. However, the result of the computation from step #3 is a class prediction instead of an input to another layer (since the output layer is the final layer).

Here is a diagram of the process I explained above:

Here is a diagram showing a single layer neural network:

b stands for the bias term. This is a constant. It is like the b in the equation for a line, y = mx + b. It enables the model to have flexibility because, without that bias term, you cannot as easily adapt the weighted sum of inputs (i.e. mx) to fit the data (i.e. in the example of a simple line, the line cannot move up and down the y-axis without that b term).

w in the diagram above stands for the weights, and x stands for the input values.

Here is a similar diagram, but now it is a two-layer neural network instead of single layer.

And here is one last way to look at the same thing I explained above:

artificial_neuron_scheme

Note that the yellow circles on the left represent the input values. w represents the weights. The sigma inside the box means that we calculated the weighted sum of the input values. We run that through the activation function f(S)…e.g. sigmoid function. And then out of that, pops the output, which is passed on to the nodes in the following layer.

Neural networks that contain many layers, for example more than 100, are called deep neural networks. Deep neural networks are the cornerstone of the rapidly growing field known as deep learning.

Training Phase

The objective during the training phase of a neural network is to determine all the connection weights. At the start of training, the weights of the network are initialized to small random values close to 0. After this step, training proceeds to the two main phases of the algorithm: forward propagation and backpropagation.

Forward Propagation

During the forward propagation phase of a neural network, we process one instance (i.e. one set of inputs) at a time. Hidden layers extract important features contained in the input data by computing a weighted sum of the inputs and running the result through the logistic sigmoid activation function. This output relays to nodes in the next hidden layer where the data is transformed yet again. This process continues until the data reaches the output layer.

The output of the output layer is a predicted class value, which in this project is a one-hot encoded class prediction vector. The index of the vector corresponds to each class. For example, if a 1 is in the 0 index of the vector (and a 0 is in all other indices of the vector), the class prediction is class 0. Because we are dealing with 0s and 1s, the output vector can also be considered the probability that an instance is in a given class.

Backpropagation

After the input signal produced by a training instance propagates through the network one layer at a time to the output layer, the backpropagation phase commences. An error value is calculated at the output layer. This error corresponds to the difference between the class predicted by the network and the actual (i.e. true) class of the training instance.

The prediction error is then propagated backward from the output layer to the input layer. Blame for the error is assigned to each node in each layer, and then the weights of each node of the neural network are updated accordingly (with the goal to make more accurate class predictions for the next instance that flows through the neural network) using stochastic gradient descent for the weight optimization procedure.

Note that weights of the neural network are adjusted on a training instance by training instance basis. This online learning method is the preferred one for classification problems of large size (Ĭordanov & Jain, 2013).

The forward propagation and backpropagation phases continue for a certain number of epochs. A single epoch finishes when each training instance has been processed exactly once.

Testing Phase

Once the neural network has been trained, it can be used to make predictions on new, unseen test instances. Test instances flow through the network one-by-one, and the resulting output (which is a vector of class probabilities) determines the classification. 

Helpful Video

Below is a helpful video by Andrew Ng, a professor at Stanford University, that explains neural networks and is helpful for getting your head around the math. The video gets pretty complicated in some spots (esp. where he starts writing all sorts of mathematical notation and derivatives). My advice is to lookup anything that he explains that isn’t clear. Take it slow as you are learning about neural networks. There is no rush. This stuff isn’t easy to understand on your first encounter with it. Over time, the fog will begin to lift, and you will be able to understand how it all works.

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

Required Data Set Format for Feedforward Neural Network Trained with Backpropagation

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  • N: Actual Class

The program then adds two additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

Modification of Attribute Values

  • The actual class value was changed to “Benign” or “Malignant.”
  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There are no missing values in this data set.

Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There were no missing attribute values.

Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.
  • Attribute values that were all the same value were removed.

Missing Data

There are no missing values in this data set.

Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

  • 267 Democrats
  • 168 Republicans

Modification of Attribute Values

  • I did the following modifications:
    • Changed all “y” to 1 and all “n” to 0.
  • Class values were vectorized using one-hot encoding.

Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

Stochastic Gradient Descent

I used stochastic gradient descent for optimizing the weights.

In normal gradient descent, we need to calculate the partial derivative of the cost function with respect to each weight. For each partial derivative, we have to tally up the terms for each training instance to compute the partial derivative of the cost function with respect to that weight. What this means is that, if we have a lot of attributes and a large dataset, gradient descent is slow. For this reason, stochastic gradient descent was chosen since weights are updated after each training instance (as opposed to after all training instances).

Here is a good video that explains stochastic gradient descent.

Logistic (Sigmoid) Activation Function

The logistic (sigmoid) activation function was used for the nodes in the neural network.

Description of Any Tuning Process Applied

Learning Rate

Some tuning was performed in this project. The learning rate was set to 0.1, which was different than the 0.01 value that is often used for multi-layer feedforward neural networks (Montavon, 2012). Lower values resulted in much longer training times and did not result in large improvements in classification accuracy.

Epochs

The number of epochs chosen for the main runs of the algorithm on the data sets was chosen to be 1000. Other values were tested, but the number of epochs did not have a large impact on classification accuracy.

Number of Nodes per Hidden Layer

In order to tune the number of nodes per hidden layer, I used a constant learning rate and constant number of epochs. I then calculated the classification accuracy for each data set for a set number of nodes per hidden layer. I performed this process using networks with one hidden layer and networks with two hidden layers. The results of this tuning process are below.

tuning-artificial-neural-network

Note that the mean classification accuracy across all data sets when one hidden layer was used for the neural network reached a peak at eight nodes per hidden layer. This value of eight nodes per hidden layer was used for the actual runs on the data sets.

For two hidden layers, the peak mean classification accuracy was attained at five nodes per hidden layer. Thus, when the algorithm was run on the data sets for two hidden layers, I used five nodes per hidden layer for each data set to compare the classification accuracy across the data sets.

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the full code for the neural network. This is all you need to run the program:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from random import shuffle # Import shuffle() method from the random module
from random import seed # Import seed() method from the random module
from random import random # Import random() method from the random module
from collections import Counter # Used for counting
from math import exp # Import exp() function from the math module

# File name: neural_network.py
# Author: Addison Sears-Collins
# Date created: 7/30/2019
# Python version: 3.7
# Description: An artificial feedforward neural network trained 
#   with backpropagation (also called multilayer perceptron)

# Required Data Set Format
# Columns (0 through N)
# 0: Attribute 0
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# 2 additional columns are added for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct?

ALGORITHM_NAME = "Feedforward Neural Network With Backpropagation"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def normalize(dataset):
    """
    Normalize the attribute values so that they are between 0 and 1, inclusive
    :param pandas_dataframe dataset: The original dataset as a Pandas dataframe
    :return: normalized_dataset
    :rtype: Pandas dataframe
    """
    # Generate a list of the column names 
    column_names = list(dataset) 

    # For every column except the actual class column
    for col in range(0, len(column_names) - 1):  
        temp = dataset[column_names[col]] # Go column by column
        minimum = temp.min() # Get the minimum of the column
        maximum = temp.max() # Get the maximum of the column

        # Normalized all values in the column so that they
        # are between 0 and 1.
        # x_norm = (x_i - min(x))/(max(x) - min(x))
        dataset[column_names[col]] = dataset[column_names[col]] - minimum
        dataset[column_names[col]] = dataset[column_names[col]] / (
            maximum - minimum)

    normalized_dataset = dataset

    return normalized_dataset

def get_five_stratified_folds(dataset):
    """
    Implementation of five-fold stratified cross-validation. Divide the data
    set into five random folds. Make sure that the proportion of each class 
    in each fold is roughly equal to its proportion in the entire data set.
    :param pandas_dataframe dataset: The original dataset as a Pandas dataframe
    :return: five_folds
    :rtype: list of folds where each fold is a list of instances(i.e. examples)
    """
    # Create five empty folds
    five_folds = list()
    fold0 = list()
    fold1 = list()
    fold2 = list()
    fold3 = list()
    fold4 = list()

    # Get the number of columns in the data set
    class_column = len(dataset[0]) - 1

    # Shuffle the data randomly
    shuffle(dataset)

    # Generate a list of the unique class values and their counts
    classes = list()  # Create an empty list named 'classes'

    # For each instance in the dataset, append the value of the class
    # to the end of the classes list
    for instance in dataset:
        classes.append(instance[class_column])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in dataset:

            # If we have a match
            if uniqueclass == instance[class_column]:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    shuffle(fold0)
    shuffle(fold1)
    shuffle(fold2)
    shuffle(fold3)
    shuffle(fold4)

    # Add the five stratified folds to the list
    five_folds.append(fold0)
    five_folds.append(fold1)
    five_folds.append(fold2)
    five_folds.append(fold3)
    five_folds.append(fold4)

    return five_folds

def initialize_neural_net(
    no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs):
    """
    Generates a new neural network that is ready to be trained.
    Network (list of layers): 0+ hidden layers, and output layer
    Input Layer (list of attribute values): A row from the training set 
    Hidden Layer (list of dictionaries): A set of nodes (i.e. neurons)
    Output Layer (list of dictionaries): A set of nodes, one node per class
    Node (dictionary): Contains a set of weights, one weight for each input 
      to the layer containing that node + an additional weight for the bias.
      Each node is represented as a dictionary that stores key-value pairs
      Each key corresponds to a property of that node (e.g. weights).
      Weights will be initialized to small random values between 0 and 1.
    :param int no_inputs: Numper of inputs (i.e. attributes)
    :param int no_hidden_layers: Numper of hidden layers (0 or more)
    :param int no_nodes_per_hidden_layer: Numper of nodes per hidden layer
    :param int no_outputs: Numper of outputs (one output node per class)
    :return: network
    :rtype:list (i.e. list of layers: hidden layers, output layer)
    """

    # Create an empty list
    network = list()

    # Create the the hidden layers
    hidden_layer = list()
    hl_counter = 0

    # Create the output layer
    output_layer = list()

    # If this neural network contains hidden layers
    if no_hidden_layers > 0:

        # Build one hidden layer at a time
        for layer in range(no_hidden_layers):

            # Reset to an empty hidden layer
            hidden_layer = list()

            # If this is the first hidden layer
            if hl_counter == 0:

                # Build one node at a time
                for node in range(no_nodes_per_hidden_layer):

                    initial_weights = list()
                    
                    # Each node in the hidden layer has no_inputs + 1 weights, 
                    # initialized to a random number in the range [0.0, 1.0)
                    for i in range(no_inputs + 1):
                        initial_weights.append(random())

                    # Add the node to the first hidden layer
                    hidden_layer.append({'weights':initial_weights})

                # Finished building the first hidden layer
                hl_counter += 1

                # Add this first hidden layer to the front of the neural 
                # network
                network.append(hidden_layer)

            # If this is not the first hidden layer
            else:

                # Build one node at a time
                for node in range(no_nodes_per_hidden_layer):

                    initial_weights = list()
                    
                    # Each node in the hidden layer has 
                    # no_nodes_per_hidden_layer + 1 weights, initialized to 
                    # a random number in the range [0.0, 1.0)
                    for i in range(no_nodes_per_hidden_layer + 1):
                        initial_weights.append(random())

                    hidden_layer.append({'weights':initial_weights})

                # Add this newly built hidden layer to the neural network
                network.append(hidden_layer)

        # Build the output layer
        for outputnode in range(no_outputs):

            initial_weights = list()
                    
            # Each node in the output layer has no_nodes_per_hidden_layer 
            # + 1 weights, initialized to a random number in 
            # the range [0.0, 1.0)
            for i in range(no_nodes_per_hidden_layer + 1):
                initial_weights.append(random())

            # Add this output node to the output layer
            output_layer.append({'weights':initial_weights})

        # Add the output layer to the neural network
        network.append(output_layer)
    
    # A neural network has no hidden layers
    else:

        # Build the output layer
        for outputnode in range(no_outputs):
        
            initial_weights = list()
                    
            # Each node in the hidden layer has no_inputs + 1 weights, 
            # initialized to a random number in the range [0.0, 1.0)
            for i in range(no_inputs + 1):
                initial_weights.append(random())

            # Add this output node to the output layer
            output_layer.append({'weights':initial_weights})

        network.append(output_layer)

    # Finished building the initial neural network
    return network

def weighted_sum_of_inputs(weights, inputs):
    """
    Calculates the weighted sum of inputs plus the bias
    :param list weights: A list of weights. Each node has a list of weights.
    :param list inputs: A list of input values. These can be a single row
        of attribute values or the output from nodes from the previous layer
    :return: weighted_sum
    :rtype: float
    """
    # We assume that the last weight is the bias value
    # The bias value is a special weight that does not multiply with an input
    # value (or we could assume its corresponding input value is always 1)
    # The bias is similar to the intercept constant b in y = mx + b. It enables
    # a (e.g. sigmoid) curve to be shifted to create a better fit
    # to the data. Without the bias term b, the line always goes through the 
    # origin (0,0) and cannot adapt as well to the data.
    # In y = mx + b, we assume b * x_0 where x_0 = 1

    # Initiate the weighted sum with the bias term. Assume the last weight is
    # the bias term
    weighted_sum = weights[-1]

    for index in range(len(weights) - 1):
        weighted_sum += weights[index] * inputs[index]

    return weighted_sum

def sigmoid(weighted_sum_of_inputs_plus_bias):
    """
    Run the weighted sum of the inputs + bias through
    the sigmoid activation function.
    :param float weighted_sum_of_inputs_plus_bias: Node summation term
    :return: sigmoid(weighted_sum_of_inputs_plus_bias)
    """
    return 1.0 / (1.0 + exp(-weighted_sum_of_inputs_plus_bias))

def forward_propagate(network, instance):
    """
    Instances move forward through the neural network from one layer
    to the next layer. At each layer, the outputs are calculated for each 
    node. These outputs are the inputs for the nodes in the next layer.
    The last set of outputs is the output for the nodes in the output 
    layer.
    :param list network: List of layers: 0+ hidden layers, 1 output layer
    :param list instance (a single training/test instance from the data set)
    :return: outputs
    :rtype: list
    """
    inputs = instance

    # For each layer in the neural network
    for layer in network:

        # These will store the outputs for this layer
        new_inputs = list()

        # For each node in this layer
        for node in layer:

            # Calculate the weighted sum + bias term
            weighted_sum = weighted_sum_of_inputs(node['weights'], inputs)

            # Run the weighted sum through the activation function
            # and store the result in this node's dictionary.
            # Now the node's dictionary has two keys, weights and output.
            node['output'] = sigmoid(weighted_sum)

            # Used for debugging
            #print(node)

            # Add the output of the node to the new_inputs list
            new_inputs.append(node['output'])

        # Update the inputs list
        inputs = new_inputs

    # We have reached the output layer
    outputs = inputs

    return outputs

def sigmoid_derivative(output):
    """
    The derivative of the sigmoid activation function with respect 
    to the weighted summation term of the node.
    Formally (after a lot of calculus), this derivative is:
        derivative = sigmoid(weighted_sum_of_inputs_plus_bias) * 
        (1 - sigmoid(weighted_sum_of_inputs_plus_bias))
                   = node_ouput * (1 - node_output)
    This method is used during the backpropagation phase. 
    :param list output: Output of a node (generated during the forward
        propagation phase)
    :return: sigmoid_der
    :rtype: float
    """
    return output * (1.0 - output)

def back_propagate(network, actual):
    """
    In backpropagation, the error is computed between the predicted output by 
    the network and the actual output as determined by the data set. This error 
    propagates backwards from the output layer to the first hidden layer. The 
    weights in each layer are updated along the way in response to the error. 
    The goal is to reduce the prediction error for the next training instance 
    that forward propagates through the network.
    :param network list: The neural network
    :param actual list: A list of the actual output from the data set
    """
    # Iterate in reverse order (i.e. starts from the output layer)
    for i in reversed(range(len(network))):

        # Work one layer at a time
        layer = network[i]

        # Keep track of the errors for the nodes in this layer
        errors = list()

        # If this is a hidden layer
        if i != len(network) - 1:

            # For each node_j in this hidden layer
            for j in range(len(layer)):

                # Reset the error value
                error = 0.0

                # Calculate the weighted error. 
                # The error values come from the error (i.e. delta) calculated
                # at each node in the layer just to the "right" of this layer. 
                # This error is weighted by the weight connections between the 
                # node in this hidden layer and the nodes in the layer just 
                # to the "right" of this layer.
                for node in network[i + 1]:
                    error += (node['weights'][j] * node['delta'])

                # Add the weighted error for node_j to the
                # errors list
                errors.append(error)
        
        # If this is the output layer
        else:

            # For each node in the output layer
            for j in range(len(layer)):
                
                # Store this node (i.e. dictionary)
                node = layer[j]

                # Actual - Predicted = Error
                errors.append(actual[j] - node['output'])

        # Calculate the delta for each node_j in this layer
        for j in range(len(layer)):
            node = layer[j]

            # Add an item to the node's dictionary with the 
            # key as delta.
            node['delta'] = errors[j] * sigmoid_derivative(node['output'])

def update_weights(network, instance, learning_rate):
    """
    After the deltas (errors) have been calculated for each node in 
    each layer of the neural network, the weights can be updated.
    new_weight = old_weight + learning_rate * delta * input_value
    :param list network: List of layers: 0+ hidden layers, 1 output layer
    :param list instance: A single training/test instance from the data set
    :param float learning_rate: Controls step size in the stochastic gradient
        descent procedure.
    """
    # For each layer in the network
    for layer_index in range(len(network)):

        # Extract all the attribute values, excluding the class value
        inputs = instance[:-1]

        # If this is not the first hidden layer
        if layer_index != 0:

            # Go through each node in the previous layer and add extract the
            # output from that node. The output from the previous layer
            # is the input to this layer.
            inputs = [node['output'] for node in network[layer_index - 1]]

        # For each node in this layer
        for node in network[layer_index]:

            # Go through each input value
            for j in range(len(inputs)):
                
                # Update the weights
                node['weights'][j] += learning_rate * node['delta'] * inputs[j]
          
            # Updating the bias weight 
            node['weights'][-1] += learning_rate * node['delta']

def train_neural_net(
    network, training_set, learning_rate, no_epochs, no_outputs):
    """
    Train a neural network that has already been initialized.
    Training is done using stochastic gradient descent where the weights
    are updated one training instance at a time rather than after the
    entire training set (as is the case with gradient descent).
    :param list network: The neural network, which is a list of layers
    :param list training_set: A list of training instances (i.e. examples)
    :param float learning_rate: Controls step size of gradient descent
    :param int no_epochs: How many passes we will make through training set
    :param int no_outputs: The number of output nodes (equal to # of classes)
    """
    # Go through the entire training set a fixed number of times (i.e. epochs)
    for epoch in range(no_epochs):
   
        # Update the weights one instance at a time
        for instance in training_set:

            # Forward propagate the training instance through the network
            # and produce the output, which is a list.
            outputs = forward_propagate(network, instance)

            # Vectorize the output using one hot encoding. 
            # Create a list called actual_output that is the same length 
            # as the number of outputs. Put a 1 in the place of the actual 
            # class.
            actual_output = [0 for i in range(no_outputs)]
            actual_output[int(instance[-1])] = 1
            
            back_propagate(network, actual_output)
            update_weights(network, instance, learning_rate)

def predict_class(network, instance):
    """
    Make a class prediction given a trained neural network and
    an instance from the test data set.
    :param list network: The neural network, which is a list of layers
    :param list instance: A single training/test instance from the data set
    :return class_prediction
    :rtype int
    """
    outputs = forward_propagate(network, instance)

    # Return the index that has the highest probability. This index
    # is the class value. Assume class values begin at 0 and go
    # upwards by 1 (i.e. 0, 1, 2, ...)
    class_prediction = outputs.index(max(outputs))
    
    return class_prediction

def calculate_accuracy(actual, predicted):
    """
    Calculates the accuracy percentages
    :param list actual: Actual class values
    :param list predicted: predicted class values
    :return: classification_accuracy
    :rtype: float (as a percentage)
    """
    number_of_correct_predictions = 0
    for index in range(len(actual)):
        if actual[index] == predicted[index]:
            number_of_correct_predictions += 1
    
    classification_accuracy = (
        number_of_correct_predictions / float(len(actual))) * 100.0
    return classification_accuracy

def get_test_set_predictions(
    training_set, test_set, learning_rate, no_epochs, 
    no_hidden_layers, no_nodes_per_hidden_layer):
    """
    This method is the workhorse. 
    A new neutal network is initialized.
    The network is trained on the training set.
    The trained neural network is used to generate predictions on the
    test data set.
    :param list training_set
    :param list test_set
    :param float learning_rate
    :param int no_epochs
    :param int no_hidden_layers
    :param int no_nodes_per_hidden_layer
    :return network, class_predictions
    :rtype list, list
    """
    # Get the number of attribute values
    no_inputs = len(training_set[0]) - 1

    # Calculate the number of unique classes
    no_outputs = len(set([instance[-1] for instance in training_set]))
    
    # Build a new neural network
    network = initialize_neural_net(
        no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs)

    train_neural_net(
        network, training_set, learning_rate, no_epochs, no_outputs)
    
    # Store the class predictions for each test instance
    class_predictions = list()
    for instance in test_set:
        cl_prediction = predict_class(network, instance)
        class_predictions.append(cl_prediction)

    # Return the learned model as well as the class predictions
    return network, class_predictions

###############################################################

def main():
    """
    The main method of the program
    """
    LEARNING_RATE = 0.1 # Used for stochastic gradient descent procedure
    NO_EPOCHS = 1000 # Epoch is one complete pass through training data
    NO_HIDDEN_LAYERS = 1 # Number of hidden layers
    NO_NODES_PER_HIDDEN_LAYER = 8 # Number of nodes per hidden layer

    # Welcome message
    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Directory where data set is located
    #data_path = input("Enter the path to your input file: ") 
    data_path = "vote.txt"

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Show functioning of the program
    #trace_runs_file = input("Enter the name of your trace runs file: ") 
    trace_runs_file = "vote_nn_trace_runs.txt"

    ## Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    #test_stats_file = input("Enter the name of your test statistics file: ") 
    test_stats_file = "vote_nn_test_stats.txt"

    ## Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Generate a list of the column names 
    column_names = list(pd_data_set) 

    # The input layer in the neural network 
    # will have one node for each attribute value
    no_of_inputs = len(column_names) - 1

    # Make a list of the unique classes
    list_of_unique_classes = pd.unique(pd_data_set["Actual Class"])

    # The output layer in the neural network 
    # will have one node for each class value
    no_of_outputs = len(list_of_unique_classes)

    # Replace all the class values with numbers, starting from 0
    # in the Pandas dataframe.
    for cl in range(0, len(list_of_unique_classes)):
        pd_data_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)

    # Normalize the attribute values so that they are all between 0 
    # and 1, inclusive
    normalized_pd_data_set = normalize(pd_data_set)

    # Convert normalized Pandas dataframe into a list
    dataset_as_list = normalized_pd_data_set.values.tolist()

    # Set the seed for random number generator
    seed(1)

    # Get a list of 5 stratified folds because we are doing
    # five-fold stratified cross-validation
    fv_folds = get_five_stratified_folds(dataset_as_list)
    
    # Keep track of the scores for each of the five experiments
    scores = list()
    
    experiment_counter = 0
    for fold in fv_folds:
        
        print()
        print("Running Experiment " + str(experiment_counter) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(
            experiment_counter) + " ...\n")
        outfile_tr.write("\n")

        # Get all the folds and store them in the training set
        training_set = list(fv_folds)

        # Four folds make up the training set
        training_set.remove(fold)        

        # Combined all the folds so that all we have is a list
        # of training instances
        training_set = sum(training_set, [])
        
        # Initialize a test set
        test_set = list()
        
        # For each instance in this test fold
        for instance in fold:
            
            # Create a copy and store it
            copy_of_instance = list(instance)
            test_set.append(copy_of_instance)
        
        # Get the trained neural network and the predicted values
        # for each test instance
        neural_net, predicted_values = get_test_set_predictions(
            training_set, test_set,LEARNING_RATE,NO_EPOCHS,
            NO_HIDDEN_LAYERS,NO_NODES_PER_HIDDEN_LAYER)
        actual_values = [instance[-1] for instance in fold]
        accuracy = calculate_accuracy(actual_values, predicted_values)
        scores.append(accuracy)

        # Print the learned model
        print("Experiment " + str(
            experiment_counter) + " Trained Neural Network")
        print()
        for layer in neural_net:
            print(layer)
        print()
        outfile_tr.write("Experiment " + str(
            experiment_counter) + " Trained Neural Network")
        outfile_tr.write("\n")
        outfile_tr.write("\n")
        for layer in neural_net:
            outfile_tr.write(str(layer))
            outfile_tr.write("\n")
        outfile_tr.write("\n\n")

        # Print the classifications on the test instances
        print("Experiment " + str(
            experiment_counter) + " Classifications on Test Instances")
        print()
        outfile_tr.write("Experiment " + str(
            experiment_counter) + " Classifications on Test Instances")
        outfile_tr.write("\n\n")
        test_df = pd.DataFrame(test_set, columns=column_names)

        # Add 2 additional columns to the testing dataframe
        test_df = test_df.reindex(
        columns=[*test_df.columns.tolist(
        ), 'Predicted Class', 'Prediction Correct?'])

        # Add the predicted values to the "Predicted Class" column
        # Indicate if the prediction was correct or not.
        for pre_val_index in range(len(predicted_values)):
            test_df.loc[pre_val_index, "Predicted Class"] = predicted_values[
                pre_val_index]
            if test_df.loc[pre_val_index, "Actual Class"] == test_df.loc[
                pre_val_index, "Predicted Class"]:
                test_df.loc[pre_val_index, "Prediction Correct?"] = "Yes"
            else:
                test_df.loc[pre_val_index, "Prediction Correct?"] = "No"

        # Replace all the class values with the name of the class
        for cl in range(0, len(list_of_unique_classes)):
            test_df["Actual Class"].replace(
                cl, list_of_unique_classes[cl] ,inplace=True)
            test_df["Predicted Class"].replace(
                cl, list_of_unique_classes[cl] ,inplace=True)

        # Print out the test data frame
        print(test_df)   
        print()
        print()
        outfile_tr.write(str(test_df))   
        outfile_tr.write("\n\n")

        # Go to the next experiment
        experiment_counter += 1
    
    print("Experiments Completed.\n")
    outfile_tr.write("Experiments Completed.\n\n")

    # Print the test stats   
    print("------------------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("------------------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print("Learning Rate: " + str(LEARNING_RATE))
    print("Number of Epochs: " + str(NO_EPOCHS))
    print("Number of Hidden Layers: " + str(NO_HIDDEN_LAYERS))
    print("Number of Nodes Per Hidden Layer: " + str(
        NO_NODES_PER_HIDDEN_LAYER))
    print()
    print("Accuracy Statistics for All 5 Experiments: %s" % scores)
    print()
    print()
    print("Classification Accuracy: %.3f%%" % (
        sum(scores)/float(len(scores))))

    outfile_ts.write(
        "------------------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write(
        "------------------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path +"\n\n")
    outfile_ts.write("Learning Rate: " + str(LEARNING_RATE) + "\n")
    outfile_ts.write("Number of Epochs: " + str(NO_EPOCHS) + "\n")
    outfile_ts.write("Number of Hidden Layers: " + str(
        NO_HIDDEN_LAYERS) + "\n")
    outfile_ts.write("Number of Nodes Per Hidden Layer: " + str(
        NO_NODES_PER_HIDDEN_LAYER) + "\n")
    outfile_ts.write(
        "Accuracy Statistics for All 5 Experiments: %s" % str(scores))
    outfile_ts.write("\n\n")
    outfile_ts.write("Classification Accuracy: %.3f%%" % (
        sum(scores)/float(len(scores))))

    ## Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Output

Here are the trace runs:

Here are the results:

full-results-neural-network

Here are the test statistics for each data set:

Analysis

Breast Cancer Data Set

The breast cancer data set results were in line with what I expected. The simpler model, the one with no hidden layers, ended up generating the highest classification accuracy. Classification accuracy was just short of 97%. In other words, the neural network that had no hidden layers successfully classified a patient as either malignant or benign with an almost 97% accuracy.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data (699 instances in this case) can lead to better learning and better classification accuracy on new, unseen instances.

Glass Data Set

The performance of the neural network on the glass data set was the worst out of all of the data sets. The ability of the network to correctly identify the type of glass given the attribute values never exceeded 70%.

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

Iris Data Set

Classification accuracy was superb on the iris dataset, attaining a classification accuracy around 97%. The results of the iris dataset were surprising given that the more complicated neural network with two hidden layers and five nodes per hidden layer outperformed the simpler neural network that had no hidden layers. In this case, it appears that the iris dataset benefited from the increasing layers of abstraction provided by a higher number of layers.

Soybean Data Set (small)

Performance on the soybean data set was stellar and was the highest of all of the data sets but also had the largest standard deviation for the classification accuracy. Note that classification accuracy reached a peak of 100% using one hidden layer and eight nodes for the hidden layer. However, when I added an additional hidden layer, classification accuracy dropped to under 70%.

The reason for the high standard deviation of the classification accuracy is unclear, but I hypothesize it has to do with the relatively small number of training instances. Future work would need to be performed with the soybean large dataset available from the UCI Machine Learning Repository to see if these results remain consistent.

The results of the soybean runs suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

Vote Data Set

The vote data set did not yield the stellar performance of the soybean data set, but classification accuracy was still solid at ~96% using one hidden layer and eight nodes per hidden layer. These results are in line with what I expected because voting behavior should provide a powerful predictor of whether a candidate is a Democrat or Republican. I would have been surprised had I observed classification accuracies that were lower since members of Congress tend to vote along party lines on most issues.

Summary and Conclusions

My hypothesis was incorrect. In some cases, simple neural networks with no hidden layers outperformed more complex neural networks with 1+ hidden layers. However, in other cases, more complex neural networks with multiple hidden layers outperformed the network with no hidden layers. The reason why some data is more amenable to networks with hidden layers instead of without hidden layers is unclear.

Other conclusions include the following:

  • Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
  • Large numbers of relevant attributes can help a neural network create more accurate classifications.
  • Neural networks are powerful and can achieve excellent results on both binary and multi-class classification problems.

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Ĭordanov, I., & Jain, L. C. (2013). Innovations in Intelligent Machines –3 : Contemporary Achievements in Intelligent Systems. Berlin: Springer.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Montavon, G. O. (2012). Neural Networks : Tricks of the Trade. New York: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Return to Table of Contents

Logistic Regression Algorithm From Scratch

In this post, I will walk you through the Logistic Regression algorithm step-by-step.

  • We will develop the code for the algorithm from scratch using Python.
  • We will run the algorithm on real-world data sets from the UCI Machine Learning Repository.

Table of Contents

What is Logistic Regression?

Logistic regression, contrary to the name, is a classification algorithm. Unlike linear regression which outputs a continuous value (e.g. house price) for the prediction, Logistic Regression transforms the output into a probability value (i.e. a number between 0 and 1) using what is known as the logistic sigmoid function. This function is also known as the squashing function since it maps a line — that can run from negative infinity to positive infinity along the y-axis — to a number between 0 and 1.

Here is what the graph of the sigmoid function looks like:

sigmoid-curve
Source: (Kelleher, Namee, & Arcy, 2015)

The function is called the sigmoid function because it is s-shaped. Here is what the sigmoid function looks like in mathematical notation:

sigmoid-mathematical-equation

where:

  • h(z) is the predicted probability of a given instance (i.e. example) being in the positive class…that is the class represented as 1 in a data set. For example, in an e-mail classification data set, this would be the probability that a given e-mail instance is spam (If h(z) = 0.73, for example, that would mean that the instance has a 73% chance of being spam).
  • 1- h(z) is the probability of an instance being in the negative class, the class represented as 0 (e.g. not spam). h(z) is always a number between 0 and 1. Going back to the example in the bullet point above, this would mean that the instance has a 27% change of being not spam.
  • z is the input (e.g. a weighted sum of the attributes of a given instance)
  • e is Euler’s number

z is commonly expressed as the dot product, w · x, where w is a 1-dimensional vector containing the weights for each attribute, and x is a vector containing the values of each attribute for a specific instance of the data set (i.e. example).

Often the dot product, w · x, is written as matrix multiplication. In that case, z = wTx where T means transpose of the single dimensional weight vector w. The symbol Ɵ is often used in place of w.

So substituting w · x into the sigmoid equation, we getthe following equation:

sigmoid-curve-2

where

  • w is a 1-dimensional vector containing the weights for each attribute.
  • The subscript w on hw means the attributes x are weighted by the weight vector w.
  • hw(x) is the probability (a value between 0 and 1) that an instance is a member of the positive class (i.e. probability an e-mail is spam).
  • x is a vector containing the values of each attribute for a specific instance of the data set.
  • w · x = w0x0 + w1x1 + w2x2 + …. + wdx(analogous to the equation of a line y = mx + b from grade school)
    • d is the number of attributes in the data set
    • x0 = 1 by convention, for all instances. This attribute has to be added by the programmer for all instances. It is known formally as the “bias” term.

As is the case for many machine learning algorithms, the starting point for Logistic Regression is to create a trained model. One we have a trained model, we can use it to make predictions on new, unseen instances.

Training Phase

Creating a trained model entails determining the weight vector w. Once we have the weights, we can make predictions on new unseen examples. All we need are the values of the attributes of those examples (i.e. the x values), and we can weight the x values with the values of w to compute the probabilities h(x) for that example using the sigmoid function.

The rule for making predictions using the sigmoid function is as follows:

  • If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
  • If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

To determine the weights in linear regression, the sum of the squared error was the cost function (where error = actual values – predicted values by the line). The cost function represents how wrong a prediction is. In linear regression, it represents how wrong a line of best fit is on a set of observed training instances. The lower the sum of the squared error, the better a line fits the training data, and, in theory, the better the line will predict new, unseen instances.

Instead, the cost function in Logistic Regression is called cross-entropy. Without getting too detailed into the mathematics and notation of this particular equation, the cross-entropy equation is the one that we want to minimize. Minimizing this equation will yield us a sigmoid curve that best fits the training data and enables us to make the best classification predictions possible for new, unseen test instances. A minimum of the cost function is attained when the gradient of the cost function is close to zero (i.e. the calculated weights stop changing). The formal term for the gradient of the cost function getting close to zero is called convergence.

In order to minimize the cost function, we need to find its gradient (i.e. derivative, slope, etc.) and determine the values for the weight vector w that make its derivative as close to 0 as possible. We cannot just set the gradient to 0 and then enter x-values and calculate the weights directly. Instead, we have to use a method called gradient descent in order to find the weights.

In the gradient descent algorithm for Logistic Regression, we:

  1. Start off with an empty weight vector (initialized to random values between -0.01 and 0.01). The size of the vector is equal to the number of attributes in the data set.
  2. Initialize an empty weight change vector initialized to all zeros. The size of the vector is equal to the number of attributes in the data set.
  3. For each training instance, one at a time.
    • a. Make a probability prediction by calculating the weighted sum of the attribute values and running that value through the sigmoid function.
    • b. We evaluate the gradient of the cost function by plugging in the actual (i.e. observed) class value and the predicted class value from bullet point 3a above.
    • c. The gradient value from 3b gets added to the weight change vector.
  4. After we finish with the last training instance from 3, we multiply each value in the weight change vector by a learning rate (commonly 0.01).
  5. The vector from 4 gets added to the empty weight vector to update the weights.
  6. We then ask two questions
    • a. Have the weights continued to change (i.e. is the norm (i.e. magnitude) of the weight change vector less than a certain threshold like 0.001)?
    • b. Have we been through the data set less than 10,000 (or whatever we set the maximum iterations to) times?
    • c. If the answer is yes to both 6a and 6b, go back to step 2. Otherwise, we return the final weight vector, exiting the algorithm.

The gradient descent pseudocode for Logistic Regression is provided in Figure 10.6 of Introduction to Machine Learning by Ethem Alpaydin (Alpaydin, 2014).

Testing Phase

Once training is completed, we have the weights and can use these weights, attribute values, and the sigmoid function to make predictions for the set of test instances.

Predictions for a given test instance are made using the aforementioned sigmoid function:

sigmoid-curve-2-1

Where the rule for making predictions using the sigmoid function is as follows:

  • If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
  • If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

Multi-class Logistic Regression

A Multi-class Logistic Regression problem is a twist on the binary Logistic Regression method presented above. Multi-class Logistic Regression can make predictions on both binary and multi-class classification problems.

In order to make predictions for multi-class datasets, we take the training set and create multiple separate binary classification problems (one for each class in the data set). For each of those training sets that we generated, we set the class values for one class to 1 (representing the positive class), and we set all other classes to 0 (i.e. the negative class).

In other words, if there are k classes in a data set, k separate training sets are generated. In each of those k separate training sets, one class is set to 1 and all other classes are set to 0.

In Multi-class Logistic Regression, the training phase entails creating k different weight vectors, one for each class rather than just a single weight vector (which was the case in binary Logistic Regression). Each weight vector will help to predict the probability of an instance being a member of that class. Thus, in the testing phase, when there is an unseen new instance, three different predictions need to be made. This method is called the one-vs-all strategy, sometimes called one-vs-rest.

The rule for making predictions for a given instance are as follows:

  • For each new test instance,
    • Make k separate probability predictions.
    • Pick the class that has the highest probability (i.e. the class that is the most enthusiastic about that instance being a member of its class)

Other multi-class Logistic Regression algorithms include Softmax Regression and the one-vs-one strategy. The one-vs-all strategy was selected due to its popularity as being the default strategy used in practice for many of the well-known machine learning libraries for Python (Rebala, Ravi, & Churiwala, 2019)

Video

Here is an excellent video on logistic regression that explains the whole process I described above, step-by-step.

Return to Table of Contents

Logistic Regression Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

Required Data Set Format for Logistic Regression

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  • N: Actual Class

The program then adds two additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

Modification of Attribute Values

The actual class value was changed to “Benign” or “Malignant.”

I transformed the attributes into binary numbers so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0.

Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There were no missing attribute values.

Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

  • 267 Democrats
  • 168 Republicans

Modification of Attribute Values

I did the following modifications:

  • Changed all “y” to 1 and all “n” to 0.

Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

Description of Any Tuning Process Applied

Some tuning was performed in this project. The learning rate was set to 0.01 by convention. A higher learning rate (0.5) resulted in poor results for the norm of the gradient (>1).

The stopping criteria for gradient descent was as follows:

  • Maximum iterations = 10,000
  • Euclidean norm of weight change vector < 0.001

When I tried max iterations at 100, the Euclidean norm of the weight change vector returned high values (> 0.2) which indicated that I needed to set a higher max iterations value in order to have a higher chance of convergence (i.e. weights stop changing) based on the norm stopping criteria.

Return to Table of Contents

Logistic Regression Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the driver code. This is where the main method is located:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
import five_fold_stratified_cv
import logistic_regression

# File name: logistic_regression_driver.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Driver of the logistic_regression.py program

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# The logistic_regression.py program then adds 2 additional columns 
# for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

ALGORITHM_NAME = "Logistic Regression"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 
    #data_path = "iris.txt"

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 
    #trace_runs_file = "iris_logistic_regression_trace_runs.txt"

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 
    #test_stats_file = "iris_logistic_regression_test_stats.txt"

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
        pd_data_set)

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run Logistic Regression the designated number of times as indicated by the 
    # number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = pd.concat([
               fold1, fold2, fold3, fold4], ignore_index=True, sort=False)                
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = pd.concat([
               fold0, fold2, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = pd.concat([
               fold0, fold1, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold4], ignore_index=True, sort=False) 
        else:
            test_dataset = fold4
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold3], ignore_index=True, sort=False) 
        
        accuracy, predictions, weights_for_each_class, no_of_instances_test = (
        logistic_regression.logistic_regression(training_dataset,test_dataset))

        # Print the trace runs of each experiment
        print("Accuracy:")
        print(str(accuracy * 100) + "%")
        print()
        print("Classifications:")
        print(predictions)
        print()
        print("Learned Model:")
        print(weights_for_each_class)
        print()
        print("Number of Test Instances:")
        print(str(no_of_instances_test))
        print() 

        outfile_tr.write("Accuracy:")
        outfile_tr.write(str(accuracy * 100) + "%\n\n")
        outfile_tr.write("Classifications:\n")
        outfile_tr.write(str(predictions) + "\n\n")
        outfile_tr.write("Learned Model:\n")
        outfile_tr.write(str(weights_for_each_class) + "\n\n")
        outfile_tr.write("Number of Test Instances:")
        outfile_tr.write(str(no_of_instances_test) + "\n\n")

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.\n")

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")
    outfile_ts.write("\n")
    accuracy = np.mean(accuracy_statistics)
    accuracy *= 100
    outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
   
    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    print()
    print("Classification Accuracy : " + str(accuracy) + "%")
    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Here is the code for logistic regression:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
 
# File name: logistic_regression.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Multi-class logistic regression using one-vs-all. 
 
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
 
# This program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

def sigmoid(z):
    """
    Parameters:
        z: A real number
    Returns: 
        1.0/(1 + np.exp(-z))
    """
    return 1.0/(1 + np.exp(-z))

def gradient_descent(training_set):
    """
    Gradient descent for logistic regression. Follows method presented
    in the textbook Introduction to Machine Learning 3rd Edition by 	
    Ethem Alpaydin (pg. 252)

    Parameters:
      training_set: The training instances as a Numpy array
    Returns:
      weights: The vector of weights, commonly called w or THETA
    """   

    no_of_columns_training_set = training_set.shape[1]
    no_of_rows_training_set = training_set.shape[0]

    # Extract the attributes from the training set.
    # x is still a 2d array
    x = training_set[:,:(no_of_columns_training_set - 1)]
    no_of_attributes = x.shape[1]

    # Extract the classes from the training set.
    # actual_class is a 1d array.
    actual_class = training_set[:,(no_of_columns_training_set - 1)]

    # Set a learning rate
    LEARNING_RATE = 0.01

    # Set the maximum number of iterations
    MAX_ITER = 10000

    # Set the iteration variable to 0
    iter = 0

    # Set a flag to determine if we have exceeded the maximum number of
    # iterations
    exceeded_max_iter = False

    # Set the tolerance. When the euclidean norm of the gradient vector 
    # (i.e. magnitude of the changes in the weights) gets below this value, 
    # stop iterating through the while loop
    GRAD_TOLERANCE = 0.001
    norm_of_gradient = None

    # Set a flag to determine if we have reached the minimum of the 
    # cost (i.e. error) function.
    converged = False

    # Create the weights vector with random floats between -0.01 and 0.01
    # The number of weights is equal to the number of attributes
    weights = np.random.uniform(-0.01,0.01,(no_of_attributes))
    changes_in_weights = None

    # Keep running the loop below until convergence on the minimum of the 
    # cost function or we exceed the max number of iterations
    while(not(converged) and not(exceeded_max_iter)):
        
        # Initialize a weight change vector that stores the changes in 
        # the weights at each iteration
        changes_in_weights = np.zeros(no_of_attributes)

        # For each training instance
        for inst in range(0, no_of_rows_training_set):

            # Calculate weighted sum of the attributes for
            # this instance
            output = np.dot(weights, x[inst,:])
                
            # Calculate the sigmoid of the weighted sum
            # This y is the probability that this instance belongs
            # to the positive class
            y =  sigmoid(output)

            # Calculate difference
            difference = (actual_class[inst] - y)

            # Multiply the difference by the attribute vector
            product = np.multiply(x[inst,:], difference)

            # For each attribute, update the weight changes 
            # i.e. the gradient vector
            changes_in_weights = np.add(changes_in_weights,product)
        
        # Calculate the step size
        step_size = np.multiply(changes_in_weights, LEARNING_RATE)

        # Update the weights vector
        weights = np.add(weights, step_size)

        # Test to see if we have converged on the minimum of the error
        # function
        norm_of_gradient = np.linalg.norm(changes_in_weights)

        if (norm_of_gradient < GRAD_TOLERANCE):
            converged = True

        # Update the number of iterations
        iter += 1

        # If we have exceeded the maximum number of iterations
        if (iter > MAX_ITER):
            exceeded_max_iter = True

    #For debugging purposes
    #print("Number of Iterations: " + str(iter - 1))
    #print("Norm of the gradient: " + str(norm_of_gradient))
    #print(changes_in_weights)
    #print()
    return weights


def logistic_regression(training_set, test_set):
    """
    Multi-class one-vs-all logistic regression
    Parameters:
      training_set: The training instances as a Pandas dataframe
      test_set: The test instances as a Pandas dataframe
    Returns:
      accuracy: Classification accuracy as a decimal
      predictions: Classifications of all the test instances as a 
        Pandas dataframe
      weights_for_each_class: The weight vectors for each class (one-vs-all)
      no_of_instances_test: The number of test instances
    """   

    # Remove the instance ID column
    training_set = training_set.drop(
        training_set.columns[[0]], axis=1)
    test_set = test_set.drop(
        test_set.columns[[0]], axis=1)

    # Make a list of the unique classes
    list_of_unique_classes = pd.unique(training_set["Actual Class"])

    # Replace all the class values with numbers, starting from 0
    # in both the test and training sets.
    for cl in range(0, len(list_of_unique_classes)):
        training_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)
        test_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)

    # Insert a column of 1s in column 0 of both the training
    # and test sets. This is the bias and helps with gradient
    # descent. (i.e. X0 = 1 for all instances)
    training_set.insert(0, "Bias", 1)
    test_set.insert(0, "Bias", 1)

    # Convert dataframes to numpy arrays
    np_training_set = training_set.values
    np_test_set = test_set.values

    # Add 2 additional columns to the testing dataframe
    test_set = test_set.reindex(
        columns=[*test_set.columns.tolist(
        ), 'Predicted Class', 'Prediction Correct?'])

    ############################# Training Phase ##############################

    no_of_columns_training_set = np_training_set.shape[1]
    no_of_rows_training_set = np_training_set.shape[0]

    # Create and store a training set for each unique class
    # to create separate binary classification
    # problems
    trainingsets = []
    for cl in range(0, len(list_of_unique_classes)):

        # Create a copy of the training set
        temp = np.copy(np_training_set)

        # This class becomes the positive class 1
        # and all other classes become the negative class 0
        for row in range(0, no_of_rows_training_set):
            if (temp[row, (no_of_columns_training_set - 1)]) == cl:
                temp[row, (no_of_columns_training_set - 1)] = 1
            else:
                temp[row, (no_of_columns_training_set - 1)] = 0
        
        # Add the new training set to the trainingsets list
        trainingsets.append(temp)

    # Calculate and store the weights for the training set
    # of each class. Execute gradient descent on each training set
    # in order to calculate the weights
    weights_for_each_class = []

    for cl in range(0, len(list_of_unique_classes)):
        weights_for_this_class = gradient_descent(trainingsets[cl])
        weights_for_each_class.append(weights_for_this_class)

    # Used for debugging
    #print(weights_for_each_class[0])
    #print()
    #print(weights_for_each_class[1])
    #print()
    #print(weights_for_each_class[2])

    ########################### End of Training Phase #########################

    ############################# Testing Phase ###############################

    no_of_columns_test_set = np_test_set.shape[1]
    no_of_rows_test_set = np_test_set.shape[0]

    # Extract the attributes from the test set.
    # x is still a 2d array
    x = np_test_set[:,:(no_of_columns_test_set - 1)]
    no_of_attributes = x.shape[1]

    # Extract the classes from the test set.
    # actual_class is a 1d array.
    actual_class = np_test_set[:,(no_of_columns_test_set - 1)]

    # Go through each row (instance) of the test data
    for inst in range(0,  no_of_rows_test_set):

        # Create a scorecard that keeps track of the probabilities of this
        # instance being a part of each class
        scorecard = []

        # Calculate and store the probability for each class in the scorecard
        for cl in range(0, len(list_of_unique_classes)):

            # Calculate weighted sum of the attributes for
            # this instance
            output = np.dot(weights_for_each_class[cl], x[inst,:])

            # Calculate the sigmoid of the weighted sum
            # This is the probability that this instance belongs
            # to the positive class
            this_probability = sigmoid(output)

            scorecard.append(this_probability)

        most_likely_class = scorecard.index(max(scorecard))

        # Store the value of the most likely class in the "Predicted Class" 
        # column of the test_set data frame
        test_set.loc[inst, "Predicted Class"] = most_likely_class

        # Update the 'Prediction Correct?' column of the test_set data frame
        # 1 if correct, else 0
        if test_set.loc[inst, "Actual Class"] == test_set.loc[
            inst, "Predicted Class"]:
            test_set.loc[inst, "Prediction Correct?"] = 1
        else:
            test_set.loc[inst, "Prediction Correct?"] = 0

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (test_set["Prediction Correct?"].sum())/(len(test_set.index))

    # Store the revamped dataframe
    predictions = test_set

    # Replace all the class values with the name of the class
    for cl in range(0, len(list_of_unique_classes)):
        predictions["Actual Class"].replace(
            cl, list_of_unique_classes[cl] ,inplace=True)
        predictions["Predicted Class"].replace(
            cl, list_of_unique_classes[cl] ,inplace=True)

    # Replace 1 with Yes and 0 with No in the 'Prediction 
    # Correct?' column
    predictions['Prediction Correct?'] = predictions[
        'Prediction Correct?'].map({1: "Yes", 0: "No"})

    # Reformat the weights_for_each_class list of arrays
    weights_for_each_class = pd.DataFrame(np.row_stack(weights_for_each_class))
 
    # Rename the row names
    for cl in range(0, len(list_of_unique_classes)):
        row_name = str(list_of_unique_classes[cl] + " weights")        
        weights_for_each_class.rename(index={cl:row_name}, inplace=True)

    # Get a list of the names of the attributes
    training_set_names = list(training_set.columns.values)
    training_set_names.pop() # Remove 'Actual Class'

    # Rename the column names
    for col in range(0, len(training_set_names)):
        col_name = str(training_set_names[col])        
        weights_for_each_class.rename(columns={col:col_name}, inplace=True)

    # Record the number of test instances
    no_of_instances_test = len(test_set.index)

    # Return statement
    return accuracy, predictions, weights_for_each_class, no_of_instances_test

Here is the code for five-fold stratified cross-validation:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

def get_five_folds(instances):
    """
    Parameters:
        instances: A Pandas data frame containing the instances
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Shuffle the data set randomly
    instances = instances.sample(frac=1).reset_index(drop=True)

    # Record the number of columns in the data set
    no_of_columns = len(instances.columns) # number of columns

    # Record the number of rows in the data set
    no_of_rows = len(instances.index) # number of rows

    # Create five empty folds (i.e. Panda Dataframes: fold0 through fold4)
    fold0 = pd.DataFrame(columns=(instances.columns))
    fold1 = pd.DataFrame(columns=(instances.columns))
    fold2 = pd.DataFrame(columns=(instances.columns))
    fold3 = pd.DataFrame(columns=(instances.columns))
    fold4 = pd.DataFrame(columns=(instances.columns))

    # Record the column of the Actual Class
    actual_class_column = no_of_columns - 1

    # Generate an array containing the unique 
    # Actual Class values
    unique_class_list_df = instances.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    unique_class_list_np_size = unique_class_list_np.size

    # For each unique class in the unique Actual Class array
    for unique_class_list_np_idx in range(0, unique_class_list_np_size):

        # Initialize the counter to 0
        counter = 0

        # Go through each row of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for row in range(0, no_of_rows):

            # If the value of the unique class is equal to the actual
            # class in the original data set on this row
            if unique_class_list_np[unique_class_list_np_idx] == (
                instances.iloc[row,actual_class_column]):

                    # Allocate instance to fold0
                    if counter == 0:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold0.loc[len(fold0)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold1
                    elif counter == 1:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold1.loc[len(fold1)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold2
                    elif counter == 2:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold2.loc[len(fold2)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold3
                    elif counter == 3:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold3.loc[len(fold3)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold4
                    else:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold4.loc[len(fold4)] = new_row
                                    
                        # Reset counter to 0
                        counter = 0
        
    return fold0, fold1, fold2, fold3, fold4

Return to Table of Contents

Logistic Regression Output

Here are the trace runs:

Here are the results:

logistic-regression-results

Here are the test statistics for each data set:

Analysis

Breast Cancer Data Set

I hypothesize that performance was high on this algorithm because of the large number of instances (699 in total). This data set had the highest number of instances out of all the data sets.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.

Glass Data Set

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

Iris Data Set

Classification accuracy on the iris data set was satisfactory. This data set was small, and more training data would be needed to see if accuracy could be improved by giving the algorithm more data to learn the underlying relationship between the attributes and the flower types.

Soybean Data Set (small)

I hypothesize that the large numbers of attributes in the soybean data set (35) helped balance the relatively small number of training instances. These results suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

Vote Data Set

The results show that classification algorithms like Logistic Regression can have outstanding performance on large data sets that are binary classification problems.

Summary and Conclusions

  • Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
  • Large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.
  • Classification algorithms like Logistic Regression can achieve excellent classification accuracy on binary classification problems, but performance on multi-class classification algorithms can yield mixed results.

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Data Analytics. Cambridge, Massachusetts: The MIT Press.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Rebala, G., Ravi, A., & Churiwala, S. (2019). An Introduction to Machine Learning. Switzerland: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.

Return to Table of Contents

How to Set Up Anaconda for Windows 10

In this post, I will show you how to set up Anaconda. Anaconda is a free, open-source distribution of Python (and R). The goal of Anaconda is to be a free “one-stop-shop” for all your Python data science and machine learning needs. It contains the key packages you need to build cool machine learning projects.

Requirements

Here are the requirements:

  • Set up Anaconda.
  • Set up Jupyter Notebook.
  • Install important libraries.
  • Learn basic Anaconda commands.

Directions

Install Anaconda

Go to the Anaconda website and click “Download.”

setting-up-anaconda-1

Choose the latest version of Python. In my case, that is Python 3.7. Click “Download” to download the Anaconda installer for your operating system type (i.e. Windows, macOS, or Linux). 

setting-up-anaconda-2

Follow the instructions to install the program:

setting-up-anaconda-3
setting-up-anaconda-4
setting-up-anaconda-5
setting-up-anaconda-6

Verify Anaconda is installed by searching for “Anaconda Navigator” on your computer.

Open Anaconda Navigator.

setting-up-anaconda-7

Follow the instructions here for creating a “Hello World” program. You can use Spyder, Jupyter Notebooks, or the Anaconda Prompt (terminal). If you use Jupyter Notebooks, you will need to open the notebooks in Firefox, Google Chrome or another web browser.

Check to make sure that you have IPython installed. Use the following command (in an Anaconda Prompt window) to check:

where ipython
setting-up-anaconda-8

Make sure that you have pip installed. Pip is the package management system for Python.

where pip
setting-up-anaconda-9

Make sure that you have conda installed. Conda is Anaconda’s package management system.

where conda
setting-up-anaconda-10

Install Some Libraries

Install OpenCV

To install OpenCV, use the following command in the Anaconda Prompt:

pip install opencv-contrib-python

Type the following command to get a list of packages and make sure opencv-contrib-python is installed.

conda list

Install dlib

Install cmake.

pip install cmake

Install the C++ library called dlib

pip install dlib
setting-up-anaconda-11

Type the following command and take a look at the list to see if dlib is successfully installed:

conda list

Install Tesseract

Go to Tesseract at UB Mannheim.

Download the Tesseract for your system.

Set it up by following the prompts.

setting-up-anaconda-12

Once Tesseract OCR is downloaded, find it on your system.

Copy the name of the file it is located in. In my case, that is:

C:\Program Files\Tesseract-OCR

Search for “Environment Variables” on your computer.

setting-up-anaconda-13

Under “System Variables,” click “Path,” and then click Edit.

Add the path: C:\Program Files\Tesseract-OCR

setting-up-anaconda-14

Click OK a few times to close all windows.

Open up the Anaconda Prompt.

Type this command to see if tesseract is installed on your system.

where tesseract

Now, apply the Python binding to the packages using the following commands:

pip install tesseract
pip install pytesseract

Install TensorFlow

Type the following command in the Anaconda Prompt:

pip install tensorflow

Install TensorFlow hub using this command:

pip install tensorflow-hub

Now install tflearn.

pip install tflearn

Now, install the Keras neural network library.

pip install keras

Install the Rest of the Libraries

Type the following commands to install the rest of the libraries:

pip install pillow
pip install SimpleITK

Learn Basic Anaconda Commands

Changing Directories

If you ever want to change directories to the D drive instead of the C drive, open Anaconda Prompt on your computer and type the following commands, in order

D:
cd D:\XXXX\XXXX\XXXX\XXXX

where D:\XXXX\XXXX\XXXX\XXXX is the file path.

Listing Directory Contents

Type the dir command to list the contents of a directory.

dir

Creating a Jupyter Notebook

Open Anaconda Prompt on your computer and type the following command:

jupyter notebook

Converting a Jupyter Notebook into Python Files

If you want to convert a Jupyter Notebook into Python files, change to that directory and type the following command:

jupyter nbconvert --to script *.ipynb

Congratulations if you made it this far! You have all the libraries installed that you need to do fundamental image processing and computer vision work in Python.

Iterative Dichotomiser 3 (ID3) Algorithm From Scratch

In this post, I will walk you through the Iterative Dichotomiser 3 (ID3) decision tree algorithm step-by-step. We will develop the code for the algorithm from scratch using Python. We will also run the algorithm on real-world data sets from the UCI Machine Learning Repository.

Table of Contents

What is the Iterative Dichotomiser 3 Algorithm?

Iterative Dichotomiser 3 (ID3)

Unpruned

In the unpruned ID3 algorithm, the decision tree is grown to completion (Quinlan, 1986). The Iterative Dichotomiser 3 (ID3) algorithm is used to create decision trees and was invented by John Ross Quinlan. The decision trees in ID3 are used for classification, and the goal is to create the shallowest decision trees possible. 

For example, consider a decision tree to help us determine if we should play tennis or not based on the weather:

  • The interior nodes of the tree are the decision variables that are based on the attributes in the data set (e.g. weather outlook [attribute]: sunny, overcast, rainy [attribute values])
  • The leaves of the tree store the classifications (e.g. play tennis or don’t play tennis)

Before we get into the details of ID3, let us examine how decision trees work.

How decision trees work

We start with a data set in which each row of the data set corresponds to a single instance. The columns of the data set are the attributes, where the rightmost column is the class label. 

Each attribute can take on a set of values. During training of a decision tree, we split the instances in the data set into subsets (i.e. groups) based on the values of the attributes. For example, the weather outlook attribute can be used to create a subset of ‘sunny’ instances, ‘overcast’, and ‘rainy’ instances. We then split each of those subsets even further based on another attribute (e.g. temperature). 

The objective behind building a decision tree is to use the attribute values to keep splitting the data into smaller and smaller subsets (recursively) until each subset consists of a single class label (e.g. play tennis or don’t play tennis). We can then classify fresh test instances based on the rules defined by the decision tree.

How ID3 Works

Starting from the root of the tree, ID3 builds the decision tree one interior node at a time where at each node we select the attribute which provides the most information gain if we were to split the instances into subsets based on the values of that attribute. How is “most information” determined? It is determined using the idea of entropy reduction which is part of Shannon’s Information Theory. 

Entropy is a measure of the amount of disorder or uncertainty (units of entropy are bits). A data set with a lot of disorder or uncertainty does not provide us a lot of information. 

A good way to think about entropy is how certain we would feel if we were to guess the class of a random training instance. A data set in which there is only one class has 0 entropy (high information here because we know with 100% certainty what the class is given an instance). However, if there is a data set in which each instance is a different class, entropy would be high because you have no certainty when trying to predict the class of a random instance. 

Entropy is thus a measure of how heterogeneous a data set is. If there are many different class types in a data set, and each class type has the same probability of occurring, the entropy (also impurity) of the data set will be high. The greater the entropy is, the higher the uncertainty, and the less information gained. 

Getting back to the running example, in a binary classification problem with positive instances p (play tennis) and negative instances n (don’t play tennis), the entropy contained in a data set is defined mathematically as follows (base 2 log is used as convention in Shannon’s Information Theory).

I(p,n) = entropy of a data set 

= weighted sum of the logs of the probabilities of each possible outcome 

= -[(p/(p+n))log2(p/(p+n)) + (n/(p+n))log2(n/(p+n))]

= -(p/(p+n))log2(p/(p+n)) + -(n/(p+n))log2(n/(p+n))

Where:

  • I = entropy
  • p = number of positive instances (e.g. number of play tennis instances)
  • n = number of negative instances (e.g. number of don’t play tennis instances)

The negative sign at the beginning of the equation is used to change the negative values returned by the log function to positive values (Kelleher, Namee, & Arcy, 2015). This negation ensures that we the logarithmic term returns small numbers for high probabilities (low entropy; greater certainty) and large numbers for small probabilities (higher entropy; less certainty).

Notice that the weights in the summation are the actual probabilities themselves. These weights make sure that classes that appear most frequently in a data set make the biggest impact on the entropy of the data set.

Also notice that if there are only instances with class p (i.e. play tennis) and none with class n (i.e. don’t play tennis) in the data set, the entropy will be 0.

I(p,n) = -(p/(p+n))log2(p/(p+n)) +  -(n/(p+n))log2(n/(p+n))

= -(p/(p+0))log2(p/(p+0)) +  -(0/(p+0))log2(0/(p+0))

= -(p/p)log2(1) + 0

= 0 + 0

= 0

It is easiest to explain the full ID3 algorithm using actual numbers, so below I will demonstrate how the ID3 algorithm works using an example.

ID3 Example

Continuing from the example in the previous section, we want to create a decision tree that will help us determine if we should play tennis or not.

We have four attributes in our data set:

  • Weather outlook: sunny, overcast, rain
  • Temperature: hot, mild, cool
  • Humidity: high, normal
  • Wind: weak, strong

We have two classes:

  • p = play tennis class (“Yes”)
  • n = don’t play tennis class (“No”)

Here is the data set to train the tree:

id3-1

Source: (Mitchell, 1997)

Step 1: Calculate the Prior Entropy of the Data Set

We have 14 total instances: 9 instances of p and 5 instances of n. With the frequency counts of each unique class, we can calculate the prior entropy of this data set where:

  • p = number of positive instances (e.g. number of play tennis instances)
  • n = number of negative instances (e.g. number of don’t play tennis instances)

Prior Entropy of Data Set = I(p,n)

= weighted sum of the logs of the probabilities of each possible outcome 

= -[(p/(p+n))log2(p/(p+n)) + (n/(p+n))log2(n/(p+n))]

= -(p/(p+n))log2(p/(p+n)) + -(n/(p+n))log2(n/(p+n)) 

= -(9/(9+5))log2(9/(9+5)) + -(5/(9+5))log2(5/(9+5))

= -(9/(9+5))log2(9/(9+5)) + -(5/(9+5))log2(5/(9+5))

= 0.9403

This value above is indicative of how much entropy is remaining prior to doing any splitting.

Step 2: Calculate the Information Gain for Each Attribute

For each attribute in the data set (e.g. outlook, temperature, humidity, and wind): 

  • Calculate the total number of instances in the data set
  • Initialize a running weighted entropy score sum variable to 0.
  • Partition the data set instances into subsets based on the attribute values. For example, weather outlook is an attribute. We create the “sunny” subset, “overcast” subset, and “rainy” subset. Therefore, this attribute would create three partitions (or subsets).
  • For each attribute value subset:
    • Calculate the number of instances in this subset
    • Calculate the entropy score of this subset using the frequency counts of each unique class.
    • Calculate the weighting of this attribute value by dividing the number of instances in this subset by the total number of instances in the data set
    • Add the weighted entropy score for this attribute value to the running weighted entropy score sum.
  • The final running weighted entropy score sum is indicative of how much entropy would remain if we were to split the current data set using this attribute. It is the remaining entropy for this attribute.
  • Calculate the information gain for this attribute where:
    • Information gain = prior entropy of the data set from Step 1 – remaining entropy for this attribute from Step 2
  • Record the information gain for this attribute and repeat the process above for the other attributes

For example, consider the first attribute (e.g. weather outlook). The remaining entropy for outlook can be calculated as follows:

Remaining Entropy for Weather Outlook  =

= (# of sunny/# of instances)*Isunny(p,n) + (# of overcast/# of instances)*Iovercast(p,n)  + (# of rainy/# of instances)*Irainy(p,n) 

= (5/14)*Isunny(p,n) + (4/14)*Iovercast(p,n)  + (5/14)*Irainy(p,n) 

Where:

Isunny(p,n) = I(number of sunny AND Yes, number of sunny AND No) 

= I(2,3)

= -(p/(p+n))log2(p/(p+n)) + -(n/(p+n))log2(n/(p+n))  

= -(2/(2+3))log2(2/(2+3)) + -(3/(2+3))log2(3/(2+3)) 

= 0.9710

Iovercast(p,n) = I(number of overcast AND Yes, number of overcast AND No) 

= I(4,0)

= -(p/(p+n))log2(p/(p+n)) + -(n/(p+n))log2(n/(p+n))  

= -(4/(4+0))log2(4/(4+0)) + -(0/(4+0))log2(0/(4+0))  

= 0

Irainy(p,n) = I(number of rainy AND Yes, number of rainy AND No) 

= I(3,2)

= -(p/(p+n))log2(p/(p+n)) + -(n/(p+n))log2(n/(p+n))  

= -(3/(3+2))log2(3/(3+2)) + -(2/(3+2))log2(2/(3+2))   

= 0.9710

Remaining Entropy for Weather Outlook =

= Weighted sum of the entropy scores of each attribute value subset 

= (5/14)*Isunny(p,n) + (4/14)*Iovercast(p,n)  + (5/14)*Irainy(p,n) 

= (5/14) * (0.9710) + (4/14) * 0 + (5/14) * (0.9710)

= 0.6936

Info Gain for Weather Outlook   = Prior entropy of the data set – remaining entropy if we split using this attribute

= 0.9403 – 0.6936

= 0.246

Now that we have the information gain for weather outlook, we need to find the remaining entropy and information gain for the other attributes using the same procedure. The remaining entropies for the other attributes are as follows:

  • Remaining Entropy for Temperature = 0.911
  • Remaining Entropy for Humidity = 0.789
  • Remaining Entropy for Wind = 0.892

The information gains are as follows:

  • Information Gain for Temperature = 0.9403 – 0.911 = 0.0293
  • Information Gain for Humidity = 0.9403 – 0.789 = 0.1513
  • Information Gain for Wind = 0.9403 – 0.892 = 0.0483

Step 3: Calculate the Attribute that Had the Maximum Information Gain

The information gain for weather outlook is 0.246, so it provides the most information and becomes the label for the root node.

Step 4: Partition the Data Set Based on the Values of the Attribute that Had the Maximum Information Gain

The data set is then partitioned based on the values this maximum information gain attribute can take (e.g. sunny, overcast, and rain). The partitions become the outgoing branches from the root node. These branches terminate in new unlabeled nodes. The maximum information gain attribute is removed from these partitions and is not used again to make any more splits from these new unlabeled nodes. An attribute can only be used as a decision variable once on a given path in a tree but can be used multiple times in an entire tree.

Step 5: Repeat Steps 1-4 for Each Branch of the Tree Using the Relevant Partition

We stop when:

  1. Every instance in the partition is part of the same class. Return a decision tree that is made up of a leaf node and labeled with the class of the instances.
  2. There are no more attributes left. Return a decision tree that is made up of a leaf node labeled with the class that comprised the majority of the instances.
  3. No instances in the partition. We return a decision tree that is made up of a leaf node and label with the most common class in the parent node.

Here is the final decision tree:

id3-2

Source: (Mitchell, 1997)

Information Gain Ratio

An issue with just using information gain as the selection criteria for the root is that it has a strong bias towards selecting attributes that contain a large number of different values (which could lead to trees that overfit to the data). One way to rectify this is to use what is called the gain ratio. It acts as a kind of normalization factor.

The gain ratio defines an additional term that penalizes attributes that create a large number of partitions. The gain ratio chooses attributes based on the ratio between their gain and their intrinsic information content. It represents the proportion of information generated that is useful for classification. The attribute with the highest gain ratio is selected as the attribute that splits the data at that node.

Gain Ratio of an Attribution = Information Gain of the Attribute / Split Information of the Attribute

The clearest way to demonstrate the information gain is to show an example. Going back to our original data set, suppose we partition the data set based on the outlook attribute. This partition will result in three subsets: “sunny” subset (frequency count of 5), “overcast” subset (frequency count of 4), and “rainy” subset (frequency count of 5). 

Split Information of Outlook = -[(5/14) * log2(5/14) + (4/14) * log2(4/14) + (5/14) * log2(5/14)] = 1.577

The gain was calculated as 0.246, so the gain ratio is 0.246 / 1.577 = 0.156.

For humidity, we can also calculate the gain ratio.

Split Information of Humidity = -[(7/14) * log2(7/14) + (7/14) * log2(7/14)] = 1

The gain was calculated as 0.1513, so the gain ratio is 0.1513 / 1 = 0.1513.

Pruned

One common way to resolve the issue of overfitting is to do reduced error pruning (Mitchell, 1997). 

  • Training, validation, and test sets are used.
  • A portion of the original data set is withheld as a validation set. This validation set is not used during the training phase of the tree (i.e. while the decision tree is getting built).
  • Once training is finished and the decision tree is built, the tree is tested on the validation set.
  • Each node is considered as a candidate for pruning. Pruning means removing the subtree at that node, making it a leaf, and assigning the most common class at the node.
  • A node is removed if the resulting pruned tree has a classification accuracy (as tested against the validation set) that is at least as good as the accuracy of the original decision tree.
    • If replacing the subtree by a leaf results in improved classification accuracy, replace the subtree with that leaf. 
  • Pruning continues until no further improvement in the classification accuracy occurs.
  • This process happens in a depth-first manner, starting from the leaves.

Return to Table of Contents

Iterative Dichotomiser 3 Algorithm Design

The ID3 algorithm was implemented from scratch with and without reduced error pruning. The abalone, car evaluation, and image segmentation data sets were then preprocessed to meet the input requirements of the algorithms.

Five-Fold Cross-Validation

In this project, for the pruned version of ID3, 10% of the data was pulled out of the original data set to be used as a validation set. The remaining 90% of the data was used for five-fold stratified cross-validation to evaluate the performance of the models. For the unpruned version, the entire data set was used for five-fold stratified cross-validation.

validationset

Required Data Set Format for ID3

Required Data Set Format for ID3-Unpruned

  • Columns (0 through N)
    • 0: Class
    • 1: Attribute 1
    • 2: Attribute 2
    • 3: Attribute 3
    • N: Attribute N

Required Data Set Format for ID3-Pruned

  • Columns (0 through N)
    • 0: Class
    • 1: Attribute 1
    • 2: Attribute 2
    • 3: Attribute 3
    • N: Attribute N

Abalone Data Set

The original abalone data set contains 4177 instances, 8 attributes, and 29 classes (Waugh, 1995). The purpose of this data set is to predict the age of an abalone from physical measurements.

Modification of Attribute Values

All attribute values except for the sex attribute were made discrete by putting the continuous values into bins based on the quartile the value fell into. This modification follows the principle of Occam’s Razor.

Some of the classes only had one or two instances, so the class values (ages of the abalones) were converted into three classes of approximately equal size:

  • Young = 0
  • Middle-Aged = 1
  • Old = 2

Missing Data

There were no missing attribute values.

Car Evaluation Data Set

The original car evaluation data set contains 6 attributes and 1728 instances (Bohanec, 1997). The purpose of the data set is to predict car acceptability based on price, comfort, and technical specifications.

Modification of Attribute Values

No modification of attribute values was necessary since all attributes were discrete.

Missing Data

There were no missing attribute values.

Image Segmentation Data Set

This data set is used for classification. It contains 19 attributes, 210 instances, and 7 classes (Brodley, 1990). This data set was created from a database of seven outdoor images.

Modification of Attribute Values

All attribute values were made discrete by putting the continuous values into bins based on the quartile the value fell into.

The class values were already discrete and were left as-is.

Missing Data

There were no missing attribute values.

Return to Table of Contents

Iterative Dichotomiser 3 Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the code that parses the input file:

import csv # Library to handle csv-formatted files

# File name: parse.py
# Author: Addison Sears-Collins
# Date created: 7/6/2019
# Python version: 3.7
# Description: Used for parsing the input data file

def parse(filename):
    """
    Parameters: 
        filename: Name of a file
    Returns: 
        data: Information on the attributes and the data as a list of 
              dictionaries. Each instance is a different dictionary.
    """

    # Initialize an empty list named 'data'
    data = []

    # Open the file in READ mode.
    # The file object is named 'file'
    with open(filename, 'r') as file:

        # Convert the file object named file to a csv.reader object. Save the
        # csv.reader object as csv_file
        csv_file = csv.reader(file)

        # Return the current row (first row) and advance the iterator to the
        # next row. Since the first row contains the attribute names (headers),
        # save them in a list called headers
        headers = next(csv_file)

        # Extract each of the remaining data rows one row at a time
        for row in csv_file:
            # append method appends an element to the end of the list
            # The element that is appended is a dictionary.
            # A dictionary contains search key-value pairs, analogous to
            # word-definition in a regular dictionary.
            # In this case, each instance is a separate dictionary.
            # The zip method joins two lists together so that we have
            # attributename(header)-value(row) pairs for each instance
            # in the data set
            data.append(dict(zip(headers, row)))

    return data

##Used for debugging
#name_of_file =  "abalone.txt" 
#data = parse(name_of_file)
#print(*data, sep = "\n")
#print()
#print(str(len(data)))

Here is the code for five-fold stratified cross-validation:

import random
from collections import Counter # Used for counting

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/7/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

def get_five_folds(instances):
    """
    Parameters:
        instances: A list of dictionaries where each dictionary is an instance. 
            Each dictionary contains attribute:value pairs 
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Create five empty folds
    fold0 = []
    fold1 = []
    fold2 = []
    fold3 = []
    fold4 = []

    # Shuffle the data randomly
    random.shuffle(instances)

    # Generate a list of the unique class values and their counts
    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in instances:

            # If we have a match
            if uniqueclass == instance['Class']:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    random.shuffle(fold0)
    random.shuffle(fold1)
    random.shuffle(fold2)
    random.shuffle(fold3)
    random.shuffle(fold4)

    # Return the folds
    return  fold0, fold1, fold2, fold3, fold4

Here is the ID3 code:

from node import Node # Import the Node class from the node.py file
from math import log # We need this to compute log base 2
from collections import Counter # Used for counting

# File name: id3.py
# Author: Addison Sears-Collins
# Date created: 7/5/2019
# Python version: 3.7
# Description: Iterative Dichotomiser 3 algorithm

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

def ID3(instances, default):
    """
    Parameters:
      instances: A list of dictionaries where each dictionary is an instance. 
                 Each dictionary contains attribute:value pairs 
                 e.g.: instances =
                   {'Class':'Play','Outlook':'Sunny','Temperature':'Hot'}
                   {'Class':'Don't Play','Outlook':'Rain','Temperature':'Cold'}
                   {'Class':'Play','Outlook':'Overcast','Temperature':'Hot'}
                   ...
                   etc.
                 The first attribute:value pair is the 
                 target variable (i.e. the class of that instance)
      default: The default class label (e.g. 'Play')
    Returns:
      tree: An object of the Node class
    """    
    # The len method returns the number of items in the list
    # If there are no more instances left, return a leaf that is labeled with 
    # the default class
    if len(instances) == 0:
        return Node(default)

    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # If all instances have the same class label or there is only one instance
    # remaining, create a leaf node labeled with that class. 
    # Counter(list) creates a tally of each element in the list. This tally is 
    # represented as an element:tally pair.
    if len(Counter(classes)) == 1 or len(classes) == 1:
        tree = Node(mode_class(instances))
        return tree

    # Otherwise, find the best attribute, the attribute that maximizes the gain 
    # ratio of the data set, to be the next decision node.
    else:
        # Find the name of the most informative attribute of the data set
        # e.g. "Outlook"
        best_attribute = most_informative_attribute(instances)

        # Initialize a tree with the most common class
        # e.g. "Play"
        tree = Node(mode_class(instances))

        # The most informative attribute becomes this decision node
        # e.g. "Outlook" becomes this node
        tree.attribute = best_attribute

        best_attribute_values = []

        # The branches of the node are the values of the best_attribute
        # e.g. "Sunny", "Overcast", "Rainy"
        # Go through each instance and create a list of the values of 
        # best_attribute
        for instance in instances:
            try:
                best_attribute_values.append(instance[best_attribute])
            except:
                no_best_attribute = True
        # Create a list of the unique best attribute values
        # Set is like a list except it extracts nonduplicate (unique) 
        # items from a list. 
        # In short, we create a list of the set of unique
        # best attribute values.
        tree.attribute_values = list(set(best_attribute_values))

        # Now we need to split the instances. We will create separate subsets
        # for each best attribute value. These become the child nodes
        # i.e. "Sunny", "Overcast", "Rainy" subsets
        for best_attr_value_i in tree.attribute_values:

            # Generate the subset of instances
            instances_i = []
            # Go through one instance at a time
            for instance in instances:
                # e.g. If this instance has "Sunny" as its best attribute value
                if instance[best_attribute] == best_attr_value_i:
                    instances_i.append(instance) #Add this instance to the list

            # Create a subtree recursively
            subtree = ID3(instances_i, mode_class(instances))

            # Initialize the values of the subtree
            subtree.instances_labeled = instances_i

            # Keep track of the state of the subtree's parent (i.e. tree)
            subtree.parent_attribute = best_attribute # parent node
            subtree.parent_attribute_value = best_attr_value_i # branch name

            # Assign the subtree to the appropriate branch
            tree.children[best_attr_value_i] = subtree

        # Return the decision tree
        return tree


def mode_class(instances):
    """
    Parameters: 
      instances: A list of dictionaries where each dictionary is an instance. 
        Each dictionary contains attribute:value pairs 
    Returns:
      Name of the most common class (e.g. 'Don't Play')
    """

    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # The 1 ensures that we get the top most common class
    # The [0][0] ensures we get the name of the class label and not the tally
    # Return the name of the most common class of the instances
    return Counter(classes).most_common(1)[0][0]

def prior_entropy(instances):
    """
    Calculate the entropy of the data set with respect to the actual class
    prior to splitting the data.
    Parameters:
      instances: A list of dictionaries where each dictionary is an instance. 
        Each dictionary contains attribute:value pairs 
    Returns:
      Entropy value in bits
    """
    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list    
    classes = []  # Create an empty list named 'classes'

    for instance in instances:
        classes.append(instance['Class'])
    counter = Counter(classes)

    # If all instances have the same class, the entropy is 0
    if len(counter) == 1:
        return 0
    else:
    # Compute the weighted sum of the logs of the probabilities of each 
    # possible outcome 
        entropy = 0
        for c, count_of_c in counter.items():
            probability = count_of_c / len(classes)
            entropy += probability * (log(probability, 2))
        return -entropy

def entropy(instances, attribute, attribute_value):
    """
    Calculate the entropy for a subset of the data filtered by attribute value
    Parameters:
      instances: A list of dictionaries where each dictionary is an instance. 
        Each dictionary contains attribute:value pairs 
      attribute: The name of the attribute (e.g. 'Outlook')
      attribute_value: The value of the attribute (e.g. 'Sunny')
    Returns:
      Entropy value in bits
    """
    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list    
    classes = []  # Create an empty list named 'classes'

    for instance in instances:
        if instance[attribute] == attribute_value:
            classes.append(instance['Class'])
    counter = Counter(classes)

    # If all instances have the same class, the entropy is 0
    if len(counter) == 1:
        return 0
    else:
    # Compute the weighted sum of the logs of the probabilities of each 
    # possible outcome 
        entropy = 0
        for c, count_of_c in counter.items():
            probability = count_of_c / len(classes)
            entropy += probability * (log(probability, 2))
        return -entropy

def gain_ratio(instances, attribute):
    """
    Calculate the gain ratio if we were to split the data set based on the values
    of this attribute.
    Parameters:
      instances: A list of dictionaries where each dictionary is an instance. 
        Each dictionary contains attribute:value pairs 
      attribute: The name of the attribute (e.g. 'Outlook')
    Returns:
      The gain ratio
    """
    # Record the entropy of the combined set of instances
    priorentropy = prior_entropy(instances)

    values = []

    # Create a list of the attribute values for each instance
    for instance in instances:
        values.append(instance[attribute])
    counter = Counter(values) # Store the frequency counts of each attribute value

    # The remaining entropy if we were to split the instances based on this attribute
    # This is a weighted entropy score sum
    remaining_entropy = 0

    # This variable is used for the gain ratio calculation
    split_information = 0

    # items() method returns a list of all dictionary key-value pairs
    for attribute_value, attribute_value_count in counter.items():
        probability = attribute_value_count/len(values)
        remaining_entropy += (probability * entropy(
            instances, attribute, attribute_value))
        split_information += probability * (log(probability, 2))

    information_gain = priorentropy - remaining_entropy

    split_information = -split_information

    gainratio = None

    if split_information != 0:
        gainratio = information_gain / split_information
    else:
        gainratio = -1000

    return gainratio

def most_informative_attribute(instances):
    """
    Choose the attribute that provides the most information if you were to
    split the data set based on that attribute's values. This attribute is the 
    one that has the highest gain ratio.
    Parameters:
      instances: A list of dictionaries where each dictionary is an instance. 
        Each dictionary contains attribute:value pairs 
      attribute: The name of the attribute (e.g. 'Outlook')
    Returns:
      The name of the most informative attribute
    """
    selected_attribute = None
    max_gain_ratio = -1000

    # instances[0].items() extracts the first instance in instances
    # for key, value iterates through each key-value pair in the first
    # instance in instances
    # In short, this code creates a list of the attribute names
    attributes = [key for key, value in instances[0].items()]
    # Remove the "Class" attribute name from the list
    attributes.remove('Class')

    # For every attribute in the list of attributes
    for attribute in attributes:
        # Calculate the gain ratio and store that value
        gain = gain_ratio(instances, attribute)

        # If we have a new most informative attribute
        if gain > max_gain_ratio:
            max_gain_ratio = gain
            selected_attribute = attribute

    return selected_attribute

def accuracy(trained_tree, test_instances):
    """
    Parameters:
        trained_tree: A tree that has already been trained
        test_instances: A set of test instances
    Returns:
        Classification accuracy (# of correct predictions/# of predictions) 
    """
    # Set the counter to 0
    no_of_correct_predictions = 0

    for test_instance in test_instances:
        if predict(trained_tree, test_instance) == test_instance['Class']:
            no_of_correct_predictions += 1

    return no_of_correct_predictions / len(test_instances)

def predict(node, test_instance):
    '''
    Parameters:
        node: A trained tree node
        test_instance: A single test instance
    Returns:
        Class value (e.g. "Play")
    '''
    # Stopping case for the recursive call.
    # If this is a leaf node (i.e. has no children)
    if len(node.children) == 0:
        return node.label
    # Otherwise, we are not yet on a leaf node.
    # Call predict method recursively until we get to a leaf node.
    else:
        # Extract the attribute name (e.g. "Outlook") from the node. 
        # Record the value of the attribute for this test instance into 
        # attribute_value (e.g. "Sunny")
        attribute_value = test_instance[node.attribute]

        # Follow the branch for this attribute value assuming we have 
        # an unpruned tree.
        if attribute_value in node.children and node.children[
            attribute_value].pruned == False:
            # Recursive call
            return predict(node.children[attribute_value], test_instance)

        # Otherwise, return the most common class
        # return the mode label of examples with other attribute values for the current attribute
        else:
            instances = []
            for attr_value in node.attribute_values:
                instances += node.children[attr_value].instances_labeled
            return mode_class(instances)

TREE = None
def prune(node, val_instances):
    """
    Prune the tree recursively, starting from the leaves
    Parameters:
        node: A tree that has already been trained
        val_instances: The validation set        
    """
    global TREE
    TREE = node

    def prune_node(node, val_instances):
        # If this is a leaf node
        if len(node.children) == 0:
            accuracy_before_pruning = accuracy(TREE, val_instances)
            node.pruned = True

            # If no improvement in accuracy, no pruning
            if accuracy_before_pruning >= accuracy(TREE, val_instances):
                node.pruned = False
            return

        for value, child_node in node.children.items():
            prune_node(child_node, val_instances)

        # Prune when we reach the end of the recursion
        accuracy_before_pruning = accuracy(TREE, val_instances)
        node.pruned = True

        if accuracy_before_pruning >= accuracy(TREE, val_instances):
            node.pruned = False

    prune_node(TREE, val_instances)

Here is the code for the nodes. This is needed in order to run ID3 (above).

# File name: node.py
# Author: Addison Sears-Collins
# Date created: 7/6/2019
# Python version: 3.7
# Description: Used for constructing nodes for a tree

class Node:
  
  # Method used to initialize a new node's data fields with initial values
  def __init__(self, label):

    # Declaring variables specific to this node
    self.attribute = None  # Attribute (e.g. 'Outlook')
    self.attribute_values = []  # Values (e.g. 'Sunny')
    self.label = label   # Class label for the node (e.g. 'Play')
    self.children = {}   # Keeps track of the node's children
    
    # References to the parent node
    self.parent_attribute = None
    self.parent_attribute_value = None

    # Used for pruned trees
    self.pruned = False  # Is this tree pruned? 
    self.instances_labeled = []

Here is the code that displays the results. This is the driver program:

import id3
import parse
import random
import five_fold_stratified_cv
from matplotlib import pyplot as plt


# File name: results.py
# Author: Addison Sears-Collins
# Date created: 7/6/2019
# Python version: 3.7
# Description: Results of the Iterative Dichotomiser 3 runs
# This source code is the driver for the entire program

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

ALGORITHM_NAME = "Iterative Dichotomiser 3"

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Enter the name of your input file
    #file_name = 'car.txt'
    file_name = input("Enter the name of your input file (e.g. car.txt): ") 
    

    # Show functioning of the program
    #trace_runs_file = 'car_id3_trace_runs.txt'
    trace_runs_file = input(
       "Enter the name of your trace runs file (e.g. car_id3_trace_runs.txt): ")     

    # Save the output graph of the results
    #imagefile = 'car_id3_results.png'
    imagefile = input(
        "Enter the name of the graphed results file (e.g. foo.png): ")     

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    outfile_tr.write("Welcome to the " +  ALGORITHM_NAME + " Program!" + "\n")
    outfile_tr.write("\n")

    data = parse.parse(file_name)
    pruned_accuracies_avgs = []
    unpruned_accuracies_avgs = []

    # Shuffle the data randomly
    random.shuffle(data)

    # This variable is used for the final graph. Places
    # upper limit on the x-axis.
    # 10% of is pulled out for the validation set
    # 20% of that set is used for testing in the five-fold
    # stratified cross-validation
    # Round up to the nearest value of 10
    upper_limit = (round(len(data) * 0.9 * 0.8) - round(
        len(data) * 0.9 * 0.8) % 10) + 10
    #print(str(upper_limit)) # Use for debugging
    if upper_limit <= 10:
        upper_limit = 50

    # Get the most common class in the data set.
    default = id3.mode_class(data)

    # Pull out 10% of the data to be used as a validation set
    # The remaining 90% of the data is used for cross validation.
    validation_set = data[: 1*len(data)//10]
    data = data[1*len(data)//10 : len(data)]

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
        data)

    # Generate lists to hold the training and test sets for each experiment
    testset = []
    trainset = []

    # Create the training and test sets for each experiment
    # Experiment 0
    testset.append(fold0)
    trainset.append(fold1 + fold2 + fold3 + fold4)

    # Experiment 1
    testset.append(fold1)
    trainset.append(fold0 + fold2 + fold3 + fold4)

    # Experiment 2
    testset.append(fold2)
    trainset.append(fold0 + fold1 + fold3 + fold4)

    # Experiment 3
    testset.append(fold3)
    trainset.append(fold0 + fold1 + fold2 + fold4)
    
    # Experiment 4
    testset.append(fold4)
    trainset.append(fold0 + fold1 + fold2 + fold3)

    step_size = len(trainset[0])//20

    for length in range(10, upper_limit, step_size):
        print('Number of Training Instances:', length)
        outfile_tr.write('Number of Training Instances:' + str(length) +"\n")
        pruned_accuracies = []
        unpruned_accuracies = []

        # Run all 5 experiments for 5-fold stratified cross-validation
        for experiment in range(5):

            # Each experiment has a training and testing set that have been 
            # preassigned.
            train = trainset[experiment][: length]
            test = testset[experiment]

            # Pruned
            tree = id3.ID3(train, default)
            id3.prune(tree, validation_set)
            acc = id3.accuracy(tree, test)
            pruned_accuracies.append(acc)

            # Unpruned
            tree = id3.ID3(train, default)
            acc = id3.accuracy(tree, test)
            unpruned_accuracies.append(acc) 
        
        # Calculate and store the average classification 
        # accuracies for each experiment
        avg_pruned_accuracies = sum(pruned_accuracies) / len(pruned_accuracies)
        avg_unpruned_accuracies = sum(unpruned_accuracies) / len(unpruned_accuracies)

        print("Classification Accuracy for Pruned Tree:", avg_pruned_accuracies) 
        print("Classification Accuracy for Unpruned Tree:", avg_unpruned_accuracies)
        print()
        outfile_tr.write("Classification Accuracy for Pruned Tree:" + str(
            avg_pruned_accuracies) + "\n") 
        outfile_tr.write("Classification Accuracy for Unpruned Tree:" + str(
                avg_unpruned_accuracies) +"\n\n")

        # Record the accuracies, so we can plot them later
        pruned_accuracies_avgs.append(avg_pruned_accuracies)
        unpruned_accuracies_avgs.append(avg_unpruned_accuracies) 
    
    # Close the file
    outfile_tr.close()

    plt.plot(range(10, upper_limit, step_size), pruned_accuracies_avgs, label='pruned tree')
    plt.plot(range(10, upper_limit, step_size), unpruned_accuracies_avgs, label='unpruned tree')
    plt.xlabel('Number of Training Instances')
    plt.ylabel('Classification Accuracy on Test Instances')
    plt.grid(True)
    plt.title("Learning Curve for " +  str(file_name))
    plt.legend()
    plt.savefig(imagefile) 
    plt.show()
    

main()

Return to Table of Contents

Iterative Dichotomiser 3 Output

Here are the trace runs:

Here are the results:

Abalone Data Set

abalone_id3_results

Car Evaluation Data Set

car_id3_results

Image Segmentation Data Set

segmentation_id3_results

Analysis

Abalone Data Set

While the overall classification accuracies on the data set were only ~60%, the data show the impact that pruning a decision tree can have on improving prediction performance. The pruned tree performed better than the unpruned tree on the same set of test instances, irrespective of how many training instances the decision trees were trained on. These results suggest that smaller decision trees can generalize better than larger trees and can more accurately classify new test instances. This pruning impact was more pronounced on this data set than any of the other data sets evaluated in this project.

Car Evaluation Data Set

Classification accuracy for both the pruned and unpruned trees was high on this data set, plateauing at ~92%. This performance was the best of any of the data sets examined in this project.

When the decision trees were trained on a relatively small number of training instances, the pruned trees outperformed the unpruned trees. However, beyond 600 training instances, unpruned trees began to have higher classification accuracy. 

It is not clear if the superior accuracy of the unpruned tree algorithm on large numbers of training instances was due to overfitting on the training data or if there was actually an improvement in performance. More training data would be needed to examine if this effect remains consistent for larger numbers of training instances.

Image Segmentation Data Set

For the segmentation data set, neither the unpruned nor the pruned trees outperformed the other consistently. Overall classification accuracy quickly improved for both ID3 algorithms when the decision trees were trained on more training instances. Performance plateaued at ~72%, which was not as good as the car evaluation data set but was better than the abalone data set.

The results show that creating smaller decision trees by pruning the branches might not lead to improved classification performance on unseen new test instances, in some cases. 

Summary and Conclusions

We can conclude that decision trees can create a convenient set of if-then-else decision rules that can be used to classify new unseen test instances. Classification accuracy for both the pruned and unpruned ID3 algorithms was >60% for all data sets, reaching a peak of ~92% on the car evaluation data set.

Neither type of tree, unpruned and pruned, consistently outperformed the other in terms of classification accuracy. The results suggest the impact of pruning depends highly on the data set being evaluated. On the abalone data set, pruning the tree reduced overfitting and led to improved performance. On the car evaluation data set, the performance was mixed. On the image segmentation data set, performance was relatively the same for both ID3-unpruned and ID3-pruned.

In a real-world setting, the overhead associated with reduced error pruning would need to be balanced against the increased speed and simplicity with which new unseen test instances could be classified. Future work would need to compare the performance of the ID3-pruned and ID3-unpruned algorithms on other classification data sets.

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Bohanec, M. (1997, 6 1). Car Evaluation Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Brodley, C. (1990, November 1). Image Segmentation Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Image+Segmentation

James, G. (2013). An Introduction to Statistical Learning: With Applications in R. New York: Springer.

Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Data Analytics. Cambridge, Massachusetts: The MIT Press.

Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill.

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 81-106.

Waugh, S. (1995, 12 1). Abalone Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Abalone

Return to Table of Contents

K-Nearest Neighbors Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-nearest neighbors algorithm (k-NN classification and k-NN regression), step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the image segmentation data set from the UCI Machine Learning Repository. The purpose of the data set is to classify the instances into seven different outdoor images (e.g. sky, foliage, cement, window, path, grass, etc.) based on pixel data. This data set gives a taste of computer vision mixed with machine learning.

Without further ado, let’s get started!

Table of Contents

What is the K-Nearest Neighbors Algorithm?

The K-Nearest Neighbors Algorithm is a supervised learning algorithm (the target variable we want to predict is available) that is used for both classification and regression problems. Let’s see how it works for both types of problems.

K-Nearest Neighbors Classification

For the k-NN classification algorithm, we have N training instances. Let us suppose that these training instances fall into one of two classes, A and B.

1-knn

We want to classify a new instance C. We do this by identifying the k nearest neighbors of C and classifying C based on the class that had the most nearest neighbors to C (out of a total of k nearest neighbors).

For example, let k = 3. The three nearest neighbors of C are A, A, and B. Since A had the most nearest neighbors, C belongs to class A.

How do we select k?

A common method for choosing k is to take the square root of N, the number of instances and subtract 1 if that number is odd. However, in this project, we will tune the value for k and record the results to see which value for k achieved optimal performance as measured by either classification accuracy or mean squared error.

How do we determine the k nearest neighbors?

A common method for measuring distance is to use the Euclidean distance formula. That is, given points p =(p1,p2,…,pn) and q = (q1,q2,…,qn), the distance d between p and q is:

d(p,q) = √((q1 – p1)2 + (q2-p2)2 + …+(qn – pn)2)

This is a version of the Pythagorean Theorem.

In short, the algorithm for k-NN classification is as follows. For each test instance, we:

  1. Compute the distance to every training instance
  2. Select the k closest instances and their class
  3. Output the class that occurs most frequently among the k closest instances

Special notes on k-NN classification:

  • Select an odd value for k for two-class problems in order to avoid ties.
  • Complexity of the algorithm depends on finding the nearest neighbor for each training instance.
  • k-NN needs labeled data sets.

k-NN is known as a lazy learning algorithm because a model is not built until the time a test instance arrives. Thus, there is no prior training on the training data as would be the case with an eager learning algorithm.

Also k-NN is a non-parametric algorithm because it does not assume any particular form of the data set (unlike algorithms like linear regression, logistic regression, etc.). The only assumption is that Euclidean distance can be consistently calculated between points.

K-Nearest Neighbors Regression

k-NN regression is a minor modification of k-NN classification. In k-NN regression, we are trying to predict a real number instead of a class. Instead of classifying a test instance based on the most frequently occurring class among the k nearest neighbors, we take the average of the target variable of the k nearest neighbors.

For example, let’s suppose we wanted to predict the age of someone given their weight. We have the following data:

2-knn

The question is: how old is someone given they weigh 175 pounds?

Suppose k = 3. We find the three nearest neighbors to 175. They are:

  • (170,50)
  • (180,52)
  • (156,43)

We take the average of the target variable:

Average = (50 + 52 + 43) / 3 = 48.3

This is our answer.

In short, the algorithm for k-NN regression is as follows. For each test instance, we:

  1. Compute the distance to every training instance
  2. Select the k closest instances and the values of their target variables
  3. Output the mean of the values of the target variables

Return to Table of Contents

K-Nearest Neighbors Algorithm Design

The first thing we need to do is preprocess the image segmentation data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class

Modification of Attribute Values

The image segmentation data set contains 19 attributes, 210 instances, and 7 classes. This data set was created from a database of seven outdoor images. Below are some of the modifications I made.

Classes were made numerical:

  • BRICKFACE = 1.0
  • SKY = 2.0
  • FOLIAGE = 3.0
  • CEMENT = 4.0
  • WINDOW = 5.0
  • PATH = 6.0
  • GRASS = 7.0

Region-pixel-count was removed since it was 9 for all instances.

I also normalized the attributes so that they are all between 0 and 1.

Here is what the first few columns of your data set should look like after all that preprocessing:

dataset-image-segment

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-Nearest Neighbors Algorithm in Python, Coded From Scratch

Here is the full code for the k-nearest neighbors algorithm (Note that I used five-fold stratified cross-validation to produce the final classification accuracy statistics). You might want to copy and paste it into a document since it is pretty large and hard to see on a single web page:

import numpy as np # Import Numpy library

# File name: knn.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Implementation of the k-nearest neighbors algorithm (k-NN)
# from scratch

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Regression Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

class Knn:

    # Constructor
    #   Parameters:
    #     k: k value for k-NN
    #     problem_type: ('r' for regression or 'c' for classification)
    def __init__(self, k, problem_type):
        self.__k = k
        self.__problem_type = problem_type

    # Parameters:
    #   training_set: The folds used for training (2d numpy array)
    #   test_instance: The test instance we need to find the neighbors for 
    #                  (numpy array)
    # Returns: 
    #   The k most similar neighbors from the training set for a given 
    #   test instance. It will be a 2D numpy array where the first column 
    #   will hold the Actual class value of the training instance and the 
    #   second column will store the distance to the test instance...
    #   (actual class value, distance). 
    def get_neighbors(self, training_set, test_instance):

        # Record the number of training instances in the training set
        no_of_training_instances = np.size(training_set,0)

        # Record the number of columns in the training set
        no_of_training_columns = np.size(training_set,1)

        # Record the column index of the actual class of the training_set
        actual_class_column = None
        # If classification problem
        if self.__problem_type == "c":
            actual_class_column = no_of_training_columns - 1
        # If regression problem
        else:
            actual_class_column = no_of_training_columns - 2

        # Create an empty 2D array called actual_class_and_distance. This 
        # array should be the same length as the number of training instances. 
        # The first column will hold the Actual Class value of the training 
        # instance, and the second column will store the distance to the 
        # test instance...(actual class value, distance). 
        actual_class_and_distance = np.zeros((no_of_training_instances, 2))

        neighbors = None

        # For each row (training instance) in the training set
        for row in range(0, no_of_training_instances):

            # Record the actual class value in the 
            # actual_class_and_distance array (column 0)
            actual_class_and_distance[row,0] = training_set[
                row,actual_class_column]

            # Initialize a temporary training instance copied from this 
            # training instance
            temp_training_instance = np.copy(training_set[row,:])

            # Initialize a temporary test instance copied from this 
            # test_instance
            temp_test_instance = np.copy(test_instance)

            # If this is a classification problem
            if self.__problem_type == "c":
            
                # Update temporary training instance with Instance ID 
                # and the Actual Class pieces removed
                temp_training_instance = np.delete(temp_training_instance,[
                    0,actual_class_column])

                # Update temporary test instance with the Instance ID 
                # and the Actual Class pieces removed
                temp_test_instance = np.delete(temp_test_instance,[
                    0,actual_class_column])

            # If this is a regression problem
            else:

                # Update temporary training instance with Instance ID, Actual 
                # Class, and Stratification Bin pieces removed
                temp_training_instance = np.delete(temp_training_instance,[
                    0,actual_class_column,actual_class_column+1])

                # Update temporary test instance with the Instance ID, Actual
                # Class, and Stratification Bin pieces removed
                temp_test_instance = np.delete(temp_test_instance,[
                    0,actual_class_column,actual_class_column+1])

            # Calculate the euclidean distance from the temporary test
            # instance to the temporary training instance
            distance = np.linalg.norm(
                temp_test_instance - temp_training_instance)

            # Record the distance in the actual_class_and_distance 
            # array (column 1)
            actual_class_and_distance[row,1] = distance

        # Sort the actual_class_and_distance 2D array by the 
        # distance column (column 1) in ascending order.
        actual_class_and_distance = actual_class_and_distance[
                actual_class_and_distance[:,1].argsort()]

        k = self.__k

        # Extract the first k rows of the actual_class_and_distance array
        neighbors = actual_class_and_distance[:k,:]

        return neighbors

    # Generate a prediction based on the most frequent class or averaged 
    # target variable value of the neighbors
    # Parameters: 
    #   neighbors - 1D array (actual_class_value)
    # Returns:
    #   predicted_class_value
    def make_prediction(self, neighbors):
    
        prediction = None

        # If this is a classification problem
        if self.__problem_type == "c":
            
            #  Prediction is the most frequent value in column 0 of
            #  the neighbors array
            neighborsint = neighbors.astype(int)
            prediction = np.bincount(neighborsint).argmax()
            
        # If this is a regression problem
        else:

            # Prediction is the average of the neighbors array
            prediction = np.mean(neighbors)

        return prediction

    # Parameters: 
    #   actual_class_array
    #   predicted_class_array
    # Returns: 
    #   accuracy: Either classification accuracy (for 
    #   classification problems) or 
    #   mean squared error (for regression problems)
    def get_accuracy(self, actual_class_array, predicted_class_array):

        # Initialize accuracy variable
        accuracy = None

        # Initialize decision variable
        decision = None

        counter = None

        actual_class_array_size = actual_class_array.size

        # If this is a classification problem
        if self.__problem_type == "c":
            
            counter = 0

            # For each element in the actual class array
            for row in range(0,actual_class_array_size):

                # If actual class value is equal to value in predicted
                # class array
                if actual_class_array[row] == predicted_class_array[row]:
                    decision = "correct"
                    counter += 1
                else:
                    decision = "incorrect"

            classification_accuracy = counter / (actual_class_array_size)

            accuracy = classification_accuracy

        # If this is a regression problem
        else:

            # Initialize an empty squared error array. 
            # Needs to be same length as number of test instances
            squared_error_array = np.empty(actual_class_array_size)

            squared_error = None

            # For each element in the actual class array
            for row in range(0,actual_class_array_size):

                # Calculate the squared error
                squared_error = (abs((actual_class_array[
                        row] - predicted_class_array[row]))) 
                squared_error *= squared_error
                squared_error_array[row] = squared_error

            mean_squared_error = np.mean(squared_error_array)
            
            accuracy = mean_squared_error
        
        return accuracy

Here is the driver program that calls and executes the code above:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from five_fold_stratified_cv import FiveFoldStratCv
from knn import Knn

# File name: knn_driver.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Driver for the knn.py program 
# (K-Nearest Neighbors)

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Regression Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

################# INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-Nearest Neighbor"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
##############################################################################

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " program!")
    print()
    print("Running " + ALGORITHM_NAME + ". Please wait...")
    print()

    # k value that will be tuned
    k = eval(input("Enter a value for k: ") )

    # "c" for classification or "r" for regression
    problem = input("Press  \"c\" for classification or \"r\" for regression: ") 

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Convert dataframe into a Numpy array
    np_data_set = pd_data_set.to_numpy(copy=True)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Create an object of class FiveFoldStratCv
    fivefolds1 = FiveFoldStratCv(np_data_set,problem)

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Create an object of class Knn
    knn1 = Knn(k,problem) 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = fivefolds1.get_five_folds()

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy for classification problems or mean squared
    # error for regression problems)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run k-NN the designated number of times as indicated by the number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = np.concatenate((
                fold1, fold2, fold3, fold4), axis=0)
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = np.concatenate((
                fold0, fold2, fold3, fold4), axis=0)
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = np.concatenate((
                fold0, fold1, fold3, fold4), axis=0)
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold4), axis=0)
        else:
            test_dataset = fold4
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold3), axis=0)

        # Actual class column index of the test dataset
        actual_class_column = None           
     
        # If classification problem
        if problem == "c":
            actual_class_column = np.size(test_dataset,1) - 1
        # If regression problem
        else:
            actual_class_column = np.size(test_dataset,1) - 2

        # Create an array of the actual_class_values of the test instances
        actual_class_values = test_dataset[:,actual_class_column]

        no_of_test_instances = np.size(test_dataset,0)

        # Make an empty array called predicted_class_values which will 
        # store the predicted class values. It should be the same length 
        # as the number of test instances
        predicted_class_values = np.zeros(no_of_test_instances)

        # For each row in the test data set
        for row in range(0, no_of_test_instances):
  
            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the k nearest neighbors for each test instance
            this_instance = test_dataset[row,:]
            neighbor_array = knn1.get_neighbors(training_dataset,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)
  
            # Record the prediction in the predicted_class_values array
            predicted_class_values[row] = prediction

        # Calculate the classification accuracy of the predictions 
        # (k-NN classification) or the mean squared error (k-NN regression)
        accuracy = knn1.get_accuracy(actual_class_values,predicted_class_values)

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

        # If classification problem
        if problem == "c":
            temp_acc = accuracy * 100
            outfile_tr.write("Classification Accuracy: " + str(temp_acc) + "%\n")
            outfile_tr.write("\n")
            print("Classification Accuracy: " + str(temp_acc) + "%\n")
        # If regression problem
        else:
            outfile_tr.write("Mean Squared Error: " + str(accuracy) + "\n")
            outfile_tr.write("\n")
            print("Mean Squared Error: " + str(accuracy) + "\n")

    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.")
    print()

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")

    # Write the relevant stats to a file
    outfile_ts.write("\n")

    if problem == "c":
        outfile_ts.write("Problem Type : Classification" + "\n")
    else:
        outfile_ts.write("Problem Type : Regression" + "\n")

    outfile_ts.write("\n")
    outfile_ts.write("Value for k : " + str(k) + "\n")
    outfile_ts.write("\n")

    accuracy = np.mean(accuracy_statistics)

    if problem == "c":
        accuracy *= 100
        outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
    else: 
        outfile_ts.write("Mean Squared Error : " + str(accuracy) + "\n")

    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    # Write the relevant stats to a file
    print()

    if problem == "c":
        print("Problem Type : Classification")
    else:
        print("Problem Type : Regression")

    print()
    print("Value for k : " + str(k))
    print()
    if problem == "c":
        print("Classification Accuracy : " + str(accuracy) + "%")
    else: 
        print("Mean Squared Error : " + str(accuracy))

    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Return to Table of Contents

Output Statistics of the K-Nearest Neighbors Algorithm on the Image Segmentation Data Set

This section shows the results for the runs of the k-nearest neighbors algorithm on the image segmentation data set. I used a k-value of 4, but you can feel free to change this and see what accuracy value you get.

Test Statistics

knn_stats

Trace Runs

Here are the trace runs of the algorithm:

trace-runs-k-nearest-neighbors

Return to Table of Contents

Condensed K-Nearest Neighbors Algorithm

The time and space complexity of the regular k-nearest neighbors algorithm described above is directly proportional to the number of instances in the training set. This could potentially present a problem with massively large data sets. This is where the condensed k-nearest neighbors algorithm (ck-NN) comes in handy.

The idea behind ck-NN is to reduce the size of the training set by selecting the smallest subset of the training data set that results in no loss of classification accuracy. By systematically removing ineffective instances, we reduce the computation time as well as the storage requirement.

Here is how condensed k-NN works:

We start with an empty bin called STORE.

1. Place the first instance in STORE

2. Check whether the second instance can be classified correctly by 1-nearest neighbor using the instance in STORE as the reference.

3. Repeat step 2 for all other instances in the data set.

4. Repeat step 2 on the data set, doing continuous passes over the data set until either

OR

5. Use the instances in STORE as the input for the k-NN classification algorithm.

Here is the code in Python for ck-NN:

import numpy as np # Import Numpy library
from knn import Knn

# File name: cknn.py
# Author: Addison Sears-Collins
# Date created: 6/21/2019
# Python version: 3.7
# Description: Implementation of the condensed k-nearest neighbors 
# algorithm (k-NN) from scratch

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

class CondensedKnn:

    # Constructor
    #   Parameters:
    #     training_set: The training set that we need to prune
    #     problem_type: ('c' for classification)
    def __init__(self, training_set, problem_type="c"):
        self.__training_set = training_set
        self.__problem_type = problem_type

    # Parameters:
    #   None.
    # Returns: 
    #   A training set that has irrelevant instances removed 
    def get_trainingset(self):
        
        # Initialize a condensed training set. Copy it from the actual 
        # training set
        condensed_training_set = np.copy(self.__training_set)

        # Record the number of instances in the condensed training_set
        no_of_training_instances = np.size(condensed_training_set,0)

        # Record the number of columns in the condensed training set
        no_of_training_columns = np.size(condensed_training_set,1)

        # Print the initial number of instances in the condensed training set
        print("\nBefore condensing: " + str(
            no_of_training_instances) + " training instances\n")

        # Record the column index of the actual class of the training_set
        actual_class_column = no_of_training_columns - 1

        # Initialize an array named store with the first instance of the 
        # condensed training_set
        store = np.copy(condensed_training_set[0,:])

        # Create a 2D array
        new_row = np.copy(condensed_training_set[0,:])
        store = np.vstack([store,new_row])

        # For the second instance to the last instance in the condensed 
        # training_set
        row = 1
        while row < no_of_training_instances:
            
            # Record the actual class value
            actual_class_value = condensed_training_set[
                row,actual_class_column]

            # Create an object of class Knn
            knn1 = Knn(1,self.__problem_type) 

            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the nearest neighbor for each instance
            this_instance = condensed_training_set[row,:]
            neighbor_array = knn1.get_neighbors(store,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)

            # If actual class value is not equal to the prediction
            # Append that instance to the store array
            # Remove this instance from the condensed training_set
            if actual_class_value != prediction:
                new_row = np.copy(this_instance)
                store = np.vstack([store,new_row])

                condensed_training_set = np.delete(
                    condensed_training_set, row, 0)
                no_of_training_instances -= 1
            
            row += 1

        # Declare the stopping criteria. We stop when either one complete
        # pass is made through the condensed training set with no more 
        # transfers of instances to store or there are no more instances 
        # remaining in the condensed training data set
  
        no_more_transfers_to_store = False
        no_more_instances_left = None
  
        # Update the number of instances in the condensed training_set
        no_of_training_instances = np.size(condensed_training_set,0)
        
        if no_of_training_instances > 0:
            no_more_instances_left = False
        else:
            no_more_instances_left = True

        while not(no_more_transfers_to_store) and not(no_more_instances_left):
            # Reset the number of transfers_made to 0
            transfers_made = 0

            # For the second instance to the last instance in the condensed 
            # training_set
            row = 0
            while row < no_of_training_instances:

                # Record the actual class value
                actual_class_value = condensed_training_set[
                    row,actual_class_column]

                # Create an object of class Knn
                knn1 = Knn(1,self.__problem_type) 

                # Neighbor array is a 2D array containing the neighbors 
                # (actual class value, distance) 
                # Get the nearest neighbor for each instance
                this_instance = condensed_training_set[row,:]
                neighbor_array = knn1.get_neighbors(store,this_instance)

                # Extract the actual class values
                neighbors_arr = neighbor_array[:,0]
  
                # Predicted class value stored in the variable prediction
                prediction = knn1.make_prediction(neighbors_arr)

                # If actual class value is not equal to the prediction
                # Append that instance to the store array
                # Remove this instance from the condensed training_set
                if actual_class_value != prediction:
                    new_row = np.copy(this_instance)
                    store = np.vstack([store,new_row])

                    condensed_training_set = np.delete(
                        condensed_training_set, row, 0)
                    no_of_training_instances -= 1
                    transfers_made += 1
            
                row += 1
        
            # Update the number of instances in the condensed training_set
            no_of_training_instances = np.size(condensed_training_set,0)

            if no_of_training_instances > 0:
                no_more_instances_left = False
            else:
                no_more_instances_left = True

            if transfers_made > 0:
                no_more_transfers_to_store = False
            else: 
                no_more_transfers_to_store = True

        # Delete row 0 from the store
        store = np.delete(store,0,0)

        # Print the final number of instances in the store
        print("After condensing: " + str(
            np.size(store,0)) + " training instances\n")
        return store

Here is the code (Python) for the driver that executes the program above. It uses the regular k-NN code I presented earlier in this post as well as the five-fold stratified cross-validation code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from five_fold_stratified_cv import FiveFoldStratCv
from knn import Knn
from cknn import CondensedKnn

# File name: cknn_driver.py
# Author: Addison Sears-Collins
# Date created: 6/21/2019
# Python version: 3.7
# Description: Driver for the cknn.py program 
# (Condensed K-Nearest Neighbors)

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

################# INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "Condensed K-Nearest Neighbor"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
##############################################################################

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " program!")
    print()
    print("Running " + ALGORITHM_NAME + ". Please wait...")
    print()

    # k value that will be tuned
    k = eval(input("Enter a value for k: ") )

    # "c" for classification
    problem = input("Press  \"c\" for classification: ") 

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Convert dataframe into a Numpy array
    np_data_set = pd_data_set.to_numpy(copy=True)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Create an object of class FiveFoldStratCv
    fivefolds1 = FiveFoldStratCv(np_data_set,problem)

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Create an object of class Knn
    knn1 = Knn(k,problem) 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = fivefolds1.get_five_folds()

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy for classification problems or mean squared
    # error for regression problems)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run k-NN the designated number of times as indicated by the number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = np.concatenate((
                fold1, fold2, fold3, fold4), axis=0)
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = np.concatenate((
                fold0, fold2, fold3, fold4), axis=0)
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = np.concatenate((
                fold0, fold1, fold3, fold4), axis=0)
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold4), axis=0)
        else:
            test_dataset = fold4
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold3), axis=0)
        
        # Create an object of class CondensedKnn
        cknn1 = CondensedKnn(training_dataset,problem)

        # Get a new, smaller training set with fewer instances
        training_dataset = cknn1.get_trainingset()

        # Actual class column index of the test dataset
        actual_class_column = np.size(test_dataset,1) - 1

        # Create an array of the actual_class_values of the test instances
        actual_class_values = test_dataset[:,actual_class_column]

        no_of_test_instances = np.size(test_dataset,0)

        # Make an empty array called predicted_class_values which will 
        # store the predicted class values. It should be the same length 
        # as the number of test instances
        predicted_class_values = np.zeros(no_of_test_instances)

        # For each row in the test data set
        for row in range(0, no_of_test_instances):
  
            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the k nearest neighbors for each test instance
            this_instance = test_dataset[row,:]
            neighbor_array = knn1.get_neighbors(training_dataset,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)
  
            # Record the prediction in the predicted_class_values array
            predicted_class_values[row] = prediction

        # Calculate the classification accuracy of the predictions 
        # (k-NN classification) or the mean squared error (k-NN regression)
        accuracy = knn1.get_accuracy(actual_class_values,predicted_class_values)

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

        # Stats
        temp_acc = accuracy * 100
        outfile_tr.write("Classification Accuracy: " + str(temp_acc) + "%\n")
        outfile_tr.write("\n")
        print("Classification Accuracy: " + str(temp_acc) + "%\n")


    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.")
    print()

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")

    # Write the relevant stats to a file
    outfile_ts.write("\n")

    outfile_ts.write("Problem Type : Classification" + "\n")

    outfile_ts.write("\n")
    outfile_ts.write("Value for k : " + str(k) + "\n")
    outfile_ts.write("\n")

    accuracy = np.mean(accuracy_statistics)

    accuracy *= 100
    outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
   
    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    # Write the relevant stats to a file
    print()

    print("Problem Type : Classification")
    print()
    print("Value for k : " + str(k))
    print()
    print("Classification Accuracy : " + str(accuracy) + "%")
    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Here are the results for those runs as well as a comparison to regular k-NN:

cknn-stats
knn-cknn-comparison

Classification accuracy was greater than 80% for both the k-NN and ck-NN runs. The classification accuracy for the k-NN runs was about 4 percentage points greater than ck-NN. However, the variability of the classification accuracy for the five experiments in the ck-NN runs was greater than for the k-NN runs.

In a real-world setting for classification, I would select the k-NN algorithm over ck-NN because its performance is more consistent. In addition, while ck-NN might result in improved run times for the actual k-NN runs, there is still the prior overhead associated with having to prune the training set of irrelevant instances.

Return to Table of Contents

K-means Clustering Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-means clustering algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the K-means Clustering Algorithm?

The K-means algorithm is an unsupervised clustering algorithm (the target variable we want to predict is not available) that is used to group unlabeled data set instances into clusters based on similar attributes.

Step 1: Choose a Value for K and Select the Initial Centroids

The first step of K-means is to determine the valve for k, the number of clusters you want to group your data set into. We then randomly select k cluster centroids (i.e. k random instances) from the data set, where k is some positive integer.

Step 2: Cluster Assignment

The next step is the cluster assignment step. This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it.

Step 3: Move Centroid

The next part is the move centroid step. We move each of the k centroids to the average of all the instances that were assigned to that cluster centroid.

Step 4: Repeat Steps 2 and 3 Until Centroids Stop Changing

We keep running the cluster assignment and move centroid steps until the cluster centroids stop changing. At that point, the data is deemed clustered. 

Andrew Ng, a professor of machine learning at Stanford University, does a good job of showing you visually how k-means clustering works:

Return to Table of Contents

K-means Clustering Algorithm Design

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class (used for classification accuracy calculation)

The program then adds 4 additional columns.

  • N + 1: Cluster
  • N + 2: Silhouette Coefficient
  • N + 3: Predicted Class
  • N + 4: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

iris-data-set

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set.

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarization-kmeans

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-means Clustering Algorithm in Python, Coded From Scratch

K-means appears to be particularly sensitive to the starting centroids. The starting centroids for the k clusters were chosen at random. When these centroids started out poor, the algorithm took longer to converge to a solution. Future work would be to fine-tune the initial centroid selection process. 

Here is the full code for k-means clustering:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: kmeans.py
# Author: Addison Sears-Collins
# Date created: 6/12/2019
# Python version: 3.7
# Description: Implementation of K-means clustering algorithm from scratch.
# K-means algorithm is a clustering algorithm that is used to group 
# unlabeled data set instances into clusters based on similar attributes.

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class (used for classification accuracy calculation)

# This program then adds 4 additional columns.
# N + 1: Cluster
# N + 2: Silhouette Coefficient
# N + 3: Predicted Class
# N + 4: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-means"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_kmeans_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_kmeans_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_kmeans_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Read the full text file and store records in a Pandas dataframe
pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Copy the dataframe into a new dataframe so we don't mess up the
# original data
pd_data_set = pd_full_data_set.copy() 

# Calculate the number of instances, columns, and attributes in the
# training data set. Assumes 1 column for the instance ID and 1 column
# for the class. Record the index of the column that contains 
# the actual class
no_of_instances = len(pd_data_set.index) # number of rows
no_of_columns = len(pd_data_set.columns) # number of columns
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Store class values in a column and then create a list of unique
# classes and store in a dataframe and a Numpy array
unique_class_list_df = pd_data_set.iloc[:,actual_class_column]
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

# Record the number of unique classes in the data set
num_unique_classes = len(unique_class_list_df)

# Record the value for K, the number of clusters
K = num_unique_classes

# Remove the Instance and the Actual Class Column to create an unlabled
# data set
instance_id_colname = pd_data_set.columns[0]
class_column_colname = pd_data_set.columns[actual_class_column]
pd_data_set = pd_data_set.drop(columns = [ # Each row is a different instance
        instance_id_colname, class_column_colname]) 

# Convert dataframe into a Numpy array
np_data_set = pd_data_set.to_numpy(copy=True)

# Randomly select k instances from the data set. 
# These will be the cluster centroids for the first iteration
# of the algorithm.
centroids = np_data_set[np.random.choice(np_data_set.shape[
    0], size=K, replace=False), :]


##################### Cluster Assignment Step ################################
# Go through each instance and assign that instance to the closest 
# centroid (based on Euclidean distance).

# Initialize an array which will contain the cluster assignments for each
# instance.
cluster_assignments = np.empty(no_of_instances)

# Goes True if new centroids are the same as the old centroids
centroids_the_same = False

# Sets the maximum number of iterations
max_iterations = 300

while max_iterations > 0 and not(centroids_the_same):
    # Go through each data point and assign it to the nearest centroid
    for row in range(0, no_of_instances):
    
        this_instance = np_data_set[row]

        # Calculate the Euclidean distance of each instance in the data set
        # from each of the centroids
        # Find the centroid with the minimum distance and assign the instance
        # to that centroid.
        # Record that centroid in the cluster assignments array.
    
        # Reset the minimum distance to infinity
        min_distance = float("inf")

        for row_centroid in range(0, K):
            this_centroid = centroids[row_centroid]
        
            # Calculate the Euclidean distance from this instance to the
            # centroid
            distance = np.linalg.norm(this_instance - this_centroid)

            # If we have a centroid that is closer to this instance,
            # update the cluster assignment for this instance.
            if distance < min_distance:
                cluster_assignments[row] = row_centroid
                min_distance = distance # Update the minimum distance

    # Print after each cluster assignment has completed
    print("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:")
    print(cluster_assignments)
    print()
    print("Now calculating the new centroids...")
    print()

    outfile3.write("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:"+ "\n")
    outfile3.write(str(cluster_assignments))
    outfile3.write("\n")
    outfile3.write("\n")
    outfile3.write("Now calculating the new centroids..." + "\n")
    outfile3.write("\n")


    ##################### Move Centroid Step ################################
    # Calculate the centroids of the clusters by computing the average
    # of the attribute values of the instances in each cluster
    # For each row in the centroids 2D array

    # Store the old centroids
    old_centroids = centroids.copy()

    for row_centroid in range(0, K):

        # For each column of each row of the centroids 2D array
        for col_centroid in range(0, no_of_attributes):

            # Reset the running sum and the counter
            running_sum = 0.0
            count = 0.0
            average = None

            for row in range(0, no_of_instances):

                # If this instance belongs to this cluster
                if(row_centroid == cluster_assignments[row]):
                
                    # Add this value to the running sum
                    running_sum += np_data_set[row,col_centroid]

                    # Increment the counter
                    count += 1
        
                    if (count > 0):
                        # Calculate the average
                        average = running_sum / count

            # Update the centroids array with this average
            centroids[row_centroid,col_centroid] = average
    
    # Print to after each cluster assignment has completed
    print("New centroids have been created. Here they are:")
    print(centroids)
    print()

    outfile3.write("New centroids have been created. Here they are:" + "\n")
    outfile3.write(str(centroids))
    outfile3.write("\n")
    outfile3.write("\n")

    # Check if cluster centroids are the same
    centroids_the_same = np.array_equal(old_centroids,centroids)

    if centroids_the_same:
        print(
        "Cluster membership is unchanged. Stopping criteria has been met.")
        outfile3.write("Cluster membership is unchanged. ")
        outfile3.write("Stopping criteria has been met." + "\n")
        outfile3.write("\n")

    # Update the number of iterations
    max_iterations -= 1

# Record the actual class column name
actual_class_col_name = pd_full_data_set.columns[len(
    pd_full_data_set.columns) - 1]

# Add 4 additional columns to the original data frame
pd_full_data_set = pd_full_data_set.reindex(
      columns=[*pd_full_data_set.columns.tolist(
      ), 'Cluster', 'Silhouette Coefficient', 'Predicted Class', (
      'Prediction Correct?')])

# Add the final cluster assignments to the Pandas dataframe
pd_full_data_set['Cluster'] = cluster_assignments

outfile3.write("Calculating the Silhouette Coefficients. Please wait..." + "\n")
outfile3.write("\n")
print()
print("Calculating the Silhouette Coefficients. Please wait...")
print()
################## Calculate the Silhouette Coefficients ######################
# Rewards clusterings that have good cohesion and good separation. Varies 
# between 1 and -1. -1 means bad clustering, 1 means great clustering.

# 1. For each instance calculate the average distance to all other instances 
# in that cluster. This is a.
# 2. (Find the average distance to all the instances in the nearest neighbor 
# cluster). For each instance and any cluster that does not contain the 
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
# 3. For each instance calculate the Silhouette Coefficient s where
# s = (b-a)/max(a,b)
# Store the value in the data frame

silhouette_column = actual_class_column + 2

# Go through one instance at a time
for row in range(0, no_of_instances):

    this_instance = np_data_set[row]
    this_cluster = cluster_assignments[row]

    a = None
    running_sum = 0.0
    counter = 0.0

    # Calculate the average distance to all other instances 
    # in this cluster. This is a.
    # Go through one instance at a time
    for row_2 in range(0, no_of_instances):

        # If the other instance is in the same cluster as this instance
        if this_cluster == cluster_assignments[row_2]:

            # Calculate the distance
            distance = np.linalg.norm(this_instance - np_data_set[row_2])

            # Add the distance to the running sum
            running_sum += distance
            counter += 1

    # Calculate the value for a
    if counter > 0:
        a = running_sum / counter

    # For each instance and any cluster that does not contain the 
    # instance calculate the average distance to all
    # of the points in that other cluster. Then return the minimum such value
    # over all of the clusters. This is b.
    b = float("inf") 
    
    for clstr in range(0, K):

        running_sum = 0.0
        counter = 0.0

        # Must be other clusters, not the one this instance is in
        if clstr != this_cluster:

            # Calculate the average distance to instances in that 
            # other cluster
            for row_3 in range(0, no_of_instances):

                if cluster_assignments[row_3] == clstr:

                    # Calculate the distance
                    distance = np.linalg.norm(this_instance - np_data_set[
                        row_3])

                    # Add the distance to the running sum
                    running_sum += distance
                    counter += 1
        
            if counter > 0:
                avg_distance_to_cluster = running_sum / counter
        
            # Update b if we have a new minimum
            if avg_distance_to_cluster < b:
                b = avg_distance_to_cluster

    # Calculate the Silhouette Coefficient s where s = (b-a)/max(a,b)
    s = (b - a) / max(a,b)

    # Store the Silhouette Coefficient in the Pandas data frame
    pd_full_data_set.iloc[row,silhouette_column] = s

#################### Predict the Class #######################################
# For each cluster, determine the predominant class and assign that 
# class to the cluster. Then determine if the prediction was correct.
# Create a data frame that maps clusters to actual classes
class_mappings = pd.DataFrame(index=range(K),columns=range(1))

for clstr in range(0, K):

    # Select rows whose column equals that cluster value
    temp_df = pd_full_data_set.loc[pd_full_data_set['Cluster'] == clstr]
    
    # Select the predominant class
    class_mappings.iloc[clstr,0] = temp_df.mode()[actual_class_col_name][0]

cluster_column = actual_class_column + 1
pred_class_column = actual_class_column + 3
pred_correct_column = actual_class_column + 4

# Assign the relevant class to each instance
# See if prediction was correct
for row in range(0, no_of_instances):

    # Go through each of the clusters to check if the instance is a member
    # of that cluster
    for clstr in range(0, K):
        if clstr == pd_full_data_set.iloc[row,cluster_column]:

            # Assign the relevant class to this instance
            pd_full_data_set.iloc[
                row,pred_class_column] = class_mappings.iloc[clstr,0]

    # If the prediction was correct
    if pd_full_data_set.iloc[row,pred_class_column] == pd_full_data_set.iloc[
        row,actual_class_column]:
        pd_full_data_set.iloc[row,pred_correct_column] = 1
    else: # If incorrect prediction
        pd_full_data_set.iloc[row,pred_correct_column] = 0

# Write dataframe to a file
pd_full_data_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Print data frame to the console
print()
print()
print("Data Set")
print(pd_full_data_set)
print()
print()

################### Summary Statistics #######################################
# Calculate the average Silhouette Coefficient for the data set
# Calculate the accuracy of the clustering-based classifier

# Open a new file to save the summary statistics
outfile1 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile1.write("\n")
outfile1.write("Number of Instances : " + str(no_of_instances) + "\n")
outfile1.write("\n")
outfile1.write("Value for k : " + str(K) + "\n")

# Calculate average Silhouette Coefficient for the data set
silhouette_coefficient = pd_full_data_set.loc[
    :,"Silhouette Coefficient"].mean()

# Write the Silhouette Coefficient to the file
outfile1.write("Silhouette Coefficient : " + str(
    silhouette_coefficient) + "\n")
      
# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (pd_full_data_set.iloc[
        :,pred_correct_column].sum())/no_of_instances

accuracy *= 100

# Write accuracy to the file
outfile1.write("Accuracy : " + str(accuracy) + "%\n")

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Instances : " + str(no_of_instances))
print("Value for k : " + str(K))

# Print the Silhouette Coefficient to the console
print("Silhouette Coefficient : " + str(
    silhouette_coefficient))

# Print accuracy to the console
print("Accuracy : " + str(accuracy) + "%")

# Close the files
outfile1.close()
outfile3.close()

Return to Table of Contents

Output Statistics of K-means Clustering on the Iris Data Set

This section shows the results for the runs of K-means Clustering on the iris data set.

Test Statistics

test-stats-kmeans

Trace Runs

Here are the trace runs of the algorithm.

Output

Here is the output of the algorithm.

Return to Table of Contents

Stepwise Forward Selection Algorithm From Scratch

In this post, I will walk you through the Stepwise Forward Selection algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python and use it for feature selection for the Naive Bayes algorithm we previously developed. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the Stepwise Forward Selection Algorithm?

One of the fundamental ideas in machine learning is the Curse of Dimensionality – as the number of attributes in the training set grows, you need exponentially more instances to learn the underlying concept. And the more instances you have, the more computational time is required to process the data.

In other words, adding attributes improves performance, to a point. Beyond that point, you need to have exponentially more instances in order for performance to improve. This phenomenon is commonly known as the Hughes Effect (or Hughes Phenomenon).

curse-of-dimensionality

Image Source: Hughes, G. (1968). On the Mean Accuracy Of Statistical Pattern Recognizers. IEEE Transactions on Information Theory, 14(1), 55-63.

In the above image, n = number of attributes and m = number of instances.

One technique for combatting the Curse of Dimensionality is known as Stepwise Forward Selection (SFS). SFS involves selecting only the most relevant attributes for learning and discarding the rest. The metric that determines what attribute is “most relevant” is determined by the programmer. In this project, I will use the classification accuracy of Naïve Bayes.

Because SFS is just a method to create an optimal attribute subset, it needs to be wrapped with another algorithm (in this case Naïve Bayes) that does the actual learning. For this reason, SFS is known as a wrapper method.

The goal of SFS is to reduce the number of features that a machine learning algorithm needs to examine by eliminating unnecessary features and finding an optimal attribute subset.

Suppose we have attributes A1 through Ak which are used to describe a data set D.

A = <A1,…Ak>

SFS begins with an empty attribute set:

A0 = <>

It then selects an attribute from A, one at a time, until the performance (e.g. as measured by classification accuracy) of the target learning algorithm (e.g. Naïve Bayes) on a subset of the training data fails to improve performance. 

Return to Table of Contents

Stepwise Forward Selection Implementation

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class

The program I developed (code is below) then adds 2 additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

preprocessing-data

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set. Here is what an iris flower looks like in case you are interested:

iris-flower

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarized-data

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

Stepwise Forward Selection Algorithm in Python, Coded From Scratch

Naïve Bayes was wrapped with SFS for feature selection. Classification accuracy was used to select the optimal attribute at each step. Once the optimal attribute subset was obtained, Naïve Bayes was run on the complete training set (67% of instances) and testing set (33% of instances), and classification accuracy was calculated.

Here is the code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: sfs_naive_bayes.py
# Author: Addison Sears-Collins
# Date created: 6/10/2019
# Python version: 3.7
# Description: Implementation of Naive Bayes which uses Stepwise Forward 
# Selection (SFS) for feature selection. This code works for multi-class 
# classification problems (e.g. democrat/republican/independent)
# Calculate P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0) and
# P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1) and
# P(E1|CL2)P(E2|CL2)P(E3|CL2)...P(E#|CL2) * P(CL2), etc. and predict the
# class with the maximum result. 
# E is an attribute, and CL means class.
# Only need class prior probability and likelihoods to make a prediction
# (i.e. the numerator of Bayes formula) since denominators are same for both 
# the P(CL0|E1,E2,E3...)*P(CL0) and P(CL1|E1,E2,E3...)*P(CL1), etc. cases 
# where P means "probability of" and | means "given".

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class 

# This program then adds 2 additional columns for the testing set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "SFS-Wrapped Naive Bayes"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_naive_bayes_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_naive_bayes_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_naive_bayes_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
TRAINING_DATA_PRCT = 0.67 # % of data set used for training. Default 0.67
testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing
SEED = 99  # SEED for the random number generator. Default: 99
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Naive Bayes algorithm. Accepts a Pandas dataframeas its parameter
def naive_bayes(pd_data):
    # Create a training dataframe by sampling random instances from original data.
    # random_state guarantees that the pseudo-random number generator generates 
    # the same sequence of random numbers each time.
    pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED)

    # Create a testing dataframe. Dropping the training data from the original
    # dataframe ensures training and testing dataframes have different instances
    pd_testing_data = pd_data.drop(pd_training_data.index)

    ######COMMENT OUT THESE TWO LINES BEFORE RUNNING ################
    #pd_training_data = pd_data.iloc[:20] # Used for testing only, gets 1st 20 rows
    #pd_testing_data = pd_data.iloc[20:22,:] # Used for testing only, rows 20 &amp; 21

    # Calculate the number of instances, columns, and attributes in the
    # training data set. Assumes 1 column for the instance ID and 1 column
    # for the class. Record the index of the column that contains 
    # the actual class
    no_of_instances_train = len(pd_training_data.index) # number of rows
    no_of_columns_train = len(pd_training_data.columns) # number of columns
    no_of_attributes = no_of_columns_train - 2
    actual_class_column = no_of_columns_train - 1

    # Store class values in a column, sort them, then create a list of unique
    # classes and store in a dataframe and a Numpy array
    unique_class_list_df = pd_training_data.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    # Record the number of unique classes in the data set
    num_unique_classes = len(unique_class_list_df)

    # Record the frequency counts of each class in a Numpy array
    freq_cnt_class = pd_training_data.iloc[:,actual_class_column].value_counts(
        sort=True)

    # Record the frequency percentages of each class in a Numpy array
    # This is a list of the class prior probabilities
    class_prior_probs = pd_training_data.iloc[:,actual_class_column].value_counts(
        normalize=True, sort=True)

    # Add 2 additional columns to the testing dataframe
    pd_testing_data = pd_testing_data.reindex(
                      columns=[*pd_testing_data.columns.tolist(
                      ), 'Predicted Class', 'Prediction Correct?'])

    # Calculate the number of instances and columns in the
    # testing data set. Record the index of the column that contains the 
    # predicted class and prediction correctness (1 if yes; 0 if no)
    no_of_instances_test = len(pd_testing_data.index) # number of rows
    no_of_columns_test = len(pd_testing_data.columns) # number of columns
    predicted_class_column = no_of_columns_test - 2
    prediction_correct_column = no_of_columns_test - 1

    ######################### Training Phase of the Model ########################
    # Create a an empty dictionary
    my_dict = {}

    # Calculate the likelihood tables for each attribute. If an attribute has
    # four levels, there are (# of unique classes x 4) different probabilities 
    # that need to be calculated for that attribute.
    # Start on the first attribute and make your way through all the attributes
    for col in range(1, no_of_attributes + 1):

        # Record the name of this column 
        colname = pd_training_data.columns[col]

        # Create a dataframe containing the unique values in the column
        unique_attribute_values_df = pd_training_data[colname].drop_duplicates()
        # Create a Numpy array containing the unique values in the column
        unique_attribute_values_np = pd_training_data[colname].unique()
    
        # Calculate likelihood of the attribute given each unique class value
        for class_index in range (0, num_unique_classes):
        
            # For each unique attribute value, calculate the likelihoods 
            # for each class
            for attr_val in range (0, unique_attribute_values_np.size) :
                running_sum = 0

                # Calculate N(unique attribute value &amp;&amp; class value)
                # Where N means "number of" and &amp;&amp; means "and"
                # Go through each row of the training set
                for row in range(0, no_of_instances_train):
                    if (pd_training_data.iloc[row,col] == (
                        unique_attribute_values_df.iloc[attr_val])) and (
                        pd_training_data.iloc[row, actual_class_column] == (
                        unique_class_list_df.iloc[class_index])):
                            running_sum += 1

                # With N(unique attribute value &amp;&amp; class value) as the numerator
                # we now need to divide by the total number of times the class
                # appeared in the data set
                try:
                    denominator = freq_cnt_class[class_index]
                except:
                    denominator = 1.0
            
                likelihood = running_sum / denominator
            
                # Add a new likelihood to the dictionary
                # Format of search key is 
                # <attribute_name><attribute_value><class_value>
                search_key = str(colname) + str(unique_attribute_values_df.iloc[
                             attr_val]) + str(unique_class_list_df.iloc[
                             class_index])
                my_dict[search_key] = likelihood
 
    # Print the likelihood table to the console
    # print(pd.DataFrame.from_dict(my_dict, orient='index'))

    ################# End of Training Phase of the Naive Bayes Model ########

    ################# Testing Phase of the Naive Bayes Model ################

    # Proceed one instance at a time and calculate the prediction
    for row in range(0, no_of_instances_test):

        # Initialize the prediction outcome
        predicted_class = unique_class_list_df.iloc[0]
        max_numerator_of_bayes = 0.0

        # Calculate the Bayes equation numerator for each test instance
        # That is: P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0),
        # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1)...
        for class_index in range (0, num_unique_classes):

            # Reset the running product with the class
            # prior probability, P(CL)
            try:
                running_product = class_prior_probs[class_index]
            except:
                running_product = 0.0000001 # Class not found in data set
        
            # Calculation of P(CL) * P(E1|CL) * P(E2|CL) * P(E3|CL)...
            # Format of search key is 
            # <attribute_name><attribute_value><class_value>
            # Record each search key value
            for col in range(1, no_of_attributes + 1):
                attribute_name = pd_testing_data.columns[col]
                attribute_value = pd_testing_data.iloc[row,col]
                class_value = unique_class_list_df.iloc[class_index]

                # Set the search key
                key = str(attribute_name) + str(
                          attribute_value) + str(class_value)

                # Update the running product
                try:
                    running_product *= my_dict[key]
                except:
                    running_product *= 0

            # Record the prediction if we have a new max
            # Bayes numerator
            if running_product > max_numerator_of_bayes:
                max_numerator_of_bayes = running_product
                predicted_class = unique_class_list_df.iloc[
                             class_index] # New predicted class

        # Store the prediction in the dataframe
        pd_testing_data.iloc[row,predicted_class_column] = predicted_class
    
        # Store if the prediction was correct
        if predicted_class == pd_testing_data.iloc[row,actual_class_column]:
            pd_testing_data.iloc[row,prediction_correct_column] = 1
        else: 
            pd_testing_data.iloc[row,prediction_correct_column] = 0

    print("-------------------------------------------------------")
    print("Learned Model Predictions on Testing Data Set")
    print("-------------------------------------------------------")

    # Print the revamped dataframe
    print(pd_testing_data)

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (pd_testing_data.iloc[
        :,prediction_correct_column].sum())/no_of_instances_test
    # Return classification accuracy
    return accuracy
    ####################### End Testing Phase ######################################    

# Stepwise forward selection method accepts a Pandas dataframe as a parameter
# and then returns the optimal dataframe from that dataframe.
def sfs(pd_data):

    # Record the name of the column that contains the instance IDs and the class
    instance_id_colname = pd_data.columns[0]
    no_of_columns = len(pd_data.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = pd_data.columns[class_column_index]

    # Record the number of available attributes
    no_of_available_attributes = no_of_columns - 2

    # Create a dataframe containing the available attributes by removing
    # the Instance and the Class Column
    available_attributes_df = pd_data.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Create an empty optimal attribute dataframe containing only the
    # Instance and the Class Columns
    optimal_attributes_df = pd_data[[instance_id_colname,class_column_colname]]

    # Set the base performance to a really low number
    base_performance = -9999.0

    # Check whether adding a new attribute to the optimal attributes dataframe
    # improves performance
    # While there are still available attributes left
    while no_of_available_attributes > 0: 
        # Set the best performance to a low number
        best_performance = -9999.0

        # Initialize the best attribute variable to a placeholder
        best_attribute = "Placeholder"

        # For all attributes in the available attribute data frame
        for col in range(0, len(available_attributes_df.columns)):

            # Record the name of this attribute
            this_attr = available_attributes_df.columns[col]
        
            # Create a new dataframe with this attribute inserted
            temp_opt_attr_df = optimal_attributes_df.copy()
            temp_opt_attr_df.insert(
                loc=1,column=this_attr,value=(available_attributes_df[this_attr]))

            # Run Naive Bayes on this new dataframe and return the 
            # classification accuracy
            current_performance = naive_bayes(temp_opt_attr_df)

            # Find the new attribute that yielded the greatest
            # classification accuracy
            if current_performance > best_performance:
                best_performance = current_performance
                best_attribute = this_attr

        # Did adding another feature lead to improvement?
        if best_performance > base_performance:
            base_performance = best_performance

            # Add the best attribute to the optimal attribute data frame
            optimal_attributes_df.insert(
                loc=1,column=best_attribute,value=(
                available_attributes_df[best_attribute]))

            # Remove the best attribute from the available attribute data frame
            available_attributes_df = available_attributes_df.drop(
                columns = [best_attribute]) 

            # Print the best attribute to the console
            print()
            print(str(best_attribute) + " added to the optimal attribute subset")
            print()

            # Write to file
            outfile3.write(str(
                best_attribute) + (
                " added to the optimal attribute subset") + "\n")
           
            # Decrement the number of available attributes by 1
            no_of_available_attributes -= 1

            # Print number of attributes remaining to the console
            print()
            print(str(no_of_available_attributes) + " attributes remaining")
            print()
            print()

            # Write to file
            outfile3.write(str(no_of_available_attributes) + (
                " attributes remaining") + "\n" + "\n")
           
        else:
            print()
            print("Performance did not improve this round.")
            print("End of Stepwise Forward Selection.")
            print()
            outfile3.write("Performance did not improve this round." + "\n")
            outfile3.write("End of Stepwise Forward Selection." + "\n")
            break

    # Return the optimal attribute set
    return optimal_attributes_df


def main():
    # Read the full text file and store records in a Pandas dataframe
    pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

    # Used for feature selection of large data sets only
    #pd_full_data_set = pd_full_data_set.sample(frac=0.10, random_state=SEED)
    
    # Runs SFS on the full data set to find the optimal attribute data frame
    optimal_attribute_set = sfs(pd_full_data_set)

    # Write dataframe to a file
    optimal_attribute_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

    # Open a new file to save the summary statistics
    outfile2 = open(TEST_STATS_FILE,"w") 

    # Write to a file
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write("Data Set : " + DATA_PATH + "\n")

    # Write the relevant stats to a file
    outfile2.write("\n")
    outfile2.write("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)) + "\n")
       
    # Run Naive Bayes on the optimal attribute set
    accuracy = naive_bayes(optimal_attribute_set)
    accuracy *= 100

    # Write accuracy to the file
    outfile2.write("Accuracy : " + str(accuracy) + "%\n")

    # Print statistics to console
    print()
    print()
    print("-------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics (Testing)")
    print("-------------------------------------------------------")
    print("Data Set : " + DATA_PATH)

    # Print the relevant stats to the console
    print()
    print("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)))

    # Print accuracy to the console
    print()
    print("Accuracy : " + str(accuracy) + "%")
    print()

    # Create a dataframe containing the optimal attributes by removing
    # the Instance and the Class Column
    instance_id_colname = optimal_attribute_set.columns[0]
    no_of_columns = len(optimal_attribute_set.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = optimal_attribute_set.columns[class_column_index]
    optimal_attribute_set = optimal_attribute_set.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Write a list of the optimal attribute set to a file
    outfile2.write("Optimal Attribute Subset : " + str(list(
        optimal_attribute_set)))

    # Print a list of the optimal attribute set
    print("Optimal Attribute Subset : " + str(list(optimal_attribute_set)))
    print()
    print()   
    
    # Close the files
    outfile2.close()
    outfile3.close()

main() # Call the main function

Return to Table of Contents

Output Statistics of Stepwise Forward Selection on the Iris Data Set

This section shows the results for the runs of Stepwise Forward Selection wrapped with Naïve Bayes on the iris data set.

Test Statistics

sfs-test-stats-1

Trace Runs

stepwise-forward-selection-trace-runs

Here is a link to the output file.

Return to Table of Contents

Winnow2 Algorithm From Scratch | Machine Learning

In this post, I will walk you through the Winnow2 machine learning algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the breast cancer data set from the UCI Machine Learning Repository. Our goal is to be able to predict if a patient has breast cancer or not based on ten different attributes. Without further ado, let’s get started!

Table of Contents

What is the Winnow2 Algorithm?

The Winnow2 algorithm was invented by Nick Littlestone in the late 1980s. Winnow2 is an example of supervised learning. Supervised learning is the most common type of machine learning.

In supervised learning, you feed the machine learning algorithm data that contains a set of attributes (also known as x, predictors, inputs, features, etc.) and the corresponding classification (also known as label, output, target variable, correct answer, etc.). Each instance (also known as data point, example, etc.) helps the Learner (the prediction mathematical model we are trying to build) learn the association between the attributes and the class. The job of the Learner is to create a model that enables it to use just the values of the attributes to predict the class.

For example, consider how a supervised learning algorithm would approach the breast cancer data set that has 699 instances. Each instance is a different medical patient case seen by a doctor.

winnow2-algorithm-1

For each instance, we have:

  • 9 attributes: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses.
  • 2 classes: malignant (breast cancer detected) or benign (breast cancer not detected)

The end goal of a classification supervised learning algorithm like Winnow2 is to develop a model that can accurately predict if a new patient has breast cancer or not based on his or her attribute values. And in order to do that, the Learner must be trained on the existing data set in order to learn the association between the 9 attributes and the class.

In Winnow2, we need to preprocess a data set so that both the attributes and the class are binary values, 0 (zero) or 1 (one). For example, in the example I presented above, the class value for each instance is either 0 (benign – breast cancer not detected) or 1 (malignant – breast cancer detected).

Return to Table of Contents

Algorithm Steps

The Winnow2 algorithm continuously updates as each new instance arrives (i.e. an online learning algorithm). Here is how Winnow2 works on a high level:

Step 1: Learner (prediction model) receives an instance (data point):

Since Winnow2 can only work with 0s and 1s, the attribute values need to have already been preprocessed to be either 0 or 1. For example, using the breast cancer data example, attributes like clump thickness need to first be binarized so that they are either 0 or 1. One-hot encoding is the method used in this project in order to perform this binarization.

Step 2: Learner makes a prediction of the class the instance belongs to based on the values of the attributes in the instance

For example, does the Learner believe the patient belongs to the breast cancer class based on the values of the attributes? The Learner predicts 0 if no (benign) or 1 if yes (malignant).

Step 3: Is the Prediction Correct?

The Learner is told the correct class after it makes a prediction. If the learner is correct, nothing happens. Learning (i.e. amending the mathematical model) only occurs when the Learner gets an incorrect answer.

There are two ways for the prediction to be incorrect:

winnow2-algorithm-5

In order to “learn”, a process called “Promotion” (to be explained in the next section) takes place when the Learner incorrectly predicts 0. This situation is a “false negative” result. A false negative occurs when the Learner predicts a person does not have a disease or condition when the person actually does have it.

“Demotion” takes place when the Learner incorrectly predicts 1. This situation is a “false positive” result. A false positive (also known as “false alarm”) occurs when the Learner predicts a person has a specific disease or condition when the person actually does not have it.

Promotion and Demotion: Building the Prediction Model One Instance at a Time

To understand the promotion and demotion activities of Winnow2, we need to first examine the Learner’s mathematical model, the tool being used to predict whether an instance belongs to the benign (0) class or malignant (1) class.

When the Learner receives an instance (“Step 1” from the previous section), it runs the attributes through the following weighted sum:

winnow2-algorithm-6

where:

  • d is the total number of attributes
  • wi is the weighting of the ith attribute  
  • xi is the value of the ith attribute in binary format
  • f(x) is the weighted sum (e.g. w1x1 + w2x2 + w3x3 + …. + wdxd)

The Learner then predicts an instance’s class (e.g. 1 for malignant and 0 for benign in our breast cancer example) as follows where:

hofx
  • h(x) is the predicted class
  • Ɵ is a constant threshold (commonly set to 0.5)

As mentioned before, if the learner makes an incorrect prediction, either promotion or demotion occurs. In both promotion and demotion, the weights wi are modified by using a constant parameter α which is any value greater than 1 (commonly set to 2).

Initially all the weights wi for each attribute are set to 1. They are then adjusted as follows:

winnow2-algorithm-2

Return to Table of Contents

Winnow2 Example Using Real Data

Let’s take a look at the Winnow2 algorithm using an example with real data. We have three different attributes (which have already been converted from real numbers into binary form) and a class variable (which has also been converted into binary form). Our goal is to develop a model that can use just those three attributes to predict if a medical patient has breast cancer or not.

  • 3 attributes: clump thickness (x1), uniformity of cell size(x2), uniformity of cell shape (x3),
  • 2 classes (labels): malignant (breast cancer detected) or benign (breast cancer not detected)

Remember:

  • All weights w for each attribute are initially set to 1. This is known as the “weight vector.”
  • Ɵ = 0.5. This is our threshold.
  • α = 2
  • d = 3 because there are 3 attributes.

Here is the original data set with the attributes (inputs) and the class (output):

winnow2-algorithm-3

Here is what we have after we run Winnow2. We proceeded row-by-row in the original data set (one instance at a time), which generated this table:

winnow2-algorithm-4

Return to Table of Contents

Winnow2 Implementation

I implemented Winnow2 from scratch in Python and then ran the algorithm on a real-world data set, the breast cancer data set from the UCI Machine Learning Repository.

Preprocessing the Example Data Set – Breast Cancer

The breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign. I transformed the attributes and classes into binary numbers (making them discrete is fine too, but they cannot be continuous) so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0. I also changed Benign from 2 to 0 and Malignant from 4 to 1.

There were 16 missing attribute values in the data set, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Finally, for each run, data sets were split into 67% for training and 33% for testing. Summary statistics were then calculated.

Here is the format the input data needs to take in the raw text/csv file. Attributes can be numerical or text, but in this example the were all numerical:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1 (in binary)
  • 2: Attribute 2 (in binary)
  • 3: Attribute 3 (in binary)
  • N: Actual Class (in binary)

The program then adds 8 additional columns:

  • N + 1: Weighted Sum (of the attributes)
  • N + 2: Predicted Class (in binary)…Weighted Sum > 0? (1 if yes; 0 if no)
  • N + 3: True Positive (1 if yes; O if no)
  • N + 4: False Positive (1 if yes; 0 if no)
  • N + 5: False Negative (1 if yes; 0 if no)
  • N + 6: True Negative (1 if yes; 0 if no)
  • N + 7: Promote (1 if yes; 0 if no)  [for training set only]
  • N + 8: Demote (1 if yes; 0 if no) [for training set only]

Here is a link to the input file after all that preprocessing was performed.

Here is a link to the output file after the algorithm below was run on the input file.

Here is a link to the final weights after the algorithm below was run on the input file.

Return to Table of Contents

Winnow2 Algorithm in Python, Coded From Scratch

Here is the code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: winnow2.py
# Author: Addison Sears-Collins
# Date created: 5/31/2019
# Python version: 3.7
# Description: Implementation of the Winnow2 machine learning
# algorithm invented by Nick Littlestone. Used for 2-class classificaiton
# problems (e.g. cancer/no cancer....spam/not spam, etc...)
# Nick Littlestone (1988). "Learning Quickly When Irrelevant Attributes 
# Abound: A New Linear-threshold Algorithm", Machine Learning 285–318(2)

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 (in binary)
# 2: Attribute 2 (in binary)
# 3: Attribute 3 (in binary)
# ...
# N: Actual Class (in binary)

# This program then adds 8 additional columns:
# N + 1: Weighted Sum (of the attributes)
# N + 2: Predicted Class (in binary)...Weighted Sum > 0? (1 if yes; 0 if no)
# N + 3: True Positive (1 if yes; O if no)
# N + 4: False Positive (1 if yes; 0 if no)
# N + 5: False Negative (1 if yes; 0 if no)
# N + 6: True Negative (1 if yes; 0 if no)
# N + 7: Promote (1 if yes; 0 if no) [for training set only]
# N + 8: Demote (1 if yes; 0 if no)  [for training set only]

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "Winnow2"
THETA = 0.5   # This is the threshold constant for the Winnow2 algorithm
ALPHA = 2.0    # This is the adjustment constant for promotion &amp; demotion
DATA_PATH = "breast_cancer_dataset.txt"  # Directory where data set is located
TRAIN_WEIGHTS_FILE = "breast_cancer_winnow2_train_weights.txt" # Weights of learned model
TRAIN_OUT_FILE = "breast_cancer_winnow2_train_out.txt" # Training phase of the model
TEST_STATS_FILE = "breast_cancer_winnow2_test_stats.txt" # Testing statistics
TEST_OUT_FILE = "breast_cancer_winnow2_test_out.txt" # Testing phase of the model
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
CLASS_IF_ONE = "Malignant" # If Class value is 1 (e.g. Malignant, Spam, etc.)
CLASS_IF_ZERO = "Benign"  # If Class value is 0 (e.g. Benign, Not Spam, etc.)
TRAINING_DATA_PRCT = 0.67 # % of data set used for training
testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing
SEED = 99  # SEED for the random number generator. Default: 99
#############################################################################

# Read a text file and store records in a Pandas dataframe
pd_data = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Create a training dataframe by sampling random instances from original data.
# random_state guarantees that the pseudo-random number generator generates 
# the same sequence of random numbers each time.
pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED)

# Create a testing dataframe. Dropping the training data from the original
# dataframe ensures training and testing dataframes have different instances
pd_testing_data = pd_data.drop(pd_training_data.index)

# Convert training dataframes to Numpy arrays
np_training_data = pd_training_data.values
np_testing_data = pd_testing_data.values

#np_training_data = pd_data.values # Used for testing only
#np_testing_data = pd_data.values  # Used for testing only

################ Begin Training Phase #####################################

# Calculate the number of instances, columns, and attributes in the data set
# Assumes 1 column for the instance ID and 1 column for the class
# Record the index of the column that contains the actual class
no_of_instances = np_training_data.shape[0]
no_of_columns = np_training_data.shape[1]
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Initialize the weight vector. Initialize all weights to 1.
# First column of weight vector is not used (i.e. Instance ID)
weights = np.ones(no_of_attributes + 1)

# Create a new array that has 8 columns, initialized to 99 for each value
extra_columns_train = np.full((no_of_instances, 8),99)

# Add extra columns to the training data set
np_training_data = np.append(np_training_data, extra_columns_train, axis=1)

# Make sure it is an array of floats
np_training_data = np_training_data.astype(float)

# Build the learning model one instance at a time
for row in range(0, no_of_instances):

    # Set the weighted sum to 0
    weighted_sum = 0

    # Calculate the weighted sum of the attributes
    for col in range(1, no_of_attributes + 1):
        weighted_sum += (weights[col] * np_training_data[row,col])

    # Record the weighted sum into column N + 1, the column just to the right
    # of the actual class column
    np_training_data[row, actual_class_column + 1] = weighted_sum

    # Set the predicted class to 99
    predicted_class = 99

    # Learner's prediction: Is the weighted sum > THETA?
    if weighted_sum > THETA:
        predicted_class = 1
    else:
        predicted_class = 0

    # Record the predicted class into column N + 2
    np_training_data[row, actual_class_column + 2] = predicted_class

    # Record the actual class into a variable
    actual_class = np_training_data[row, actual_class_column]

    # Initialize the prediction outcomes
    # These variables are standard inputs into a "Confusion Matrix"
    true_positive = 0   # Predicted class = 1; Actual class = 1 (hit)
    false_positive = 0  # Predicted class = 1; Actual class = 0 (false alarm)
    false_negative = 0  # Predicted class = 0; Actual class = 1 (miss)
    true_negative = 0   # Predicted class = 0; Actual class = 0 

    # Determine the outcome of the Learner's prediction
    if predicted_class == 1 and actual_class == 1:
        true_positive = 1
    elif predicted_class == 1 and actual_class == 0:
        false_positive = 1
    elif predicted_class == 0 and actual_class == 1:
        false_negative = 1
    else:
        true_negative = 1

    # Record the outcome of the Learner's prediction
    np_training_data[row, actual_class_column + 3] = true_positive
    np_training_data[row, actual_class_column + 4] = false_positive
    np_training_data[row, actual_class_column + 5] = false_negative
    np_training_data[row, actual_class_column + 6] = true_negative

    # Set the promote and demote variables to 0
    promote = 0
    demote = 0

    # Promote if false negative
    if false_negative == 1:
        promote = 1
   
    # Demote if false positive
    if false_positive == 1:
        demote = 1

    # Record if either a promotion or demotion is needed
    np_training_data[row, actual_class_column + 7] = promote
    np_training_data[row, actual_class_column + 8] = demote

    # Run through each attribute and see if it is equal to 1
    # If attribute is 1, we need to either demote or promote (adjust the
    # corresponding weight by ALPHA).
    if demote == 1:
        for col in range(1, no_of_attributes + 1):
            if(np_training_data[row,col] == 1):
                weights[col] /= ALPHA
    if promote == 1:
        for col in range(1, no_of_attributes + 1):
            if(np_training_data[row,col] == 1):
                weights[col] *= ALPHA

# Open a new file to save the weights
outfile1 = open(TRAIN_WEIGHTS_FILE,"w") 

# Write the weights of the Learned model to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(" " + ALGORITHM_NAME + " Training Weights\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")
outfile1.write("\n----------------------------\n")
outfile1.write("Weights of the Learned Model\n")
outfile1.write("----------------------------\n")
for col in range(1, no_of_attributes + 1):
    colname = pd_training_data.columns[col]
    s = str(weights[col])
    outfile1.write(colname + " : " + s + "\n")

# Write the relevant constants used in the model to a file
outfile1.write("\n")
outfile1.write("\n")
outfile1.write("-----------\n")
outfile1.write("Constants\n")
outfile1.write("-----------\n")
s = str(THETA)
outfile1.write("THETA = " + s + "\n")
s = str(ALPHA)
outfile1.write("ALPHA = " + s + "\n")

# Close the weights file
outfile1.close()

# Print the weights of the Learned model
print("----------------------------------------------------------")
print(" " + ALGORITHM_NAME + " Results")
print("----------------------------------------------------------")
print("Data Set : " + DATA_PATH)
print()
print()
print("----------------------------")
print("Weights of the Learned Model")
print("----------------------------")
for col in range(1, no_of_attributes + 1):
    colname = pd_training_data.columns[col]
    s = str(weights[col])
    print(colname + " : " + s)

# Print the relevant constants used in the model
print()
print()
print("-----------")
print("Constants")
print("-----------")
s = str(THETA)
print("THETA = " + s)
s = str(ALPHA)
print("ALPHA = " + s)
print()

# Print the learned model runs in binary form
print("-------------------------------------------------------")
print("Learned Model Runs of the Training Data Set (in binary)")
print("-------------------------------------------------------")
print(np_training_data)
print()
print()

# Convert Numpy array to a dataframe
df = pd.DataFrame(data=np_training_data)

# Replace 0s and 1s in the attribute columns with False and True
for col in range(1, no_of_attributes + 1):
    df[[col]] = df[[col]].replace([0,1],["False","True"])

# Replace values in Actual Class column with more descriptive values
df[[actual_class_column]] = df[[actual_class_column]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Replace values in Predicted Class column with more descriptive values
df[[actual_class_column + 2]] = df[[actual_class_column + 2]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Change prediction outcomes to more descriptive values
for col in range(actual_class_column + 3,actual_class_column + 9):
    df[[col]] = df[[col]].replace([0,1],["No","Yes"])

# Rename the columns
df.rename(columns={actual_class_column + 1 : "Weighted Sum" }, inplace = True)
df.rename(columns={actual_class_column + 2 : "Predicted Class" }, inplace = True)
df.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)
df.rename(columns={actual_class_column + 7 : "Promote" }, inplace = True)
df.rename(columns={actual_class_column + 8 : "Demote" }, inplace = True)

# Change remaining columns names from position numbers to descriptive names
for pos in range(0,actual_class_column + 1):
    df.rename(columns={pos : pd_data.columns[pos] }, inplace = True)

print("-------------------------------------------------------")
print("Learned Model Runs of the Training Data Set (readable) ")
print("-------------------------------------------------------")
# Print the revamped dataframe
print(df)

# Write revamped dataframe to a file
df.to_csv(TRAIN_OUT_FILE, sep=",", header=True)
################ End Training Phase #####################################

################ Begin Testing Phase ######################################

# Calculate the number of instances, columns, and attributes in the data set
# Assumes 1 column for the instance ID and 1 column for the class
# Record the index of the column that contains the actual class
no_of_instances = np_testing_data.shape[0]
no_of_columns = np_testing_data.shape[1]
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Create a new array that has 6 columns, initialized to 99 for each value
extra_columns_test = np.full((no_of_instances, 6),99)

# Add extra columns to the testing data set
np_testing_data = np.append(np_testing_data, extra_columns_test, axis=1)

# Make sure it is an array of floats
np_testing_data = np_testing_data.astype(float)

# Test the learning model one instance at a time
for row in range(0, no_of_instances):

    # Set the weighted sum to 0
    weighted_sum = 0

    # Calculate the weighted sum of the attributes
    for col in range(1, no_of_attributes + 1):
        weighted_sum += (weights[col] * np_testing_data[row,col])

    # Record the weighted sum into column N + 1, the column just to the right
    # of the actual class column
    np_testing_data[row, actual_class_column + 1] = weighted_sum

    # Set the predicted class to 99
    predicted_class = 99

    # Learner's prediction: Is the weighted sum > THETA?
    if weighted_sum > THETA:
        predicted_class = 1
    else:
        predicted_class = 0

    # Record the predicted class into column N + 2
    np_testing_data[row, actual_class_column + 2] = predicted_class

    # Record the actual class into a variable
    actual_class = np_testing_data[row, actual_class_column]

    # Initialize the prediction outcomes
    # These variables are standard inputs into a "Confusion Matrix"
    true_positive = 0   # Predicted class = 1; Actual class = 1 (hit)
    false_positive = 0  # Predicted class = 1; Actual class = 0 (false alarm)
    false_negative = 0  # Predicted class = 0; Actual class = 1 (miss)
    true_negative = 0   # Predicted class = 0; Actual class = 0 

    # Determine the outcome of the Learner's prediction
    if predicted_class == 1 and actual_class == 1:
        true_positive = 1
    elif predicted_class == 1 and actual_class == 0:
        false_positive = 1
    elif predicted_class == 0 and actual_class == 1:
        false_negative = 1
    else:
        true_negative = 1

    # Record the outcome of the Learner's prediction
    np_testing_data[row, actual_class_column + 3] = true_positive
    np_testing_data[row, actual_class_column + 4] = false_positive
    np_testing_data[row, actual_class_column + 5] = false_negative
    np_testing_data[row, actual_class_column + 6] = true_negative

# Convert Numpy array to a dataframe
df = pd.DataFrame(data=np_testing_data)

# Replace 0s and 1s in the attribute columns with False and True
for col in range(1, no_of_attributes + 1):
    df[[col]] = df[[col]].replace([0,1],["False","True"])

# Replace values in Actual Class column with more descriptive values
df[[actual_class_column]] = df[[actual_class_column]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Replace values in Predicted Class column with more descriptive values
df[[actual_class_column + 2]] = df[[actual_class_column + 2]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Change prediction outcomes to more descriptive values
for col in range(actual_class_column + 3,actual_class_column + 7):
    df[[col]] = df[[col]].replace([0,1],["No","Yes"])

# Rename the columns
df.rename(columns={actual_class_column + 1 : "Weighted Sum" }, inplace = True)
df.rename(columns={actual_class_column + 2 : "Predicted Class" }, inplace = True)
df.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)

df_numerical = pd.DataFrame(data=np_testing_data) # Keep the values in this dataframe numerical
df_numerical.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)

# Change remaining columns names from position numbers to descriptive names
for pos in range(0,actual_class_column + 1):
    df.rename(columns={pos : pd_data.columns[pos] }, inplace = True)

print("-------------------------------------------------------")
print("Learned Model Predictions on Testing Data Set")
print("-------------------------------------------------------")
# Print the revamped dataframe
print(df)

# Write revamped dataframe to a file
df.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Open a new file to save the summary statistics
outfile2 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile2.write("----------------------------------------------------------\n")
outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile2.write("----------------------------------------------------------\n")
outfile2.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile2.write("\n")
outfile2.write("Number of Test Instances : " + 
    str(np_testing_data.shape[0])+ "\n")

tp = df_numerical["True Positive"].sum()
s = str(int(tp))
outfile2.write("True Positives : " + s + "\n")

fp = df_numerical["False Positive"].sum()
s = str(int(fp))
outfile2.write("False Positives : " + s + "\n")

fn = df_numerical["False Negative"].sum()
s = str(int(fn))
outfile2.write("False Negatives : " + s + "\n")

tn = df_numerical["True Negative"].sum()
s = str(int(tn))
outfile2.write("True Negatives : " + s + "\n")

accuracy = (tp + tn)/(tp + tn + fp + fn)
accuracy *= 100
s = str(accuracy)
outfile2.write("Accuracy : " + s + "%\n")

specificity = (tn)/(tn + fp)
specificity *= 100
s = str(specificity)
outfile2.write("Specificity : " + s + "%\n")

precision = (tp)/(tp + fp)
precision *= 100
s = str(precision)
outfile2.write("Precision : " + s + "%\n")

recall = (tp)/(tp + fn)
recall *= 100
s = str(recall)
outfile2.write("Recall : " + s + "%\n")

neg_pred_value = (tn)/(tn + fn)
neg_pred_value *= 100
s = str(neg_pred_value)
outfile2.write("Negative Predictive Value : " + s + "%\n")

miss_rate = (fn)/(fn + tp)
miss_rate *= 100
s = str(miss_rate)
outfile2.write("Miss Rate  : " + s + "%\n")

fall_out = (fp)/(fp + tn)
fall_out *= 100
s = str(fall_out)
outfile2.write("Fall-Out : " + s + "%\n")

false_discovery_rate = (fp)/(fp + tp)
false_discovery_rate *= 100
s = str(false_discovery_rate)
outfile2.write("False Discovery Rate : " + s + "%\n")

false_omission_rate = (fn)/(fn + tn)
false_omission_rate *= 100
s = str(false_omission_rate)
outfile2.write("False Omission Rate  : " + s + "%\n")

f1_score = (2 * tp)/((2 * tp) + fp + fn)
s = str(f1_score)
outfile2.write("F1 Score: " + s)

# Close the weights file
outfile2.close()

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Test Instances : " + 
    str(np_testing_data.shape[0]))

s = str(int(tp))
print("True Positives : " + s)

s = str(int(fp))
print("False Positives : " + s)

s = str(int(fn))
print("False Negatives : " + s)

s = str(int(tn))
print("True Negatives : " + s)

s = str(accuracy)
print("Accuracy : " + s + "%")

s = str(specificity)
print("Specificity : " + s + "%")

s = str(precision)
print("Precision : " + s + "%")

s = str(recall)
print("Recall : " + s + "%")

s = str(neg_pred_value)
print("Negative Predictive Value : " + s + "%")

s = str(miss_rate)
print("Miss Rate  : " + s + "%")

s = str(fall_out)
print("Fall-Out : " + s + "%")

s = str(false_discovery_rate)
print("False Discovery Rate : " + s + "%")

s = str(false_omission_rate)
print("False Omission Rate  : " + s + "%")

s = str(f1_score)
print("F1 Score: " + s)


###################### End Testing Phase ######################################

Return to Table of Contents

Output Statistics of Winnow2 on the Breast Cancer Data Set

Here is a link to a screenshot of the summary statistics:

breast_cancer_results

Return to Table of Contents

Naive Bayes Algorithm From Scratch | Machine Learning

In this post, I will walk you through the Naive Bayes machine learning algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm a real-world data sets from the UCI Machine Learning Repository. On one of the data sets, we will predict if a patient has breast cancer or not based on ten different attributes. Let’s get started!

Table of Contents

What is Naive Bayes?

The Naive Bayes algorithm is a technique based on Bayes Theorem for calculating the probability of a hypothesis (H) given some pieces of evidence (E).

For example, suppose we are trying to identify if a person is sick or not. Our hypothesis is that the person is sick.

nurse_checks_blood_pressure_1

We would naturally take a look at the evidence (eye color, body temperature, blood pressure, etc.) to determine if the person is sick or not. Each piece of evidence provides us clues. From that evidence, we can then use the Naive Bayes algorithm to calculate two probabilities:

  • Probability 1: The probability that the person is sick given she has red eyes, a body temperature of 99°F, and has normal blood pressure.
  • Probability 2: The probability that the person is not sick given she has red eyes, a body temperature of 99°F, and has normal blood pressure.

We then classify the person as being sick or not based on which probability (Probability 1 vs. Probability 2) is the highest.

Mathematically, Bayes theorem can be expressed as follows:

naive-bayes-1

Or in expanded form, we have:

naive-bayes-2

Or…

naive-bayes-3

Where:

  • P = probability
  • | = given
  • E = evidence (e.g. red eyes, body temperature, etc.)
  • H = hypothesis (e.g. sick)
  • ¬ = not
  • P(H|E1, E2,E3,…,EN) = posterior probability: the probability of a hypothesis after taking the evidence into account (e.g. probability of being sick given all this evidence)
  • P(E1, E2,E3,…,EN|H)= likelihood: the probability of the evidence given the hypothesis (e.g. probability of having red eyes given that a person is sick)
  • P(H) = class prior probability: the known probability of the hypothesis (e.g. probability of being sick for the population or entire sample of instances)

The equation above says: “The probability of the hypothesis (e.g. a person is sick) given the evidence (e.g. eye color, body temperature, blood pressure) is equal to the probability of the evidence given the hypothesis times the probability of the hypothesis divided by the probability of the evidence.”

The key assumption with the Bayes theorem is that all of the attributes are conditionally independent. In other words, the occurrence of one piece of evidence gives no information about the probability of another piece of evidence occurring.

For example, Bayes theorem would assume that the probability of having red eyes gives no information about the probability of having a high body temperature. We know this is not often the case. Such an assumption is naive, and that is why we call this classification algorithm the Naive Bayes algorithm.

We can therefore rewrite the equation based on the probability rule of conditional independence, which is:

naive-bayes-4

Bayes equation can be rewritten as:

naive-bayes-5

Return to Table of Contents

Algorithm Steps

Training Phase

Recall in Naive Bayes, for a 2-class classification problem (e.g. sick or not sick), we need to calculate two probabilities for each instance. The highest probability is our prediction.:

Probability 1 (sick): The probability that the person is sick given she has red eyes, a body temperature of 99°F, and has normal blood pressure.

naive-bayes-6-1

Probability 2 (not sick): The probability that the person is not sick given she has red eyes, a body temperature of 99°F, and has a normal blood pressure.

naive-bayes-7

If Probability 1 > Probability 2, she is sick. Otherwise, she is not sick

Notice, the denominators above are both equal. Because they are both equal, we can ignore them for training our model since all we need to do is to compare the numerators.

  • Probability 1 (sick): The probability that the person is sick given she has red eyes, a body temperature of 99°F, and has a normal blood pressure.
naive-bayes-8
  • Probability 2 (not sick): The probability that the person is not sick given she has red eyes, a body temperature of 99°F, and has a normal blood pressure.
naive-bayes-9

This makes our lives easier, since now all Naive Bayes algorithm needs to do to train on a data set is to calculate those values in blue above in order to make a classification prediction (sick or not sick). That is, we need to calculate two class prior probabilities (sick or not sick for the whole sample or population) plus the conditional probability of each unique value in each attribute for each class:

Number of Probabilities Calculated during Training Phase of Naive Bayes = 2 class prior probabilities + 2 * (# of unique values for E1) + 2 * (# of unique values for E2) + … 2 * (# of unique values for EN)

If all pieces of evidence were binary (e.g. red eyes, no red eyes) and the class is binary (sick or not sick), we would need to calculate four probabilities for each attribute. The total number of probabilities calculated in the training phase is therefore (where N is the number of attributes):

Number of Probabilities Calculated during Training Phase of Naive Bayes = 2 + 4N

For example, let’s see all the probabilities that would need to be calculated for binary attribute E1 (e.g. red eyes or no red eyes):

naive-bayes-10

We have our two class prior probabilities. These will be the same for all attributes:

  1. P(1) = (Total number of people that are sick in the training data set) / (Total number of people in the training data set)
  2. P(0) =  (Total number of people that are not sick in the training data set) / (Total number of people in the training data set)

And since (N = number, S = total number of instances in the training data set)…

naive-bayes-11

To complete the table for attribute E1, we calculate four different probabilities:

naive-bayes-12

We have to store these probabilities somewhere so they can be looked up during the testing phase. In my program, I stored them in a Python dictionary, with the following search key: <attribute_name><attribute_value><class_value>.

For example, the search key redeyes01, would return the probability of not having red eyes given someone is sick:

naive-bayes-13

That’s it. Once we have the tables for each attribute along with the class prior probabilities, the algorithm can go to work and make predictions for new instances.

Return to Table of Contents

Testing Phase

Having calculated the required probabilities and stored them somewhere, the algorithm is ready to make its predictions for new instances. As mentioned in the previous section, for each instance (i.e. row of the testing data set), two calculations need to be made and then the results are compared.

1. Probability 1 (sick): The probability that the person is sick given she has red eyes, a body temperature of 99°F, and has normal blood pressure.

naive-bayes-14

2.Probability 2 (not sick): The probability that the person is not sick given she has red eyes, a body temperature of 99°F, and has a normal blood pressure.

naive-bayes-15

3. If Probability 1 > Probability 2, she is sick. Otherwise, she is not sick

4. Proceed to the next instance and repeat 1-3.

Return to Table of Contents

Naive Bayes Implementation

The Naive Bayes algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

Required Data Set Format for Naïve Bayes

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  • N: Actual Class

The program then adds two additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

Modification of Attribute Values

The actual class value was changed to “Benign” or “Malignant.”

I transformed the attributes into binary numbers so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0.

Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There were no missing attribute values.

Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

  • 267 Democrats
  • 168 Republicans

Modification of Attribute Values

I did the following modifications:

  • Changed all “y” to 1 and all “n” to 0.

Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

Return to Table of Contents

Naive Bayes Algorithm in Python, Coded From Scratch

Here are the input files for the code below:

Here is the driver code that contains the main method. I recommend copying and pasting it into a text editor like Notepad++ or an IDE so that you don’t have to do any horizontal scrolling to see the entire code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
import five_fold_stratified_cv
import naive_bayes

# File name: naive_bayes_driver.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Driver for the naive_bayes.py program 
# (Naive Bayes)

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# The naive_bayes.py program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

ALGORITHM_NAME = "Naive Bayes"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 
    #data_path = "breast_cancer.txt"

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 
    #trace_runs_file = "breast_cancer_naive_bayes_trace_runs.txt"

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 
    #test_stats_file = "breast_cancer_naive_bayes_test_stats.txt"

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
        pd_data_set)

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run Naive Bayes the designated number of times as indicated by the 
    # number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = pd.concat([
               fold1, fold2, fold3, fold4], ignore_index=True, sort=False)                
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = pd.concat([
               fold0, fold2, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = pd.concat([
               fold0, fold1, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold4], ignore_index=True, sort=False) 
        else:
            test_dataset = fold4
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold3], ignore_index=True, sort=False) 
        
        # Run Naive Bayes
        accuracy, predictions, learned_model, no_of_instances_test = (
            naive_bayes.naive_bayes(training_dataset,test_dataset))

        # Replace 1 with Yes and 0 with No in the 'Prediction 
        # Correct?' column
        predictions['Prediction Correct?'] = predictions[
            'Prediction Correct?'].map({1: "Yes", 0: "No"})

        # Print the trace runs of each experiment
        print("Accuracy:")
        print(str(accuracy * 100) + "%")
        print()
        print("Classifications:")
        print(predictions)
        print()
        print("Learned Model (Likelihood Table):")
        print(learned_model)
        print()
        print("Number of Test Instances:")
        print(str(no_of_instances_test))
        print() 

        outfile_tr.write("Accuracy:")
        outfile_tr.write(str(accuracy * 100) + "%\n\n")
        outfile_tr.write("Classifications:\n")
        outfile_tr.write(str(predictions) + "\n\n")
        outfile_tr.write("Learned Model (Likelihood Table):\n")
        outfile_tr.write(str(learned_model) + "\n\n")
        outfile_tr.write("Number of Test Instances:")
        outfile_tr.write(str(no_of_instances_test) + "\n\n")

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.\n")

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")
    outfile_ts.write("\n")
    accuracy = np.mean(accuracy_statistics)
    accuracy *= 100
    outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
   
    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    print()
    print("Classification Accuracy : " + str(accuracy) + "%")
    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Here is the code for Naive Bayes:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
 
# File name: naive_bayes.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of Naive Bayes 
# This code works for multi-class 
# classification problems (e.g. democrat/republican/independent)
# Calculate P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0) and
# P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1) and
# P(E1|CL2)P(E2|CL2)P(E3|CL2)...P(E#|CL2) * P(CL2), etc. and 
# predict the class with the maximum result. 
# E is an attribute, and CL means class.
# Only need class prior probability and likelihoods to make a prediction
# (i.e. the numerator of Bayes formula) since denominators are 
# same for both the P(CL0|E1,E2,E3...)*P(CL0) and 
# P(CL1|E1,E2,E3...)*P(CL1), etc. cases where P means "probability of" 
# and | means "given".
 
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
 
# This program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

def naive_bayes(training_set,test_set):
    """
    Parameters:
      training_set: The training instances as a Pandas dataframe
      test_set: The test instances as a Pandas dataframe
    Returns:
      accuracy: Classification accuracy as a decimal
      predictions: Classifications of all the test instances as a 
        Pandas dataframe
      learned_model: The likelihood table that is produced
        during the training phase
      no_of_instances_test: The number of test instances
    """   
 
    # Calculate the number of instances, columns, and attributes in the
    # training data set. Assumes 1 column for the instance ID and 1 column
    # for the class. Record the index of the column that contains 
    # the actual class
    no_of_instances_train = len(training_set.index) # number of rows
    no_of_columns_train = len(training_set.columns) # number of columns
    no_of_attributes = no_of_columns_train - 2
    actual_class_column = no_of_columns_train - 1
 
    # Store class values in a column, sort them, then create a list of unique
    # classes and store in a dataframe and a Numpy array
    unique_class_list_df = training_set.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df
 
    # Record the number of unique classes in the data set
    num_unique_classes = len(unique_class_list_df)
 
    # Record the frequency counts of each class in a Numpy array
    freq_cnt_class = training_set.iloc[:,actual_class_column].value_counts(
        sort=True)
 
    # Record the frequency percentages of each class in a Numpy array
    # This is a list of the class prior probabilities
    class_prior_probs = training_set.iloc[:,actual_class_column].value_counts(
        normalize=True, sort=True)
 
    # Add 2 additional columns to the testing dataframe
    test_set = test_set.reindex(
        columns=[*test_set.columns.tolist(
        ), 'Predicted Class', 'Prediction Correct?'])
 
    # Calculate the number of instances and columns in the
    # testing data set. Record the index of the column that contains the 
    # predicted class and prediction correctness (1 if yes; 0 if no)
    no_of_instances_test = len(test_set.index) # number of rows
    no_of_columns_test = len(test_set.columns) # number of columns
    predicted_class_column = no_of_columns_test - 2
    prediction_correct_column = no_of_columns_test - 1
 
    ######################### Training Phase of the Model #####################
    # Create a an empty dictionary
    my_dict = {}
 
    # Calculate the likelihood tables for each attribute. If an attribute has
    # four levels, there are (# of unique classes x 4) different probabilities 
    # that need to be calculated for that attribute.
    # Start on the first attribute and make your way through all the attributes
    for col in range(1, no_of_attributes + 1):
 
        # Record the name of this column 
        colname = training_set.columns[col]
 
        # Create a dataframe containing the unique values in the column
        unique_attribute_values_df = training_set[colname].drop_duplicates()

        # Create a Numpy array containing the unique values in the column
        unique_attribute_values_np = training_set[colname].unique()
     
        # Calculate likelihood of the attribute given each unique class value
        for class_index in range (0, num_unique_classes):
         
            # For each unique attribute value, calculate the likelihoods 
            # for each class
            for attr_val in range (0, unique_attribute_values_np.size) :
                running_sum = 0
 
                # Calculate N(unique attribute value and class value)
                # Where N means "number of" 
                # Go through each row of the training set
                for row in range(0, no_of_instances_train):
                    if (training_set.iloc[row,col] == (
                        unique_attribute_values_df.iloc[attr_val])) and (
                        training_set.iloc[row, actual_class_column] == (
                        unique_class_list_df.iloc[class_index])):
                            running_sum += 1
 
                # With N(unique attribute value and class value) as the numerator
                # we now need to divide by the total number of times the class
                # appeared in the data set
                try:
                    denominator = freq_cnt_class[class_index]
                except:
                    denominator = 1.0
             
                likelihood = min(1.0,(running_sum / denominator))
             
                # Add a new likelihood to the dictionary
                # Format of search key is 
                # <attribute_name><attribute_value><class_value>
                search_key = str(colname) + str(
                    unique_attribute_values_df.iloc[
                    attr_val]) + str(unique_class_list_df.iloc[
                    class_index])
                my_dict[search_key] = likelihood
  
    # Print the likelihood table to the console
    learned_model = pd.DataFrame.from_dict(my_dict, orient='index')
 
    ################# End of Training Phase of the Naive Bayes Model ########
 
    ################# Testing Phase of the Naive Bayes Model ################
 
    # Proceed one instance at a time and calculate the prediction
    for row in range(0, no_of_instances_test):
 
        # Initialize the prediction outcome
        predicted_class = unique_class_list_df.iloc[0]
        max_numerator_of_bayes = 0.0
 
        # Calculate the Bayes equation numerator for each test instance
        # That is: P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0),
        # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1)...
        for class_index in range (0, num_unique_classes):
 
            # Reset the running product with the class
            # prior probability, P(CL)
            try:
                running_product = class_prior_probs[class_index]
            except:
                running_product = 0.0000001 # Class not found in data set
         
            # Calculation of P(CL) * P(E1|CL) * P(E2|CL) * P(E3|CL)...
            # Format of search key is 
            # <attribute_name><attribute_value><class_value>
            # Record each search key value
            for col in range(1, no_of_attributes + 1):
                attribute_name = test_set.columns[col]
                attribute_value = test_set.iloc[row,col]
                class_value = unique_class_list_df.iloc[class_index]
 
                # Set the search key
                key = str(attribute_name) + str(
                          attribute_value) + str(class_value)
 
                # Update the running product
                try:
                    running_product *= my_dict[key]
                except:
                    running_product *= 0
 
            # Record the prediction if we have a new max
            # Bayes numerator
            if running_product > max_numerator_of_bayes:
                max_numerator_of_bayes = running_product
                predicted_class = unique_class_list_df.iloc[
                             class_index] # New predicted class
 
        # Store the prediction in the dataframe
        test_set.iloc[row,predicted_class_column] = predicted_class
     
        # Store if the prediction was correct
        if predicted_class == test_set.iloc[row,actual_class_column]:
            test_set.iloc[row,prediction_correct_column] = 1
        else: 
            test_set.iloc[row,prediction_correct_column] = 0
 
    # Store the revamped dataframe
    predictions = test_set

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (test_set.iloc[
        :,prediction_correct_column].sum())/no_of_instances_test
 
    # Return statement
    return  accuracy, predictions, learned_model, no_of_instances_test 
    ####################### End Testing Phase #################################

Here is the code for five-fold stratified cross-validation:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

def get_five_folds(instances):
    """
    Parameters:
        instances: A Pandas data frame containing the instances
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Shuffle the data set randomly
    instances = instances.sample(frac=1).reset_index(drop=True)

    # Record the number of columns in the data set
    no_of_columns = len(instances.columns) # number of columns

    # Record the number of rows in the data set
    no_of_rows = len(instances.index) # number of rows

    # Create five empty folds (i.e. Panda Dataframes: fold0 through fold4)
    fold0 = pd.DataFrame(columns=(instances.columns))
    fold1 = pd.DataFrame(columns=(instances.columns))
    fold2 = pd.DataFrame(columns=(instances.columns))
    fold3 = pd.DataFrame(columns=(instances.columns))
    fold4 = pd.DataFrame(columns=(instances.columns))

    # Record the column of the Actual Class
    actual_class_column = no_of_columns - 1

    # Generate an array containing the unique 
    # Actual Class values
    unique_class_list_df = instances.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    unique_class_list_np_size = unique_class_list_np.size

    # For each unique class in the unique Actual Class array
    for unique_class_list_np_idx in range(0, unique_class_list_np_size):

        # Initialize the counter to 0
        counter = 0

        # Go through each row of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for row in range(0, no_of_rows):

            # If the value of the unique class is equal to the actual
            # class in the original data set on this row
            if unique_class_list_np[unique_class_list_np_idx] == (
                instances.iloc[row,actual_class_column]):

                    # Allocate instance to fold0
                    if counter == 0:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold0.loc[len(fold0)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold1
                    elif counter == 1:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold1.loc[len(fold1)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold2
                    elif counter == 2:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold2.loc[len(fold2)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold3
                    elif counter == 3:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold3.loc[len(fold3)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold4
                    else:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold4.loc[len(fold4)] = new_row
                                    
                        # Reset counter to 0
                        counter = 0
        
    return fold0, fold1, fold2, fold3, fold4

Return to Table of Contents

Output Statistics of Naive Bayes

Here are the trace runs:

Here are the results:

results-naive-bayes

Here are the test statistics for each data set:

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Data Analytics. Cambridge, Massachusetts: The MIT Press.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Rebala, G., Ravi, A., & Churiwala, S. (2019). An Introduction to Machine Learning. Switzerland: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.

Return to Table of Contents