In order to answer this question, let us break it down piece-by-piece.
What is Quickprop?
Normal backpropagation when training an artificial feedforward neural network can take a long time to converge on the optimal set of weights. Updating the weights in the first layers of a neural network requires errors that are propagated backwards from the output layer. This procedure might work well for small networks, but when there are a lot of hidden layers, standard backpropagation can really slow down the training process.
Scott E. Fahlman, a Professor Emeritus at Carnegie Mellon University in the School of Computer Science, decided in the late 1980s to speed up backpropagation by inventing a new algorithm called Quickprop which was inspired by Newton’s method, a popular root-finding algorithm.
The idea behind QuickProp is to take the biggest steps possible without overstepping the solution (Kirk, 2015). Instead of the error information propagating backwards from the output layer (as is the case with standard backpropagation), QuickProp needs just the current and previous values of the weights, as well as the error derivative that was calculated during the most recent epoch. This twist on backpropagation means that the weight update process is now entirely local with respect to each node.
In Quickprop, we assume that the shape of the cost function for a given weight when near its optimal value is a paraboloid. With this assumption we can then follow a method similar to Newton’s method for converging on the optimal solution.
The downside of Quickprop is that it can get chaotic if the surface of the cost function has a lot of local minima (Kirk, 2015).
The vanishing gradient problem is as follows: As a neural network contains more and more layers, the gradients of the cost function with respect to the weights at each of the nodes gets smaller and smaller, approaching zero. Teeny tiny gradients make the network increasingly more difficult to train.
The vanishing gradient problem is caused by the nature of backpropagation as well as the activation function (e.g. logistic sigmoid function). Backpropagation computes gradients using the chain rule. And since the sigmoid function generates gradients in the range (0, 1), at each step of training, you are multiplying a chain of small numbers together in order to calculate the gradients.
Consistently multiplying smaller and smaller gradient values produces smaller and smaller weight update values since the magnitude of the weight updates (i.e. step size) is proportional to the learning rate and the gradients.
Thus, because the gradients become vanishingly small, the algorithm takes longer and longer to converge on an optimal set of weights. And since the weights will only make teeny tiny changes during each step of the training phase, the weights do not update effectively, and the neural network network losses accuracy.
What is Cascade Correlation?
Cascade correlation is a method that addresses the question of how many hidden layers and how many hidden nodes we need. Instead of starting with a fixed network with a set number of hidden layers and nodes per hidden layer, the cascade correlation method starts with a minimal network and then trains and adds one hidden layer at a time.
How Does Quickprop Address the Vanishing Gradient Problem in Cascade Correlation?
Some scientists have combined the Cascade correlation learning architecture with the Quickprop weight update rule (Abrahart, 2004). When combined in this fashion, Cascade correlation prevents there from being a vanishing gradient problem due to the fact the gradient is never actually propagated from output all the way back to input through the various layers.
As noted by Marquez et al. (2018), “such training of deep networks in a cascade, directly circumvents the well-known vanishing gradient problem by ensuring that the output is always adjacent to the layer being trained.” This is because of the way the weights are locked as new layers are inserted, and the inputs are still tied directly to the outputs.
So to answer the question posed in the title, it’s really the cascade correlation architecture that has the greatest effect on reducing the vanishing gradient problem even though there is some benefit from the Quickprop algorithm.
References
Abrahart, Robert J., Pauline E. Kneale, and Linda M. See. Neural networks for hydrological modelling. Leiden New York: A.A. Balkema Publishers, 2004. Print.
Marquez, Enrique S., Jonathon S. Hare, Mahesan Niranjan: Deep Cascade Learning. IEEE Trans. Neural Netw. Learning Syst. 29(11): 5475-5485 (2018).
Kirk, Matthew, et al. Thoughtful Machine learning : A Test-driven Approach. Sebastopol, California: O’Reilly, 2015. Print.
In this post, I will walk you through how to build an artificial feedforward neural network trained with backpropagation, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.
Our end goal is to evaluate the performance of an artificial feedforward neural network trained with backpropagation and to compare the performance using no hidden layers, one hidden layer, and two hidden layers. Five different data sets from the UCI Machine Learning Repository are used to compare performance: Breast Cancer, Glass, Iris, Soybean (small), and Vote.
We will use our neural network to do the following:
Predict if someone has breast cancer
Identify glass type
Identify flower species
Determine soybean disease type
Classify a representative as either a Democrat or Republican based on their voting patterns
I hypothesize that the neural networks with no hidden layers will outperform the networks with two hidden layers. My hypothesis is based on the notion that the simplest solutions are often the best solutions (i.e. Occam’s Razor).
What is an Artificial Feedforward Neural Network Trained with Backpropagation?
Background
An artificial feed-forward neural network (also known as multilayer perceptron) trained with backpropagation is an old machine learning technique that was developed in order to have machines that can mimic the brain. Neural networks were the focus of a lot of machine learning research during the 1980s and early 1990s but declined in popularity during the late 1990s.
Since 2010, neural networks have experienced a resurgence in popularity due to improvements in computing speed and the availability of massive amounts of data with which to train large-scale neural networks. In the real world, neural networks have been used to recognize speech, caption images, and even help self-driving cars learn how to park autonomously.
The Brain as Inspiration for Artificial Neural Networks
In
order to understand neural networks, it helps to first take a look at the basic
architecture of the human brain. The brain has 1011 neurons (Alpaydin,
2014). Neurons
are cells inside the brain that process information.
Each neuron contains a number of input wires called dendrites. Each neuron also has one output wire called an axon. The axon is used to send messages to other neurons. The axon of a sending neuron is connected to the dendrites of the receiving neuron via a synapse.
So,
in short, a neuron receives inputs from dendrites, performs a computation, and
sends the output to other neurons via the axon. This process is how information
flows through the brain. The messages sent between neurons are in the form of
electric pulses.
An
artificial neural network, the one used in machine learning, is a simplified
model of the actual human neural network explained above. It is typically
composed of zero or more layers.
Each
layer of the neural network is made up of nodes (analogous to neurons in the
brain). Nodes of one layer are connected to nodes in another layer by
connection weights, which are typically just floating-point numbers (e.g.
0.23342341). These numbers represent the strength of the connection between two
nodes.
The
job of a node in a hidden layer is to:
Receive
input values from each node in a preceding layer
Compute
a weighted sum of those input values
Send
that weighted sum through some activation function (e.g. logistic sigmoid
function or hyperbolic tangent function)
Send
the result of the computation in #3 to each node in the next layer of the
neural network.
Thus,
the output from the nodes in a given layer becomes the input for all nodes in the
next layer.
The
output layer of a network does steps 1-3 above. However, the result of the
computation from step #3 is a class prediction instead of an input to another
layer (since the output layer is the final layer).
Here is a diagram of the process I explained above:
Here is a diagram showing a single layer neural network:
b stands for the bias term. This is a constant. It is like the b in the equation for a line, y = mx + b. It enables the model to have flexibility because, without that bias term, you cannot as easily adapt the weighted sum of inputs (i.e. mx) to fit the data (i.e. in the example of a simple line, the line cannot move up and down the y-axis without that b term).
w in the diagram above stands for the weights, and x stands for the input values.
Here is a similar diagram, but now it is a two-layer neural network instead of single layer.
And here is one last way to look at the same thing I explained above:
Note that the yellow circles on the left represent the input values. w represents the weights. The sigma inside the box means that we calculated the weighted sum of the input values. We run that through the activation function f(S)…e.g. sigmoid function. And then out of that, pops the output, which is passed on to the nodes in the following layer.
Neural networks that contain many layers, for example more than 100, are called deep neural networks. Deep neural networks are the cornerstone of the rapidly growing field known as deep learning.
Training Phase
The objective during the training phase of a neural network is to determine all the connection weights. At the start of training, the weights of the network are initialized to small random values close to 0. After this step, training proceeds to the two main phases of the algorithm: forward propagation and backpropagation.
Forward Propagation
During
the forward propagation phase of a neural network, we process one instance (i.e.
one set of inputs) at a time. Hidden layers extract important features
contained in the input data by computing a weighted sum of the inputs and
running the result through the logistic sigmoid activation function. This
output relays to nodes in the next hidden layer where the data is transformed
yet again. This process continues until the data reaches the output layer.
The
output of the output layer is a predicted class value, which in this project is
a one-hot encoded class prediction vector. The index of the vector corresponds
to each class. For example, if a 1 is in the 0 index of the vector (and a 0 is in
all other indices of the vector), the class prediction is class 0. Because we
are dealing with 0s and 1s, the output vector can also be considered the
probability that an instance is in a given class.
Backpropagation
After
the input signal produced by a training instance propagates through the network
one layer at a time to the output layer, the backpropagation phase commences.
An error value is calculated at the output layer. This error corresponds to the
difference between the class predicted by the network and the actual (i.e.
true) class of the training instance.
The
prediction error is then propagated backward from the output layer to the input
layer. Blame for the error is assigned to each node in each layer, and then the
weights of each node of the neural network are updated accordingly (with the
goal to make more accurate class predictions for the next instance that flows
through the neural network) using stochastic gradient descent for the weight
optimization procedure.
Note
that weights of the neural network are adjusted on a training instance by
training instance basis. This online learning method is the preferred one for
classification problems of large size (Ĭordanov
& Jain, 2013).
The
forward propagation and backpropagation phases continue for a certain number of
epochs. A single epoch finishes when each training instance has been processed
exactly once.
Testing Phase
Once the neural network has been trained, it can be used to make predictions on new, unseen test instances. Test instances flow through the network one-by-one, and the resulting output (which is a vector of class probabilities) determines the classification.
Helpful Video
Below is a helpful video by Andrew Ng, a professor at Stanford University, that explains neural networks and is helpful for getting your head around the math. The video gets pretty complicated in some spots (esp. where he starts writing all sorts of mathematical notation and derivatives). My advice is to lookup anything that he explains that isn’t clear. Take it slow as you are learning about neural networks. There is no rush. This stuff isn’t easy to understand on your first encounter with it. Over time, the fog will begin to lift, and you will be able to understand how it all works.
Required Data Set Format for Feedforward Neural Network Trained with Backpropagation
Columns (0 through N)
0: Instance ID
1: Attribute 1
2: Attribute 2
3: Attribute 3
…
N: Actual Class
The program then adds two
additional columns for the testing set.
N + 1: Predicted Class
N + 2: Prediction Correct? (1 if yes, 0 if no)
Breast Cancer Data Set
This breast cancer data set
contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg,
1992).
Modification of Attribute Values
The actual class value was changed to “Benign”
or “Malignant.”
Attribute values were normalized to be in the
range 0 to 1.
Class values were vectorized using one-hot
encoding.
Missing Data
There were 16 missing attribute values, each denoted with a “?”. I
chose a random number between 1 and 10 (inclusive) to fill in the data.
Glass Data Set
This glass data set contains 214
instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to
identify the type of glass.
Modification of Attribute Values
Attribute values were normalized to be in the
range 0 to 1.
Class values were vectorized using one-hot
encoding.
Missing Data
There are no missing values in this data set.
Iris Data Set
This data set contains 3 classes
of 50 instances each (150 instances in total), where each class refers to a
different type of iris plant (Fisher, 1988).
Modification of Attribute Values
Attribute values were normalized to be in the
range 0 to 1.
Class values were vectorized using one-hot
encoding.
Missing Data
There were no missing attribute values.
Soybean Data Set (small)
This soybean (small) data set
contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to
determine the disease type.
Modification of Attribute Values
Attribute values were normalized to be in the
range 0 to 1.
Class values were vectorized using one-hot
encoding.
Attribute values that were all the same value were
removed.
Missing Data
There are no missing values in this data set.
Vote Data Set
This data set includes votes for
each of the U.S. House of Representatives Congressmen (435 instances) on the 16
key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify
the representative as either a Democrat or Republican.
267 Democrats
168 Republicans
Modification of Attribute Values
I did the following modifications:
Changed all “y” to 1 and all “n” to 0.
Class values were vectorized using one-hot
encoding.
Missing Data
Missing values were denoted as “?”. To fill in
those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).
Stochastic Gradient Descent
I used stochastic gradient descent for optimizing the
weights.
In normal gradient descent, we need
to calculate the partial derivative of the cost function with respect to each
weight. For each partial derivative, we have to tally up the terms for each training
instance to compute the partial derivative of the cost function with respect to
that weight. What this means is that, if we have a lot of attributes and a
large dataset, gradient descent is slow. For this reason, stochastic gradient
descent was chosen since weights are updated after each training instance (as
opposed to after all training instances).
Here is a good video that explains stochastic gradient descent.
Some tuning was performed in this
project. The learning rate was set to 0.1, which was different than the 0.01
value that is often used for multi-layer feedforward neural networks (Montavon, 2012). Lower values resulted in much
longer training times and did not result in large improvements in
classification accuracy.
Epochs
The number of epochs chosen for
the main runs of the algorithm on the data sets was chosen to be 1000. Other
values were tested, but the number of epochs did not have a large impact on classification
accuracy.
Number of Nodes per Hidden Layer
In order to tune the number of
nodes per hidden layer, I used a constant learning rate and constant number of
epochs. I then calculated the classification accuracy for each data set for a
set number of nodes per hidden layer. I performed this process using networks
with one hidden layer and networks with two hidden layers. The results of this
tuning process are below.
Note that the mean classification
accuracy across all data sets when one hidden layer was used for the neural
network reached a peak at eight nodes per hidden layer. This value of eight nodes
per hidden layer was used for the actual runs on the data sets.
For two hidden layers, the peak
mean classification accuracy was attained at five nodes per hidden layer. Thus,
when the algorithm was run on the data sets for two hidden layers, I used five
nodes per hidden layer for each data set to compare the classification accuracy
across the data sets.
Here is the full code for the neural network. This is all you need to run the program:
import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
from random import shuffle # Import shuffle() method from the random module
from random import seed # Import seed() method from the random module
from random import random # Import random() method from the random module
from collections import Counter # Used for counting
from math import exp # Import exp() function from the math module
# File name: neural_network.py
# Author: Addison Sears-Collins
# Date created: 7/30/2019
# Python version: 3.7
# Description: An artificial feedforward neural network trained
# with backpropagation (also called multilayer perceptron)
# Required Data Set Format
# Columns (0 through N)
# 0: Attribute 0
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class
# 2 additional columns are added for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct?
ALGORITHM_NAME = "Feedforward Neural Network With Backpropagation"
SEPARATOR = "," # Separator for the data set (e.g. "\t" for tab data)
def normalize(dataset):
"""
Normalize the attribute values so that they are between 0 and 1, inclusive
:param pandas_dataframe dataset: The original dataset as a Pandas dataframe
:return: normalized_dataset
:rtype: Pandas dataframe
"""
# Generate a list of the column names
column_names = list(dataset)
# For every column except the actual class column
for col in range(0, len(column_names) - 1):
temp = dataset[column_names[col]] # Go column by column
minimum = temp.min() # Get the minimum of the column
maximum = temp.max() # Get the maximum of the column
# Normalized all values in the column so that they
# are between 0 and 1.
# x_norm = (x_i - min(x))/(max(x) - min(x))
dataset[column_names[col]] = dataset[column_names[col]] - minimum
dataset[column_names[col]] = dataset[column_names[col]] / (
maximum - minimum)
normalized_dataset = dataset
return normalized_dataset
def get_five_stratified_folds(dataset):
"""
Implementation of five-fold stratified cross-validation. Divide the data
set into five random folds. Make sure that the proportion of each class
in each fold is roughly equal to its proportion in the entire data set.
:param pandas_dataframe dataset: The original dataset as a Pandas dataframe
:return: five_folds
:rtype: list of folds where each fold is a list of instances(i.e. examples)
"""
# Create five empty folds
five_folds = list()
fold0 = list()
fold1 = list()
fold2 = list()
fold3 = list()
fold4 = list()
# Get the number of columns in the data set
class_column = len(dataset[0]) - 1
# Shuffle the data randomly
shuffle(dataset)
# Generate a list of the unique class values and their counts
classes = list() # Create an empty list named 'classes'
# For each instance in the dataset, append the value of the class
# to the end of the classes list
for instance in dataset:
classes.append(instance[class_column])
# Create a list of the unique classes
unique_classes = list(Counter(classes).keys())
# For each unique class in the unique class list
for uniqueclass in unique_classes:
# Initialize the counter to 0
counter = 0
# Go through each instance of the data set and find instances that
# are part of this unique class. Distribute them among one
# of five folds
for instance in dataset:
# If we have a match
if uniqueclass == instance[class_column]:
# Allocate instance to fold0
if counter == 0:
# Append this instance to the fold
fold0.append(instance)
# Increase the counter by 1
counter += 1
# Allocate instance to fold1
elif counter == 1:
# Append this instance to the fold
fold1.append(instance)
# Increase the counter by 1
counter += 1
# Allocate instance to fold2
elif counter == 2:
# Append this instance to the fold
fold2.append(instance)
# Increase the counter by 1
counter += 1
# Allocate instance to fold3
elif counter == 3:
# Append this instance to the fold
fold3.append(instance)
# Increase the counter by 1
counter += 1
# Allocate instance to fold4
else:
# Append this instance to the fold
fold4.append(instance)
# Reset the counter to 0
counter = 0
# Shuffle the folds
shuffle(fold0)
shuffle(fold1)
shuffle(fold2)
shuffle(fold3)
shuffle(fold4)
# Add the five stratified folds to the list
five_folds.append(fold0)
five_folds.append(fold1)
five_folds.append(fold2)
five_folds.append(fold3)
five_folds.append(fold4)
return five_folds
def initialize_neural_net(
no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs):
"""
Generates a new neural network that is ready to be trained.
Network (list of layers): 0+ hidden layers, and output layer
Input Layer (list of attribute values): A row from the training set
Hidden Layer (list of dictionaries): A set of nodes (i.e. neurons)
Output Layer (list of dictionaries): A set of nodes, one node per class
Node (dictionary): Contains a set of weights, one weight for each input
to the layer containing that node + an additional weight for the bias.
Each node is represented as a dictionary that stores key-value pairs
Each key corresponds to a property of that node (e.g. weights).
Weights will be initialized to small random values between 0 and 1.
:param int no_inputs: Numper of inputs (i.e. attributes)
:param int no_hidden_layers: Numper of hidden layers (0 or more)
:param int no_nodes_per_hidden_layer: Numper of nodes per hidden layer
:param int no_outputs: Numper of outputs (one output node per class)
:return: network
:rtype:list (i.e. list of layers: hidden layers, output layer)
"""
# Create an empty list
network = list()
# Create the the hidden layers
hidden_layer = list()
hl_counter = 0
# Create the output layer
output_layer = list()
# If this neural network contains hidden layers
if no_hidden_layers > 0:
# Build one hidden layer at a time
for layer in range(no_hidden_layers):
# Reset to an empty hidden layer
hidden_layer = list()
# If this is the first hidden layer
if hl_counter == 0:
# Build one node at a time
for node in range(no_nodes_per_hidden_layer):
initial_weights = list()
# Each node in the hidden layer has no_inputs + 1 weights,
# initialized to a random number in the range [0.0, 1.0)
for i in range(no_inputs + 1):
initial_weights.append(random())
# Add the node to the first hidden layer
hidden_layer.append({'weights':initial_weights})
# Finished building the first hidden layer
hl_counter += 1
# Add this first hidden layer to the front of the neural
# network
network.append(hidden_layer)
# If this is not the first hidden layer
else:
# Build one node at a time
for node in range(no_nodes_per_hidden_layer):
initial_weights = list()
# Each node in the hidden layer has
# no_nodes_per_hidden_layer + 1 weights, initialized to
# a random number in the range [0.0, 1.0)
for i in range(no_nodes_per_hidden_layer + 1):
initial_weights.append(random())
hidden_layer.append({'weights':initial_weights})
# Add this newly built hidden layer to the neural network
network.append(hidden_layer)
# Build the output layer
for outputnode in range(no_outputs):
initial_weights = list()
# Each node in the output layer has no_nodes_per_hidden_layer
# + 1 weights, initialized to a random number in
# the range [0.0, 1.0)
for i in range(no_nodes_per_hidden_layer + 1):
initial_weights.append(random())
# Add this output node to the output layer
output_layer.append({'weights':initial_weights})
# Add the output layer to the neural network
network.append(output_layer)
# A neural network has no hidden layers
else:
# Build the output layer
for outputnode in range(no_outputs):
initial_weights = list()
# Each node in the hidden layer has no_inputs + 1 weights,
# initialized to a random number in the range [0.0, 1.0)
for i in range(no_inputs + 1):
initial_weights.append(random())
# Add this output node to the output layer
output_layer.append({'weights':initial_weights})
network.append(output_layer)
# Finished building the initial neural network
return network
def weighted_sum_of_inputs(weights, inputs):
"""
Calculates the weighted sum of inputs plus the bias
:param list weights: A list of weights. Each node has a list of weights.
:param list inputs: A list of input values. These can be a single row
of attribute values or the output from nodes from the previous layer
:return: weighted_sum
:rtype: float
"""
# We assume that the last weight is the bias value
# The bias value is a special weight that does not multiply with an input
# value (or we could assume its corresponding input value is always 1)
# The bias is similar to the intercept constant b in y = mx + b. It enables
# a (e.g. sigmoid) curve to be shifted to create a better fit
# to the data. Without the bias term b, the line always goes through the
# origin (0,0) and cannot adapt as well to the data.
# In y = mx + b, we assume b * x_0 where x_0 = 1
# Initiate the weighted sum with the bias term. Assume the last weight is
# the bias term
weighted_sum = weights[-1]
for index in range(len(weights) - 1):
weighted_sum += weights[index] * inputs[index]
return weighted_sum
def sigmoid(weighted_sum_of_inputs_plus_bias):
"""
Run the weighted sum of the inputs + bias through
the sigmoid activation function.
:param float weighted_sum_of_inputs_plus_bias: Node summation term
:return: sigmoid(weighted_sum_of_inputs_plus_bias)
"""
return 1.0 / (1.0 + exp(-weighted_sum_of_inputs_plus_bias))
def forward_propagate(network, instance):
"""
Instances move forward through the neural network from one layer
to the next layer. At each layer, the outputs are calculated for each
node. These outputs are the inputs for the nodes in the next layer.
The last set of outputs is the output for the nodes in the output
layer.
:param list network: List of layers: 0+ hidden layers, 1 output layer
:param list instance (a single training/test instance from the data set)
:return: outputs
:rtype: list
"""
inputs = instance
# For each layer in the neural network
for layer in network:
# These will store the outputs for this layer
new_inputs = list()
# For each node in this layer
for node in layer:
# Calculate the weighted sum + bias term
weighted_sum = weighted_sum_of_inputs(node['weights'], inputs)
# Run the weighted sum through the activation function
# and store the result in this node's dictionary.
# Now the node's dictionary has two keys, weights and output.
node['output'] = sigmoid(weighted_sum)
# Used for debugging
#print(node)
# Add the output of the node to the new_inputs list
new_inputs.append(node['output'])
# Update the inputs list
inputs = new_inputs
# We have reached the output layer
outputs = inputs
return outputs
def sigmoid_derivative(output):
"""
The derivative of the sigmoid activation function with respect
to the weighted summation term of the node.
Formally (after a lot of calculus), this derivative is:
derivative = sigmoid(weighted_sum_of_inputs_plus_bias) *
(1 - sigmoid(weighted_sum_of_inputs_plus_bias))
= node_ouput * (1 - node_output)
This method is used during the backpropagation phase.
:param list output: Output of a node (generated during the forward
propagation phase)
:return: sigmoid_der
:rtype: float
"""
return output * (1.0 - output)
def back_propagate(network, actual):
"""
In backpropagation, the error is computed between the predicted output by
the network and the actual output as determined by the data set. This error
propagates backwards from the output layer to the first hidden layer. The
weights in each layer are updated along the way in response to the error.
The goal is to reduce the prediction error for the next training instance
that forward propagates through the network.
:param network list: The neural network
:param actual list: A list of the actual output from the data set
"""
# Iterate in reverse order (i.e. starts from the output layer)
for i in reversed(range(len(network))):
# Work one layer at a time
layer = network[i]
# Keep track of the errors for the nodes in this layer
errors = list()
# If this is a hidden layer
if i != len(network) - 1:
# For each node_j in this hidden layer
for j in range(len(layer)):
# Reset the error value
error = 0.0
# Calculate the weighted error.
# The error values come from the error (i.e. delta) calculated
# at each node in the layer just to the "right" of this layer.
# This error is weighted by the weight connections between the
# node in this hidden layer and the nodes in the layer just
# to the "right" of this layer.
for node in network[i + 1]:
error += (node['weights'][j] * node['delta'])
# Add the weighted error for node_j to the
# errors list
errors.append(error)
# If this is the output layer
else:
# For each node in the output layer
for j in range(len(layer)):
# Store this node (i.e. dictionary)
node = layer[j]
# Actual - Predicted = Error
errors.append(actual[j] - node['output'])
# Calculate the delta for each node_j in this layer
for j in range(len(layer)):
node = layer[j]
# Add an item to the node's dictionary with the
# key as delta.
node['delta'] = errors[j] * sigmoid_derivative(node['output'])
def update_weights(network, instance, learning_rate):
"""
After the deltas (errors) have been calculated for each node in
each layer of the neural network, the weights can be updated.
new_weight = old_weight + learning_rate * delta * input_value
:param list network: List of layers: 0+ hidden layers, 1 output layer
:param list instance: A single training/test instance from the data set
:param float learning_rate: Controls step size in the stochastic gradient
descent procedure.
"""
# For each layer in the network
for layer_index in range(len(network)):
# Extract all the attribute values, excluding the class value
inputs = instance[:-1]
# If this is not the first hidden layer
if layer_index != 0:
# Go through each node in the previous layer and add extract the
# output from that node. The output from the previous layer
# is the input to this layer.
inputs = [node['output'] for node in network[layer_index - 1]]
# For each node in this layer
for node in network[layer_index]:
# Go through each input value
for j in range(len(inputs)):
# Update the weights
node['weights'][j] += learning_rate * node['delta'] * inputs[j]
# Updating the bias weight
node['weights'][-1] += learning_rate * node['delta']
def train_neural_net(
network, training_set, learning_rate, no_epochs, no_outputs):
"""
Train a neural network that has already been initialized.
Training is done using stochastic gradient descent where the weights
are updated one training instance at a time rather than after the
entire training set (as is the case with gradient descent).
:param list network: The neural network, which is a list of layers
:param list training_set: A list of training instances (i.e. examples)
:param float learning_rate: Controls step size of gradient descent
:param int no_epochs: How many passes we will make through training set
:param int no_outputs: The number of output nodes (equal to # of classes)
"""
# Go through the entire training set a fixed number of times (i.e. epochs)
for epoch in range(no_epochs):
# Update the weights one instance at a time
for instance in training_set:
# Forward propagate the training instance through the network
# and produce the output, which is a list.
outputs = forward_propagate(network, instance)
# Vectorize the output using one hot encoding.
# Create a list called actual_output that is the same length
# as the number of outputs. Put a 1 in the place of the actual
# class.
actual_output = [0 for i in range(no_outputs)]
actual_output[int(instance[-1])] = 1
back_propagate(network, actual_output)
update_weights(network, instance, learning_rate)
def predict_class(network, instance):
"""
Make a class prediction given a trained neural network and
an instance from the test data set.
:param list network: The neural network, which is a list of layers
:param list instance: A single training/test instance from the data set
:return class_prediction
:rtype int
"""
outputs = forward_propagate(network, instance)
# Return the index that has the highest probability. This index
# is the class value. Assume class values begin at 0 and go
# upwards by 1 (i.e. 0, 1, 2, ...)
class_prediction = outputs.index(max(outputs))
return class_prediction
def calculate_accuracy(actual, predicted):
"""
Calculates the accuracy percentages
:param list actual: Actual class values
:param list predicted: predicted class values
:return: classification_accuracy
:rtype: float (as a percentage)
"""
number_of_correct_predictions = 0
for index in range(len(actual)):
if actual[index] == predicted[index]:
number_of_correct_predictions += 1
classification_accuracy = (
number_of_correct_predictions / float(len(actual))) * 100.0
return classification_accuracy
def get_test_set_predictions(
training_set, test_set, learning_rate, no_epochs,
no_hidden_layers, no_nodes_per_hidden_layer):
"""
This method is the workhorse.
A new neutal network is initialized.
The network is trained on the training set.
The trained neural network is used to generate predictions on the
test data set.
:param list training_set
:param list test_set
:param float learning_rate
:param int no_epochs
:param int no_hidden_layers
:param int no_nodes_per_hidden_layer
:return network, class_predictions
:rtype list, list
"""
# Get the number of attribute values
no_inputs = len(training_set[0]) - 1
# Calculate the number of unique classes
no_outputs = len(set([instance[-1] for instance in training_set]))
# Build a new neural network
network = initialize_neural_net(
no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs)
train_neural_net(
network, training_set, learning_rate, no_epochs, no_outputs)
# Store the class predictions for each test instance
class_predictions = list()
for instance in test_set:
cl_prediction = predict_class(network, instance)
class_predictions.append(cl_prediction)
# Return the learned model as well as the class predictions
return network, class_predictions
###############################################################
def main():
"""
The main method of the program
"""
LEARNING_RATE = 0.1 # Used for stochastic gradient descent procedure
NO_EPOCHS = 1000 # Epoch is one complete pass through training data
NO_HIDDEN_LAYERS = 1 # Number of hidden layers
NO_NODES_PER_HIDDEN_LAYER = 8 # Number of nodes per hidden layer
# Welcome message
print("Welcome to the " + ALGORITHM_NAME + " Program!")
print()
# Directory where data set is located
#data_path = input("Enter the path to your input file: ")
data_path = "vote.txt"
# Read the full text file and store records in a Pandas dataframe
pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)
# Show functioning of the program
#trace_runs_file = input("Enter the name of your trace runs file: ")
trace_runs_file = "vote_nn_trace_runs.txt"
## Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")
# Testing statistics
#test_stats_file = input("Enter the name of your test statistics file: ")
test_stats_file = "vote_nn_test_stats.txt"
## Open a test_stats_file
outfile_ts = open(test_stats_file,"w")
# Generate a list of the column names
column_names = list(pd_data_set)
# The input layer in the neural network
# will have one node for each attribute value
no_of_inputs = len(column_names) - 1
# Make a list of the unique classes
list_of_unique_classes = pd.unique(pd_data_set["Actual Class"])
# The output layer in the neural network
# will have one node for each class value
no_of_outputs = len(list_of_unique_classes)
# Replace all the class values with numbers, starting from 0
# in the Pandas dataframe.
for cl in range(0, len(list_of_unique_classes)):
pd_data_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)
# Normalize the attribute values so that they are all between 0
# and 1, inclusive
normalized_pd_data_set = normalize(pd_data_set)
# Convert normalized Pandas dataframe into a list
dataset_as_list = normalized_pd_data_set.values.tolist()
# Set the seed for random number generator
seed(1)
# Get a list of 5 stratified folds because we are doing
# five-fold stratified cross-validation
fv_folds = get_five_stratified_folds(dataset_as_list)
# Keep track of the scores for each of the five experiments
scores = list()
experiment_counter = 0
for fold in fv_folds:
print()
print("Running Experiment " + str(experiment_counter) + " ...")
print()
outfile_tr.write("Running Experiment " + str(
experiment_counter) + " ...\n")
outfile_tr.write("\n")
# Get all the folds and store them in the training set
training_set = list(fv_folds)
# Four folds make up the training set
training_set.remove(fold)
# Combined all the folds so that all we have is a list
# of training instances
training_set = sum(training_set, [])
# Initialize a test set
test_set = list()
# For each instance in this test fold
for instance in fold:
# Create a copy and store it
copy_of_instance = list(instance)
test_set.append(copy_of_instance)
# Get the trained neural network and the predicted values
# for each test instance
neural_net, predicted_values = get_test_set_predictions(
training_set, test_set,LEARNING_RATE,NO_EPOCHS,
NO_HIDDEN_LAYERS,NO_NODES_PER_HIDDEN_LAYER)
actual_values = [instance[-1] for instance in fold]
accuracy = calculate_accuracy(actual_values, predicted_values)
scores.append(accuracy)
# Print the learned model
print("Experiment " + str(
experiment_counter) + " Trained Neural Network")
print()
for layer in neural_net:
print(layer)
print()
outfile_tr.write("Experiment " + str(
experiment_counter) + " Trained Neural Network")
outfile_tr.write("\n")
outfile_tr.write("\n")
for layer in neural_net:
outfile_tr.write(str(layer))
outfile_tr.write("\n")
outfile_tr.write("\n\n")
# Print the classifications on the test instances
print("Experiment " + str(
experiment_counter) + " Classifications on Test Instances")
print()
outfile_tr.write("Experiment " + str(
experiment_counter) + " Classifications on Test Instances")
outfile_tr.write("\n\n")
test_df = pd.DataFrame(test_set, columns=column_names)
# Add 2 additional columns to the testing dataframe
test_df = test_df.reindex(
columns=[*test_df.columns.tolist(
), 'Predicted Class', 'Prediction Correct?'])
# Add the predicted values to the "Predicted Class" column
# Indicate if the prediction was correct or not.
for pre_val_index in range(len(predicted_values)):
test_df.loc[pre_val_index, "Predicted Class"] = predicted_values[
pre_val_index]
if test_df.loc[pre_val_index, "Actual Class"] == test_df.loc[
pre_val_index, "Predicted Class"]:
test_df.loc[pre_val_index, "Prediction Correct?"] = "Yes"
else:
test_df.loc[pre_val_index, "Prediction Correct?"] = "No"
# Replace all the class values with the name of the class
for cl in range(0, len(list_of_unique_classes)):
test_df["Actual Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
test_df["Predicted Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
# Print out the test data frame
print(test_df)
print()
print()
outfile_tr.write(str(test_df))
outfile_tr.write("\n\n")
# Go to the next experiment
experiment_counter += 1
print("Experiments Completed.\n")
outfile_tr.write("Experiments Completed.\n\n")
# Print the test stats
print("------------------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics")
print("------------------------------------------------------------------")
print("Data Set : " + data_path)
print()
print("Learning Rate: " + str(LEARNING_RATE))
print("Number of Epochs: " + str(NO_EPOCHS))
print("Number of Hidden Layers: " + str(NO_HIDDEN_LAYERS))
print("Number of Nodes Per Hidden Layer: " + str(
NO_NODES_PER_HIDDEN_LAYER))
print()
print("Accuracy Statistics for All 5 Experiments: %s" % scores)
print()
print()
print("Classification Accuracy: %.3f%%" % (
sum(scores)/float(len(scores))))
outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write(
"------------------------------------------------------------------\n")
outfile_ts.write("Data Set : " + data_path +"\n\n")
outfile_ts.write("Learning Rate: " + str(LEARNING_RATE) + "\n")
outfile_ts.write("Number of Epochs: " + str(NO_EPOCHS) + "\n")
outfile_ts.write("Number of Hidden Layers: " + str(
NO_HIDDEN_LAYERS) + "\n")
outfile_ts.write("Number of Nodes Per Hidden Layer: " + str(
NO_NODES_PER_HIDDEN_LAYER) + "\n")
outfile_ts.write(
"Accuracy Statistics for All 5 Experiments: %s" % str(scores))
outfile_ts.write("\n\n")
outfile_ts.write("Classification Accuracy: %.3f%%" % (
sum(scores)/float(len(scores))))
## Close the files
outfile_tr.close()
outfile_ts.close()
main()
The breast cancer data set results
were in line with what I expected. The simpler model, the one with no hidden
layers, ended up generating the highest classification accuracy. Classification
accuracy was just short of 97%. In other words, the neural network that had no
hidden layers successfully classified a patient as either malignant or benign
with an almost 97% accuracy.
These results also suggest that
the amount of training data has a direct impact on performance. Higher amounts
of data (699 instances in this case) can lead to better learning and better
classification accuracy on new, unseen instances.
Glass Data Set
The performance of the neural
network on the glass data set was the worst out of all of the data sets. The
ability of the network to correctly identify the type of glass given the
attribute values never exceeded 70%.
I hypothesize that the poor
performance on the glass data set is due to the high numbers of classes
combined with a relatively smaller data set.
Iris Data Set
Classification accuracy was
superb on the iris dataset, attaining a classification accuracy around 97%. The
results of the iris dataset were surprising given that the more complicated
neural network with two hidden layers and five nodes per hidden layer outperformed
the simpler neural network that had no hidden layers. In this case, it appears
that the iris dataset benefited from the increasing layers of abstraction
provided by a higher number of layers.
Soybean Data Set (small)
Performance on the soybean data set
was stellar and was the highest of all of the data sets but also had the
largest standard deviation for the classification accuracy. Note that
classification accuracy reached a peak of 100% using one hidden layer and eight
nodes for the hidden layer. However, when I added an additional hidden layer,
classification accuracy dropped to under 70%.
The reason for the high standard
deviation of the classification accuracy is unclear, but I hypothesize it has
to do with the relatively small number of training instances. Future work would
need to be performed with the soybean large dataset available from the UCI Machine
Learning Repository to see if these results remain consistent.
The results of the soybean runs suggest
that large numbers of relevant attributes can help a machine learning algorithm
create more accurate classifications.
Vote Data Set
The vote data set did not yield the
stellar performance of the soybean data set, but classification accuracy was
still solid at ~96% using one hidden layer and eight nodes per hidden layer.
These results are in line with what I expected because voting behavior should
provide a powerful predictor of whether a candidate is a Democrat or
Republican. I would have been surprised had I observed classification
accuracies that were lower since members of Congress tend to vote along party
lines on most issues.
Summary and Conclusions
My hypothesis was incorrect. In
some cases, simple neural networks with no hidden layers outperformed more
complex neural networks with 1+ hidden layers. However, in other cases, more
complex neural networks with multiple hidden layers outperformed the network
with no hidden layers. The reason why some data is more amenable to networks
with hidden layers instead of without hidden layers is unclear.
Other conclusions include the
following:
Higher amounts of data can lead to better
learning and better classification accuracy on new, unseen instances.
Large numbers of relevant attributes can help a neural
network create more accurate classifications.
Neural networks are powerful and can achieve
excellent results on both binary and multi-class classification problems.
Alpaydin, E. (2014). Introduction to Machine
Learning. Cambridge, Massachusetts: The MIT Press.
Fisher, R. (1988, July 01). Iris Data Set.
Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris
German, B. (1987, September 1). Glass
Identification Data Set. Retrieved from UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification
Ĭordanov, I., & Jain, L. C. (2013). Innovations
in Intelligent Machines –3 : Contemporary Achievements in Intelligent
Systems. Berlin: Springer.
Michalski, R. (1980). Learning by being told and
learning from examples: an experimental comparison of the two methodes of
knowledge acquisition in the context of developing an expert system for
soybean disease diagnosis. International Journal of Policy Analysis and
Information Systems, 4(2), 125-161.
Montavon, G. O. (2012). Neural Networks : Tricks
of the Trade. New York: Springer.
Schlimmer, J. (1987, 04 27). Congressional Voting
Records Data Set. Retrieved from Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin
(Original) Data Set. Retrieved from Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25
Logistic regression, contrary to
the name, is a classification algorithm. Unlike linear regression which outputs
a continuous value (e.g. house price) for the prediction, Logistic Regression
transforms the output into a probability value (i.e. a number between 0 and 1)
using what is known as the logistic sigmoid function. This function is also
known as the squashing function since it maps a line — that can run from
negative infinity to positive infinity along the y-axis — to a number between
0 and 1.
The function is called the
sigmoid function because it is s-shaped. Here is what the sigmoid function
looks like in mathematical notation:
where:
h(z) is the predicted probability of a given instance (i.e. example) being in the positive class…that is the class represented as 1 in a data set. For example, in an e-mail classification data set, this would be the probability that a given e-mail instance is spam (If h(z) = 0.73, for example, that would mean that the instance has a 73% chance of being spam).
1- h(z) is the probability of an instance being in the negative class, the class represented as 0 (e.g. not spam). h(z) is always a number between 0 and 1. Going back to the example in the bullet point above, this would mean that the instance has a 27% change of being not spam.
z is the input (e.g. a weighted sum of the attributes of a given instance)
z is commonly expressed as the
dot product, w · x, where w is a 1-dimensional vector containing
the weights for each attribute, and x is a vector containing the values of each
attribute for a specific instance of the data set (i.e. example).
Often the dot product, w · x,
is written as matrix multiplication. In that case, z = wTx where T
means transpose of the single dimensional weight vector w. The symbol Ɵ
is often used in place of w.
So substituting w · x
into the sigmoid equation, we getthe following equation:
where
w is a 1-dimensional vector containing the
weights for each attribute.
The subscript w on hw means the
attributes x are weighted by the weight vector w.
hw(x) is the probability (a value
between 0 and 1) that an instance is a member of the positive class (i.e.
probability an e-mail is spam).
x is a vector containing the values of each
attribute for a specific instance of the data set.
w · x = w0x0 + w1x1
+ w2x2 + …. + wdxd (analogous to the equation of a line y
= mx + b from grade school)
d is the number of attributes in the data set
x0 = 1 by convention, for all
instances. This attribute has to be added by the programmer for all instances.
It is known formally as the “bias” term.
As is the case for many machine
learning algorithms, the starting point for Logistic Regression is to create a
trained model. One we have a trained model, we can use it to make predictions
on new, unseen instances.
Training Phase
Creating a trained model entails
determining the weight vector w. Once we have the weights, we can make
predictions on new unseen examples. All we need are the values of the
attributes of those examples (i.e. the x values), and we can weight the x
values with the values of w to compute the probabilities h(x) for that example
using the sigmoid function.
The rule for making predictions
using the sigmoid function is as follows:
If hw(x) ≥ 0.5, class = 1 (positive
class, e.g. spam)
If hw(x) < 0.5, class = 0
(negative class, e.g. not spam)
To determine the weights in
linear regression, the sum of the squared error was the cost function (where
error = actual values – predicted values by the line). The cost function
represents how wrong a prediction is. In linear regression, it represents how
wrong a line of best fit is on a set of observed training instances. The lower
the sum of the squared error, the better a line fits the training data, and, in
theory, the better the line will predict new, unseen instances.
Instead, the cost function in Logistic
Regression is called cross-entropy. Without getting too detailed into the
mathematics and notation of this particular equation, the cross-entropy equation
is the one that we want to minimize. Minimizing this equation will yield us a
sigmoid curve that best fits the training data and enables us to make the best
classification predictions possible for new, unseen test instances. A minimum
of the cost function is attained when the gradient of the cost function is
close to zero (i.e. the calculated weights stop changing). The formal term for the
gradient of the cost function getting close to zero is called convergence.
In order to minimize the cost
function, we need to find its gradient (i.e. derivative, slope, etc.) and
determine the values for the weight vector w that make its derivative as close
to 0 as possible. We cannot just set the gradient to 0 and then enter x-values
and calculate the weights directly. Instead, we have to use a method called
gradient descent in order to find the weights.
In the gradient descent algorithm
for Logistic Regression, we:
Start off with an empty weight vector (initialized to random values between -0.01 and 0.01). The size of the vector is equal to the number of attributes in the data set.
Initialize an empty weight change vector initialized to all zeros. The size of the vector is equal to the number of attributes in the data set.
For each training instance, one at a time.
a. Make a probability prediction by calculating the weighted sum of the attribute values and running that value through the sigmoid function.
b. We evaluate the gradient of the cost function by plugging in the actual (i.e. observed) class value and the predicted class value from bullet point 3a above.
c. The gradient value from 3b gets added to the weight change vector.
After we finish with the last training instance from 3, we multiply each value in the weight change vector by a learning rate (commonly 0.01).
The vector from 4 gets added to the empty weight vector to update the weights.
We then ask two questions
a. Have the weights continued to change (i.e. is the norm (i.e. magnitude) of the weight change vector less than a certain threshold like 0.001)?
b. Have we been through the data set less than 10,000 (or whatever we set the maximum iterations to) times?
c. If the answer is yes to both 6a and 6b, go back to step 2. Otherwise, we return the final weight vector, exiting the algorithm.
The gradient descent pseudocode
for Logistic Regression is provided in Figure 10.6 of Introduction to Machine Learning by Ethem Alpaydin (Alpaydin, 2014).
Testing Phase
Once training is completed, we
have the weights and can use these weights, attribute values, and the sigmoid
function to make predictions for the set of test instances.
Predictions for a given test instance are made using the aforementioned sigmoid function:
Where the rule for making
predictions using the sigmoid function is as follows:
If hw(x) ≥ 0.5, class = 1 (positive
class, e.g. spam)
If hw(x) < 0.5, class = 0 (negative
class, e.g. not spam)
Multi-class Logistic Regression
A Multi-class Logistic Regression
problem is a twist on the binary Logistic Regression method presented above.
Multi-class Logistic Regression can make predictions on both binary and
multi-class classification problems.
In order to make predictions for
multi-class datasets, we take the training set and create multiple separate
binary classification problems (one for each class in the data set). For each
of those training sets that we generated, we set the class values for one class
to 1 (representing the positive class), and we set all other classes to 0 (i.e.
the negative class).
In other words, if there are k
classes in a data set, k separate training sets are generated. In each of those
k separate training sets, one class is set to 1 and all other classes are set
to 0.
In Multi-class Logistic
Regression, the training phase entails creating k different weight vectors, one
for each class rather than just a single weight vector (which was the case in
binary Logistic Regression). Each weight vector will help to predict the
probability of an instance being a member of that class. Thus, in the testing
phase, when there is an unseen new instance, three different predictions need
to be made. This method is called the one-vs-all strategy, sometimes called one-vs-rest.
The rule for making predictions
for a given instance are as follows:
For each new test instance,
Make k separate probability predictions.
Pick the class that has the highest probability
(i.e. the class that is the most enthusiastic about that instance being a
member of its class)
Other multi-class Logistic
Regression algorithms include Softmax Regression and the one-vs-one strategy.
The one-vs-all strategy was selected due to its popularity as being the default
strategy used in practice for many of the well-known machine learning libraries
for Python (Rebala, Ravi, & Churiwala, 2019)
Video
Here is an excellent video on logistic regression that explains the whole process I described above, step-by-step.
The program then adds two
additional columns for the testing set.
N + 1: Predicted Class
N + 2: Prediction Correct? (1 if yes, 0 if no)
Breast Cancer Data Set
This breast cancer data set
contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg,
1992).
Modification of Attribute Values
The actual class value was changed to “Benign” or
“Malignant.”
I transformed the attributes into binary numbers so that the
algorithms could process the data properly and efficiently. If attribute value
was greater than 5, the value was changed to 1, otherwise it was 0.
Missing Data
There were 16 missing attribute values, each denoted with a “?”. I
chose a random number between 1 and 10 (inclusive) to fill in the data.
Glass Data Set
This glass data set contains 214
instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to
identify the type of glass.
Modification of Attribute Values
If attribute values were greater than the median of the attribute,
value was changed to 1, otherwise it was set to 0.
Missing Data
There are no missing values in this data set.
Iris Data Set
This data set contains 3 classes
of 50 instances each (150 instances in total), where each class refers to a
different type of iris plant (Fisher, 1988).
Modification of Attribute Values
If attribute values were greater than the median of the attribute,
value was changed to 1, otherwise it was set to 0.
Missing Data
There were no missing attribute values.
Soybean Data Set (small)
This soybean (small) data set
contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to
determine the disease type.
Modification of Attribute Values
If attribute values were greater than the median of the attribute,
value was changed to 1, otherwise it was set to 0.
Missing Data
There are no missing values in this data set.
Vote Data Set
This data set includes votes for
each of the U.S. House of Representatives Congressmen (435 instances) on the 16
key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to
identify the representative as either a Democrat or Republican.
267 Democrats
168 Republicans
Modification of Attribute Values
I did the following modifications:
Changed all “y” to 1 and all “n” to 0.
Missing Data
Missing values were denoted as “?”. To fill in
those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).
Description of Any Tuning Process Applied
Some tuning was performed in this
project. The learning rate was set to 0.01 by convention. A higher learning
rate (0.5) resulted in poor results for the norm of the gradient (>1).
The stopping criteria for
gradient descent was as follows:
Maximum iterations = 10,000
Euclidean norm of weight change vector <
0.001
When I tried max iterations at
100, the Euclidean norm of the weight change vector returned high values (>
0.2) which indicated that I needed to set a higher max iterations value in
order to have a higher chance of convergence (i.e. weights stop changing) based
on the norm stopping criteria.
Here is the driver code. This is where the main method is located:
import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
import five_fold_stratified_cv
import logistic_regression
# File name: logistic_regression_driver.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Driver of the logistic_regression.py program
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class
# The logistic_regression.py program then adds 2 additional columns
# for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)
ALGORITHM_NAME = "Logistic Regression"
SEPARATOR = "," # Separator for the data set (e.g. "\t" for tab data)
def main():
print("Welcome to the " + ALGORITHM_NAME + " Program!")
print()
# Directory where data set is located
data_path = input("Enter the path to your input file: ")
#data_path = "iris.txt"
# Read the full text file and store records in a Pandas dataframe
pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)
# Show functioning of the program
trace_runs_file = input("Enter the name of your trace runs file: ")
#trace_runs_file = "iris_logistic_regression_trace_runs.txt"
# Open a new file to save trace runs
outfile_tr = open(trace_runs_file,"w")
# Testing statistics
test_stats_file = input("Enter the name of your test statistics file: ")
#test_stats_file = "iris_logistic_regression_test_stats.txt"
# Open a test_stats_file
outfile_ts = open(test_stats_file,"w")
# The number of folds in the cross-validation
NO_OF_FOLDS = 5
# Generate the five stratified folds
fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
pd_data_set)
training_dataset = None
test_dataset = None
# Create an empty array of length 5 to store the accuracy_statistics
# (classification accuracy)
accuracy_statistics = np.zeros(NO_OF_FOLDS)
# Run Logistic Regression the designated number of times as indicated by the
# number of folds
for experiment in range(0, NO_OF_FOLDS):
print()
print("Running Experiment " + str(experiment + 1) + " ...")
print()
outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
outfile_tr.write("\n")
# Each fold will have a chance to be the test data set
if experiment == 0:
test_dataset = fold0
training_dataset = pd.concat([
fold1, fold2, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 1:
test_dataset = fold1
training_dataset = pd.concat([
fold0, fold2, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 2:
test_dataset = fold2
training_dataset = pd.concat([
fold0, fold1, fold3, fold4], ignore_index=True, sort=False)
elif experiment == 3:
test_dataset = fold3
training_dataset = pd.concat([
fold0, fold1, fold2, fold4], ignore_index=True, sort=False)
else:
test_dataset = fold4
training_dataset = pd.concat([
fold0, fold1, fold2, fold3], ignore_index=True, sort=False)
accuracy, predictions, weights_for_each_class, no_of_instances_test = (
logistic_regression.logistic_regression(training_dataset,test_dataset))
# Print the trace runs of each experiment
print("Accuracy:")
print(str(accuracy * 100) + "%")
print()
print("Classifications:")
print(predictions)
print()
print("Learned Model:")
print(weights_for_each_class)
print()
print("Number of Test Instances:")
print(str(no_of_instances_test))
print()
outfile_tr.write("Accuracy:")
outfile_tr.write(str(accuracy * 100) + "%\n\n")
outfile_tr.write("Classifications:\n")
outfile_tr.write(str(predictions) + "\n\n")
outfile_tr.write("Learned Model:\n")
outfile_tr.write(str(weights_for_each_class) + "\n\n")
outfile_tr.write("Number of Test Instances:")
outfile_tr.write(str(no_of_instances_test) + "\n\n")
# Store the accuracy in the accuracy_statistics array
accuracy_statistics[experiment] = accuracy
outfile_tr.write("Experiments Completed.\n")
print("Experiments Completed.\n")
# Write to a file
outfile_ts.write("----------------------------------------------------------\n")
outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
outfile_ts.write("----------------------------------------------------------\n")
outfile_ts.write("Data Set : " + data_path + "\n")
outfile_ts.write("\n")
outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
outfile_ts.write(np.array2string(
accuracy_statistics, precision=2, separator=',',
suppress_small=True))
outfile_ts.write("\n")
outfile_ts.write("\n")
accuracy = np.mean(accuracy_statistics)
accuracy *= 100
outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
# Print to the console
print()
print("----------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics")
print("----------------------------------------------------------")
print("Data Set : " + data_path)
print()
print()
print("Accuracy Statistics for All 5 Experiments:")
print(accuracy_statistics)
print()
print()
print("Classification Accuracy : " + str(accuracy) + "%")
print()
# Close the files
outfile_tr.close()
outfile_ts.close()
main()
Here is the code for logistic regression:
import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
# File name: logistic_regression.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Multi-class logistic regression using one-vs-all.
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class
# This program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)
def sigmoid(z):
"""
Parameters:
z: A real number
Returns:
1.0/(1 + np.exp(-z))
"""
return 1.0/(1 + np.exp(-z))
def gradient_descent(training_set):
"""
Gradient descent for logistic regression. Follows method presented
in the textbook Introduction to Machine Learning 3rd Edition by
Ethem Alpaydin (pg. 252)
Parameters:
training_set: The training instances as a Numpy array
Returns:
weights: The vector of weights, commonly called w or THETA
"""
no_of_columns_training_set = training_set.shape[1]
no_of_rows_training_set = training_set.shape[0]
# Extract the attributes from the training set.
# x is still a 2d array
x = training_set[:,:(no_of_columns_training_set - 1)]
no_of_attributes = x.shape[1]
# Extract the classes from the training set.
# actual_class is a 1d array.
actual_class = training_set[:,(no_of_columns_training_set - 1)]
# Set a learning rate
LEARNING_RATE = 0.01
# Set the maximum number of iterations
MAX_ITER = 10000
# Set the iteration variable to 0
iter = 0
# Set a flag to determine if we have exceeded the maximum number of
# iterations
exceeded_max_iter = False
# Set the tolerance. When the euclidean norm of the gradient vector
# (i.e. magnitude of the changes in the weights) gets below this value,
# stop iterating through the while loop
GRAD_TOLERANCE = 0.001
norm_of_gradient = None
# Set a flag to determine if we have reached the minimum of the
# cost (i.e. error) function.
converged = False
# Create the weights vector with random floats between -0.01 and 0.01
# The number of weights is equal to the number of attributes
weights = np.random.uniform(-0.01,0.01,(no_of_attributes))
changes_in_weights = None
# Keep running the loop below until convergence on the minimum of the
# cost function or we exceed the max number of iterations
while(not(converged) and not(exceeded_max_iter)):
# Initialize a weight change vector that stores the changes in
# the weights at each iteration
changes_in_weights = np.zeros(no_of_attributes)
# For each training instance
for inst in range(0, no_of_rows_training_set):
# Calculate weighted sum of the attributes for
# this instance
output = np.dot(weights, x[inst,:])
# Calculate the sigmoid of the weighted sum
# This y is the probability that this instance belongs
# to the positive class
y = sigmoid(output)
# Calculate difference
difference = (actual_class[inst] - y)
# Multiply the difference by the attribute vector
product = np.multiply(x[inst,:], difference)
# For each attribute, update the weight changes
# i.e. the gradient vector
changes_in_weights = np.add(changes_in_weights,product)
# Calculate the step size
step_size = np.multiply(changes_in_weights, LEARNING_RATE)
# Update the weights vector
weights = np.add(weights, step_size)
# Test to see if we have converged on the minimum of the error
# function
norm_of_gradient = np.linalg.norm(changes_in_weights)
if (norm_of_gradient < GRAD_TOLERANCE):
converged = True
# Update the number of iterations
iter += 1
# If we have exceeded the maximum number of iterations
if (iter > MAX_ITER):
exceeded_max_iter = True
#For debugging purposes
#print("Number of Iterations: " + str(iter - 1))
#print("Norm of the gradient: " + str(norm_of_gradient))
#print(changes_in_weights)
#print()
return weights
def logistic_regression(training_set, test_set):
"""
Multi-class one-vs-all logistic regression
Parameters:
training_set: The training instances as a Pandas dataframe
test_set: The test instances as a Pandas dataframe
Returns:
accuracy: Classification accuracy as a decimal
predictions: Classifications of all the test instances as a
Pandas dataframe
weights_for_each_class: The weight vectors for each class (one-vs-all)
no_of_instances_test: The number of test instances
"""
# Remove the instance ID column
training_set = training_set.drop(
training_set.columns[[0]], axis=1)
test_set = test_set.drop(
test_set.columns[[0]], axis=1)
# Make a list of the unique classes
list_of_unique_classes = pd.unique(training_set["Actual Class"])
# Replace all the class values with numbers, starting from 0
# in both the test and training sets.
for cl in range(0, len(list_of_unique_classes)):
training_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)
test_set["Actual Class"].replace(
list_of_unique_classes[cl], cl ,inplace=True)
# Insert a column of 1s in column 0 of both the training
# and test sets. This is the bias and helps with gradient
# descent. (i.e. X0 = 1 for all instances)
training_set.insert(0, "Bias", 1)
test_set.insert(0, "Bias", 1)
# Convert dataframes to numpy arrays
np_training_set = training_set.values
np_test_set = test_set.values
# Add 2 additional columns to the testing dataframe
test_set = test_set.reindex(
columns=[*test_set.columns.tolist(
), 'Predicted Class', 'Prediction Correct?'])
############################# Training Phase ##############################
no_of_columns_training_set = np_training_set.shape[1]
no_of_rows_training_set = np_training_set.shape[0]
# Create and store a training set for each unique class
# to create separate binary classification
# problems
trainingsets = []
for cl in range(0, len(list_of_unique_classes)):
# Create a copy of the training set
temp = np.copy(np_training_set)
# This class becomes the positive class 1
# and all other classes become the negative class 0
for row in range(0, no_of_rows_training_set):
if (temp[row, (no_of_columns_training_set - 1)]) == cl:
temp[row, (no_of_columns_training_set - 1)] = 1
else:
temp[row, (no_of_columns_training_set - 1)] = 0
# Add the new training set to the trainingsets list
trainingsets.append(temp)
# Calculate and store the weights for the training set
# of each class. Execute gradient descent on each training set
# in order to calculate the weights
weights_for_each_class = []
for cl in range(0, len(list_of_unique_classes)):
weights_for_this_class = gradient_descent(trainingsets[cl])
weights_for_each_class.append(weights_for_this_class)
# Used for debugging
#print(weights_for_each_class[0])
#print()
#print(weights_for_each_class[1])
#print()
#print(weights_for_each_class[2])
########################### End of Training Phase #########################
############################# Testing Phase ###############################
no_of_columns_test_set = np_test_set.shape[1]
no_of_rows_test_set = np_test_set.shape[0]
# Extract the attributes from the test set.
# x is still a 2d array
x = np_test_set[:,:(no_of_columns_test_set - 1)]
no_of_attributes = x.shape[1]
# Extract the classes from the test set.
# actual_class is a 1d array.
actual_class = np_test_set[:,(no_of_columns_test_set - 1)]
# Go through each row (instance) of the test data
for inst in range(0, no_of_rows_test_set):
# Create a scorecard that keeps track of the probabilities of this
# instance being a part of each class
scorecard = []
# Calculate and store the probability for each class in the scorecard
for cl in range(0, len(list_of_unique_classes)):
# Calculate weighted sum of the attributes for
# this instance
output = np.dot(weights_for_each_class[cl], x[inst,:])
# Calculate the sigmoid of the weighted sum
# This is the probability that this instance belongs
# to the positive class
this_probability = sigmoid(output)
scorecard.append(this_probability)
most_likely_class = scorecard.index(max(scorecard))
# Store the value of the most likely class in the "Predicted Class"
# column of the test_set data frame
test_set.loc[inst, "Predicted Class"] = most_likely_class
# Update the 'Prediction Correct?' column of the test_set data frame
# 1 if correct, else 0
if test_set.loc[inst, "Actual Class"] == test_set.loc[
inst, "Predicted Class"]:
test_set.loc[inst, "Prediction Correct?"] = 1
else:
test_set.loc[inst, "Prediction Correct?"] = 0
# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (test_set["Prediction Correct?"].sum())/(len(test_set.index))
# Store the revamped dataframe
predictions = test_set
# Replace all the class values with the name of the class
for cl in range(0, len(list_of_unique_classes)):
predictions["Actual Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
predictions["Predicted Class"].replace(
cl, list_of_unique_classes[cl] ,inplace=True)
# Replace 1 with Yes and 0 with No in the 'Prediction
# Correct?' column
predictions['Prediction Correct?'] = predictions[
'Prediction Correct?'].map({1: "Yes", 0: "No"})
# Reformat the weights_for_each_class list of arrays
weights_for_each_class = pd.DataFrame(np.row_stack(weights_for_each_class))
# Rename the row names
for cl in range(0, len(list_of_unique_classes)):
row_name = str(list_of_unique_classes[cl] + " weights")
weights_for_each_class.rename(index={cl:row_name}, inplace=True)
# Get a list of the names of the attributes
training_set_names = list(training_set.columns.values)
training_set_names.pop() # Remove 'Actual Class'
# Rename the column names
for col in range(0, len(training_set_names)):
col_name = str(training_set_names[col])
weights_for_each_class.rename(columns={col:col_name}, inplace=True)
# Record the number of test instances
no_of_instances_test = len(test_set.index)
# Return statement
return accuracy, predictions, weights_for_each_class, no_of_instances_test
Here is the code for five-fold stratified cross-validation:
import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library
# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure
# that the proportion of each class in each group is roughly equal to its
# proportion in the entire data set.
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class
def get_five_folds(instances):
"""
Parameters:
instances: A Pandas data frame containing the instances
Returns:
fold0, fold1, fold2, fold3, fold4
Five folds whose class frequency distributions are
each representative of the entire original data set (i.e. Five-Fold
Stratified Cross Validation)
"""
# Shuffle the data set randomly
instances = instances.sample(frac=1).reset_index(drop=True)
# Record the number of columns in the data set
no_of_columns = len(instances.columns) # number of columns
# Record the number of rows in the data set
no_of_rows = len(instances.index) # number of rows
# Create five empty folds (i.e. Panda Dataframes: fold0 through fold4)
fold0 = pd.DataFrame(columns=(instances.columns))
fold1 = pd.DataFrame(columns=(instances.columns))
fold2 = pd.DataFrame(columns=(instances.columns))
fold3 = pd.DataFrame(columns=(instances.columns))
fold4 = pd.DataFrame(columns=(instances.columns))
# Record the column of the Actual Class
actual_class_column = no_of_columns - 1
# Generate an array containing the unique
# Actual Class values
unique_class_list_df = instances.iloc[:,actual_class_column]
unique_class_list_df = unique_class_list_df.sort_values()
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df
unique_class_list_np_size = unique_class_list_np.size
# For each unique class in the unique Actual Class array
for unique_class_list_np_idx in range(0, unique_class_list_np_size):
# Initialize the counter to 0
counter = 0
# Go through each row of the data set and find instances that
# are part of this unique class. Distribute them among one
# of five folds
for row in range(0, no_of_rows):
# If the value of the unique class is equal to the actual
# class in the original data set on this row
if unique_class_list_np[unique_class_list_np_idx] == (
instances.iloc[row,actual_class_column]):
# Allocate instance to fold0
if counter == 0:
# Extract data for the new row
new_row = instances.iloc[row,:]
# Append that entire instance to fold
fold0.loc[len(fold0)] = new_row
# Increase the counter by 1
counter += 1
# Allocate instance to fold1
elif counter == 1:
# Extract data for the new row
new_row = instances.iloc[row,:]
# Append that entire instance to fold
fold1.loc[len(fold1)] = new_row
# Increase the counter by 1
counter += 1
# Allocate instance to fold2
elif counter == 2:
# Extract data for the new row
new_row = instances.iloc[row,:]
# Append that entire instance to fold
fold2.loc[len(fold2)] = new_row
# Increase the counter by 1
counter += 1
# Allocate instance to fold3
elif counter == 3:
# Extract data for the new row
new_row = instances.iloc[row,:]
# Append that entire instance to fold
fold3.loc[len(fold3)] = new_row
# Increase the counter by 1
counter += 1
# Allocate instance to fold4
else:
# Extract data for the new row
new_row = instances.iloc[row,:]
# Append that entire instance to fold
fold4.loc[len(fold4)] = new_row
# Reset counter to 0
counter = 0
return fold0, fold1, fold2, fold3, fold4
I hypothesize that performance
was high on this algorithm because of the large number of instances (699 in
total). This data set had the highest number of instances out of all the data
sets.
These results also suggest that
the amount of training data has a direct impact on performance. Higher amounts
of data can lead to better learning and better classification accuracy on new,
unseen instances.
Glass Data Set
I hypothesize that the poor
performance on the glass data set is due to the high numbers of classes
combined with a relatively smaller data set.
Iris Data Set
Classification accuracy on the iris data set was satisfactory. This data set was small, and more training data would be needed to see if accuracy could be improved by giving the algorithm more data to learn the underlying relationship between the attributes and the flower types.
Soybean Data Set (small)
I hypothesize that the large
numbers of attributes in the soybean data set (35) helped balance the
relatively small number of training instances. These results suggest that large
numbers of relevant attributes can help a machine learning algorithm create
more accurate classifications.
Vote Data Set
The results show that classification algorithms like Logistic Regression can have outstanding performance on large data sets that are binary classification problems.
Summary and Conclusions
Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
Large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.
Classification algorithms like Logistic Regression can achieve excellent classification accuracy on binary classification problems, but performance on multi-class classification algorithms can yield mixed results.
Alpaydin, E. (2014). Introduction to Machine
Learning. Cambridge, Massachusetts: The MIT Press.
Fisher, R. (1988, July 01). Iris Data Set.
Retrieved from Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/iris
German, B. (1987, September 1). Glass
Identification Data Set. Retrieved from UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification
Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals
of Machine Learning for Predictive Data Analytics. Cambridge,
Massachusetts: The MIT Press.
Michalski, R. (1980). Learning by being told and
learning from examples: an experimental comparison of the two methodes of
knowledge acquisition in the context of developing an expert system for
soybean disease diagnosis. International Journal of Policy Analysis and
Information Systems, 4(2), 125-161.
Rebala, G., Ravi, A., & Churiwala, S. (2019). An
Introduction to Machine Learning. Switzerland: Springer.
Schlimmer, J. (1987, 04 27). Congressional Voting
Records Data Set. Retrieved from Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin
(Original) Data Set. Retrieved from Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25
Y. Ng, A., & Jordan, M. (2001). On Discriminative
vs. Generative Classifiers: A Comparison of Logistic Regression and Naive
Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural
Information Processing Systems: Natural and Synthetic , 841-848.