Artificial Feedforward Neural Network With Backpropagation From Scratch

In this post, I will walk you through how to build an artificial feedforward neural network trained with backpropagation, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to evaluate the performance of an artificial feedforward neural network trained with backpropagation and to compare the performance using no hidden layers, one hidden layer, and two hidden layers. Five different data sets from the UCI Machine Learning Repository are used to compare performance: Breast Cancer, Glass, Iris, Soybean (small), and Vote.

We will use our neural network to do the following:

  • Predict if someone has breast cancer
  • Identify glass type
  • Identify flower species
  • Determine soybean disease type
  • Classify a representative as either a Democrat or Republican based on their voting patterns

I hypothesize that the neural networks with no hidden layers will outperform the networks with two hidden layers. My hypothesis is based on the notion that the simplest solutions are often the best solutions (i.e. Occam’s Razor).

The classification accuracy of the algorithms on the data sets will be evaluated as follows, using five-fold stratified cross-validation:

  • Accuracy = (number of correct predictions)/(total number of predictions)

Title image source: Wikimedia commons

Table of Contents

What is an Artificial Feedforward Neural Network Trained with Backpropagation?

neural_network-1

Background

An artificial feed-forward neural network (also known as multilayer perceptron) trained with backpropagation is an old machine learning technique that was developed in order to have machines that can mimic the brain. Neural networks were the focus of a lot of machine learning research during the 1980s and early 1990s but declined in popularity during the late 1990s.

Since 2010, neural networks have experienced a resurgence in popularity due to improvements in computing speed and the availability of massive amounts of data with which to train large-scale neural networks. In the real world, neural networks have been used to recognize speech, caption images, and even help self-driving cars learn how to park autonomously.

The Brain as Inspiration for Artificial Neural Networks

neuron_t_png

In order to understand neural networks, it helps to first take a look at the basic architecture of the human brain. The brain has 1011 neurons (Alpaydin, 2014). Neurons are cells inside the brain that process information.

Each neuron contains a number of input wires called dendrites. Each neuron also has one output wire called an axon. The axon is used to send messages to other neurons. The axon of a sending neuron is connected to the dendrites of the receiving neuron via a synapse.

So, in short, a neuron receives inputs from dendrites, performs a computation, and sends the output to other neurons via the axon. This process is how information flows through the brain. The messages sent between neurons are in the form of electric pulses.

An artificial neural network, the one used in machine learning, is a simplified model of the actual human neural network explained above. It is typically composed of zero or more layers.

neural-network

Each layer of the neural network is made up of nodes (analogous to neurons in the brain). Nodes of one layer are connected to nodes in another layer by connection weights, which are typically just floating-point numbers (e.g. 0.23342341). These numbers represent the strength of the connection between two nodes.

The job of a node in a hidden layer is to:

  1. Receive input values from each node in a preceding layer
  2. Compute a weighted sum of those input values
  3. Send that weighted sum through some activation function (e.g. logistic sigmoid function or hyperbolic tangent function)
  4. Send the result of the computation in #3 to each node in the next layer of the neural network.

Thus, the output from the nodes in a given layer becomes the input for all nodes in the next layer.

The output layer of a network does steps 1-3 above. However, the result of the computation from step #3 is a class prediction instead of an input to another layer (since the output layer is the final layer).

Here is a diagram of the process I explained above:

Here is a diagram showing a single layer neural network:

b stands for the bias term. This is a constant. It is like the b in the equation for a line, y = mx + b. It enables the model to have flexibility because, without that bias term, you cannot as easily adapt the weighted sum of inputs (i.e. mx) to fit the data (i.e. in the example of a simple line, the line cannot move up and down the y-axis without that b term).

w in the diagram above stands for the weights, and x stands for the input values.

Here is a similar diagram, but now it is a two-layer neural network instead of single layer.

And here is one last way to look at the same thing I explained above:

artificial_neuron_scheme

Note that the yellow circles on the left represent the input values. w represents the weights. The sigma inside the box means that we calculated the weighted sum of the input values. We run that through the activation function f(S)…e.g. sigmoid function. And then out of that, pops the output, which is passed on to the nodes in the following layer.

Neural networks that contain many layers, for example more than 100, are called deep neural networks. Deep neural networks are the cornerstone of the rapidly growing field known as deep learning.

Training Phase

The objective during the training phase of a neural network is to determine all the connection weights. At the start of training, the weights of the network are initialized to small random values close to 0. After this step, training proceeds to the two main phases of the algorithm: forward propagation and backpropagation.

Forward Propagation

During the forward propagation phase of a neural network, we process one instance (i.e. one set of inputs) at a time. Hidden layers extract important features contained in the input data by computing a weighted sum of the inputs and running the result through the logistic sigmoid activation function. This output relays to nodes in the next hidden layer where the data is transformed yet again. This process continues until the data reaches the output layer.

The output of the output layer is a predicted class value, which in this project is a one-hot encoded class prediction vector. The index of the vector corresponds to each class. For example, if a 1 is in the 0 index of the vector (and a 0 is in all other indices of the vector), the class prediction is class 0. Because we are dealing with 0s and 1s, the output vector can also be considered the probability that an instance is in a given class.

Backpropagation

After the input signal produced by a training instance propagates through the network one layer at a time to the output layer, the backpropagation phase commences. An error value is calculated at the output layer. This error corresponds to the difference between the class predicted by the network and the actual (i.e. true) class of the training instance.

The prediction error is then propagated backward from the output layer to the input layer. Blame for the error is assigned to each node in each layer, and then the weights of each node of the neural network are updated accordingly (with the goal to make more accurate class predictions for the next instance that flows through the neural network) using stochastic gradient descent for the weight optimization procedure.

Note that weights of the neural network are adjusted on a training instance by training instance basis. This online learning method is the preferred one for classification problems of large size (Ĭordanov & Jain, 2013).

The forward propagation and backpropagation phases continue for a certain number of epochs. A single epoch finishes when each training instance has been processed exactly once.

Testing Phase

Once the neural network has been trained, it can be used to make predictions on new, unseen test instances. Test instances flow through the network one-by-one, and the resulting output (which is a vector of class probabilities) determines the classification. 

Helpful Video

Below is a helpful video by Andrew Ng, a professor at Stanford University, that explains neural networks and is helpful for getting your head around the math. The video gets pretty complicated in some spots (esp. where he starts writing all sorts of mathematical notation and derivatives). My advice is to lookup anything that he explains that isn’t clear. Take it slow as you are learning about neural networks. There is no rush. This stuff isn’t easy to understand on your first encounter with it. Over time, the fog will begin to lift, and you will be able to understand how it all works.

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

Required Data Set Format for Feedforward Neural Network Trained with Backpropagation

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  • N: Actual Class

The program then adds two additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

Modification of Attribute Values

  • The actual class value was changed to “Benign” or “Malignant.”
  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There are no missing values in this data set.

Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.

Missing Data

There were no missing attribute values.

Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

Modification of Attribute Values

  • Attribute values were normalized to be in the range 0 to 1.
  • Class values were vectorized using one-hot encoding.
  • Attribute values that were all the same value were removed.

Missing Data

There are no missing values in this data set.

Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

  • 267 Democrats
  • 168 Republicans

Modification of Attribute Values

  • I did the following modifications:
    • Changed all “y” to 1 and all “n” to 0.
  • Class values were vectorized using one-hot encoding.

Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

Stochastic Gradient Descent

I used stochastic gradient descent for optimizing the weights.

In normal gradient descent, we need to calculate the partial derivative of the cost function with respect to each weight. For each partial derivative, we have to tally up the terms for each training instance to compute the partial derivative of the cost function with respect to that weight. What this means is that, if we have a lot of attributes and a large dataset, gradient descent is slow. For this reason, stochastic gradient descent was chosen since weights are updated after each training instance (as opposed to after all training instances).

Here is a good video that explains stochastic gradient descent.

Logistic (Sigmoid) Activation Function

The logistic (sigmoid) activation function was used for the nodes in the neural network.

Description of Any Tuning Process Applied

Learning Rate

Some tuning was performed in this project. The learning rate was set to 0.1, which was different than the 0.01 value that is often used for multi-layer feedforward neural networks (Montavon, 2012). Lower values resulted in much longer training times and did not result in large improvements in classification accuracy.

Epochs

The number of epochs chosen for the main runs of the algorithm on the data sets was chosen to be 1000. Other values were tested, but the number of epochs did not have a large impact on classification accuracy.

Number of Nodes per Hidden Layer

In order to tune the number of nodes per hidden layer, I used a constant learning rate and constant number of epochs. I then calculated the classification accuracy for each data set for a set number of nodes per hidden layer. I performed this process using networks with one hidden layer and networks with two hidden layers. The results of this tuning process are below.

tuning-artificial-neural-network

Note that the mean classification accuracy across all data sets when one hidden layer was used for the neural network reached a peak at eight nodes per hidden layer. This value of eight nodes per hidden layer was used for the actual runs on the data sets.

For two hidden layers, the peak mean classification accuracy was attained at five nodes per hidden layer. Thus, when the algorithm was run on the data sets for two hidden layers, I used five nodes per hidden layer for each data set to compare the classification accuracy across the data sets.

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the full code for the neural network. This is all you need to run the program:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from random import shuffle # Import shuffle() method from the random module
from random import seed # Import seed() method from the random module
from random import random # Import random() method from the random module
from collections import Counter # Used for counting
from math import exp # Import exp() function from the math module

# File name: neural_network.py
# Author: Addison Sears-Collins
# Date created: 7/30/2019
# Python version: 3.7
# Description: An artificial feedforward neural network trained 
#   with backpropagation (also called multilayer perceptron)

# Required Data Set Format
# Columns (0 through N)
# 0: Attribute 0
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# 2 additional columns are added for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct?

ALGORITHM_NAME = "Feedforward Neural Network With Backpropagation"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def normalize(dataset):
    """
    Normalize the attribute values so that they are between 0 and 1, inclusive
    :param pandas_dataframe dataset: The original dataset as a Pandas dataframe
    :return: normalized_dataset
    :rtype: Pandas dataframe
    """
    # Generate a list of the column names 
    column_names = list(dataset) 

    # For every column except the actual class column
    for col in range(0, len(column_names) - 1):  
        temp = dataset[column_names[col]] # Go column by column
        minimum = temp.min() # Get the minimum of the column
        maximum = temp.max() # Get the maximum of the column

        # Normalized all values in the column so that they
        # are between 0 and 1.
        # x_norm = (x_i - min(x))/(max(x) - min(x))
        dataset[column_names[col]] = dataset[column_names[col]] - minimum
        dataset[column_names[col]] = dataset[column_names[col]] / (
            maximum - minimum)

    normalized_dataset = dataset

    return normalized_dataset

def get_five_stratified_folds(dataset):
    """
    Implementation of five-fold stratified cross-validation. Divide the data
    set into five random folds. Make sure that the proportion of each class 
    in each fold is roughly equal to its proportion in the entire data set.
    :param pandas_dataframe dataset: The original dataset as a Pandas dataframe
    :return: five_folds
    :rtype: list of folds where each fold is a list of instances(i.e. examples)
    """
    # Create five empty folds
    five_folds = list()
    fold0 = list()
    fold1 = list()
    fold2 = list()
    fold3 = list()
    fold4 = list()

    # Get the number of columns in the data set
    class_column = len(dataset[0]) - 1

    # Shuffle the data randomly
    shuffle(dataset)

    # Generate a list of the unique class values and their counts
    classes = list()  # Create an empty list named 'classes'

    # For each instance in the dataset, append the value of the class
    # to the end of the classes list
    for instance in dataset:
        classes.append(instance[class_column])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in dataset:

            # If we have a match
            if uniqueclass == instance[class_column]:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    shuffle(fold0)
    shuffle(fold1)
    shuffle(fold2)
    shuffle(fold3)
    shuffle(fold4)

    # Add the five stratified folds to the list
    five_folds.append(fold0)
    five_folds.append(fold1)
    five_folds.append(fold2)
    five_folds.append(fold3)
    five_folds.append(fold4)

    return five_folds

def initialize_neural_net(
    no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs):
    """
    Generates a new neural network that is ready to be trained.
    Network (list of layers): 0+ hidden layers, and output layer
    Input Layer (list of attribute values): A row from the training set 
    Hidden Layer (list of dictionaries): A set of nodes (i.e. neurons)
    Output Layer (list of dictionaries): A set of nodes, one node per class
    Node (dictionary): Contains a set of weights, one weight for each input 
      to the layer containing that node + an additional weight for the bias.
      Each node is represented as a dictionary that stores key-value pairs
      Each key corresponds to a property of that node (e.g. weights).
      Weights will be initialized to small random values between 0 and 1.
    :param int no_inputs: Numper of inputs (i.e. attributes)
    :param int no_hidden_layers: Numper of hidden layers (0 or more)
    :param int no_nodes_per_hidden_layer: Numper of nodes per hidden layer
    :param int no_outputs: Numper of outputs (one output node per class)
    :return: network
    :rtype:list (i.e. list of layers: hidden layers, output layer)
    """

    # Create an empty list
    network = list()

    # Create the the hidden layers
    hidden_layer = list()
    hl_counter = 0

    # Create the output layer
    output_layer = list()

    # If this neural network contains hidden layers
    if no_hidden_layers > 0:

        # Build one hidden layer at a time
        for layer in range(no_hidden_layers):

            # Reset to an empty hidden layer
            hidden_layer = list()

            # If this is the first hidden layer
            if hl_counter == 0:

                # Build one node at a time
                for node in range(no_nodes_per_hidden_layer):

                    initial_weights = list()
                    
                    # Each node in the hidden layer has no_inputs + 1 weights, 
                    # initialized to a random number in the range [0.0, 1.0)
                    for i in range(no_inputs + 1):
                        initial_weights.append(random())

                    # Add the node to the first hidden layer
                    hidden_layer.append({'weights':initial_weights})

                # Finished building the first hidden layer
                hl_counter += 1

                # Add this first hidden layer to the front of the neural 
                # network
                network.append(hidden_layer)

            # If this is not the first hidden layer
            else:

                # Build one node at a time
                for node in range(no_nodes_per_hidden_layer):

                    initial_weights = list()
                    
                    # Each node in the hidden layer has 
                    # no_nodes_per_hidden_layer + 1 weights, initialized to 
                    # a random number in the range [0.0, 1.0)
                    for i in range(no_nodes_per_hidden_layer + 1):
                        initial_weights.append(random())

                    hidden_layer.append({'weights':initial_weights})

                # Add this newly built hidden layer to the neural network
                network.append(hidden_layer)

        # Build the output layer
        for outputnode in range(no_outputs):

            initial_weights = list()
                    
            # Each node in the output layer has no_nodes_per_hidden_layer 
            # + 1 weights, initialized to a random number in 
            # the range [0.0, 1.0)
            for i in range(no_nodes_per_hidden_layer + 1):
                initial_weights.append(random())

            # Add this output node to the output layer
            output_layer.append({'weights':initial_weights})

        # Add the output layer to the neural network
        network.append(output_layer)
    
    # A neural network has no hidden layers
    else:

        # Build the output layer
        for outputnode in range(no_outputs):
        
            initial_weights = list()
                    
            # Each node in the hidden layer has no_inputs + 1 weights, 
            # initialized to a random number in the range [0.0, 1.0)
            for i in range(no_inputs + 1):
                initial_weights.append(random())

            # Add this output node to the output layer
            output_layer.append({'weights':initial_weights})

        network.append(output_layer)

    # Finished building the initial neural network
    return network

def weighted_sum_of_inputs(weights, inputs):
    """
    Calculates the weighted sum of inputs plus the bias
    :param list weights: A list of weights. Each node has a list of weights.
    :param list inputs: A list of input values. These can be a single row
        of attribute values or the output from nodes from the previous layer
    :return: weighted_sum
    :rtype: float
    """
    # We assume that the last weight is the bias value
    # The bias value is a special weight that does not multiply with an input
    # value (or we could assume its corresponding input value is always 1)
    # The bias is similar to the intercept constant b in y = mx + b. It enables
    # a (e.g. sigmoid) curve to be shifted to create a better fit
    # to the data. Without the bias term b, the line always goes through the 
    # origin (0,0) and cannot adapt as well to the data.
    # In y = mx + b, we assume b * x_0 where x_0 = 1

    # Initiate the weighted sum with the bias term. Assume the last weight is
    # the bias term
    weighted_sum = weights[-1]

    for index in range(len(weights) - 1):
        weighted_sum += weights[index] * inputs[index]

    return weighted_sum

def sigmoid(weighted_sum_of_inputs_plus_bias):
    """
    Run the weighted sum of the inputs + bias through
    the sigmoid activation function.
    :param float weighted_sum_of_inputs_plus_bias: Node summation term
    :return: sigmoid(weighted_sum_of_inputs_plus_bias)
    """
    return 1.0 / (1.0 + exp(-weighted_sum_of_inputs_plus_bias))

def forward_propagate(network, instance):
    """
    Instances move forward through the neural network from one layer
    to the next layer. At each layer, the outputs are calculated for each 
    node. These outputs are the inputs for the nodes in the next layer.
    The last set of outputs is the output for the nodes in the output 
    layer.
    :param list network: List of layers: 0+ hidden layers, 1 output layer
    :param list instance (a single training/test instance from the data set)
    :return: outputs
    :rtype: list
    """
    inputs = instance

    # For each layer in the neural network
    for layer in network:

        # These will store the outputs for this layer
        new_inputs = list()

        # For each node in this layer
        for node in layer:

            # Calculate the weighted sum + bias term
            weighted_sum = weighted_sum_of_inputs(node['weights'], inputs)

            # Run the weighted sum through the activation function
            # and store the result in this node's dictionary.
            # Now the node's dictionary has two keys, weights and output.
            node['output'] = sigmoid(weighted_sum)

            # Used for debugging
            #print(node)

            # Add the output of the node to the new_inputs list
            new_inputs.append(node['output'])

        # Update the inputs list
        inputs = new_inputs

    # We have reached the output layer
    outputs = inputs

    return outputs

def sigmoid_derivative(output):
    """
    The derivative of the sigmoid activation function with respect 
    to the weighted summation term of the node.
    Formally (after a lot of calculus), this derivative is:
        derivative = sigmoid(weighted_sum_of_inputs_plus_bias) * 
        (1 - sigmoid(weighted_sum_of_inputs_plus_bias))
                   = node_ouput * (1 - node_output)
    This method is used during the backpropagation phase. 
    :param list output: Output of a node (generated during the forward
        propagation phase)
    :return: sigmoid_der
    :rtype: float
    """
    return output * (1.0 - output)

def back_propagate(network, actual):
    """
    In backpropagation, the error is computed between the predicted output by 
    the network and the actual output as determined by the data set. This error 
    propagates backwards from the output layer to the first hidden layer. The 
    weights in each layer are updated along the way in response to the error. 
    The goal is to reduce the prediction error for the next training instance 
    that forward propagates through the network.
    :param network list: The neural network
    :param actual list: A list of the actual output from the data set
    """
    # Iterate in reverse order (i.e. starts from the output layer)
    for i in reversed(range(len(network))):

        # Work one layer at a time
        layer = network[i]

        # Keep track of the errors for the nodes in this layer
        errors = list()

        # If this is a hidden layer
        if i != len(network) - 1:

            # For each node_j in this hidden layer
            for j in range(len(layer)):

                # Reset the error value
                error = 0.0

                # Calculate the weighted error. 
                # The error values come from the error (i.e. delta) calculated
                # at each node in the layer just to the "right" of this layer. 
                # This error is weighted by the weight connections between the 
                # node in this hidden layer and the nodes in the layer just 
                # to the "right" of this layer.
                for node in network[i + 1]:
                    error += (node['weights'][j] * node['delta'])

                # Add the weighted error for node_j to the
                # errors list
                errors.append(error)
        
        # If this is the output layer
        else:

            # For each node in the output layer
            for j in range(len(layer)):
                
                # Store this node (i.e. dictionary)
                node = layer[j]

                # Actual - Predicted = Error
                errors.append(actual[j] - node['output'])

        # Calculate the delta for each node_j in this layer
        for j in range(len(layer)):
            node = layer[j]

            # Add an item to the node's dictionary with the 
            # key as delta.
            node['delta'] = errors[j] * sigmoid_derivative(node['output'])

def update_weights(network, instance, learning_rate):
    """
    After the deltas (errors) have been calculated for each node in 
    each layer of the neural network, the weights can be updated.
    new_weight = old_weight + learning_rate * delta * input_value
    :param list network: List of layers: 0+ hidden layers, 1 output layer
    :param list instance: A single training/test instance from the data set
    :param float learning_rate: Controls step size in the stochastic gradient
        descent procedure.
    """
    # For each layer in the network
    for layer_index in range(len(network)):

        # Extract all the attribute values, excluding the class value
        inputs = instance[:-1]

        # If this is not the first hidden layer
        if layer_index != 0:

            # Go through each node in the previous layer and add extract the
            # output from that node. The output from the previous layer
            # is the input to this layer.
            inputs = [node['output'] for node in network[layer_index - 1]]

        # For each node in this layer
        for node in network[layer_index]:

            # Go through each input value
            for j in range(len(inputs)):
                
                # Update the weights
                node['weights'][j] += learning_rate * node['delta'] * inputs[j]
          
            # Updating the bias weight 
            node['weights'][-1] += learning_rate * node['delta']

def train_neural_net(
    network, training_set, learning_rate, no_epochs, no_outputs):
    """
    Train a neural network that has already been initialized.
    Training is done using stochastic gradient descent where the weights
    are updated one training instance at a time rather than after the
    entire training set (as is the case with gradient descent).
    :param list network: The neural network, which is a list of layers
    :param list training_set: A list of training instances (i.e. examples)
    :param float learning_rate: Controls step size of gradient descent
    :param int no_epochs: How many passes we will make through training set
    :param int no_outputs: The number of output nodes (equal to # of classes)
    """
    # Go through the entire training set a fixed number of times (i.e. epochs)
    for epoch in range(no_epochs):
   
        # Update the weights one instance at a time
        for instance in training_set:

            # Forward propagate the training instance through the network
            # and produce the output, which is a list.
            outputs = forward_propagate(network, instance)

            # Vectorize the output using one hot encoding. 
            # Create a list called actual_output that is the same length 
            # as the number of outputs. Put a 1 in the place of the actual 
            # class.
            actual_output = [0 for i in range(no_outputs)]
            actual_output[int(instance[-1])] = 1
            
            back_propagate(network, actual_output)
            update_weights(network, instance, learning_rate)

def predict_class(network, instance):
    """
    Make a class prediction given a trained neural network and
    an instance from the test data set.
    :param list network: The neural network, which is a list of layers
    :param list instance: A single training/test instance from the data set
    :return class_prediction
    :rtype int
    """
    outputs = forward_propagate(network, instance)

    # Return the index that has the highest probability. This index
    # is the class value. Assume class values begin at 0 and go
    # upwards by 1 (i.e. 0, 1, 2, ...)
    class_prediction = outputs.index(max(outputs))
    
    return class_prediction

def calculate_accuracy(actual, predicted):
    """
    Calculates the accuracy percentages
    :param list actual: Actual class values
    :param list predicted: predicted class values
    :return: classification_accuracy
    :rtype: float (as a percentage)
    """
    number_of_correct_predictions = 0
    for index in range(len(actual)):
        if actual[index] == predicted[index]:
            number_of_correct_predictions += 1
    
    classification_accuracy = (
        number_of_correct_predictions / float(len(actual))) * 100.0
    return classification_accuracy

def get_test_set_predictions(
    training_set, test_set, learning_rate, no_epochs, 
    no_hidden_layers, no_nodes_per_hidden_layer):
    """
    This method is the workhorse. 
    A new neutal network is initialized.
    The network is trained on the training set.
    The trained neural network is used to generate predictions on the
    test data set.
    :param list training_set
    :param list test_set
    :param float learning_rate
    :param int no_epochs
    :param int no_hidden_layers
    :param int no_nodes_per_hidden_layer
    :return network, class_predictions
    :rtype list, list
    """
    # Get the number of attribute values
    no_inputs = len(training_set[0]) - 1

    # Calculate the number of unique classes
    no_outputs = len(set([instance[-1] for instance in training_set]))
    
    # Build a new neural network
    network = initialize_neural_net(
        no_inputs, no_hidden_layers, no_nodes_per_hidden_layer, no_outputs)

    train_neural_net(
        network, training_set, learning_rate, no_epochs, no_outputs)
    
    # Store the class predictions for each test instance
    class_predictions = list()
    for instance in test_set:
        cl_prediction = predict_class(network, instance)
        class_predictions.append(cl_prediction)

    # Return the learned model as well as the class predictions
    return network, class_predictions

###############################################################

def main():
    """
    The main method of the program
    """
    LEARNING_RATE = 0.1 # Used for stochastic gradient descent procedure
    NO_EPOCHS = 1000 # Epoch is one complete pass through training data
    NO_HIDDEN_LAYERS = 1 # Number of hidden layers
    NO_NODES_PER_HIDDEN_LAYER = 8 # Number of nodes per hidden layer

    # Welcome message
    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Directory where data set is located
    #data_path = input("Enter the path to your input file: ") 
    data_path = "vote.txt"

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Show functioning of the program
    #trace_runs_file = input("Enter the name of your trace runs file: ") 
    trace_runs_file = "vote_nn_trace_runs.txt"

    ## Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    #test_stats_file = input("Enter the name of your test statistics file: ") 
    test_stats_file = "vote_nn_test_stats.txt"

    ## Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Generate a list of the column names 
    column_names = list(pd_data_set) 

    # The input layer in the neural network 
    # will have one node for each attribute value
    no_of_inputs = len(column_names) - 1

    # Make a list of the unique classes
    list_of_unique_classes = pd.unique(pd_data_set["Actual Class"])

    # The output layer in the neural network 
    # will have one node for each class value
    no_of_outputs = len(list_of_unique_classes)

    # Replace all the class values with numbers, starting from 0
    # in the Pandas dataframe.
    for cl in range(0, len(list_of_unique_classes)):
        pd_data_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)

    # Normalize the attribute values so that they are all between 0 
    # and 1, inclusive
    normalized_pd_data_set = normalize(pd_data_set)

    # Convert normalized Pandas dataframe into a list
    dataset_as_list = normalized_pd_data_set.values.tolist()

    # Set the seed for random number generator
    seed(1)

    # Get a list of 5 stratified folds because we are doing
    # five-fold stratified cross-validation
    fv_folds = get_five_stratified_folds(dataset_as_list)
    
    # Keep track of the scores for each of the five experiments
    scores = list()
    
    experiment_counter = 0
    for fold in fv_folds:
        
        print()
        print("Running Experiment " + str(experiment_counter) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(
            experiment_counter) + " ...\n")
        outfile_tr.write("\n")

        # Get all the folds and store them in the training set
        training_set = list(fv_folds)

        # Four folds make up the training set
        training_set.remove(fold)        

        # Combined all the folds so that all we have is a list
        # of training instances
        training_set = sum(training_set, [])
        
        # Initialize a test set
        test_set = list()
        
        # For each instance in this test fold
        for instance in fold:
            
            # Create a copy and store it
            copy_of_instance = list(instance)
            test_set.append(copy_of_instance)
        
        # Get the trained neural network and the predicted values
        # for each test instance
        neural_net, predicted_values = get_test_set_predictions(
            training_set, test_set,LEARNING_RATE,NO_EPOCHS,
            NO_HIDDEN_LAYERS,NO_NODES_PER_HIDDEN_LAYER)
        actual_values = [instance[-1] for instance in fold]
        accuracy = calculate_accuracy(actual_values, predicted_values)
        scores.append(accuracy)

        # Print the learned model
        print("Experiment " + str(
            experiment_counter) + " Trained Neural Network")
        print()
        for layer in neural_net:
            print(layer)
        print()
        outfile_tr.write("Experiment " + str(
            experiment_counter) + " Trained Neural Network")
        outfile_tr.write("\n")
        outfile_tr.write("\n")
        for layer in neural_net:
            outfile_tr.write(str(layer))
            outfile_tr.write("\n")
        outfile_tr.write("\n\n")

        # Print the classifications on the test instances
        print("Experiment " + str(
            experiment_counter) + " Classifications on Test Instances")
        print()
        outfile_tr.write("Experiment " + str(
            experiment_counter) + " Classifications on Test Instances")
        outfile_tr.write("\n\n")
        test_df = pd.DataFrame(test_set, columns=column_names)

        # Add 2 additional columns to the testing dataframe
        test_df = test_df.reindex(
        columns=[*test_df.columns.tolist(
        ), 'Predicted Class', 'Prediction Correct?'])

        # Add the predicted values to the "Predicted Class" column
        # Indicate if the prediction was correct or not.
        for pre_val_index in range(len(predicted_values)):
            test_df.loc[pre_val_index, "Predicted Class"] = predicted_values[
                pre_val_index]
            if test_df.loc[pre_val_index, "Actual Class"] == test_df.loc[
                pre_val_index, "Predicted Class"]:
                test_df.loc[pre_val_index, "Prediction Correct?"] = "Yes"
            else:
                test_df.loc[pre_val_index, "Prediction Correct?"] = "No"

        # Replace all the class values with the name of the class
        for cl in range(0, len(list_of_unique_classes)):
            test_df["Actual Class"].replace(
                cl, list_of_unique_classes[cl] ,inplace=True)
            test_df["Predicted Class"].replace(
                cl, list_of_unique_classes[cl] ,inplace=True)

        # Print out the test data frame
        print(test_df)   
        print()
        print()
        outfile_tr.write(str(test_df))   
        outfile_tr.write("\n\n")

        # Go to the next experiment
        experiment_counter += 1
    
    print("Experiments Completed.\n")
    outfile_tr.write("Experiments Completed.\n\n")

    # Print the test stats   
    print("------------------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("------------------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print("Learning Rate: " + str(LEARNING_RATE))
    print("Number of Epochs: " + str(NO_EPOCHS))
    print("Number of Hidden Layers: " + str(NO_HIDDEN_LAYERS))
    print("Number of Nodes Per Hidden Layer: " + str(
        NO_NODES_PER_HIDDEN_LAYER))
    print()
    print("Accuracy Statistics for All 5 Experiments: %s" % scores)
    print()
    print()
    print("Classification Accuracy: %.3f%%" % (
        sum(scores)/float(len(scores))))

    outfile_ts.write(
        "------------------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write(
        "------------------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path +"\n\n")
    outfile_ts.write("Learning Rate: " + str(LEARNING_RATE) + "\n")
    outfile_ts.write("Number of Epochs: " + str(NO_EPOCHS) + "\n")
    outfile_ts.write("Number of Hidden Layers: " + str(
        NO_HIDDEN_LAYERS) + "\n")
    outfile_ts.write("Number of Nodes Per Hidden Layer: " + str(
        NO_NODES_PER_HIDDEN_LAYER) + "\n")
    outfile_ts.write(
        "Accuracy Statistics for All 5 Experiments: %s" % str(scores))
    outfile_ts.write("\n\n")
    outfile_ts.write("Classification Accuracy: %.3f%%" % (
        sum(scores)/float(len(scores))))

    ## Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Return to Table of Contents

Artificial Feedforward Neural Network Trained with Backpropagation Output

Here are the trace runs:

Here are the results:

full-results-neural-network

Here are the test statistics for each data set:

Analysis

Breast Cancer Data Set

The breast cancer data set results were in line with what I expected. The simpler model, the one with no hidden layers, ended up generating the highest classification accuracy. Classification accuracy was just short of 97%. In other words, the neural network that had no hidden layers successfully classified a patient as either malignant or benign with an almost 97% accuracy.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data (699 instances in this case) can lead to better learning and better classification accuracy on new, unseen instances.

Glass Data Set

The performance of the neural network on the glass data set was the worst out of all of the data sets. The ability of the network to correctly identify the type of glass given the attribute values never exceeded 70%.

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

Iris Data Set

Classification accuracy was superb on the iris dataset, attaining a classification accuracy around 97%. The results of the iris dataset were surprising given that the more complicated neural network with two hidden layers and five nodes per hidden layer outperformed the simpler neural network that had no hidden layers. In this case, it appears that the iris dataset benefited from the increasing layers of abstraction provided by a higher number of layers.

Soybean Data Set (small)

Performance on the soybean data set was stellar and was the highest of all of the data sets but also had the largest standard deviation for the classification accuracy. Note that classification accuracy reached a peak of 100% using one hidden layer and eight nodes for the hidden layer. However, when I added an additional hidden layer, classification accuracy dropped to under 70%.

The reason for the high standard deviation of the classification accuracy is unclear, but I hypothesize it has to do with the relatively small number of training instances. Future work would need to be performed with the soybean large dataset available from the UCI Machine Learning Repository to see if these results remain consistent.

The results of the soybean runs suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

Vote Data Set

The vote data set did not yield the stellar performance of the soybean data set, but classification accuracy was still solid at ~96% using one hidden layer and eight nodes per hidden layer. These results are in line with what I expected because voting behavior should provide a powerful predictor of whether a candidate is a Democrat or Republican. I would have been surprised had I observed classification accuracies that were lower since members of Congress tend to vote along party lines on most issues.

Summary and Conclusions

My hypothesis was incorrect. In some cases, simple neural networks with no hidden layers outperformed more complex neural networks with 1+ hidden layers. However, in other cases, more complex neural networks with multiple hidden layers outperformed the network with no hidden layers. The reason why some data is more amenable to networks with hidden layers instead of without hidden layers is unclear.

Other conclusions include the following:

  • Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
  • Large numbers of relevant attributes can help a neural network create more accurate classifications.
  • Neural networks are powerful and can achieve excellent results on both binary and multi-class classification problems.

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Ĭordanov, I., & Jain, L. C. (2013). Innovations in Intelligent Machines –3 : Contemporary Achievements in Intelligent Systems. Berlin: Springer.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Montavon, G. O. (2012). Neural Networks : Tricks of the Trade. New York: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Return to Table of Contents

Logistic Regression Algorithm From Scratch

In this post, I will walk you through the Logistic Regression algorithm step-by-step.

  • We will develop the code for the algorithm from scratch using Python.
  • We will run the algorithm on real-world data sets from the UCI Machine Learning Repository.

Table of Contents

What is Logistic Regression?

Logistic regression, contrary to the name, is a classification algorithm. Unlike linear regression which outputs a continuous value (e.g. house price) for the prediction, Logistic Regression transforms the output into a probability value (i.e. a number between 0 and 1) using what is known as the logistic sigmoid function. This function is also known as the squashing function since it maps a line — that can run from negative infinity to positive infinity along the y-axis — to a number between 0 and 1.

Here is what the graph of the sigmoid function looks like:

sigmoid-curve
Source: (Kelleher, Namee, & Arcy, 2015)

The function is called the sigmoid function because it is s-shaped. Here is what the sigmoid function looks like in mathematical notation:

sigmoid-mathematical-equation

where:

  • h(z) is the predicted probability of a given instance (i.e. example) being in the positive class…that is the class represented as 1 in a data set. For example, in an e-mail classification data set, this would be the probability that a given e-mail instance is spam (If h(z) = 0.73, for example, that would mean that the instance has a 73% chance of being spam).
  • 1- h(z) is the probability of an instance being in the negative class, the class represented as 0 (e.g. not spam). h(z) is always a number between 0 and 1. Going back to the example in the bullet point above, this would mean that the instance has a 27% change of being not spam.
  • z is the input (e.g. a weighted sum of the attributes of a given instance)
  • e is Euler’s number

z is commonly expressed as the dot product, w · x, where w is a 1-dimensional vector containing the weights for each attribute, and x is a vector containing the values of each attribute for a specific instance of the data set (i.e. example).

Often the dot product, w · x, is written as matrix multiplication. In that case, z = wTx where T means transpose of the single dimensional weight vector w. The symbol Ɵ is often used in place of w.

So substituting w · x into the sigmoid equation, we getthe following equation:

sigmoid-curve-2

where

  • w is a 1-dimensional vector containing the weights for each attribute.
  • The subscript w on hw means the attributes x are weighted by the weight vector w.
  • hw(x) is the probability (a value between 0 and 1) that an instance is a member of the positive class (i.e. probability an e-mail is spam).
  • x is a vector containing the values of each attribute for a specific instance of the data set.
  • w · x = w0x0 + w1x1 + w2x2 + …. + wdx(analogous to the equation of a line y = mx + b from grade school)
    • d is the number of attributes in the data set
    • x0 = 1 by convention, for all instances. This attribute has to be added by the programmer for all instances. It is known formally as the “bias” term.

As is the case for many machine learning algorithms, the starting point for Logistic Regression is to create a trained model. One we have a trained model, we can use it to make predictions on new, unseen instances.

Training Phase

Creating a trained model entails determining the weight vector w. Once we have the weights, we can make predictions on new unseen examples. All we need are the values of the attributes of those examples (i.e. the x values), and we can weight the x values with the values of w to compute the probabilities h(x) for that example using the sigmoid function.

The rule for making predictions using the sigmoid function is as follows:

  • If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
  • If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

To determine the weights in linear regression, the sum of the squared error was the cost function (where error = actual values – predicted values by the line). The cost function represents how wrong a prediction is. In linear regression, it represents how wrong a line of best fit is on a set of observed training instances. The lower the sum of the squared error, the better a line fits the training data, and, in theory, the better the line will predict new, unseen instances.

Instead, the cost function in Logistic Regression is called cross-entropy. Without getting too detailed into the mathematics and notation of this particular equation, the cross-entropy equation is the one that we want to minimize. Minimizing this equation will yield us a sigmoid curve that best fits the training data and enables us to make the best classification predictions possible for new, unseen test instances. A minimum of the cost function is attained when the gradient of the cost function is close to zero (i.e. the calculated weights stop changing). The formal term for the gradient of the cost function getting close to zero is called convergence.

In order to minimize the cost function, we need to find its gradient (i.e. derivative, slope, etc.) and determine the values for the weight vector w that make its derivative as close to 0 as possible. We cannot just set the gradient to 0 and then enter x-values and calculate the weights directly. Instead, we have to use a method called gradient descent in order to find the weights.

In the gradient descent algorithm for Logistic Regression, we:

  1. Start off with an empty weight vector (initialized to random values between -0.01 and 0.01). The size of the vector is equal to the number of attributes in the data set.
  2. Initialize an empty weight change vector initialized to all zeros. The size of the vector is equal to the number of attributes in the data set.
  3. For each training instance, one at a time.
    • a. Make a probability prediction by calculating the weighted sum of the attribute values and running that value through the sigmoid function.
    • b. We evaluate the gradient of the cost function by plugging in the actual (i.e. observed) class value and the predicted class value from bullet point 3a above.
    • c. The gradient value from 3b gets added to the weight change vector.
  4. After we finish with the last training instance from 3, we multiply each value in the weight change vector by a learning rate (commonly 0.01).
  5. The vector from 4 gets added to the empty weight vector to update the weights.
  6. We then ask two questions
    • a. Have the weights continued to change (i.e. is the norm (i.e. magnitude) of the weight change vector less than a certain threshold like 0.001)?
    • b. Have we been through the data set less than 10,000 (or whatever we set the maximum iterations to) times?
    • c. If the answer is yes to both 6a and 6b, go back to step 2. Otherwise, we return the final weight vector, exiting the algorithm.

The gradient descent pseudocode for Logistic Regression is provided in Figure 10.6 of Introduction to Machine Learning by Ethem Alpaydin (Alpaydin, 2014).

Testing Phase

Once training is completed, we have the weights and can use these weights, attribute values, and the sigmoid function to make predictions for the set of test instances.

Predictions for a given test instance are made using the aforementioned sigmoid function:

sigmoid-curve-2-1

Where the rule for making predictions using the sigmoid function is as follows:

  • If hw(x) ≥ 0.5, class = 1 (positive class, e.g. spam)
  • If hw(x) < 0.5, class = 0 (negative class, e.g. not spam)

Multi-class Logistic Regression

A Multi-class Logistic Regression problem is a twist on the binary Logistic Regression method presented above. Multi-class Logistic Regression can make predictions on both binary and multi-class classification problems.

In order to make predictions for multi-class datasets, we take the training set and create multiple separate binary classification problems (one for each class in the data set). For each of those training sets that we generated, we set the class values for one class to 1 (representing the positive class), and we set all other classes to 0 (i.e. the negative class).

In other words, if there are k classes in a data set, k separate training sets are generated. In each of those k separate training sets, one class is set to 1 and all other classes are set to 0.

In Multi-class Logistic Regression, the training phase entails creating k different weight vectors, one for each class rather than just a single weight vector (which was the case in binary Logistic Regression). Each weight vector will help to predict the probability of an instance being a member of that class. Thus, in the testing phase, when there is an unseen new instance, three different predictions need to be made. This method is called the one-vs-all strategy, sometimes called one-vs-rest.

The rule for making predictions for a given instance are as follows:

  • For each new test instance,
    • Make k separate probability predictions.
    • Pick the class that has the highest probability (i.e. the class that is the most enthusiastic about that instance being a member of its class)

Other multi-class Logistic Regression algorithms include Softmax Regression and the one-vs-one strategy. The one-vs-all strategy was selected due to its popularity as being the default strategy used in practice for many of the well-known machine learning libraries for Python (Rebala, Ravi, & Churiwala, 2019)

Video

Here is an excellent video on logistic regression that explains the whole process I described above, step-by-step.

Return to Table of Contents

Logistic Regression Algorithm Design

The Logistic Regression algorithm was implemented from scratch. The Breast Cancer, Glass, Iris, Soybean (small), and Vote data sets were preprocessed to meet the input requirements of the algorithms. I used five-fold stratified cross-validation to evaluate the performance of the models.

Required Data Set Format for Logistic Regression

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  • N: Actual Class

The program then adds two additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Breast Cancer Data Set

This breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign (Wolberg, 1992).

Modification of Attribute Values

The actual class value was changed to “Benign” or “Malignant.”

I transformed the attributes into binary numbers so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0.

Missing Data

There were 16 missing attribute values, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Glass Data Set

This glass data set contains 214 instances, 10 attributes, and 7 classes (German, 1987). The purpose of the data set is to identify the type of glass.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Iris Data Set

This data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris plant (Fisher, 1988).

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There were no missing attribute values.

Soybean Data Set (small)

This soybean (small) data set contains 47 instances, 35 attributes, and 4 classes (Michalski, 1980). The purpose of the data set is to determine the disease type.

Modification of Attribute Values

If attribute values were greater than the median of the attribute, value was changed to 1, otherwise it was set to 0.

Missing Data

There are no missing values in this data set.

Vote Data Set

This data set includes votes for each of the U.S. House of Representatives Congressmen (435 instances) on the 16 key votes identified by the Congressional Quarterly Almanac (Schlimmer, 1987). The purpose of the data set is to identify the representative as either a Democrat or Republican.

  • 267 Democrats
  • 168 Republicans

Modification of Attribute Values

I did the following modifications:

  • Changed all “y” to 1 and all “n” to 0.

Missing Data

Missing values were denoted as “?”. To fill in those missing values, I chose random number, either 0 (“No”) or 1 (“Yes”).

Description of Any Tuning Process Applied

Some tuning was performed in this project. The learning rate was set to 0.01 by convention. A higher learning rate (0.5) resulted in poor results for the norm of the gradient (>1).

The stopping criteria for gradient descent was as follows:

  • Maximum iterations = 10,000
  • Euclidean norm of weight change vector < 0.001

When I tried max iterations at 100, the Euclidean norm of the weight change vector returned high values (> 0.2) which indicated that I needed to set a higher max iterations value in order to have a higher chance of convergence (i.e. weights stop changing) based on the norm stopping criteria.

Return to Table of Contents

Logistic Regression Algorithm in Python, Coded From Scratch

Here are the preprocessed data sets:

Here is the driver code. This is where the main method is located:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
import five_fold_stratified_cv
import logistic_regression

# File name: logistic_regression_driver.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Driver of the logistic_regression.py program

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# The logistic_regression.py program then adds 2 additional columns 
# for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

ALGORITHM_NAME = "Logistic Regression"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " Program!")
    print()

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 
    #data_path = "iris.txt"

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 
    #trace_runs_file = "iris_logistic_regression_trace_runs.txt"

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 
    #test_stats_file = "iris_logistic_regression_test_stats.txt"

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = five_fold_stratified_cv.get_five_folds(
        pd_data_set)

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run Logistic Regression the designated number of times as indicated by the 
    # number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = pd.concat([
               fold1, fold2, fold3, fold4], ignore_index=True, sort=False)                
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = pd.concat([
               fold0, fold2, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = pd.concat([
               fold0, fold1, fold3, fold4], ignore_index=True, sort=False) 
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold4], ignore_index=True, sort=False) 
        else:
            test_dataset = fold4
            training_dataset = pd.concat([
               fold0, fold1, fold2, fold3], ignore_index=True, sort=False) 
        
        accuracy, predictions, weights_for_each_class, no_of_instances_test = (
        logistic_regression.logistic_regression(training_dataset,test_dataset))

        # Print the trace runs of each experiment
        print("Accuracy:")
        print(str(accuracy * 100) + "%")
        print()
        print("Classifications:")
        print(predictions)
        print()
        print("Learned Model:")
        print(weights_for_each_class)
        print()
        print("Number of Test Instances:")
        print(str(no_of_instances_test))
        print() 

        outfile_tr.write("Accuracy:")
        outfile_tr.write(str(accuracy * 100) + "%\n\n")
        outfile_tr.write("Classifications:\n")
        outfile_tr.write(str(predictions) + "\n\n")
        outfile_tr.write("Learned Model:\n")
        outfile_tr.write(str(weights_for_each_class) + "\n\n")
        outfile_tr.write("Number of Test Instances:")
        outfile_tr.write(str(no_of_instances_test) + "\n\n")

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.\n")

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")
    outfile_ts.write("\n")
    accuracy = np.mean(accuracy_statistics)
    accuracy *= 100
    outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
   
    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    print()
    print("Classification Accuracy : " + str(accuracy) + "%")
    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Here is the code for logistic regression:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
 
# File name: logistic_regression.py
# Author: Addison Sears-Collins
# Date created: 7/19/2019
# Python version: 3.7
# Description: Multi-class logistic regression using one-vs-all. 
 
# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
 
# This program then adds 2 additional columns for the test set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

def sigmoid(z):
    """
    Parameters:
        z: A real number
    Returns: 
        1.0/(1 + np.exp(-z))
    """
    return 1.0/(1 + np.exp(-z))

def gradient_descent(training_set):
    """
    Gradient descent for logistic regression. Follows method presented
    in the textbook Introduction to Machine Learning 3rd Edition by 	
    Ethem Alpaydin (pg. 252)

    Parameters:
      training_set: The training instances as a Numpy array
    Returns:
      weights: The vector of weights, commonly called w or THETA
    """   

    no_of_columns_training_set = training_set.shape[1]
    no_of_rows_training_set = training_set.shape[0]

    # Extract the attributes from the training set.
    # x is still a 2d array
    x = training_set[:,:(no_of_columns_training_set - 1)]
    no_of_attributes = x.shape[1]

    # Extract the classes from the training set.
    # actual_class is a 1d array.
    actual_class = training_set[:,(no_of_columns_training_set - 1)]

    # Set a learning rate
    LEARNING_RATE = 0.01

    # Set the maximum number of iterations
    MAX_ITER = 10000

    # Set the iteration variable to 0
    iter = 0

    # Set a flag to determine if we have exceeded the maximum number of
    # iterations
    exceeded_max_iter = False

    # Set the tolerance. When the euclidean norm of the gradient vector 
    # (i.e. magnitude of the changes in the weights) gets below this value, 
    # stop iterating through the while loop
    GRAD_TOLERANCE = 0.001
    norm_of_gradient = None

    # Set a flag to determine if we have reached the minimum of the 
    # cost (i.e. error) function.
    converged = False

    # Create the weights vector with random floats between -0.01 and 0.01
    # The number of weights is equal to the number of attributes
    weights = np.random.uniform(-0.01,0.01,(no_of_attributes))
    changes_in_weights = None

    # Keep running the loop below until convergence on the minimum of the 
    # cost function or we exceed the max number of iterations
    while(not(converged) and not(exceeded_max_iter)):
        
        # Initialize a weight change vector that stores the changes in 
        # the weights at each iteration
        changes_in_weights = np.zeros(no_of_attributes)

        # For each training instance
        for inst in range(0, no_of_rows_training_set):

            # Calculate weighted sum of the attributes for
            # this instance
            output = np.dot(weights, x[inst,:])
                
            # Calculate the sigmoid of the weighted sum
            # This y is the probability that this instance belongs
            # to the positive class
            y =  sigmoid(output)

            # Calculate difference
            difference = (actual_class[inst] - y)

            # Multiply the difference by the attribute vector
            product = np.multiply(x[inst,:], difference)

            # For each attribute, update the weight changes 
            # i.e. the gradient vector
            changes_in_weights = np.add(changes_in_weights,product)
        
        # Calculate the step size
        step_size = np.multiply(changes_in_weights, LEARNING_RATE)

        # Update the weights vector
        weights = np.add(weights, step_size)

        # Test to see if we have converged on the minimum of the error
        # function
        norm_of_gradient = np.linalg.norm(changes_in_weights)

        if (norm_of_gradient < GRAD_TOLERANCE):
            converged = True

        # Update the number of iterations
        iter += 1

        # If we have exceeded the maximum number of iterations
        if (iter > MAX_ITER):
            exceeded_max_iter = True

    #For debugging purposes
    #print("Number of Iterations: " + str(iter - 1))
    #print("Norm of the gradient: " + str(norm_of_gradient))
    #print(changes_in_weights)
    #print()
    return weights


def logistic_regression(training_set, test_set):
    """
    Multi-class one-vs-all logistic regression
    Parameters:
      training_set: The training instances as a Pandas dataframe
      test_set: The test instances as a Pandas dataframe
    Returns:
      accuracy: Classification accuracy as a decimal
      predictions: Classifications of all the test instances as a 
        Pandas dataframe
      weights_for_each_class: The weight vectors for each class (one-vs-all)
      no_of_instances_test: The number of test instances
    """   

    # Remove the instance ID column
    training_set = training_set.drop(
        training_set.columns[[0]], axis=1)
    test_set = test_set.drop(
        test_set.columns[[0]], axis=1)

    # Make a list of the unique classes
    list_of_unique_classes = pd.unique(training_set["Actual Class"])

    # Replace all the class values with numbers, starting from 0
    # in both the test and training sets.
    for cl in range(0, len(list_of_unique_classes)):
        training_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)
        test_set["Actual Class"].replace(
            list_of_unique_classes[cl], cl ,inplace=True)

    # Insert a column of 1s in column 0 of both the training
    # and test sets. This is the bias and helps with gradient
    # descent. (i.e. X0 = 1 for all instances)
    training_set.insert(0, "Bias", 1)
    test_set.insert(0, "Bias", 1)

    # Convert dataframes to numpy arrays
    np_training_set = training_set.values
    np_test_set = test_set.values

    # Add 2 additional columns to the testing dataframe
    test_set = test_set.reindex(
        columns=[*test_set.columns.tolist(
        ), 'Predicted Class', 'Prediction Correct?'])

    ############################# Training Phase ##############################

    no_of_columns_training_set = np_training_set.shape[1]
    no_of_rows_training_set = np_training_set.shape[0]

    # Create and store a training set for each unique class
    # to create separate binary classification
    # problems
    trainingsets = []
    for cl in range(0, len(list_of_unique_classes)):

        # Create a copy of the training set
        temp = np.copy(np_training_set)

        # This class becomes the positive class 1
        # and all other classes become the negative class 0
        for row in range(0, no_of_rows_training_set):
            if (temp[row, (no_of_columns_training_set - 1)]) == cl:
                temp[row, (no_of_columns_training_set - 1)] = 1
            else:
                temp[row, (no_of_columns_training_set - 1)] = 0
        
        # Add the new training set to the trainingsets list
        trainingsets.append(temp)

    # Calculate and store the weights for the training set
    # of each class. Execute gradient descent on each training set
    # in order to calculate the weights
    weights_for_each_class = []

    for cl in range(0, len(list_of_unique_classes)):
        weights_for_this_class = gradient_descent(trainingsets[cl])
        weights_for_each_class.append(weights_for_this_class)

    # Used for debugging
    #print(weights_for_each_class[0])
    #print()
    #print(weights_for_each_class[1])
    #print()
    #print(weights_for_each_class[2])

    ########################### End of Training Phase #########################

    ############################# Testing Phase ###############################

    no_of_columns_test_set = np_test_set.shape[1]
    no_of_rows_test_set = np_test_set.shape[0]

    # Extract the attributes from the test set.
    # x is still a 2d array
    x = np_test_set[:,:(no_of_columns_test_set - 1)]
    no_of_attributes = x.shape[1]

    # Extract the classes from the test set.
    # actual_class is a 1d array.
    actual_class = np_test_set[:,(no_of_columns_test_set - 1)]

    # Go through each row (instance) of the test data
    for inst in range(0,  no_of_rows_test_set):

        # Create a scorecard that keeps track of the probabilities of this
        # instance being a part of each class
        scorecard = []

        # Calculate and store the probability for each class in the scorecard
        for cl in range(0, len(list_of_unique_classes)):

            # Calculate weighted sum of the attributes for
            # this instance
            output = np.dot(weights_for_each_class[cl], x[inst,:])

            # Calculate the sigmoid of the weighted sum
            # This is the probability that this instance belongs
            # to the positive class
            this_probability = sigmoid(output)

            scorecard.append(this_probability)

        most_likely_class = scorecard.index(max(scorecard))

        # Store the value of the most likely class in the "Predicted Class" 
        # column of the test_set data frame
        test_set.loc[inst, "Predicted Class"] = most_likely_class

        # Update the 'Prediction Correct?' column of the test_set data frame
        # 1 if correct, else 0
        if test_set.loc[inst, "Actual Class"] == test_set.loc[
            inst, "Predicted Class"]:
            test_set.loc[inst, "Prediction Correct?"] = 1
        else:
            test_set.loc[inst, "Prediction Correct?"] = 0

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (test_set["Prediction Correct?"].sum())/(len(test_set.index))

    # Store the revamped dataframe
    predictions = test_set

    # Replace all the class values with the name of the class
    for cl in range(0, len(list_of_unique_classes)):
        predictions["Actual Class"].replace(
            cl, list_of_unique_classes[cl] ,inplace=True)
        predictions["Predicted Class"].replace(
            cl, list_of_unique_classes[cl] ,inplace=True)

    # Replace 1 with Yes and 0 with No in the 'Prediction 
    # Correct?' column
    predictions['Prediction Correct?'] = predictions[
        'Prediction Correct?'].map({1: "Yes", 0: "No"})

    # Reformat the weights_for_each_class list of arrays
    weights_for_each_class = pd.DataFrame(np.row_stack(weights_for_each_class))
 
    # Rename the row names
    for cl in range(0, len(list_of_unique_classes)):
        row_name = str(list_of_unique_classes[cl] + " weights")        
        weights_for_each_class.rename(index={cl:row_name}, inplace=True)

    # Get a list of the names of the attributes
    training_set_names = list(training_set.columns.values)
    training_set_names.pop() # Remove 'Actual Class'

    # Rename the column names
    for col in range(0, len(training_set_names)):
        col_name = str(training_set_names[col])        
        weights_for_each_class.rename(columns={col:col_name}, inplace=True)

    # Record the number of test instances
    no_of_instances_test = len(test_set.index)

    # Return statement
    return accuracy, predictions, weights_for_each_class, no_of_instances_test

Here is the code for five-fold stratified cross-validation:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/17/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

def get_five_folds(instances):
    """
    Parameters:
        instances: A Pandas data frame containing the instances
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Shuffle the data set randomly
    instances = instances.sample(frac=1).reset_index(drop=True)

    # Record the number of columns in the data set
    no_of_columns = len(instances.columns) # number of columns

    # Record the number of rows in the data set
    no_of_rows = len(instances.index) # number of rows

    # Create five empty folds (i.e. Panda Dataframes: fold0 through fold4)
    fold0 = pd.DataFrame(columns=(instances.columns))
    fold1 = pd.DataFrame(columns=(instances.columns))
    fold2 = pd.DataFrame(columns=(instances.columns))
    fold3 = pd.DataFrame(columns=(instances.columns))
    fold4 = pd.DataFrame(columns=(instances.columns))

    # Record the column of the Actual Class
    actual_class_column = no_of_columns - 1

    # Generate an array containing the unique 
    # Actual Class values
    unique_class_list_df = instances.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    unique_class_list_np_size = unique_class_list_np.size

    # For each unique class in the unique Actual Class array
    for unique_class_list_np_idx in range(0, unique_class_list_np_size):

        # Initialize the counter to 0
        counter = 0

        # Go through each row of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for row in range(0, no_of_rows):

            # If the value of the unique class is equal to the actual
            # class in the original data set on this row
            if unique_class_list_np[unique_class_list_np_idx] == (
                instances.iloc[row,actual_class_column]):

                    # Allocate instance to fold0
                    if counter == 0:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold0.loc[len(fold0)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold1
                    elif counter == 1:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold1.loc[len(fold1)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold2
                    elif counter == 2:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold2.loc[len(fold2)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold3
                    elif counter == 3:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold3.loc[len(fold3)] = new_row
                                    
                        # Increase the counter by 1
                        counter += 1

                    # Allocate instance to fold4
                    else:

                        # Extract data for the new row
                        new_row = instances.iloc[row,:]

                        # Append that entire instance to fold
                        fold4.loc[len(fold4)] = new_row
                                    
                        # Reset counter to 0
                        counter = 0
        
    return fold0, fold1, fold2, fold3, fold4

Return to Table of Contents

Logistic Regression Output

Here are the trace runs:

Here are the results:

logistic-regression-results

Here are the test statistics for each data set:

Analysis

Breast Cancer Data Set

I hypothesize that performance was high on this algorithm because of the large number of instances (699 in total). This data set had the highest number of instances out of all the data sets.

These results also suggest that the amount of training data has a direct impact on performance. Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.

Glass Data Set

I hypothesize that the poor performance on the glass data set is due to the high numbers of classes combined with a relatively smaller data set.

Iris Data Set

Classification accuracy on the iris data set was satisfactory. This data set was small, and more training data would be needed to see if accuracy could be improved by giving the algorithm more data to learn the underlying relationship between the attributes and the flower types.

Soybean Data Set (small)

I hypothesize that the large numbers of attributes in the soybean data set (35) helped balance the relatively small number of training instances. These results suggest that large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.

Vote Data Set

The results show that classification algorithms like Logistic Regression can have outstanding performance on large data sets that are binary classification problems.

Summary and Conclusions

  • Higher amounts of data can lead to better learning and better classification accuracy on new, unseen instances.
  • Large numbers of relevant attributes can help a machine learning algorithm create more accurate classifications.
  • Classification algorithms like Logistic Regression can achieve excellent classification accuracy on binary classification problems, but performance on multi-class classification algorithms can yield mixed results.

Return to Table of Contents

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Fisher, R. (1988, July 01). Iris Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris

German, B. (1987, September 1). Glass Identification Data Set. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Kelleher, J. D., Namee, B., & Arcy, A. (2015). Fundamentals of Machine Learning for Predictive Data Analytics. Cambridge, Massachusetts: The MIT Press.

Michalski, R. (1980). Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.

Rebala, G., Ravi, A., & Churiwala, S. (2019). An Introduction to Machine Learning. Switzerland: Springer.

Schlimmer, J. (1987, 04 27). Congressional Voting Records Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wolberg, W. (1992, 07 15). Breast Cancer Wisconsin (Original) Data Set. Retrieved from Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%25

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.

Return to Table of Contents

The Difference Between Generative and Discriminative Classifiers

In this post, I will explain the difference between generative classifiers and discriminative classifiers.

Let us suppose we have a class that we want to predict H (hypothesis) and a set of attributes E (evidence). The goal of classification is to create a model based on E and H that can predict the class H given a set of new, unseen attributes E. However, both classifier types, generative and discriminative, go about this classification process differently.

Generative Classifiers

no-junk-mail

Classification algorithms such as Naïve Bayes are known as generative classifiers. Generative classifiers take in training data and create probability estimates. Specifically, they estimate the following:

  • P(H): The probability of the hypothesis (e.g. spam or not spam). This value is the class prior probability (e.g. probability an e-mail is spam before taking any evidence into account).
  • P(E|H): The probability of the evidence given the hypothesis (e.g. probability an e-mail contains the phrase “Buy Now” given that an e-mail is spam). This value is known as the likelihood.

Once the probability estimates above have been computed, the model then uses Bayes Rule to make predictions, choosing the most likely class, based on which class maximizes the expression P(E|H) * P(H).

Discriminative Classifiers

coffee-beans-separate

Rather than estimate likelihoods, discriminative classifiers like Logistic Regression estimate P(H|E) directly. A decision boundary is created that creates a dividing line/plane between instances of one class and instances of another class. New, unseen instances are classified based on which side of the line/plane they fall. In this way, a direct mapping is generated from attributes E to class labels H.

An Example Using an Analogy

cat-rabbit

Here is an analogy that demonstrates the difference between generative and discriminative classifiers. Suppose we live in a world in which there are only two classes of animals, cats and rabbits. We want to build a robot that can automatically classify a new animal as either a cat or a rabbit. How would we train this robot using a discriminative algorithm like Logistic Regression?

With a discriminative algorithm, we would feed the model a set of training data containing instances of cats and instances of rabbits. The discriminative algorithm would try to find a straight line/plane (a decision boundary) that separates instances of cats from instances of rabbits. This line would be created by examining the differences in the attributes (e.g. herbivore vs. carnivore, long oval ears vs. small triangular ears, hopping vs. walking, etc.)

Once the training step is complete, the discriminative algorithm is then ready to classify new unseen animals. It will look at new, unseen animals and check which side of the decision boundary the animal should go. The animal is classified based on the side of the decision boundary it falls into.

In contrast, a generative learning algorithm like Naïve Bayes will take in training data and develop a model of what a cat and rabbit should look like. Once trained, a new, unseen animal is compared to the model of a cat and the model of a rabbit. It is then classified based on whether it looks more like the cat instances the model was trained on or the rabbit instances the model was trained on.

Past research has shown that discriminative classifiers like Logistic Regression generally perform better on classification tasks than generative classifiers like Naïve Bayes (Y. Ng & Jordan, 2001).

As a final note, generative classifiers are called generative because we can use the probabilistic information of the data to generate more instances. In other words, given a class y, you can generate its respective attributes x.

References

Y. Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , 841-848.

How to Set Up Anaconda for Windows 10

In this post, I will show you how to set up Anaconda. Anaconda is a free, open-source distribution of Python (and R). The goal of Anaconda is to be a free “one-stop-shop” for all your Python data science and machine learning needs. It contains the key packages you need to build cool machine learning projects.

Requirements

Here are the requirements:

  • Set up Anaconda.
  • Set up Jupyter Notebook.
  • Install important libraries.
  • Learn basic Anaconda commands.

Directions

Install Anaconda

Go to the Anaconda website and click “Download.”

setting-up-anaconda-1

Choose the latest version of Python. In my case, that is Python 3.7. Click “Download” to download the Anaconda installer for your operating system type (i.e. Windows, macOS, or Linux). 

setting-up-anaconda-2

Follow the instructions to install the program:

setting-up-anaconda-3
setting-up-anaconda-4
setting-up-anaconda-5
setting-up-anaconda-6

Verify Anaconda is installed by searching for “Anaconda Navigator” on your computer.

Open Anaconda Navigator.

setting-up-anaconda-7

Follow the instructions here for creating a “Hello World” program. You can use Spyder, Jupyter Notebooks, or the Anaconda Prompt (terminal). If you use Jupyter Notebooks, you will need to open the notebooks in Firefox, Google Chrome or another web browser.

Check to make sure that you have IPython installed. Use the following command (in an Anaconda Prompt window) to check:

where ipython
setting-up-anaconda-8

Make sure that you have pip installed. Pip is the package management system for Python.

where pip
setting-up-anaconda-9

Make sure that you have conda installed. Conda is Anaconda’s package management system.

where conda
setting-up-anaconda-10

Install Some Libraries

Install OpenCV

To install OpenCV, use the following command in the Anaconda Prompt:

pip install opencv-contrib-python

Type the following command to get a list of packages and make sure opencv-contrib-python is installed.

conda list

Install dlib

Install cmake.

pip install cmake

Install the C++ library called dlib

pip install dlib
setting-up-anaconda-11

Type the following command and take a look at the list to see if dlib is successfully installed:

conda list

Install Tesseract

Go to Tesseract at UB Mannheim.

Download the Tesseract for your system.

Set it up by following the prompts.

setting-up-anaconda-12

Once Tesseract OCR is downloaded, find it on your system.

Copy the name of the file it is located in. In my case, that is:

C:\Program Files\Tesseract-OCR

Search for “Environment Variables” on your computer.

setting-up-anaconda-13

Under “System Variables,” click “Path,” and then click Edit.

Add the path: C:\Program Files\Tesseract-OCR

setting-up-anaconda-14

Click OK a few times to close all windows.

Open up the Anaconda Prompt.

Type this command to see if tesseract is installed on your system.

where tesseract

Now, apply the Python binding to the packages using the following commands:

pip install tesseract
pip install pytesseract

Install TensorFlow

Type the following command in the Anaconda Prompt:

pip install tensorflow

Install TensorFlow hub using this command:

pip install tensorflow-hub

Now install tflearn.

pip install tflearn

Now, install the Keras neural network library.

pip install keras

Install the Rest of the Libraries

Type the following commands to install the rest of the libraries:

pip install pillow
pip install SimpleITK

Learn Basic Anaconda Commands

Changing Directories

If you ever want to change directories to the D drive instead of the C drive, open Anaconda Prompt on your computer and type the following commands, in order

D:
cd D:\XXXX\XXXX\XXXX\XXXX

where D:\XXXX\XXXX\XXXX\XXXX is the file path.

Listing Directory Contents

Type the dir command to list the contents of a directory.

dir

Creating a Jupyter Notebook

Open Anaconda Prompt on your computer and type the following command:

jupyter notebook

Converting a Jupyter Notebook into Python Files

If you want to convert a Jupyter Notebook into Python files, change to that directory and type the following command:

jupyter nbconvert --to script *.ipynb

Congratulations if you made it this far! You have all the libraries installed that you need to do fundamental image processing and computer vision work in Python.

Artificial General Intelligence and Deep Learning’s Long-Run Viability

Learning and true intelligence are more than classification and regression (which is what deep learning is really good at). Deep learning is deficient in perhaps one of the most important types of intelligence, emotional intelligence. Deep learning cannot negotiate trade agreements between countries, resolve conflict between a couple going through a divorce, or craft legislation to reduce student debt.

The long-run viability of deep learning as well as progress towards machines that are more intelligent will depend both on improvements in computing power as well as our ability to understand and subsequently quantify general intelligence. 

Concerning the quantification of general intelligence, this really is the inherent goal of not just artificial intelligence, but of computer science as a whole. We want a machine that can automate tasks that humans do and to be as independent as possible while doing it. We want to teach a computer to think the way a human thinks, to understand the way a human understands, and to infer the way a human infers. At present, the state-of-the-art is a long way from achieving this goal. 

Deep learning is only as good as the data fed to it: garbage in, garbage out. Deep learning has a lot of limitations and is a long way from the kind of artificial general intelligence envisioned in those science fiction movies.

At this point, the state of the art in deep learning is identifying images, detecting fraud, processing audio, and performing advertising. The only thing that we really know for sure about deep learning at this point is that it performs some tasks at a very high level, and we aren’t 100% sure why it does. We understand the probability and the algorithm behind the representations that we are making, but we don’t understand what the network is actually learning like we would understand a classification tree or a run of Naive Bayes.  

Understanding what abstractions that the network is learning is essential to move forward in the field. If at some point the theory is proven to be invalid, then all the emerging techniques will be rendered useless.

In any case, even if some theoretical basis is proven, it may or may not show a strong relationship that warrants continued development in these fields.

Deep Learning Needs Generalists Not Just Specialists

statue_davinci_leonardo_315924
The Renaissance Man, Leonardo da Vinci

My prediction is that the real breakthroughs on the route to artificial general intelligence will come from the T-shaped generalists, those professionals who have the breadth and depth to see the unseen interconnectedness. 

In order for deep learning and AI to be viable over the long run, researchers in this field will need to be able to think outside the box and be well-versed in the fundamentals of multiple disciplines, not just their own silo. Mathematicians need to learn a little bit about computer science, neuroscience, and psychology. Computer scientists need to learn a little bit about neuroscience and psychology. Neuroscientists need to cross over and learn a little bit about computer science as well as mathematics. 

While learning about other fields is no guarantee that a researcher will make a breakthrough, staying in a narrow area and not venturing out is a guarantee of not making a breakthrough. As they say, to the hammer, everything looks like a nail.

As a final note, I’m keeping my eye out on quantum computing, and, more importantly, quantum machine learning and quantum neural networks. Quantum machine learning entails executing machine learning algorithms on a quantum computer. It will be interesting to see how getting away from the 1s and 0s of a standard computer will impact the field, and if it will usher in a new era of artificial intelligence. Only time will tell.

The Limitations of Deep Learning

Despite the explosion in the popularity of deep learning, we are still a long way from the kind of artificial general intelligence that is the inherent long term goal of computer science. Let us take a look at what the limitations are of deep learning. I’m going to pull from a paper written by Professor Gary Marcus of New York University about this topic. The paper is entitled: Deep Learning: A Critical Appraisal.

Deep Learning Methods Require Lots and Lots of Data

child_playing_with_horseshoe

Humans do not need hundreds, thousands, or even millions of examples of horseshoe crabs in order to learn what a horseshoe crab is. In fact, a 3 year old child could look at one example of a horseshoe crab, be told it is a horseshoe crab, and immediately identify other horseshoe crabs, even if those new horseshoe crabs do not look exactly like that first horseshoe crab example. 

Deep learning models on the other hand, need truck loads of examples of horseshoe crabs in order to distinguish horseshoe crabs from say spiders, or other similar-looking creatures.

Deep Learning Is Not Deep

shallow_water

The word “deep” in deep learning just refers to the structure of the mathematical model built during the training phase of the algorithm. The learning is not really deep, in the true sense of the word. A deep learning algorithm does not understand the why behind its output. 

Deep Learning Does Not Quickly Adapt When Things Change

blue-school-bus

Deep learning works well when test data looks like training data. However, it cannot easily cope with novel situations, such as when the domain changes from the one that it was trained on.

For example, if a deep neural network learns that school buses are always yellow, but, all of a sudden, school buses become blue, deep learning models will need to be retrained. A five year old would have no problem recognizing the vehicle as a blue school bus.

Deep Learning Does Not Do Well With Inference

semantics

Deep learning algorithms cannot easily tell the difference between phrases like: “John promised Mary to leave” and “John promised to leave Mary.”

Deep Learning Is Not Transparent

black-box

Part of the confusion and lack of clarity of deep learning is that it is difficult to understand for the general public. For example, there are many formulas and recalculations that are completed in order to create the neural network that forms the learning model for our deep learning program. 

These neural networks could also be biased depending on how the programmer sets up the formulas…for example, weighting some features more than others instead of considering the collaborative effect of all the features.

It also does not help that the layers of the neural network are called “hidden layers” which makes deep learning and its methodologies sound even more mysterious.  

Also, the mathematics….ahh the mathematics. Try explaining the whole sigmoid and gradient descent thing to a child. Better yet, imagine a doctor explaining to a patient suffering from cancer that the cancer diagnosis was determined based on the output of an artificial feedforward neural network trained with backpropagation. How would a patient respond to this? How would a health insurance company respond?

Neural networks are relatively black boxes that look like magic to the untrained eye, containing hundreds if not millions of parameters. In fields like medicine and finance, humans want to know exactly (and simply) why a particular decision was made. You cannot just say, “because my deep neural network said so.” 

Deep Learning Cannot Take Full Advantage of the Five Basic Senses

five_senses
The Five Senses: Look, Sound, Smell, Taste, Feel

Humans have five basic senses: vision, hearing, smell, taste, touch. When I learn what a rooster is, I can see it, hear it, smell it, taste it, and touch it. Deep learning at this stage focuses mainly on the vision part and lacks the other four senses. Those other senses can be important.

For example, imagine a driverless car. A human could hear a train coming long before it sees the train. The driver would then stop. A driverless car on the other hand would need to see the train first before making the decision to stop.

Humans use all five senses to learn, and these five senses play an important role in getting a bigger picture of recognizing objects and understanding what makes a rooster a rooster, or a train, a train.

Deep Learning Is Not yet Able to Know Something With 100% Certainty

trash_recycle_bin_recycle

I could look at a trash can inside my kitchen, know it is a trash can, and tell you with 100% certainty that it is a trash can. A deep neural network on the other hand, no matter how many parameters or data it has been trained on, works in the world of probabilities and numbers. It will never have 100% confidence that the trash can is a trash can. 

A trained deep neural network, for example, might be 96.3% sure it is a trash can but not 100% sure. There is always that miniscule probability it could be something else, like say a tree stump or a recycle bin. 

I’m in front of my laptop now. It is 100% my laptop. A deep learning algorithm, on the other hand, might say it is 99.3% sure that the object currently in front of me is my laptop. It would need a human to validate that the object is, in fact, a laptop.

This limitation this has to do with the way a neural network algorithm “learns.” 

The reality is that a program does not look at a dog and intuitively know that it is a dog like a human does. Instead, deep learning is more like a statistical analysis of patterns that are observed from the sample data points. So a deep learning program might only identify that an image is a dog based on the shapes in the image and the fact that statistical patterns tell you that these shapes will mean that the image is likely a dog.  

But the fact is that the deep learning program cannot be 100% sure that an image is a dog; for example if there is a weird picture of a fox or bear that looks a lot like a dog. Due to this uncertainty (the fact that the deep learning algorithm cannot be 100% sure), it is difficult for humans to trust deep learning especially when applied to critical, possibly life-threatening applications.  

For example, if a deep learning algorithm cannot always correctly identify that an object is an obstacle when processing the images from a driverless car, then it would be concerning for people to trust this program to drive for us.

Furthermore, there may be an inherent bias in how the deep learning algorithm identifies patterns. For example, if a programmer decided that a road is only definable by separating lines for lanes, then the algorithm may be biased to using lines for identifying roads. This would mean that the algorithm misses identifying roads that don’t have clear lane markings or even dirt roads. Such a bias means that a deep learning program could miss important identifying features/patterns. 

dirt_road_road_journey
Is this a road or a walking path to a beach? Humans would easily be able to recognize this as a road due to the tire tracks. Computers in driverless cars would need to have seen something similar to this in order to recognize it as a road.

In short, it is difficult for a deep learning algorithm to account for every possible example and every variation, which means that it would not be 100% correct.     

Significance of the Misunderstanding of Deep Learning

In my previous post on deep learning, I posted about some of the potential fallout that could occur due to the misunderstanding of what deep learning is really learning. I mentioned some possible significant consequences:

  • Wasted resources as venture capitalists throw money at anything that has to do with deep learning. 
  • Wasted resources as non-expert government agencies fund any research project that has the term “deep learning” in it.
  • A boat load of computer science graduates around the world that, all of a sudden, have found their “passion” in deep learning.
  • Disappointed companies as deep learning does not have the expected impact on their bottom line.
  • Another AI winter.

Let’s look at the last bullet point. Rodney Brooks, former MIT professor and co-founder of iRobot and Rethink Robotics, predicts that we will enter a new AI winter in 2020

AI winter is a period of reduced funding and interest in artificial intelligence research that comes at the tail end of an AI hype cycle. Each AI hype cycle begins with some major breakthrough. Then for the next 5-10 years after that breakthrough, all sorts of papers get written on AI, companies that are doing “X + [insert some new hot, AI technology]” get funded, computer science students around the world change their career paths, and the media goes into a feeding frenzy about how the new breakthrough will change the world.

Executives at big companies around the world then shout out quotations like this: “AI is more profound than … electricity or fire” –  Sundar Pichai (CEO of Google). 

Experts in AI then chime in, “This time is different!” 

When you hear comments like this, ask yourself “what are they selling?”.

Deep learning is a tool like any other tool…like a wrench for a car mechanic or a serrated knife for a master chef. Deep learning can currently solve specific problems really well but others not so well. 

Machine learning, the field that encompasses deep learning, is about automating the process of finding relationships based on empirical data. It is a powerful tool that has an enormous amount of potential, but it is not a panacea and is still a long way away from replacing the human brain. 

I do agree with Mr. Pichai that when we have true artificial general intelligence that such a breakthrough would be as profound as electricity or fire. We are not there yet. Much more work needs to be done (and that is a great thing for us scientists and engineers). The future is bright.

How Can We Help Others Gain a Better Understanding of What These Models Are Learning?

The example that is often marketed to explain deep learning is a neural network that first takes the inputs and learns lines, curves, and other shapes. Each successive layer abstracts and combines the data more and more until we see letters and fully formed images. This method of explaining deep learning seems like a good example of how we might gain an understanding of what exactly these algorithms actually learn.

I think that one reason why people might not trust deep learning is because they don’t understand how it works and even when they do, we cannot see the hidden layers in the neural network. When we look at the neural network for deep learning, we have multiple layers including hidden layers that are compose some of the neurons.

neural_network
Neural Network

With deep learning, we allow the program to learn and distinguish the key features from our sample data set. The problem is that we may not be easily able to understand the features that are distinguished and how they might be related to each other as defined by the hidden layers in the algorithm’s learning model.

I think that one strategy for improving, or at least understanding what deep learning is doing is to unpack these abstracted layers in order to hand tune the results into something that is more relevant.

What are Deep Learning Methods Really Learning?

It is not exactly clear what deep learning methods are really learning. Sure, they are highly effective and are learning something, but I’m still trying to get my head around exactly what they are learning.

Consider your run-of-the-mill deep neural network. “Learning” is nothing more than an optimization procedure. We are trying to produce an optimized mathematical formula that takes in a set of training examples and then can, as accurately as possible, map the inputs (i.e. attributes, features, etc.) of those examples to the outputs (i.e. class, target variable, etc.). We then use this formula to classify a new set of examples.

gradient_descent_png
Gradient Descent. Is this really all there is to learning?

At its core, deep learning is about input-process-output. It is not true learning in the sense of the word (the way we humans do). True learning entails understanding, and understanding is nonexistent during deep learning. 

You can memorize a book, chapter-by-chapter, word-for-word; but that doesn’t mean you are learning. You still would not understand the plot. Similarly, in deep learning there is no understanding. Deep learning “memorizes” a mapping between inputs and outputs without any real understanding of the why behind those relationships. And in my view, the why is a huge part of learning. True learning (in the human sense of the word) without understanding is not learning. Perhaps then we should call deep learning something different? Deep optimization perhaps??? Guess that didn’t sound as marketable and sexy as deep learning.

If you look out in nature — the human brain or the brain of any living organism — nothing out there learns in a way that even remotely resembles backpropagation. Neural networks are about classification error, but real learning — the way humans learn — is deeper than that (pun intended). 

A neural network, for example, has a completely different concept of what it is to be a dog. That concept could involve where certain groups of pixels are placed and may have nothing to do with the actual structure of the animal. Where a human would see legs, arms, torso, etc., a deep learning algorithm may abstract a completely different set of things. This has led a rise in adversarial attacks, where an attacker is able to determine what representation provides the highest probability for an image to be classified as anything and is able to insert noise that causes things to be misclassified.

Another point to consider is that neural networks generate something. It may be a relationship that we did not previously understand, but it may also just be nonsense that happens to work.  The abstractions may result in some representational form that is at its most basic just complete nonsense. If there is no real understanding of what the abstractions that the algorithm makes, then it is hard to confirm that it’s actually doing anything.

neural_network
A Basic Neural Network

The significance of the lack of understanding of what deep learning is, is yet to be seen; but here are just a few of the consequences if the hype gets unchecked:

  • Wasted resources as venture capitalists throw money at anything that has to do with deep learning. 
  • Wasted resources as non-expert government agencies fund any research project that has the term “deep learning” in it.
  • A boat load of computer science graduates around the world that, all of a sudden, have found their “passion” in deep learning.
  • Disappointed companies as deep learning does not have the expected impact on their bottom line.
  • Another AI winter.

Remember, there is the marketing element in there too. Using anthropomorphic terms like machine “learning” and deep “learning” is a much better sell to a general audience than machine mathematical optimization or deep optimization. Researchers gotta sell their ideas too!

Bottom Line: Artificial intelligence is not yet intelligent, and deep learning is not yet deep (yay! we still have work to do!)…nor is it learning in the true sense of the word. Deep learning certainly will continue to have an enormous impact on the world, but there needs to be more awareness and discussion of not just the enormous potential of deep learning but also its limitations so non-technical stakeholders can make more informed decisions.

Why Deep Learning Has Received So Much Attention Lately

Deep learning has been receiving an enormous amount of interest over the last seven years in the academic and business communities. Let’s take a look at the definition of deep learning, and then we will take a look at how this field has become so popular so quickly.

What is Deep Learning

Deep learning is a machine learning technique in which we teach a computer how to make predictions. Predictions are made by mapping a set of inputs to a set of outputs. 

Input Data —–> Deep Learning Algorithm (i.e. Process) —–> Output Data

For example, let’s say our input data into a deep learning algorithm is a set of photos. We want to be able to automatically tag each photo as either being dogs or elephants.

dogs_playing_on_beach
Dogs
elephant_flock_baby_elephant
Elephants

Input Data (lots of images containing dogs and elephants) —–> Deep Learning Algorithm —–> Classification of Each Image (i.e. Dogs or Elephants)

The “learning” part of the term deep learning entails looking at a bunch (hundreds, thousands, even millions+) of photos of elephants and dogs to develop a mathematical model of what both animals look like. Once the deep learning algorithm has been trained to recognize dogs and elephants, it can then be used to classify new photos as either dogs or elephants.

Most deep learning algorithms use neural network architectures as the structure of the underlying mathematical model. For this reason, deep learning methods are commonly called deep neural networks. 

Neural networks consist of layers and interconnected nodes. The first layer is the input layer. This layer might consist of, for example, thousands of matrices of pixels that represent photos of dogs or elephants. Each layer after the input layer transforms the data slightly so that the data is more abstract and complete than the previous layer. 

The layer after the input layer (i.e. second layer), for example, might contain nodes that recognize simple shapes like circles and edges (that at this point look nothing like a dog or elephant). The third layer contains nodes that recognize more complex shapes that look like a dog’s body parts (e.g. nose, eye, ear, etc.). Then the final layer, the output layer, outputs the classification of a photo as being either a dog or elephant.

neural-network
A basic multi-layer neural network architecture. The first layer on the left is the input layer. The two inner layers of nodes (neurons) are the hidden layers. The fourth layer on the right is the output layer that outputs the classification. In this case, the network expects four different classes in the data set (e.g. dogs, elephants, cows, horses).

Forbes Magazine has a good image showing the basic deep neural network structure I described above.

The “deep” part of deep learning refers to the number of hidden layers in the neural network. Standard neural networks have two or three (like in my example above) hidden layers, but deep neural networks can have 100+ layers. 

In short, a deep neural network is one that has several hidden layers, with the idea that these layers learn different levels of abstraction of the input attributes; thereby allowing the network to solve more complex problems, such as face recognition, object tracking and so on.

Origin of the Deep Learning Revolution: AlexNet

This post at Medium.com shows the graphs of the percentage of selected arXiv publications with either “deep”, “adversarial” or “convolutional” in the title. Note how the graph was virtually all 0s prior to 2010. It then took off like a rocket in 2012. What happened in 2012?

In 2010 and 2011, Fei-Fei Li held the ImageNet competition, an annual machine learning contest. Contest participants were given millions of images to use to train their models. These images were pre-labeled with one of ~1,000 different categories (e.g. leopard, cherry, mushroom, etc.). The objective of the contest was to correctly classify examples that were not in the training set. 

During those first two years of the competition in 2010 and 2011, the winning teams had a classification accuracy of 72%. None of the winners of those competitions used deep learning methods. Then in 2012, a team from the University of Toronto led by Alex Krizhevsky won the competition with a classification accuracy of 84%. The second place contestant had a classification accuracy of 74%. The team from the University of Toronto used deep learning methods combined with the computational power of graphical processing units (GPUs) to completely blow the competition out of the water.

The results were remarkable and gave birth to the deep learning era that continues to this day.

Why Deep Learning Has Received So Much Attention Lately

With traditional machine learning approaches, you would have to design a feature extraction algorithm which generally involves a lot of heavy mathematics (complex design), may not be very efficient, and does not perform well (i.e. accuracy may not be suitable for real-world applications). After doing all of that, you would also have to design a whole classification model to classify your input given the extracted features (i.e. attributes).

That’s a lot of work!

Enter Deep Learning…

  • With deep neural networks, we can perform feature extraction and classification in one shot, which means we could only need to design one model.
  • The availability of large amounts of labeled data as well as GPUs, which can process data in parallel at high speeds, enables these models to be much faster than previous methods.
  • Using the back-propagation algorithm, a well-designed loss function, and millions of parameters, these deep networks are able to learn highly complex features (which had to traditionally be hand designed)…i.e. no more complex design!
  • Deep neural networks have become fairly easy to implement with high-level open source libraries such as Keras, Pytorch, and TensorFlow.

Deep Learning has made many new applications practically feasible. We wouldn’t have been able to make good language translators pre-deep learning, because we simply had no technique at the time that would perform well enough or at a high enough speed for a real-world application.  Deep learning techniques have been applied to not just image recognition, but automatic speech recognition, natural language processing, drug discovery, customer relationship management, robotics, self-driving cars, and more.

Why Most Machine Learning Books Suck

Good teaching is work. Great teaching is a lot of work. Mediocre teaching is no work at all.

Addison Sears-Collins (2019)

Most of the “introductory” machine learning books (textbooks, especially) suck. While the subject matter in these books is supposed to be introductory, the way in which concepts are explained is not introductory at all. Simple concepts are often explained with such complex jargon that the underlying ideas get totally lost. If you want to see mental masturbation at its finest, just pick up any of the popular machine learning textbooks used in machine learning courses at colleges and universities around the world.

So many of these machine learning textbooks spend pages and pages explaining machine learning without actually doing machine learning with step-by-step fully worked examples (I presume because the publisher wants to limit the page count). It is kind of like learning how to play tennis by having someone explain to you how to play tennis vs. getting out there on the court and actually playing!

While reading these books, I always find myself asking these questions:

  • Why write an introductory book if the explanations are so convoluted that only an expert can understand? 
  • Why ask end-of-chapter questions and provide no answer key? 
  • Why provide no step-by-step practice with real-world examples?

Feedback and deliberate practice is a critical part of building confidence as you learn a new skill. Most of the “introduction” to machine learning books lose sight of this concept. Authors need to realize that you need to start with teeny tiny baby steps when you are writing for beginners.

Subscript and Superscript Soup

Take a look at this excerpt from a popular Introduction to Machine Learning textbook where the author attempts to explain logistic regression.

subscript-soup
Source: Introduction to Machine Learning

Wtf? 

This example hits on one of my biggest annoyances about machine learning books. The algorithms and mathematical equations contain so many variables, subscripts, and superscripts that you need to make a glossary on a separate sheet of paper just to keep track of it all because it would be utterly impossible to retain everything in your head while trying to decipher the points the author is trying to make. This kind of practice gets in the way of learning as well as your ability to see the big picture. 

Most of these “introduction” to machine learning books are written for people who already have a deep expertise in the subject. The subscript and superscript soup present in these books are as understandable to me as Ancient Greek. 

ancient-greek-writing

Too Much Focus on Irrelevant Minutiae

Take a look at this page taken from another popular Machine Learning book where the author introduces the K-Means clustering algorithm.

What the heck is this guy talking about? I had to double check the preface to see if this book was written for beginners or experts (i.e. it was written for beginners). To the average undergrad who wants to go out and get a job applying machine learning to real-world problems, how important is it to know all these mathematical details upfront? 

Answer: Not very important. Rarely will you ever need to know these details, and if you do, just look them up. Most textbooks focus way too much on the minutiae (as if they are writing an academic research paper) and not enough on how to apply this knowledge to solve real-world problems.

At no point do these authors convey:

  • Why does knowing this matter? 
  • What value does knowing this have on a real-world application as well as my ability to get and retain a job in this field. Most beginners don’t want to go on to get PhDs and do research. Some do, but most of the readers do not.

Teaching to the Tools Instead of the Problem

Machine learning textbooks need to teach to the problem, not to the tools and the underlying mathematics. They need to first tell you why something is important. The easiest way to do this is to explain a concept by starting with a real-world problem

Most textbooks, instead, do it the other way around. They teach you the intricate details of the tool in isolation, teach you some proof, and then (in rare cases), apply the concept to a problem (a problem which almost never resembles the types of problems you would face in the real world at a job).

Authors need to show readers how machine learning fits into the big picture (i.e. the other tools a beginner might already be familiar with, like statistics or data analysis). At the end of the day, machine learning is just a tool. It is not a panacea. 

It is just a tool, just like the World Wide Web is a tool…just like a programming language like Python is a tool…or just like Microsoft Excel is a tool. 

Machine Learning is to an engineer what a wrench is to a mechanic. A mechanic does not need to know how to build a wrench in order to use it. He just needs to know how to use it to solve a problem. The vast majority of people will not need to know the underlying proofs and mathematics at such a detailed level in order to succeed in the workplace (unless you become a researcher, in that case, you will be trying to build a better wrench).

tool_wall_tool_storage_0
Machine Learning is a tool in the toolbox

Machine learning textbooks need to start with a real-world problem, then work through that problem, set-by-step, and explain the mathematics on an as-needed basis, not just throw it on the page for the sake of pretending to be mathematically rigorous.

If you decide to deep dive into research later on in your career, THEN pick up the mathematics on an as-needed basis. 

I don’t need to know how an internal combustion engine is built to be an awesome driver, and most companies will hire you based on your problem-solving ability, not on your ability to write proofs of off-the-shelf algorithms…so problem solving is what books need to focus on.

No Common Language

It does not help that there is no common language in machine learning. There are sometimes a dozen different ways to say the same thing (feature, attribute, predictor, x-variables, independent variables, etc.) or (target, class, response variable, y-variable, hypothesis). These multiple ways of saying the same thing make it totally confusing for a beginner, and rarely do authors point out the fact that there are a myriad of ways of saying the same thing.

Author Has No Training on How to Teach

elementary_school_aboard

Many of these authors are accustomed to writing for an expert audience (via academic research papers) and have not properly developed the skill of breaking down complex subject matter into easy-to-understand, digestible bite-sized pieces that would be understandable by a competent beginner. 

Elementary and high school teachers must get a degree or diploma in teaching. They have to endure hundreds of hours of instruction on how to teach and how to deal with different learning styles. Most machine learning textbook authors have not had this training. 

I find that the best teachers of a subject are often students who are one step removed from having learned the subject. It is fresh in their minds, and they still remember what it is like to be a beginner.

Using Words Like “Basically”, “Simple”, and “Easy”

A favorite of machine learning textbook authors it to make the assumption that the reader already understands a “simple” concept that’s necessary to understand the new topic. This is frustrating for someone new.

So many of these textbooks use terms like “simple,” “straightforward,” “easy,” “obvious,” and “basically,” which hurts a beginner’s confidence if he is not easily able to grasp the material. This is a huge pet peeve of mine. Words like these have no place in a book claiming to be an “introduction.” 

You’re trying to climb the machine learning mountain. The author is already at the top of the mountain. The author forgot how hard the struggle was to get to the top. The author forgot what it is like to even take the first step because it has been so long. It all seems easy after you’ve been there, done that. 

It would be like Michael Jordan teaching someone how to play basketball. Some things are so intuitive and second nature to him that, although he is the best basketball player that ever lived, he might not be the best one to teach it because he is so far removed from what it was like as a beginner, learning the basics, when even the smallest steps are difficult and not second nature. 

Similarly, you might know how to ride a bicycle really well. However, try to teach someone else how to ride a bicycle. You will notice that teaching someone how to ride a bike is different from being a really good bike rider. In order to teach someone how to ride a bike, you have to take what you know and break it down into teeny tiny parts. The ability to do this well is a skill in and of itself…one that takes time and practice to get good at.

bicycle-2

When you learn something, and especially if it is something you’ve spent decades immersed in, it’s intuition to you. Most machine learning books are especially bad about this. They just assume that you understand exactly what is going on. They do not explain. It was just “if this, then that.” They speak in generalities and hand-wavy language when the beginner needs step-by-step detail. They use machine learning and math to explain machine learning and math.

They do not realize that in order to teach someone something, it is best to tie the new concept to a concept that the beginner might already be familiar with.

Again, the best teachers in my experience are those that are one step removed from learning a subject as they have the knowledge fresh in their mind and remember clearly what it is like to be a beginner.

Show Me Don’t Tell Me

Authors need to stop introducing a new concept by explaining it. Instead, they need to use the following teaching aids:

  • Analogies: Connect the current knowledge to previous knowledge that most beginners would have.
  • Pictures and Diagrams: Draw a picture to help me visualize the concept.
  • Real-world Examples: Why does this concept matter? Show me a real-world example of this concept in practice, solving an actual problem. Tell me a story.
  • Layman’s Terms: Explain a term in basic plain language. Act as if you are explaining a concept to a five-year-old child.

Here is What Well Written Textbooks Look Like

Here is what a well written textbooks for beginners should look like. Consider these textbooks among the GOATs (i.e. greatest of all time) of textbooks:

The books above are an absolute joy to learn from. They will make you rethink the way introductory textbooks should be written. 

For you “Introduction” to machine learning authors out there, take note.

Further Reading

I encourage you to check out Jason Brownlee’s Post, “Why Machine Learning Does Not Have to Be So Hard”, where he calls out universities and traditional courses for teaching machine learning incorrectly. He also outlines a rough process for getting started.

Also, check out the first response in this post about how to learn difficult subjects.