Difference Between Recursion and Iteration

If you are learning algorithms or data structures, you might have come across the terms recursion and iteration. Knowing the difference between the two can be confusing, so I will explain both terms using a real-world example.

Eating My Morning Bowl of Oatmeal

eating-oatmeal

Everyday before I go to work, I take a shower, get dressed, and then head over to the kitchen to fix myself a big bowl of oatmeal. Breakfast is my favorite meal of the day and the most important. Many days I feel like skipping breakfast so that I can go on with my day. However, on those rare occasions when I decide to skip breakfast, I have usually regretted it. I spend the morning hungry and cannot fully concentrate on the tasks at hand.

The problem of eating my morning bowl of oatmeal can be implemented both iteratively and recursively. Iteration is the act of repeating a process. So, in the case of oatmeal, an iterative way to solve the problem of eating my bowl of oatmeal would be to do the following:

  1. Take one bite of the oatmeal
  2. Repeat Step 1 until the bowl is empty.

Recursion, on the other hand, is when a method calls itself. It involves breaking a problem into smaller versions of the same problem until you reach a trivial stopping case.

Let’s consider the problem of eating a bowl of oatmeal. I can solve the problem of eating the entire bowl of oatmeal recursively by breaking it into two subproblems as follows:

  • Take one bite of oatmeal
  • Eat the rest of the bowl of oatmeal

Each step of this recursive procedure gets smaller and smaller until all I am left with is one more spoonful of oatmeal. Once I finish that last spoonful, I am done, and I can go on about my day.

Here is how I would implement the iterative and recursive procedures for eating a bowl of oatmeal in Java pseudocode:

Iterative Implementation

public static main void main(String[] args) {

    while (!bowl.empty()) {
        bowl.takeOneBite(); 
        // Keep taking bites over and over 
        // again until the bowl of oatmeal 
        // is empty
    }
}

Recursive Implementation

public static void eatOatmeal(Bowl bowl) {

    if (bowl.empty()) {  
    // This is our stopping case. 
    // We stop taking bites once 
    // the bowl is empty

        return;  
    }
    else {
        bowl.takeOneBite(); 
        eatOatmeal(bowl); // Recursive call
    }
}

K-Nearest Neighbors Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-nearest neighbors algorithm (k-NN classification and k-NN regression), step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the image segmentation data set from the UCI Machine Learning Repository. The purpose of the data set is to classify the instances into seven different outdoor images (e.g. sky, foliage, cement, window, path, grass, etc.) based on pixel data. This data set gives a taste of computer vision mixed with machine learning.

Without further ado, let’s get started!

Table of Contents

What is the K-Nearest Neighbors Algorithm?

The K-Nearest Neighbors Algorithm is a supervised learning algorithm (the target variable we want to predict is not available) that is used for both classification and regression problems. Let’s see how it works for both types of problems.

K-Nearest Neighbors Classification

For the k-NN classification algorithm, we have N training instances. Let us suppose that these training instances fall into one of two classes, A and B.

1-knn

We want to classify a new instance C. We do this by identifying the k nearest neighbors of C and classifying C based on the class that had the most nearest neighbors to C (out of a total of k nearest neighbors).

For example, let k = 3. The three nearest neighbors of C are A, A, and B. Since A had the most nearest neighbors, C belongs to class A.

How do we select k?

A common method for choosing k is to take the square root of N, the number of instances and subtract 1 if that number is odd. However, in this project, we will tune the value for k and record the results to see which value for k achieved optimal performance as measured by either classification accuracy or mean squared error.

How do we determine the k nearest neighbors?

A common method for measuring distance is to use the Euclidean distance formula. That is, given points p =(p1,p2,…,pn) and q = (q1,q2,…,qn), the distance d between p and q is:

d(p,q) = √((q1 – p1)2 + (q2-p2)2 + …+(qn – pn)2)

This is a version of the Pythagorean Theorem.

In short, the algorithm for k-NN classification is as follows. For each test instance, we:

  1. Compute the distance to every training instance
  2. Select the k closest instances and their class
  3. Output the class that occurs most frequently among the k closest instances

Special notes on k-NN classification:

  • Select an odd value for k for two-class problems in order to avoid ties.
  • Complexity of the algorithm depends on finding the nearest neighbor for each training instance.
  • k-NN needs labeled data sets.

k-NN is known as a lazy learning algorithm because a model is not built until the time a test instance arrives. Thus, there is no prior training on the training data as would be the case with an eager learning algorithm.

Also k-NN is a non-parametric algorithm because it does not assume any particular form of the data set (unlike algorithms like linear regression, logistic regression, etc.). The only assumption is that Euclidean distance can be consistently calculated between points.

K-Nearest Neighbors Regression

k-NN regression is a minor modification of k-NN classification. In k-NN regression, we are trying to predict a real number instead of a class. Instead of classifying a test instance based on the most frequently occurring class among the k nearest neighbors, we take the average of the target variable of the k nearest neighbors.

For example, let’s suppose we wanted to predict the age of someone given their weight. We have the following data:

2-knn

The question is: how old is someone given they weigh 175 pounds?

Suppose k = 3. We find the three nearest neighbors to 175. They are:

  • (170,50)
  • (180,52)
  • (156,43)

We take the average of the target variable:

Average = (50 + 52 + 43) / 3 = 48.3

This is our answer.

In short, the algorithm for k-NN regression is as follows. For each test instance, we:

  1. Compute the distance to every training instance
  2. Select the k closest instances and the values of their target variables
  3. Output the mean of the values of the target variables

Return to Table of Contents

K-Nearest Neighbors Algorithm Design

The first thing we need to do is preprocess the image segmentation data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class

Modification of Attribute Values

The image segmentation data set contains 19 attributes, 210 instances, and 7 classes. This data set was created from a database of seven outdoor images. Below are some of the modifications I made.

Classes were made numerical:

  • BRICKFACE = 1.0
  • SKY = 2.0
  • FOLIAGE = 3.0
  • CEMENT = 4.0
  • WINDOW = 5.0
  • PATH = 6.0
  • GRASS = 7.0

Region-pixel-count was removed since it was 9 for all instances.

I also normalized the attributes so that they are all between 0 and 1.

Here is what the first few columns of your data set should look like after all that preprocessing:

dataset-image-segment

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-Nearest Neighbors Algorithm in Python, Coded From Scratch

Here is the full code for the k-nearest neighbors algorithm (Note that I used five-fold stratified cross-validation to produce the final classification accuracy statistics). You might want to copy and paste it into a document since it is pretty large and hard to see on a single web page:

import numpy as np # Import Numpy library

# File name: knn.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Implementation of the k-nearest neighbors algorithm (k-NN)
# from scratch

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Regression Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

class Knn:

    # Constructor
    #   Parameters:
    #     k: k value for k-NN
    #     problem_type: ('r' for regression or 'c' for classification)
    def __init__(self, k, problem_type):
        self.__k = k
        self.__problem_type = problem_type

    # Parameters:
    #   training_set: The folds used for training (2d numpy array)
    #   test_instance: The test instance we need to find the neighbors for 
    #                  (numpy array)
    # Returns: 
    #   The k most similar neighbors from the training set for a given 
    #   test instance. It will be a 2D numpy array where the first column 
    #   will hold the Actual class value of the training instance and the 
    #   second column will store the distance to the test instance...
    #   (actual class value, distance). 
    def get_neighbors(self, training_set, test_instance):

        # Record the number of training instances in the training set
        no_of_training_instances = np.size(training_set,0)

        # Record the number of columns in the training set
        no_of_training_columns = np.size(training_set,1)

        # Record the column index of the actual class of the training_set
        actual_class_column = None
        # If classification problem
        if self.__problem_type == "c":
            actual_class_column = no_of_training_columns - 1
        # If regression problem
        else:
            actual_class_column = no_of_training_columns - 2

        # Create an empty 2D array called actual_class_and_distance. This 
        # array should be the same length as the number of training instances. 
        # The first column will hold the Actual Class value of the training 
        # instance, and the second column will store the distance to the 
        # test instance...(actual class value, distance). 
        actual_class_and_distance = np.zeros((no_of_training_instances, 2))

        neighbors = None

        # For each row (training instance) in the training set
        for row in range(0, no_of_training_instances):

            # Record the actual class value in the 
            # actual_class_and_distance array (column 0)
            actual_class_and_distance[row,0] = training_set[
                row,actual_class_column]

            # Initialize a temporary training instance copied from this 
            # training instance
            temp_training_instance = np.copy(training_set[row,:])

            # Initialize a temporary test instance copied from this 
            # test_instance
            temp_test_instance = np.copy(test_instance)

            # If this is a classification problem
            if self.__problem_type == "c":
            
                # Update temporary training instance with Instance ID 
                # and the Actual Class pieces removed
                temp_training_instance = np.delete(temp_training_instance,[
                    0,actual_class_column])

                # Update temporary test instance with the Instance ID 
                # and the Actual Class pieces removed
                temp_test_instance = np.delete(temp_test_instance,[
                    0,actual_class_column])

            # If this is a regression problem
            else:

                # Update temporary training instance with Instance ID, Actual 
                # Class, and Stratification Bin pieces removed
                temp_training_instance = np.delete(temp_training_instance,[
                    0,actual_class_column,actual_class_column+1])

                # Update temporary test instance with the Instance ID, Actual
                # Class, and Stratification Bin pieces removed
                temp_test_instance = np.delete(temp_test_instance,[
                    0,actual_class_column,actual_class_column+1])

            # Calculate the euclidean distance from the temporary test
            # instance to the temporary training instance
            distance = np.linalg.norm(
                temp_test_instance - temp_training_instance)

            # Record the distance in the actual_class_and_distance 
            # array (column 1)
            actual_class_and_distance[row,1] = distance

        # Sort the actual_class_and_distance 2D array by the 
        # distance column (column 1) in ascending order.
        actual_class_and_distance = actual_class_and_distance[
                actual_class_and_distance[:,1].argsort()]

        k = self.__k

        # Extract the first k rows of the actual_class_and_distance array
        neighbors = actual_class_and_distance[:k,:]

        return neighbors

    # Generate a prediction based on the most frequent class or averaged 
    # target variable value of the neighbors
    # Parameters: 
    #   neighbors - 1D array (actual_class_value)
    # Returns:
    #   predicted_class_value
    def make_prediction(self, neighbors):
    
        prediction = None

        # If this is a classification problem
        if self.__problem_type == "c":
            
            #  Prediction is the most frequent value in column 0 of
            #  the neighbors array
            neighborsint = neighbors.astype(int)
            prediction = np.bincount(neighborsint).argmax()
            
        # If this is a regression problem
        else:

            # Prediction is the average of the neighbors array
            prediction = np.mean(neighbors)

        return prediction

    # Parameters: 
    #   actual_class_array
    #   predicted_class_array
    # Returns: 
    #   accuracy: Either classification accuracy (for 
    #   classification problems) or 
    #   mean squared error (for regression problems)
    def get_accuracy(self, actual_class_array, predicted_class_array):

        # Initialize accuracy variable
        accuracy = None

        # Initialize decision variable
        decision = None

        counter = None

        actual_class_array_size = actual_class_array.size

        # If this is a classification problem
        if self.__problem_type == "c":
            
            counter = 0

            # For each element in the actual class array
            for row in range(0,actual_class_array_size):

                # If actual class value is equal to value in predicted
                # class array
                if actual_class_array[row] == predicted_class_array[row]:
                    decision = "correct"
                    counter += 1
                else:
                    decision = "incorrect"

            classification_accuracy = counter / (actual_class_array_size)

            accuracy = classification_accuracy

        # If this is a regression problem
        else:

            # Initialize an empty squared error array. 
            # Needs to be same length as number of test instances
            squared_error_array = np.empty(actual_class_array_size)

            squared_error = None

            # For each element in the actual class array
            for row in range(0,actual_class_array_size):

                # Calculate the squared error
                squared_error = (abs((actual_class_array[
                        row] - predicted_class_array[row]))) 
                squared_error *= squared_error
                squared_error_array[row] = squared_error

            mean_squared_error = np.mean(squared_error_array)
            
            accuracy = mean_squared_error
        
        return accuracy

Here is the driver program that calls and executes the code above:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from five_fold_stratified_cv import FiveFoldStratCv
from knn import Knn

# File name: knn_driver.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Driver for the knn.py program 
# (K-Nearest Neighbors)

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Regression Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

################# INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-Nearest Neighbor"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
##############################################################################

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " program!")
    print()
    print("Running " + ALGORITHM_NAME + ". Please wait...")
    print()

    # k value that will be tuned
    k = eval(input("Enter a value for k: ") )

    # "c" for classification or "r" for regression
    problem = input("Press  \"c\" for classification or \"r\" for regression: ") 

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Convert dataframe into a Numpy array
    np_data_set = pd_data_set.to_numpy(copy=True)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Create an object of class FiveFoldStratCv
    fivefolds1 = FiveFoldStratCv(np_data_set,problem)

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Create an object of class Knn
    knn1 = Knn(k,problem) 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = fivefolds1.get_five_folds()

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy for classification problems or mean squared
    # error for regression problems)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run k-NN the designated number of times as indicated by the number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = np.concatenate((
                fold1, fold2, fold3, fold4), axis=0)
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = np.concatenate((
                fold0, fold2, fold3, fold4), axis=0)
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = np.concatenate((
                fold0, fold1, fold3, fold4), axis=0)
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold4), axis=0)
        else:
            test_dataset = fold4
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold3), axis=0)

        # Actual class column index of the test dataset
        actual_class_column = None           
     
        # If classification problem
        if problem == "c":
            actual_class_column = np.size(test_dataset,1) - 1
        # If regression problem
        else:
            actual_class_column = np.size(test_dataset,1) - 2

        # Create an array of the actual_class_values of the test instances
        actual_class_values = test_dataset[:,actual_class_column]

        no_of_test_instances = np.size(test_dataset,0)

        # Make an empty array called predicted_class_values which will 
        # store the predicted class values. It should be the same length 
        # as the number of test instances
        predicted_class_values = np.zeros(no_of_test_instances)

        # For each row in the test data set
        for row in range(0, no_of_test_instances):
  
            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the k nearest neighbors for each test instance
            this_instance = test_dataset[row,:]
            neighbor_array = knn1.get_neighbors(training_dataset,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)
  
            # Record the prediction in the predicted_class_values array
            predicted_class_values[row] = prediction

        # Calculate the classification accuracy of the predictions 
        # (k-NN classification) or the mean squared error (k-NN regression)
        accuracy = knn1.get_accuracy(actual_class_values,predicted_class_values)

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

        # If classification problem
        if problem == "c":
            temp_acc = accuracy * 100
            outfile_tr.write("Classification Accuracy: " + str(temp_acc) + "%\n")
            outfile_tr.write("\n")
            print("Classification Accuracy: " + str(temp_acc) + "%\n")
        # If regression problem
        else:
            outfile_tr.write("Mean Squared Error: " + str(accuracy) + "\n")
            outfile_tr.write("\n")
            print("Mean Squared Error: " + str(accuracy) + "\n")

    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.")
    print()

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")

    # Write the relevant stats to a file
    outfile_ts.write("\n")

    if problem == "c":
        outfile_ts.write("Problem Type : Classification" + "\n")
    else:
        outfile_ts.write("Problem Type : Regression" + "\n")

    outfile_ts.write("\n")
    outfile_ts.write("Value for k : " + str(k) + "\n")
    outfile_ts.write("\n")

    accuracy = np.mean(accuracy_statistics)

    if problem == "c":
        accuracy *= 100
        outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
    else: 
        outfile_ts.write("Mean Squared Error : " + str(accuracy) + "\n")

    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    # Write the relevant stats to a file
    print()

    if problem == "c":
        print("Problem Type : Classification")
    else:
        print("Problem Type : Regression")

    print()
    print("Value for k : " + str(k))
    print()
    if problem == "c":
        print("Classification Accuracy : " + str(accuracy) + "%")
    else: 
        print("Mean Squared Error : " + str(accuracy))

    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Return to Table of Contents

Output Statistics of the K-Nearest Neighbors Algorithm on the Image Segmentation Data Set

This section shows the results for the runs of the k-nearest neighbors algorithm on the image segmentation data set. I used a k-value of 4, but you can feel free to change this and see what accuracy value you get.

Test Statistics

knn_stats

Trace Runs

Here are the trace runs of the algorithm:

trace-runs-k-nearest-neighbors

Return to Table of Contents

Condensed K-Nearest Neighbors Algorithm

The time and space complexity of the regular k-nearest neighbors algorithm described above is directly proportional to the number of instances in the training set. This could potentially present a problem with massively large data sets. This is where the condensed k-nearest neighbors algorithm (ck-NN) comes in handy.

The idea behind ck-NN is to reduce the size of the training set by selecting the smallest subset of the training data set that results in no loss of classification accuracy. By systematically removing ineffective instances, we reduce the computation time as well as the storage requirement.

Here is how condensed k-NN works:

We start with an empty bin called STORE.

1. Place the first instance in STORE

2. Check whether the second instance can be classified correctly by 1-nearest neighbor using the instance in STORE as the reference.

3. Repeat step 2 for all other instances in the data set.

4. Repeat step 2 on the data set, doing continuous passes over the data set until either

OR

5. Use the instances in STORE as the input for the k-NN classification algorithm.

Here is the code in Python for ck-NN:

import numpy as np # Import Numpy library
from knn import Knn

# File name: cknn.py
# Author: Addison Sears-Collins
# Date created: 6/21/2019
# Python version: 3.7
# Description: Implementation of the condensed k-nearest neighbors 
# algorithm (k-NN) from scratch

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

class CondensedKnn:

    # Constructor
    #   Parameters:
    #     training_set: The training set that we need to prune
    #     problem_type: ('c' for classification)
    def __init__(self, training_set, problem_type="c"):
        self.__training_set = training_set
        self.__problem_type = problem_type

    # Parameters:
    #   None.
    # Returns: 
    #   A training set that has irrelevant instances removed 
    def get_trainingset(self):
        
        # Initialize a condensed training set. Copy it from the actual 
        # training set
        condensed_training_set = np.copy(self.__training_set)

        # Record the number of instances in the condensed training_set
        no_of_training_instances = np.size(condensed_training_set,0)

        # Record the number of columns in the condensed training set
        no_of_training_columns = np.size(condensed_training_set,1)

        # Print the initial number of instances in the condensed training set
        print("\nBefore condensing: " + str(
            no_of_training_instances) + " training instances\n")

        # Record the column index of the actual class of the training_set
        actual_class_column = no_of_training_columns - 1

        # Initialize an array named store with the first instance of the 
        # condensed training_set
        store = np.copy(condensed_training_set[0,:])

        # Create a 2D array
        new_row = np.copy(condensed_training_set[0,:])
        store = np.vstack([store,new_row])

        # For the second instance to the last instance in the condensed 
        # training_set
        row = 1
        while row < no_of_training_instances:
            
            # Record the actual class value
            actual_class_value = condensed_training_set[
                row,actual_class_column]

            # Create an object of class Knn
            knn1 = Knn(1,self.__problem_type) 

            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the nearest neighbor for each instance
            this_instance = condensed_training_set[row,:]
            neighbor_array = knn1.get_neighbors(store,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)

            # If actual class value is not equal to the prediction
            # Append that instance to the store array
            # Remove this instance from the condensed training_set
            if actual_class_value != prediction:
                new_row = np.copy(this_instance)
                store = np.vstack([store,new_row])

                condensed_training_set = np.delete(
                    condensed_training_set, row, 0)
                no_of_training_instances -= 1
            
            row += 1

        # Declare the stopping criteria. We stop when either one complete
        # pass is made through the condensed training set with no more 
        # transfers of instances to store or there are no more instances 
        # remaining in the condensed training data set
  
        no_more_transfers_to_store = False
        no_more_instances_left = None
  
        # Update the number of instances in the condensed training_set
        no_of_training_instances = np.size(condensed_training_set,0)
        
        if no_of_training_instances > 0:
            no_more_instances_left = False
        else:
            no_more_instances_left = True

        while not(no_more_transfers_to_store) and not(no_more_instances_left):
            # Reset the number of transfers_made to 0
            transfers_made = 0

            # For the second instance to the last instance in the condensed 
            # training_set
            row = 0
            while row < no_of_training_instances:

                # Record the actual class value
                actual_class_value = condensed_training_set[
                    row,actual_class_column]

                # Create an object of class Knn
                knn1 = Knn(1,self.__problem_type) 

                # Neighbor array is a 2D array containing the neighbors 
                # (actual class value, distance) 
                # Get the nearest neighbor for each instance
                this_instance = condensed_training_set[row,:]
                neighbor_array = knn1.get_neighbors(store,this_instance)

                # Extract the actual class values
                neighbors_arr = neighbor_array[:,0]
  
                # Predicted class value stored in the variable prediction
                prediction = knn1.make_prediction(neighbors_arr)

                # If actual class value is not equal to the prediction
                # Append that instance to the store array
                # Remove this instance from the condensed training_set
                if actual_class_value != prediction:
                    new_row = np.copy(this_instance)
                    store = np.vstack([store,new_row])

                    condensed_training_set = np.delete(
                        condensed_training_set, row, 0)
                    no_of_training_instances -= 1
                    transfers_made += 1
            
                row += 1
        
            # Update the number of instances in the condensed training_set
            no_of_training_instances = np.size(condensed_training_set,0)

            if no_of_training_instances > 0:
                no_more_instances_left = False
            else:
                no_more_instances_left = True

            if transfers_made > 0:
                no_more_transfers_to_store = False
            else: 
                no_more_transfers_to_store = True

        # Delete row 0 from the store
        store = np.delete(store,0,0)

        # Print the final number of instances in the store
        print("After condensing: " + str(
            np.size(store,0)) + " training instances\n")
        return store

Here is the code (Python) for the driver that executes the program above. It uses the regular k-NN code I presented earlier in this post as well as the five-fold stratified cross-validation code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library
from five_fold_stratified_cv import FiveFoldStratCv
from knn import Knn
from cknn import CondensedKnn

# File name: cknn_driver.py
# Author: Addison Sears-Collins
# Date created: 6/21/2019
# Python version: 3.7
# Description: Driver for the cknn.py program 
# (Condensed K-Nearest Neighbors)

# Required Data Set Format for Classification Problems:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

################# INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "Condensed K-Nearest Neighbor"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
##############################################################################

def main():

    print("Welcome to the " +  ALGORITHM_NAME + " program!")
    print()
    print("Running " + ALGORITHM_NAME + ". Please wait...")
    print()

    # k value that will be tuned
    k = eval(input("Enter a value for k: ") )

    # "c" for classification
    problem = input("Press  \"c\" for classification: ") 

    # Directory where data set is located
    data_path = input("Enter the path to your input file: ") 

    # Read the full text file and store records in a Pandas dataframe
    pd_data_set = pd.read_csv(data_path, sep=SEPARATOR)

    # Convert dataframe into a Numpy array
    np_data_set = pd_data_set.to_numpy(copy=True)

    # Show functioning of the program
    trace_runs_file = input("Enter the name of your trace runs file: ") 

    # Open a new file to save trace runs
    outfile_tr = open(trace_runs_file,"w") 

    # Testing statistics
    test_stats_file = input("Enter the name of your test statistics file: ") 

    # Open a test_stats_file 
    outfile_ts = open(test_stats_file,"w")

    # Create an object of class FiveFoldStratCv
    fivefolds1 = FiveFoldStratCv(np_data_set,problem)

    # The number of folds in the cross-validation
    NO_OF_FOLDS = 5 

    # Create an object of class Knn
    knn1 = Knn(k,problem) 

    # Generate the five stratified folds
    fold0, fold1, fold2, fold3, fold4 = fivefolds1.get_five_folds()

    training_dataset = None
    test_dataset = None

    # Create an empty array of length 5 to store the accuracy_statistics 
    # (classification accuracy for classification problems or mean squared
    # error for regression problems)
    accuracy_statistics = np.zeros(NO_OF_FOLDS)

    # Run k-NN the designated number of times as indicated by the number of folds
    for experiment in range(0, NO_OF_FOLDS):

        print()
        print("Running Experiment " + str(experiment + 1) + " ...")
        print()
        outfile_tr.write("Running Experiment " + str(experiment + 1) + " ...\n")
        outfile_tr.write("\n")

        # Each fold will have a chance to be the test data set
        if experiment == 0:
            test_dataset = fold0
            training_dataset = np.concatenate((
                fold1, fold2, fold3, fold4), axis=0)
        elif experiment == 1:
            test_dataset = fold1
            training_dataset = np.concatenate((
                fold0, fold2, fold3, fold4), axis=0)
        elif experiment == 2:
            test_dataset = fold2
            training_dataset = np.concatenate((
                fold0, fold1, fold3, fold4), axis=0)
        elif experiment == 3:
            test_dataset = fold3
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold4), axis=0)
        else:
            test_dataset = fold4
            training_dataset = np.concatenate((
                fold0, fold1, fold2, fold3), axis=0)
        
        # Create an object of class CondensedKnn
        cknn1 = CondensedKnn(training_dataset,problem)

        # Get a new, smaller training set with fewer instances
        training_dataset = cknn1.get_trainingset()

        # Actual class column index of the test dataset
        actual_class_column = np.size(test_dataset,1) - 1

        # Create an array of the actual_class_values of the test instances
        actual_class_values = test_dataset[:,actual_class_column]

        no_of_test_instances = np.size(test_dataset,0)

        # Make an empty array called predicted_class_values which will 
        # store the predicted class values. It should be the same length 
        # as the number of test instances
        predicted_class_values = np.zeros(no_of_test_instances)

        # For each row in the test data set
        for row in range(0, no_of_test_instances):
  
            # Neighbor array is a 2D array containing the neighbors 
            # (actual class value, distance) 
            # Get the k nearest neighbors for each test instance
            this_instance = test_dataset[row,:]
            neighbor_array = knn1.get_neighbors(training_dataset,this_instance)

            # Extract the actual class values
            neighbors_arr = neighbor_array[:,0]
  
            # Predicted class value stored in the variable prediction
            prediction = knn1.make_prediction(neighbors_arr)
  
            # Record the prediction in the predicted_class_values array
            predicted_class_values[row] = prediction

        # Calculate the classification accuracy of the predictions 
        # (k-NN classification) or the mean squared error (k-NN regression)
        accuracy = knn1.get_accuracy(actual_class_values,predicted_class_values)

        # Store the accuracy in the accuracy_statistics array
        accuracy_statistics[experiment] = accuracy

        # Stats
        temp_acc = accuracy * 100
        outfile_tr.write("Classification Accuracy: " + str(temp_acc) + "%\n")
        outfile_tr.write("\n")
        print("Classification Accuracy: " + str(temp_acc) + "%\n")


    outfile_tr.write("Experiments Completed.\n")
    print("Experiments Completed.")
    print()

    # Write to a file
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
    outfile_ts.write("----------------------------------------------------------\n")
    outfile_ts.write("Data Set : " + data_path + "\n")
    outfile_ts.write("\n")
    outfile_ts.write("Accuracy Statistics for All 5 Experiments:")
    outfile_ts.write(np.array2string(
        accuracy_statistics, precision=2, separator=',',
        suppress_small=True))
    outfile_ts.write("\n")

    # Write the relevant stats to a file
    outfile_ts.write("\n")

    outfile_ts.write("Problem Type : Classification" + "\n")

    outfile_ts.write("\n")
    outfile_ts.write("Value for k : " + str(k) + "\n")
    outfile_ts.write("\n")

    accuracy = np.mean(accuracy_statistics)

    accuracy *= 100
    outfile_ts.write("Classification Accuracy : " + str(accuracy) + "%\n")
   
    # Print to the console
    print()
    print("----------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics")
    print("----------------------------------------------------------")
    print("Data Set : " + data_path)
    print()
    print()
    print("Accuracy Statistics for All 5 Experiments:")
    print(accuracy_statistics)
    print()
    # Write the relevant stats to a file
    print()

    print("Problem Type : Classification")
    print()
    print("Value for k : " + str(k))
    print()
    print("Classification Accuracy : " + str(accuracy) + "%")
    print()

    # Close the files
    outfile_tr.close()
    outfile_ts.close()

main()

Here are the results for those runs as well as a comparison to regular k-NN:

cknn-stats
knn-cknn-comparison

Classification accuracy was greater than 80% for both the k-NN and ck-NN runs. The classification accuracy for the k-NN runs was about 4 percentage points greater than ck-NN. However, the variability of the classification accuracy for the five experiments in the ck-NN runs was greater than for the k-NN runs.

In a real-world setting for classification, I would select the k-NN algorithm over ck-NN because its performance is more consistent. In addition, while ck-NN might result in improved run times for the actual k-NN runs, there is still the prior overhead associated with having to prune the training set of irrelevant instances.

Return to Table of Contents

Five-Fold Stratified Cross-Validation

In a lot of my machine learning projects, you might have noticed that I use a technique called five-fold stratified cross-validation. The purpose of cross-validation is to test the effectiveness of a machine learning algorithm. You don’t just want to spend all that time building your model only to find out that it only works well on the training data but works terribly on data it has never seen before. Cross-validation is the process that helps combat that risk.

The basic idea is that you shuffle your data randomly and then divide it into five equally-sized subsets. Ideally, you would like to have the same number of instances to be in each class in each of the five partitions.

For example, if you have a data set of 100 points where 1/3 of the data is in one class and 2/3 of the data is in another class, you will create five partitions of 20 instances each. Then for each of these partitions, 1/3 of the instances (~6 or 7 points) should be from the one class, and the remaining points should be in the other class. This is the “stratified” part of five-fold stratified cross-validation.

You then run five experiments where you train on four of the partitions (80% of the data) and test on the remaining partition (20% of the data). You rotate through the partitions so that each one serves as a test set exactly once. Then you average the performance on the five experiments when you report the results.

Let’s take a look at this process visually:

five-fold-stratified-cross-validation
  • Divide the data set into five random groups of equal size. Make sure that the proportion of each class in each group is roughly equal to its proportion in the entire data set.
  • Use four groups for training and one group for testing.
  • Calculate the classification accuracy.
  • Repeat the procedure four more times, rotating the test set so that each group serves as a test set exactly once.
  • Compute the average classification accuracy (or mean squared error) for the five runs.

Note that, if the target variable is continuous instead of a class, we use mean squared error instead of classification accuracy as the loss function.

Implementation 1 (Using Numpy in Python)

Here is the code for five-fold stratified cross-validation using the Numpy Python library.

import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Classification:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Continuous Class Values:
# Regression:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

class FiveFoldStratCv:

    # Constructor
    # Parameters: 
    #   np_dataset: The entire original data set as a numpy array
    #   problem_type: 'r' for regression and 'c' for classification 
    def __init__(self, np_dataset, problem_type):
        self.__np_dataset = np_dataset
        self.__problem_type = problem_type
  
    # Returns: 
    #   fold0, fold1, fold2, fold3, fold4
    #   Five folds whose class frequency distributions are 
    #   each representative of the entire original data set (i.e. Five-Fold 
    #   Stratified Cross Validation)
    def get_five_folds(self):

        # Record the number of columns in the data set
        no_of_columns = np.size(self.__np_dataset,1)

        # Record the number of rows in the data set
        no_of_rows = np.size(self.__np_dataset,0)

        # Create five empty folds (i.e. numpy arrays: fold0 through fold4)
        fold0 = np.arange(1)
        fold1 = np.arange(1)
        fold2 = np.arange(1)
        fold3 = np.arange(1)
        fold4 = np.arange(1)

        # Shuffle the data set randomly
        np.random.shuffle(self.__np_dataset)

        # Generate folds for classification problem
        if self.__problem_type == "c":

            # Record the column of the Actual Class
            actual_class_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Actual Class values
            unique_class_arr = np.unique(self.__np_dataset[
                :,actual_class_column])

            unique_class_arr_size = unique_class_arr.size

            # For each unique class in the unique Actual Class array
            for unique_class_arr_idx in range(0, unique_class_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique class. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique class is equal to the actual
                    # class in the original data set on this row
                    if unique_class_arr[unique_class_arr_idx] == (
                        self.__np_dataset[row,actual_class_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0

        # If this is a regression problem
        else:
            # Record the column of the Stratification Bin
            strat_bin_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Stratification Bin values
            unique_bin_arr = np.unique(self.__np_dataset[
                :,strat_bin_column])

            unique_bin_arr_size = unique_bin_arr.size

            # For each unique bin in the unique Stratification Bin array
            for unique_bin_arr_idx in range(0, unique_bin_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique bin. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique bin is equal to the actual
                    # bin in the original data set on this row
                    if unique_bin_arr[unique_bin_arr_idx] == (
                        self.__np_dataset[row,strat_bin_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0
        
        return fold0, fold1, fold2, fold3, fold4

Implementation 2 (Using the Counter subclass)

Here is another implementation using the Counter subclass.

import random
from collections import Counter # Used for counting

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/7/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

def get_five_folds(instances):
    """
    Parameters:
        instances: A list of dictionaries where each dictionary is an instance. 
            Each dictionary contains attribute:value pairs 
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Create five empty folds
    fold0 = []
    fold1 = []
    fold2 = []
    fold3 = []
    fold4 = []

    # Shuffle the data randomly
    random.shuffle(instances)

    # Generate a list of the unique class values and their counts
    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in instances:

            # If we have a match
            if uniqueclass == instance['Class']:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    random.shuffle(fold0)
    random.shuffle(fold1)
    random.shuffle(fold2)
    random.shuffle(fold3)
    random.shuffle(fold4)

    # Return the folds
    return  fold0, fold1, fold2, fold3, fold4

Difference Between Maximum Likelihood and Maximum a Posteriori Estimation

In this blog post, we will take a look at the difference between maximum likelihood estimation and maximum a posteriori estimation. Both are methods that attempt to estimate unknown values for parameters.

What is Maximum Likelihood Estimation?

I will explain the term maximum likelihood estimation by using a real-world example. 

Suppose you went out to the store and bought a six-sided die. You have no prior information about the type of die that you have. It could be a standard die which has a uniform prior probability distribution (i.e. each face has 1/6 chance of being face-up on any given roll), or you could have a weighted die where some numbers are more likely to appear than others. How would you calculate the probability of getting each number for a given roll of the die?

One way to calculate the probabilities is to roll the die 10,000 times and keep track of how many times (i.e. frequency in the table below) each face appeared. Your table might look something like this:

1-roll-dice-table

What you see above is the basis of maximum likelihood estimation. In maximum likelihood estimation, you estimate the parameters by maximizing the “likelihood function.” Stated more simply, you choose the value of the parameters that were most likely to have generated the data that was observed in the table above. 

So, what is the problem with maximum likelihood estimation? Well, as you saw above, we did not incorporate any prior knowledge (i.e. prior belief information) into our calculation. What if we knew that this was a weighted die where the probabilities of each face were as follows:

2-probabilities-dice

Maximum likelihood estimation does not take this into account. It assumes a uniform prior probability distribution. That is, it assumes that the probabilities of each face of the die are equal before we begin rolling the die 500 times. The problem with this assumption is that you would need to have a huge dataset (i.e. have to roll the die many times, perhaps until your arm gets too tired to continue rolling!) to get values for the appropriate probabilities that are close to what they should be.

More formally, maximum likelihood estimation looks like this:

Parameter = argmax P(Observed Data | Parameter)

where P = probability and | = given.

So what do we do? Maximum a posteriori (MAP) estimation to the rescue!

What is Maximum a Posteriori (MAP) Estimation?

Maximum a Posteriori (MAP) Estimation is similar to Maximum Likelihood Estimation (MLE) with a couple major differences. MAP takes prior probability information into account. 

For example, if we knew that the die in our example above was a weighted die with the probabilities noted in the table in the previous section, MAP estimation factors this information into the parameter estimation.

More formally, MAP estimation looks like this:

Parameter = argmax P(Observed Data | Parameter)P(Parameter)

where P = probability and | = given.

Contrast that equation above to the MLE equation from the previous section. Note that the likelihood is now weighted by the prior probability. This distinction is important. 

Imagine in MLE if we did 500 rolls, and one of the faces appeared only one time (i.e. 1/500). MLE would generate incorrect parameter values. Having that extra nonzero prior probability factor makes sure that the model does not overfit to the observed data in the way that MLE does.

Why Is Maximum a Posteriori (MAP) Estimation the Ideal Estimation Method for Smaller Datasets?

It is ideal because it takes into account prior knowledge of an event. MLE does not and is prone to overfitting. For this reason, MAP is considered a regularization of MLE. Adding the prior probability information reduces the overdependence on the observed data for parameter estimation.

Latent Variables and Latent Variable Models

In this post, I will explore the idea of latent variable models, where a latent variable is a “hidden” variable (i.e., one we cannot observe directly but that plays a part in the overall model).

What is a Latent Variable?

A latent variable is a variable that is not directly observed but is inferred from other variables we can measure directly.

A common example of a latent variable is quality of life. You cannot measure quality of life directly.

Real-world Example of Latent Variables

A few years ago, my wife and I decided to move cities to a place which would be able to offer us a better quality of life. I created a spreadsheet in Excel containing all the factors that influence the quality of life in a particular location. They are:

  1. Days per year of sunshine
  2. Unemployment rate
  3. Crime rate
  4. Average household income
  5. State income tax rate
  6. Cost of living
  7. Population density
  8. Median home price
  9. Average annual temperature
  10. Number of professional sports teams
  11. Average number of days of snow per year
  12. Population
  13. Average inches of rain per year
  14. Number of universities
  15. Distance to an international airport

If I wanted to feed this into a machine learning algorithm, I could then develop a mathematical model for quality of life.

The advantage of using the latent variable model with quality of life is that I can use a fewer number of features, which reduces the dimensionality of what would be an enormous data set. All those factors I mentioned are correlated in some way to the latent variable quality of life. So instead of using 15 columns of data, I can use just 1. The data then becomes not only easier to understand but useful if I want to use a quality of life score as a feature for any type of classification or regression problem because it captures the underlying phenomenon we are interested in.

Exploring the Use of Latent Variables Across Different Model Types

test_testing_exam_sat

One particularly useful framework for exploring the use of latent variables across different model types is the idea of path analysis. In path analysis, one develops a model that shows the relationships between variables that we can readily measure and variables that we cannot readily measure. These variables that we cannot readily measure are called latent variables.

By convention, variables that we can readily measure will be placed inside rectangles or boxes. Variables that we cannot readily measure, the latent variables, are typically located inside ovals.

Let’s look at an example now of how we can model latent variables using the latent variable of childhood intelligence.

1-sat-score

As you can see in this diagram we have four different boxes.

We have the oval at the bottom which is representative of the intelligence of a child. This is the latent variable. This is the variable that we want to predict.

Then up above we have two boxes. One box includes the father’s SAT score. The other box is the mother’s SAT score. And then finally we have another box which represents the average SAT scores of students in the state that the child lives in.

Note that in this particular path model, we draw arrows showing relationships between two variables. The arrow between the father and the mother is indicative of correlation. We are inferring in this particular model a correlation between the SAT score of the father and the SAT score of the mother, and we are also inferring a relationship between both of their SAT scores and the intelligence of a child.

We are also showing a correlation relationship between the average SAT score of the students in the state and the child’s intelligence.

But notice that there are no arrows that are drawn from either the father or the mother to the average SAT scores in the state. The lack of arrows is indicative of a lack of correlation between those two variables.

In short, I believe at the outset of a machine learning problem, we can draw different path models to begin to model the relationships and to develop an appropriate framework for selecting a model to tackle a particular machine learning problem.

Getting back to the idea of path modeling that I described earlier, we can use this diagram to not only explore the relationship between observed variables and latent variables, but we could use it as a springboard to actually develop quantifiable relationships between observed and latent variables…in other words, the numerical coefficients that we can use as inputs into a particular mathematical model.

Let’s take a look at an example, perhaps the simplest example we can think of…the equation for a line of best fit:

y = mx + b. 

You might recall the equation y = mx + b from grade school. In this particular equation, y is the dependent variable, b is the intercept, m is the slope, and x is the independent variable.

In a linear regression model we typically see an equation of the form:

Y = (BETA) * X + EPSILON 

where EPSILON is the “residual”, the difference between actual and predicted values for a given value for x.

What would a path diagram look like for this example?

Let’s draw it.

2-xyz-epsilon

I think this is a very cool way of being able to very easily take a very large data set and begin to think of ways of how we can condense this data set into a data set which would have lower storage requirements. By adding weightings to the arrows and by following the actual arrows that show the correlations between two variables whether observed or unobserved, we can then use this as an outline to start pruning irrelevant features from the data set. The next step would be able to use the subset of features to develop a hopefully useful predictive model with one of the common machine learning models available.

Factor Analysis

One framework for generating latent variable models is the idea of factor analysis.

In factor analysis, we want to determine if the variations in the measured, observed variables are primarily the result of variations in the underlying latent variables. We can then use this information, the interdependencies between the latent variables and the observed variables, to eliminate some of the features in our data set and therefore reduce the storage requirements for a particular machine learning problem. We do this by estimating the covariance (the extent to which two variables move in tandem with each other) between the observed variables and the unobserved latent variables.

Why is this promising? It is promising because we can create models of things we want to measure but are not easily able to measure. And in doing that, we can tackle the representation bias inherent in many decisions made by human beings.

Real-World Example of Factor Analysis

job_interview_colleagues_business

Let’s look at a concrete example to see how all of this fits together and how it would be implemented in the real world.

Imagine we are a software engineering team at a big company trying to hire employees that have “outstanding” technical ability. “Technical ability,” like the terms altruism, generosity, health, creativity, aggressiveness, and innovativeness, are all terms that are not easily quantifiable or measured. So let us consider the term technical ability a latent variable.

We can then consider some observed variables as part of the candidate evaluation process:

  • Number of lines of code written
  • Kaggle scores
  • Hacker rank
  • Grade point average
  • SAT score
  • GRE score
  • Number of years of experience
3-factor_analysis

These would be all of the observed variables that we believe are connected in some way to the underlying factor known as technical ability.

At the end of the day, technical ability is what we would like to discover, but since we can’t measure that directly , we use data from some other variables in order to back in to that technical ability. And if you see on each of these arrows, we could add additional weightings to our model to be able to determine which of the particular observed variables we can safely remove from the model and still achieve a result that is desirable that gets us a candidate that has the technical ability we want.

So you can see by having some kind of model developed by factor analysis, we can help eliminate some of the representation bias that might be evident in a hiring system.

At the end of the day, people who hire teams of software engineers have their own inherent bias about who to hire and who not to hire. These biases might be based on irrelevant factors such as the hiring manager’s experience with people who looked similar to the candidate, how the candidate speaks, etc.

By evaluating the results of a model that is based on mathematics, we can then compare the results of the model to individual opinions in the team to try to get a more appropriate result and hire the best candidate…all the while alleviating any concerns of representative bias that might exist in the hiring process.

K-Means Clustering and the Local Search Problem

One of the disadvantages of k-means is that it is a local search (optimization) procedure. In order to explain what that means, I will do a quick overview of:

  1. How k-means works.
  2. The main disadvantage of local search procedures like k-means
  3. A real-world example of k-means and the local search problem

How K-means Works 

The K-means algorithm is used to group unlabeled data sets into clusters based on similar attributes. It involves four main steps. 

  1. Select k: Determine the number of clusters k you want to group your data set into. 
  2. Select k Random Instances: You choose, at random, k instances from the data set. These instances are the initial cluster centroids. 
  3. Cluster Assignment: This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it. 
  4. Move Centroid: You move each of the k centroids to the average of all the instances that were assigned to that cluster centroid. 

Notice how k-means is a local search procedure. What this means is that at each step the optimal solution for each instance is the one that is the most local. Local search entails defining a neighborhood, searching for that neighborhood, and then proceeding based on the assumption that local is better…the neighboring centroid that is the closest is the best. 

Why is Local Search a Disadvantage? 

The problem with methods like k-means that make continuous local improvements to the solution is that they are at risk of “missing the forest for the trees.” They might be so focused on what is local, that they miss the obvious big picture.

france_aerial_landscape_river
Ahh, a nice river with a forest next to it. Too bad k-means clustering can’t see this…
forest_path_forest_away_1
This is the only thing k-means can see because it searches locally for the optimal solution at each step of the algorithm.

To explain this point, I will present an example from the real world. 

Real-world Example 

Suppose we had data on the voting behavior, income level, and age for 10,000 people in Los Angeles, CA. We want to group these people into clusters based on similarities of these attributes. Without looking at the unlabeled data set, we think there should be three clusters (Republican, Democrat, and Independent). We decide to run k-means on the data set and select k=3. 

Here is the unlabeled data set:

1-clustering

Here are the results after running k-means:

2-colored-clusters

The algorithm has defined three clusters, each with its own centroid. 

What is the problem with this solution? The problem is that our solution is stuck in local minima. It is obvious to the human eye that there are two clusters not three (blue = Democrats and red/green = Republicans). The red and green dots are really a single group. However, since k-means is a local search procedure, it optimizes locally.

We could run k-means 10,000 times, and the centroids would not move. The real centroid for those green and red points is somewhere between the green and red dots. However, k-means doesn’t know that. It just cares about finding the optimal local solution. 

Globally, we can see k-means gave us incorrect clustering because it was so focused on what was good in the neighborhoods and could not see the big picture. An analogy is this graph below: 

3-local-search

Since k-means is so sensitive to the initial choice of centroids as well as the value for k, it helps to have some prior knowledge of the data set, This knowledge will be invaluable to making sure the results of k-means make sense in the context of the problem.

Would local searching still be a potential issue if we had chosen k=2 in this case?

Yep! It would still be an issue depending on the choice of initial centroids. As I mentioned earlier, k-means is sensitive to the initial choice of centroids as well as the value for k. Choose bad starting centroids that gets the algorithm stuck in bad local optima, and you could run k-means 10,000 times, and the centroids would not budge. 

K-means Clustering Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-means clustering algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the K-means Clustering Algorithm?

The K-means algorithm is an unsupervised clustering algorithm (the target variable we want to predict is not available) that is used to group unlabeled data set instances into clusters based on similar attributes.

Step 1: Choose a Value for K and Select the Initial Centroids

The first step of K-means is to determine the valve for k, the number of clusters you want to group your data set into. We then randomly select k cluster centroids (i.e. k random instances) from the data set, where k is some positive integer.

Step 2: Cluster Assignment

The next step is the cluster assignment step. This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it.

Step 3: Move Centroid

The next part is the move centroid step. We move each of the k centroids to the average of all the instances that were assigned to that cluster centroid.

Step 4: Repeat Steps 2 and 3 Until Centroids Stop Changing

We keep running the cluster assignment and move centroid steps until the cluster centroids stop changing. At that point, the data is deemed clustered. 

Andrew Ng, a professor of machine learning at Stanford University, does a good job of showing you visually how k-means clustering works:

Return to Table of Contents

K-means Clustering Algorithm Design

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class (used for classification accuracy calculation)

The program then adds 4 additional columns.

  • N + 1: Cluster
  • N + 2: Silhouette Coefficient
  • N + 3: Predicted Class
  • N + 4: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

iris-data-set

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set.

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarization-kmeans

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-means Clustering Algorithm in Python, Coded From Scratch

K-means appears to be particularly sensitive to the starting centroids. The starting centroids for the k clusters were chosen at random. When these centroids started out poor, the algorithm took longer to converge to a solution. Future work would be to fine-tune the initial centroid selection process. 

Here is the full code for k-means clustering:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: kmeans.py
# Author: Addison Sears-Collins
# Date created: 6/12/2019
# Python version: 3.7
# Description: Implementation of K-means clustering algorithm from scratch.
# K-means algorithm is a clustering algorithm that is used to group 
# unlabeled data set instances into clusters based on similar attributes.

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class (used for classification accuracy calculation)

# This program then adds 4 additional columns.
# N + 1: Cluster
# N + 2: Silhouette Coefficient
# N + 3: Predicted Class
# N + 4: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-means"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_kmeans_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_kmeans_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_kmeans_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Read the full text file and store records in a Pandas dataframe
pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Copy the dataframe into a new dataframe so we don't mess up the
# original data
pd_data_set = pd_full_data_set.copy() 

# Calculate the number of instances, columns, and attributes in the
# training data set. Assumes 1 column for the instance ID and 1 column
# for the class. Record the index of the column that contains 
# the actual class
no_of_instances = len(pd_data_set.index) # number of rows
no_of_columns = len(pd_data_set.columns) # number of columns
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Store class values in a column and then create a list of unique
# classes and store in a dataframe and a Numpy array
unique_class_list_df = pd_data_set.iloc[:,actual_class_column]
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

# Record the number of unique classes in the data set
num_unique_classes = len(unique_class_list_df)

# Record the value for K, the number of clusters
K = num_unique_classes

# Remove the Instance and the Actual Class Column to create an unlabled
# data set
instance_id_colname = pd_data_set.columns[0]
class_column_colname = pd_data_set.columns[actual_class_column]
pd_data_set = pd_data_set.drop(columns = [ # Each row is a different instance
        instance_id_colname, class_column_colname]) 

# Convert dataframe into a Numpy array
np_data_set = pd_data_set.to_numpy(copy=True)

# Randomly select k instances from the data set. 
# These will be the cluster centroids for the first iteration
# of the algorithm.
centroids = np_data_set[np.random.choice(np_data_set.shape[
    0], size=K, replace=False), :]


##################### Cluster Assignment Step ################################
# Go through each instance and assign that instance to the closest 
# centroid (based on Euclidean distance).

# Initialize an array which will contain the cluster assignments for each
# instance.
cluster_assignments = np.empty(no_of_instances)

# Goes True if new centroids are the same as the old centroids
centroids_the_same = False

# Sets the maximum number of iterations
max_iterations = 300

while max_iterations > 0 and not(centroids_the_same):
    # Go through each data point and assign it to the nearest centroid
    for row in range(0, no_of_instances):
    
        this_instance = np_data_set[row]

        # Calculate the Euclidean distance of each instance in the data set
        # from each of the centroids
        # Find the centroid with the minimum distance and assign the instance
        # to that centroid.
        # Record that centroid in the cluster assignments array.
    
        # Reset the minimum distance to infinity
        min_distance = float("inf")

        for row_centroid in range(0, K):
            this_centroid = centroids[row_centroid]
        
            # Calculate the Euclidean distance from this instance to the
            # centroid
            distance = np.linalg.norm(this_instance - this_centroid)

            # If we have a centroid that is closer to this instance,
            # update the cluster assignment for this instance.
            if distance < min_distance:
                cluster_assignments[row] = row_centroid
                min_distance = distance # Update the minimum distance

    # Print after each cluster assignment has completed
    print("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:")
    print(cluster_assignments)
    print()
    print("Now calculating the new centroids...")
    print()

    outfile3.write("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:"+ "\n")
    outfile3.write(str(cluster_assignments))
    outfile3.write("\n")
    outfile3.write("\n")
    outfile3.write("Now calculating the new centroids..." + "\n")
    outfile3.write("\n")


    ##################### Move Centroid Step ################################
    # Calculate the centroids of the clusters by computing the average
    # of the attribute values of the instances in each cluster
    # For each row in the centroids 2D array

    # Store the old centroids
    old_centroids = centroids.copy()

    for row_centroid in range(0, K):

        # For each column of each row of the centroids 2D array
        for col_centroid in range(0, no_of_attributes):

            # Reset the running sum and the counter
            running_sum = 0.0
            count = 0.0
            average = None

            for row in range(0, no_of_instances):

                # If this instance belongs to this cluster
                if(row_centroid == cluster_assignments[row]):
                
                    # Add this value to the running sum
                    running_sum += np_data_set[row,col_centroid]

                    # Increment the counter
                    count += 1
        
                    if (count > 0):
                        # Calculate the average
                        average = running_sum / count

            # Update the centroids array with this average
            centroids[row_centroid,col_centroid] = average
    
    # Print to after each cluster assignment has completed
    print("New centroids have been created. Here they are:")
    print(centroids)
    print()

    outfile3.write("New centroids have been created. Here they are:" + "\n")
    outfile3.write(str(centroids))
    outfile3.write("\n")
    outfile3.write("\n")

    # Check if cluster centroids are the same
    centroids_the_same = np.array_equal(old_centroids,centroids)

    if centroids_the_same:
        print(
        "Cluster membership is unchanged. Stopping criteria has been met.")
        outfile3.write("Cluster membership is unchanged. ")
        outfile3.write("Stopping criteria has been met." + "\n")
        outfile3.write("\n")

    # Update the number of iterations
    max_iterations -= 1

# Record the actual class column name
actual_class_col_name = pd_full_data_set.columns[len(
    pd_full_data_set.columns) - 1]

# Add 4 additional columns to the original data frame
pd_full_data_set = pd_full_data_set.reindex(
      columns=[*pd_full_data_set.columns.tolist(
      ), 'Cluster', 'Silhouette Coefficient', 'Predicted Class', (
      'Prediction Correct?')])

# Add the final cluster assignments to the Pandas dataframe
pd_full_data_set['Cluster'] = cluster_assignments

outfile3.write("Calculating the Silhouette Coefficients. Please wait..." + "\n")
outfile3.write("\n")
print()
print("Calculating the Silhouette Coefficients. Please wait...")
print()
################## Calculate the Silhouette Coefficients ######################
# Rewards clusterings that have good cohesion and good separation. Varies 
# between 1 and -1. -1 means bad clustering, 1 means great clustering.

# 1. For each instance calculate the average distance to all other instances 
# in that cluster. This is a.
# 2. (Find the average distance to all the instances in the nearest neighbor 
# cluster). For each instance and any cluster that does not contain the 
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
# 3. For each instance calculate the Silhouette Coefficient s where
# s = (b-a)/max(a,b)
# Store the value in the data frame

silhouette_column = actual_class_column + 2

# Go through one instance at a time
for row in range(0, no_of_instances):

    this_instance = np_data_set[row]
    this_cluster = cluster_assignments[row]

    a = None
    running_sum = 0.0
    counter = 0.0

    # Calculate the average distance to all other instances 
    # in this cluster. This is a.
    # Go through one instance at a time
    for row_2 in range(0, no_of_instances):

        # If the other instance is in the same cluster as this instance
        if this_cluster == cluster_assignments[row_2]:

            # Calculate the distance
            distance = np.linalg.norm(this_instance - np_data_set[row_2])

            # Add the distance to the running sum
            running_sum += distance
            counter += 1

    # Calculate the value for a
    if counter > 0:
        a = running_sum / counter

    # For each instance and any cluster that does not contain the 
    # instance calculate the average distance to all
    # of the points in that other cluster. Then return the minimum such value
    # over all of the clusters. This is b.
    b = float("inf") 
    
    for clstr in range(0, K):

        running_sum = 0.0
        counter = 0.0

        # Must be other clusters, not the one this instance is in
        if clstr != this_cluster:

            # Calculate the average distance to instances in that 
            # other cluster
            for row_3 in range(0, no_of_instances):

                if cluster_assignments[row_3] == clstr:

                    # Calculate the distance
                    distance = np.linalg.norm(this_instance - np_data_set[
                        row_3])

                    # Add the distance to the running sum
                    running_sum += distance
                    counter += 1
        
            if counter > 0:
                avg_distance_to_cluster = running_sum / counter
        
            # Update b if we have a new minimum
            if avg_distance_to_cluster < b:
                b = avg_distance_to_cluster

    # Calculate the Silhouette Coefficient s where s = (b-a)/max(a,b)
    s = (b - a) / max(a,b)

    # Store the Silhouette Coefficient in the Pandas data frame
    pd_full_data_set.iloc[row,silhouette_column] = s

#################### Predict the Class #######################################
# For each cluster, determine the predominant class and assign that 
# class to the cluster. Then determine if the prediction was correct.
# Create a data frame that maps clusters to actual classes
class_mappings = pd.DataFrame(index=range(K),columns=range(1))

for clstr in range(0, K):

    # Select rows whose column equals that cluster value
    temp_df = pd_full_data_set.loc[pd_full_data_set['Cluster'] == clstr]
    
    # Select the predominant class
    class_mappings.iloc[clstr,0] = temp_df.mode()[actual_class_col_name][0]

cluster_column = actual_class_column + 1
pred_class_column = actual_class_column + 3
pred_correct_column = actual_class_column + 4

# Assign the relevant class to each instance
# See if prediction was correct
for row in range(0, no_of_instances):

    # Go through each of the clusters to check if the instance is a member
    # of that cluster
    for clstr in range(0, K):
        if clstr == pd_full_data_set.iloc[row,cluster_column]:

            # Assign the relevant class to this instance
            pd_full_data_set.iloc[
                row,pred_class_column] = class_mappings.iloc[clstr,0]

    # If the prediction was correct
    if pd_full_data_set.iloc[row,pred_class_column] == pd_full_data_set.iloc[
        row,actual_class_column]:
        pd_full_data_set.iloc[row,pred_correct_column] = 1
    else: # If incorrect prediction
        pd_full_data_set.iloc[row,pred_correct_column] = 0

# Write dataframe to a file
pd_full_data_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Print data frame to the console
print()
print()
print("Data Set")
print(pd_full_data_set)
print()
print()

################### Summary Statistics #######################################
# Calculate the average Silhouette Coefficient for the data set
# Calculate the accuracy of the clustering-based classifier

# Open a new file to save the summary statistics
outfile1 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile1.write("\n")
outfile1.write("Number of Instances : " + str(no_of_instances) + "\n")
outfile1.write("\n")
outfile1.write("Value for k : " + str(K) + "\n")

# Calculate average Silhouette Coefficient for the data set
silhouette_coefficient = pd_full_data_set.loc[
    :,"Silhouette Coefficient"].mean()

# Write the Silhouette Coefficient to the file
outfile1.write("Silhouette Coefficient : " + str(
    silhouette_coefficient) + "\n")
      
# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (pd_full_data_set.iloc[
        :,pred_correct_column].sum())/no_of_instances

accuracy *= 100

# Write accuracy to the file
outfile1.write("Accuracy : " + str(accuracy) + "%\n")

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Instances : " + str(no_of_instances))
print("Value for k : " + str(K))

# Print the Silhouette Coefficient to the console
print("Silhouette Coefficient : " + str(
    silhouette_coefficient))

# Print accuracy to the console
print("Accuracy : " + str(accuracy) + "%")

# Close the files
outfile1.close()
outfile3.close()

Return to Table of Contents

Output Statistics of K-means Clustering on the Iris Data Set

This section shows the results for the runs of K-means Clustering on the iris data set.

Test Statistics

test-stats-kmeans

Trace Runs

Here are the trace runs of the algorithm.

Output

Here is the output of the algorithm.

Return to Table of Contents

Stepwise Forward Selection Algorithm From Scratch

In this post, I will walk you through the Stepwise Forward Selection algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python and use it for feature selection for the Naive Bayes algorithm we previously developed. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the Stepwise Forward Selection Algorithm?

One of the fundamental ideas in machine learning is the Curse of Dimensionality – as the number of attributes in the training set grows, you need exponentially more instances to learn the underlying concept. And the more instances you have, the more computational time is required to process the data.

In other words, adding attributes improves performance, to a point. Beyond that point, you need to have exponentially more instances in order for performance to improve. This phenomenon is commonly known as the Hughes Effect (or Hughes Phenomenon).

curse-of-dimensionality

Image Source: Hughes, G. (1968). On the Mean Accuracy Of Statistical Pattern Recognizers. IEEE Transactions on Information Theory, 14(1), 55-63.

In the above image, n = number of attributes and m = number of instances.

One technique for combatting the Curse of Dimensionality is known as Stepwise Forward Selection (SFS). SFS involves selecting only the most relevant attributes for learning and discarding the rest. The metric that determines what attribute is “most relevant” is determined by the programmer. In this project, I will use the classification accuracy of Naïve Bayes.

Because SFS is just a method to create an optimal attribute subset, it needs to be wrapped with another algorithm (in this case Naïve Bayes) that does the actual learning. For this reason, SFS is known as a wrapper method.

The goal of SFS is to reduce the number of features that a machine learning algorithm needs to examine by eliminating unnecessary features and finding an optimal attribute subset.

Suppose we have attributes A1 through Ak which are used to describe a data set D.

A = <A1,…Ak>

SFS begins with an empty attribute set:

A0 = <>

It then selects an attribute from A, one at a time, until the performance (e.g. as measured by classification accuracy) of the target learning algorithm (e.g. Naïve Bayes) on a subset of the training data fails to improve performance. 

Return to Table of Contents

Stepwise Forward Selection Implementation

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class

The program I developed (code is below) then adds 2 additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

preprocessing-data

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set. Here is what an iris flower looks like in case you are interested:

iris-flower

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarized-data

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

Stepwise Forward Selection Algorithm in Python, Coded From Scratch

Naïve Bayes was wrapped with SFS for feature selection. Classification accuracy was used to select the optimal attribute at each step. Once the optimal attribute subset was obtained, Naïve Bayes was run on the complete training set (67% of instances) and testing set (33% of instances), and classification accuracy was calculated.

Here is the code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: sfs_naive_bayes.py
# Author: Addison Sears-Collins
# Date created: 6/10/2019
# Python version: 3.7
# Description: Implementation of Naive Bayes which uses Stepwise Forward 
# Selection (SFS) for feature selection. This code works for multi-class 
# classification problems (e.g. democrat/republican/independent)
# Calculate P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0) and
# P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1) and
# P(E1|CL2)P(E2|CL2)P(E3|CL2)...P(E#|CL2) * P(CL2), etc. and predict the
# class with the maximum result. 
# E is an attribute, and CL means class.
# Only need class prior probability and likelihoods to make a prediction
# (i.e. the numerator of Bayes formula) since denominators are same for both 
# the P(CL0|E1,E2,E3...)*P(CL0) and P(CL1|E1,E2,E3...)*P(CL1), etc. cases 
# where P means "probability of" and | means "given".

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class 

# This program then adds 2 additional columns for the testing set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "SFS-Wrapped Naive Bayes"
DATA_PATH = "glass_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "glass_dataset_naive_bayes_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "glass_dataset_naive_bayes_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "glass_dataset_naive_bayes_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
TRAINING_DATA_PRCT = 0.67 # % of data set used for training. Default 0.67
testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing
SEED = 99  # SEED for the random number generator. Default: 99
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Naive Bayes algorithm. Accepts a Pandas dataframeas its parameter
def naive_bayes(pd_data):
    # Create a training dataframe by sampling random instances from original data.
    # random_state guarantees that the pseudo-random number generator generates 
    # the same sequence of random numbers each time.
    pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED)

    # Create a testing dataframe. Dropping the training data from the original
    # dataframe ensures training and testing dataframes have different instances
    pd_testing_data = pd_data.drop(pd_training_data.index)

    ######COMMENT OUT THESE TWO LINES BEFORE RUNNING ################
    #pd_training_data = pd_data.iloc[:20] # Used for testing only, gets 1st 20 rows
    #pd_testing_data = pd_data.iloc[20:22,:] # Used for testing only, rows 20 &amp; 21

    # Calculate the number of instances, columns, and attributes in the
    # training data set. Assumes 1 column for the instance ID and 1 column
    # for the class. Record the index of the column that contains 
    # the actual class
    no_of_instances_train = len(pd_training_data.index) # number of rows
    no_of_columns_train = len(pd_training_data.columns) # number of columns
    no_of_attributes = no_of_columns_train - 2
    actual_class_column = no_of_columns_train - 1

    # Store class values in a column, sort them, then create a list of unique
    # classes and store in a dataframe and a Numpy array
    unique_class_list_df = pd_training_data.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    # Record the number of unique classes in the data set
    num_unique_classes = len(unique_class_list_df)

    # Record the frequency counts of each class in a Numpy array
    freq_cnt_class = pd_training_data.iloc[:,actual_class_column].value_counts(
        sort=True)

    # Record the frequency percentages of each class in a Numpy array
    # This is a list of the class prior probabilities
    class_prior_probs = pd_training_data.iloc[:,actual_class_column].value_counts(
        normalize=True, sort=True)

    # Add 2 additional columns to the testing dataframe
    pd_testing_data = pd_testing_data.reindex(
                      columns=[*pd_testing_data.columns.tolist(
                      ), 'Predicted Class', 'Prediction Correct?'])

    # Calculate the number of instances and columns in the
    # testing data set. Record the index of the column that contains the 
    # predicted class and prediction correctness (1 if yes; 0 if no)
    no_of_instances_test = len(pd_testing_data.index) # number of rows
    no_of_columns_test = len(pd_testing_data.columns) # number of columns
    predicted_class_column = no_of_columns_test - 2
    prediction_correct_column = no_of_columns_test - 1

    ######################### Training Phase of the Model ########################
    # Create a an empty dictionary
    my_dict = {}

    # Calculate the likelihood tables for each attribute. If an attribute has
    # four levels, there are (# of unique classes x 4) different probabilities 
    # that need to be calculated for that attribute.
    # Start on the first attribute and make your way through all the attributes
    for col in range(1, no_of_attributes + 1):

        # Record the name of this column 
        colname = pd_training_data.columns[col]

        # Create a dataframe containing the unique values in the column
        unique_attribute_values_df = pd_training_data[colname].drop_duplicates()
        # Create a Numpy array containing the unique values in the column
        unique_attribute_values_np = pd_training_data[colname].unique()
    
        # Calculate likelihood of the attribute given each unique class value
        for class_index in range (0, num_unique_classes):
        
            # For each unique attribute value, calculate the likelihoods 
            # for each class
            for attr_val in range (0, unique_attribute_values_np.size) :
                running_sum = 0

                # Calculate N(unique attribute value &amp;&amp; class value)
                # Where N means "number of" and &amp;&amp; means "and"
                # Go through each row of the training set
                for row in range(0, no_of_instances_train):
                    if (pd_training_data.iloc[row,col] == (
                        unique_attribute_values_df.iloc[attr_val])) and (
                        pd_training_data.iloc[row, actual_class_column] == (
                        unique_class_list_df.iloc[class_index])):
                            running_sum += 1

                # With N(unique attribute value &amp;&amp; class value) as the numerator
                # we now need to divide by the total number of times the class
                # appeared in the data set
                try:
                    denominator = freq_cnt_class[class_index]
                except:
                    denominator = 1.0
            
                likelihood = running_sum / denominator
            
                # Add a new likelihood to the dictionary
                # Format of search key is 
                # <attribute_name><attribute_value><class_value>
                search_key = str(colname) + str(unique_attribute_values_df.iloc[
                             attr_val]) + str(unique_class_list_df.iloc[
                             class_index])
                my_dict[search_key] = likelihood
 
    # Print the likelihood table to the console
    # print(pd.DataFrame.from_dict(my_dict, orient='index'))

    ################# End of Training Phase of the Naive Bayes Model ########

    ################# Testing Phase of the Naive Bayes Model ################

    # Proceed one instance at a time and calculate the prediction
    for row in range(0, no_of_instances_test):

        # Initialize the prediction outcome
        predicted_class = unique_class_list_df.iloc[0]
        max_numerator_of_bayes = 0.0

        # Calculate the Bayes equation numerator for each test instance
        # That is: P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0),
        # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1)...
        for class_index in range (0, num_unique_classes):

            # Reset the running product with the class
            # prior probability, P(CL)
            try:
                running_product = class_prior_probs[class_index]
            except:
                running_product = 0.0000001 # Class not found in data set
        
            # Calculation of P(CL) * P(E1|CL) * P(E2|CL) * P(E3|CL)...
            # Format of search key is 
            # <attribute_name><attribute_value><class_value>
            # Record each search key value
            for col in range(1, no_of_attributes + 1):
                attribute_name = pd_testing_data.columns[col]
                attribute_value = pd_testing_data.iloc[row,col]
                class_value = unique_class_list_df.iloc[class_index]

                # Set the search key
                key = str(attribute_name) + str(
                          attribute_value) + str(class_value)

                # Update the running product
                try:
                    running_product *= my_dict[key]
                except:
                    running_product *= 0

            # Record the prediction if we have a new max
            # Bayes numerator
            if running_product > max_numerator_of_bayes:
                max_numerator_of_bayes = running_product
                predicted_class = unique_class_list_df.iloc[
                             class_index] # New predicted class

        # Store the prediction in the dataframe
        pd_testing_data.iloc[row,predicted_class_column] = predicted_class
    
        # Store if the prediction was correct
        if predicted_class == pd_testing_data.iloc[row,actual_class_column]:
            pd_testing_data.iloc[row,prediction_correct_column] = 1
        else: 
            pd_testing_data.iloc[row,prediction_correct_column] = 0

    print("-------------------------------------------------------")
    print("Learned Model Predictions on Testing Data Set")
    print("-------------------------------------------------------")

    # Print the revamped dataframe
    print(pd_testing_data)

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (pd_testing_data.iloc[
        :,prediction_correct_column].sum())/no_of_instances_test
    # Return classification accuracy
    return accuracy
    ####################### End Testing Phase ######################################    

# Stepwise forward selection method accepts a Pandas dataframe as a parameter
# and then returns the optimal dataframe from that dataframe.
def sfs(pd_data):

    # Record the name of the column that contains the instance IDs and the class
    instance_id_colname = pd_data.columns[0]
    no_of_columns = len(pd_data.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = pd_data.columns[class_column_index]

    # Record the number of available attributes
    no_of_available_attributes = no_of_columns - 2

    # Create a dataframe containing the available attributes by removing
    # the Instance and the Class Column
    available_attributes_df = pd_data.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Create an empty optimal attribute dataframe containing only the
    # Instance and the Class Columns
    optimal_attributes_df = pd_data[[instance_id_colname,class_column_colname]]

    # Set the base performance to a really low number
    base_performance = -9999.0

    # Check whether adding a new attribute to the optimal attributes dataframe
    # improves performance
    # While there are still available attributes left
    while no_of_available_attributes > 0: 
        # Set the best performance to a low number
        best_performance = -9999.0

        # Initialize the best attribute variable to a placeholder
        best_attribute = "Placeholder"

        # For all attributes in the available attribute data frame
        for col in range(0, len(available_attributes_df.columns)):

            # Record the name of this attribute
            this_attr = available_attributes_df.columns[col]
        
            # Create a new dataframe with this attribute inserted
            temp_opt_attr_df = optimal_attributes_df.copy()
            temp_opt_attr_df.insert(
                loc=1,column=this_attr,value=(available_attributes_df[this_attr]))

            # Run Naive Bayes on this new dataframe and return the 
            # classification accuracy
            current_performance = naive_bayes(temp_opt_attr_df)

            # Find the new attribute that yielded the greatest
            # classification accuracy
            if current_performance > best_performance:
                best_performance = current_performance
                best_attribute = this_attr

        # Did adding another feature lead to improvement?
        if best_performance > base_performance:
            base_performance = best_performance

            # Add the best attribute to the optimal attribute data frame
            optimal_attributes_df.insert(
                loc=1,column=best_attribute,value=(
                available_attributes_df[best_attribute]))

            # Remove the best attribute from the available attribute data frame
            available_attributes_df = available_attributes_df.drop(
                columns = [best_attribute]) 

            # Print the best attribute to the console
            print()
            print(str(best_attribute) + " added to the optimal attribute subset")
            print()

            # Write to file
            outfile3.write(str(
                best_attribute) + (
                " added to the optimal attribute subset") + "\n")
           
            # Decrement the number of available attributes by 1
            no_of_available_attributes -= 1

            # Print number of attributes remaining to the console
            print()
            print(str(no_of_available_attributes) + " attributes remaining")
            print()
            print()

            # Write to file
            outfile3.write(str(no_of_available_attributes) + (
                " attributes remaining") + "\n" + "\n")
           
        else:
            print()
            print("Performance did not improve this round.")
            print("End of Stepwise Forward Selection.")
            print()
            outfile3.write("Performance did not improve this round." + "\n")
            outfile3.write("End of Stepwise Forward Selection." + "\n")
            break

    # Return the optimal attribute set
    return optimal_attributes_df


def main():
    # Read the full text file and store records in a Pandas dataframe
    pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

    # Used for feature selection of large data sets only
    #pd_full_data_set = pd_full_data_set.sample(frac=0.10, random_state=SEED)
    
    # Runs SFS on the full data set to find the optimal attribute data frame
    optimal_attribute_set = sfs(pd_full_data_set)

    # Write dataframe to a file
    optimal_attribute_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

    # Open a new file to save the summary statistics
    outfile2 = open(TEST_STATS_FILE,"w") 

    # Write to a file
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write("Data Set : " + DATA_PATH + "\n")

    # Write the relevant stats to a file
    outfile2.write("\n")
    outfile2.write("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)) + "\n")
       
    # Run Naive Bayes on the optimal attribute set
    accuracy = naive_bayes(optimal_attribute_set)
    accuracy *= 100

    # Write accuracy to the file
    outfile2.write("Accuracy : " + str(accuracy) + "%\n")

    # Print statistics to console
    print()
    print()
    print("-------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics (Testing)")
    print("-------------------------------------------------------")
    print("Data Set : " + DATA_PATH)

    # Print the relevant stats to the console
    print()
    print("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)))

    # Print accuracy to the console
    print()
    print("Accuracy : " + str(accuracy) + "%")
    print()

    # Create a dataframe containing the optimal attributes by removing
    # the Instance and the Class Column
    instance_id_colname = optimal_attribute_set.columns[0]
    no_of_columns = len(optimal_attribute_set.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = optimal_attribute_set.columns[class_column_index]
    optimal_attribute_set = optimal_attribute_set.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Write a list of the optimal attribute set to a file
    outfile2.write("Optimal Attribute Subset : " + str(list(
        optimal_attribute_set)))

    # Print a list of the optimal attribute set
    print("Optimal Attribute Subset : " + str(list(optimal_attribute_set)))
    print()
    print()   
    
    # Close the files
    outfile2.close()
    outfile3.close()

main() # Call the main function

Return to Table of Contents

Output Statistics of Stepwise Forward Selection on the Iris Data Set

This section shows the results for the runs of Stepwise Forward Selection wrapped with Naïve Bayes on the iris data set.

Test Statistics

sfs-test-stats-1

Trace Runs

stepwise-forward-selection-trace-runs

Here is a link to the output file.

Return to Table of Contents

Winnow2 Algorithm From Scratch | Machine Learning

In this post, I will walk you through the Winnow2 machine learning algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the breast cancer data set from the UCI Machine Learning Repository. Our goal is to be able to predict if a patient has breast cancer or not based on ten different attributes. Without further ado, let’s get started!

Table of Contents

What is the Winnow2 Algorithm?

The Winnow2 algorithm was invented by Nick Littlestone in the late 1980s. Winnow2 is an example of supervised learning. Supervised learning is the most common type of machine learning.

In supervised learning, you feed the machine learning algorithm data that contains a set of attributes (also known as x, predictors, inputs, features, etc.) and the corresponding classification (also known as label, output, target variable, correct answer, etc.). Each instance (also known as data point, example, etc.) helps the Learner (the prediction mathematical model we are trying to build) learn the association between the attributes and the class. The job of the Learner is to create a model that enables it to use just the values of the attributes to predict the class.

For example, consider how a supervised learning algorithm would approach the breast cancer data set that has 699 instances. Each instance is a different medical patient case seen by a doctor.

winnow2-algorithm-1

For each instance, we have:

  • 9 attributes: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses.
  • 2 classes: malignant (breast cancer detected) or benign (breast cancer not detected)

The end goal of a classification supervised learning algorithm like Winnow2 is to develop a model that can accurately predict if a new patient has breast cancer or not based on his or her attribute values. And in order to do that, the Learner must be trained on the existing data set in order to learn the association between the 9 attributes and the class.

In Winnow2, we need to preprocess a data set so that both the attributes and the class are binary values, 0 (zero) or 1 (one). For example, in the example I presented above, the class value for each instance is either 0 (benign – breast cancer not detected) or 1 (malignant – breast cancer detected).

Return to Table of Contents

Algorithm Steps

The Winnow2 algorithm continuously updates as each new instance arrives (i.e. an online learning algorithm). Here is how Winnow2 works on a high level:

Step 1: Learner (prediction model) receives an instance (data point):

Since Winnow2 can only work with 0s and 1s, the attribute values need to have already been preprocessed to be either 0 or 1. For example, using the breast cancer data example, attributes like clump thickness need to first be binarized so that they are either 0 or 1. One-hot encoding is the method used in this project in order to perform this binarization.

Step 2: Learner makes a prediction of the class the instance belongs to based on the values of the attributes in the instance

For example, does the Learner believe the patient belongs to the breast cancer class based on the values of the attributes? The Learner predicts 0 if no (benign) or 1 if yes (malignant).

Step 3: Is the Prediction Correct?

The Learner is told the correct class after it makes a prediction. If the learner is correct, nothing happens. Learning (i.e. amending the mathematical model) only occurs when the Learner gets an incorrect answer.

There are two ways for the prediction to be incorrect:

winnow2-algorithm-5

In order to “learn”, a process called “Promotion” (to be explained in the next section) takes place when the Learner incorrectly predicts 0. This situation is a “false negative” result. A false negative occurs when the Learner predicts a person does not have a disease or condition when the person actually does have it.

“Demotion” takes place when the Learner incorrectly predicts 1. This situation is a “false positive” result. A false positive (also known as “false alarm”) occurs when the Learner predicts a person has a specific disease or condition when the person actually does not have it.

Promotion and Demotion: Building the Prediction Model One Instance at a Time

To understand the promotion and demotion activities of Winnow2, we need to first examine the Learner’s mathematical model, the tool being used to predict whether an instance belongs to the benign (0) class or malignant (1) class.

When the Learner receives an instance (“Step 1” from the previous section), it runs the attributes through the following weighted sum:

winnow2-algorithm-6

where:

  • d is the total number of attributes
  • wi is the weighting of the ith attribute  
  • xi is the value of the ith attribute in binary format
  • f(x) is the weighted sum (e.g. w1x1 + w2x2 + w3x3 + …. + wdxd)

The Learner then predicts an instance’s class (e.g. 1 for malignant and 0 for benign in our breast cancer example) as follows where:

hofx
  • h(x) is the predicted class
  • Ɵ is a constant threshold (commonly set to 0.5)

As mentioned before, if the learner makes an incorrect prediction, either promotion or demotion occurs. In both promotion and demotion, the weights wi are modified by using a constant parameter α which is any value greater than 1 (commonly set to 2).

Initially all the weights wi for each attribute are set to 1. They are then adjusted as follows:

winnow2-algorithm-2

Return to Table of Contents

Winnow2 Example Using Real Data

Let’s take a look at the Winnow2 algorithm using an example with real data. We have three different attributes (which have already been converted from real numbers into binary form) and a class variable (which has also been converted into binary form). Our goal is to develop a model that can use just those three attributes to predict if a medical patient has breast cancer or not.

  • 3 attributes: clump thickness (x1), uniformity of cell size(x2), uniformity of cell shape (x3),
  • 2 classes (labels): malignant (breast cancer detected) or benign (breast cancer not detected)

Remember:

  • All weights w for each attribute are initially set to 1. This is known as the “weight vector.”
  • Ɵ = 0.5. This is our threshold.
  • α = 2
  • d = 3 because there are 3 attributes.

Here is the original data set with the attributes (inputs) and the class (output):

winnow2-algorithm-3

Here is what we have after we run Winnow2. We proceeded row-by-row in the original data set (one instance at a time), which generated this table:

winnow2-algorithm-4

Return to Table of Contents

Winnow2 Implementation

I implemented Winnow2 from scratch in Python and then ran the algorithm on a real-world data set, the breast cancer data set from the UCI Machine Learning Repository.

Preprocessing the Example Data Set – Breast Cancer

The breast cancer data set contains 699 instances, 10 attributes, and a class – malignant or benign. I transformed the attributes and classes into binary numbers (making them discrete is fine too, but they cannot be continuous) so that the algorithms could process the data properly and efficiently. If attribute value was greater than 5, the value was changed to 1, otherwise it was 0. I also changed Benign from 2 to 0 and Malignant from 4 to 1.

There were 16 missing attribute values in the data set, each denoted with a “?”. I chose a random number between 1 and 10 (inclusive) to fill in the data.

Finally, for each run, data sets were split into 67% for training and 33% for testing. Summary statistics were then calculated.

Here is the format the input data needs to take in the raw text/csv file. Attributes can be numerical or text, but in this example the were all numerical:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1 (in binary)
  • 2: Attribute 2 (in binary)
  • 3: Attribute 3 (in binary)
  • N: Actual Class (in binary)

The program then adds 8 additional columns:

  • N + 1: Weighted Sum (of the attributes)
  • N + 2: Predicted Class (in binary)…Weighted Sum > 0? (1 if yes; 0 if no)
  • N + 3: True Positive (1 if yes; O if no)
  • N + 4: False Positive (1 if yes; 0 if no)
  • N + 5: False Negative (1 if yes; 0 if no)
  • N + 6: True Negative (1 if yes; 0 if no)
  • N + 7: Promote (1 if yes; 0 if no)  [for training set only]
  • N + 8: Demote (1 if yes; 0 if no) [for training set only]

Here is a link to the input file after all that preprocessing was performed.

Here is a link to the output file after the algorithm below was run on the input file.

Here is a link to the final weights after the algorithm below was run on the input file.

Return to Table of Contents

Winnow2 Algorithm in Python, Coded From Scratch

Here is the code:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: winnow2.py
# Author: Addison Sears-Collins
# Date created: 5/31/2019
# Python version: 3.7
# Description: Implementation of the Winnow2 machine learning
# algorithm invented by Nick Littlestone. Used for 2-class classificaiton
# problems (e.g. cancer/no cancer....spam/not spam, etc...)
# Nick Littlestone (1988). "Learning Quickly When Irrelevant Attributes 
# Abound: A New Linear-threshold Algorithm", Machine Learning 285–318(2)

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 (in binary)
# 2: Attribute 2 (in binary)
# 3: Attribute 3 (in binary)
# ...
# N: Actual Class (in binary)

# This program then adds 8 additional columns:
# N + 1: Weighted Sum (of the attributes)
# N + 2: Predicted Class (in binary)...Weighted Sum > 0? (1 if yes; 0 if no)
# N + 3: True Positive (1 if yes; O if no)
# N + 4: False Positive (1 if yes; 0 if no)
# N + 5: False Negative (1 if yes; 0 if no)
# N + 6: True Negative (1 if yes; 0 if no)
# N + 7: Promote (1 if yes; 0 if no) [for training set only]
# N + 8: Demote (1 if yes; 0 if no)  [for training set only]

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "Winnow2"
THETA = 0.5   # This is the threshold constant for the Winnow2 algorithm
ALPHA = 2.0    # This is the adjustment constant for promotion &amp; demotion
DATA_PATH = "iris_setosa_dataset.txt"  # Directory where data set is located
TRAIN_WEIGHTS_FILE = "iris_setosa_winnow2_train_weights.txt" # Weights of learned model
TRAIN_OUT_FILE = "iris_setosa_winnow2_train_out.txt" # Training phase of the model
TEST_STATS_FILE = "iris_setosa_winnow2_test_stats.txt" # Testing statistics
TEST_OUT_FILE = "iris_setosa_winnow2_test_out.txt" # Testing phase of the model
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
CLASS_IF_ONE = "Iris Setosa" # If Class value is 1 (e.g. Malignant, Spam, etc.)
CLASS_IF_ZERO = "Not Iris Setosa"  # If Class value is 0 (e.g. Benign, Not Spam, etc.)
TRAINING_DATA_PRCT = 0.67 # % of data set used for training
testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing
SEED = 99  # SEED for the random number generator. Default: 99
#############################################################################

# Read a text file and store records in a Pandas dataframe
pd_data = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Create a training dataframe by sampling random instances from original data.
# random_state guarantees that the pseudo-random number generator generates 
# the same sequence of random numbers each time.
pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED)

# Create a testing dataframe. Dropping the training data from the original
# dataframe ensures training and testing dataframes have different instances
pd_testing_data = pd_data.drop(pd_training_data.index)

# Convert training dataframes to Numpy arrays
np_training_data = pd_training_data.values
np_testing_data = pd_testing_data.values

#np_training_data = pd_data.values # Used for testing only
#np_testing_data = pd_data.values  # Used for testing only

################ Begin Training Phase #####################################

# Calculate the number of instances, columns, and attributes in the data set
# Assumes 1 column for the instance ID and 1 column for the class
# Record the index of the column that contains the actual class
no_of_instances = np_training_data.shape[0]
no_of_columns = np_training_data.shape[1]
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Initialize the weight vector. Initialize all weights to 1.
# First column of weight vector is not used (i.e. Instance ID)
weights = np.ones(no_of_attributes + 1)

# Create a new array that has 8 columns, initialized to 99 for each value
extra_columns_train = np.full((no_of_instances, 8),99)

# Add extra columns to the training data set
np_training_data = np.append(np_training_data, extra_columns_train, axis=1)

# Make sure it is an array of floats
np_training_data = np_training_data.astype(float)

# Build the learning model one instance at a time
for row in range(0, no_of_instances):

    # Set the weighted sum to 0
    weighted_sum = 0

    # Calculate the weighted sum of the attributes
    for col in range(1, no_of_attributes + 1):
        weighted_sum += (weights[col] * np_training_data[row,col])

    # Record the weighted sum into column N + 1, the column just to the right
    # of the actual class column
    np_training_data[row, actual_class_column + 1] = weighted_sum

    # Set the predicted class to 99
    predicted_class = 99

    # Learner's prediction: Is the weighted sum > THETA?
    if weighted_sum > THETA:
        predicted_class = 1
    else:
        predicted_class = 0

    # Record the predicted class into column N + 2
    np_training_data[row, actual_class_column + 2] = predicted_class

    # Record the actual class into a variable
    actual_class = np_training_data[row, actual_class_column]

    # Initialize the prediction outcomes
    # These variables are standard inputs into a "Confusion Matrix"
    true_positive = 0   # Predicted class = 1; Actual class = 1 (hit)
    false_positive = 0  # Predicted class = 1; Actual class = 0 (false alarm)
    false_negative = 0  # Predicted class = 0; Actual class = 1 (miss)
    true_negative = 0   # Predicted class = 0; Actual class = 0 

    # Determine the outcome of the Learner's prediction
    if predicted_class == 1 and actual_class == 1:
        true_positive = 1
    elif predicted_class == 1 and actual_class == 0:
        false_positive = 1
    elif predicted_class == 0 and actual_class == 1:
        false_negative = 1
    else:
        true_negative = 1

    # Record the outcome of the Learner's prediction
    np_training_data[row, actual_class_column + 3] = true_positive
    np_training_data[row, actual_class_column + 4] = false_positive
    np_training_data[row, actual_class_column + 5] = false_negative
    np_training_data[row, actual_class_column + 6] = true_negative

    # Set the promote and demote variables to 0
    promote = 0
    demote = 0

    # Promote if false negative
    if false_negative == 1:
        promote = 1
   
    # Demote if false positive
    if false_positive == 1:
        demote = 1

    # Record if either a promotion or demotion is needed
    np_training_data[row, actual_class_column + 7] = promote
    np_training_data[row, actual_class_column + 8] = demote

    # Run through each attribute and see if it is equal to 1
    # If attribute is 1, we need to either demote or promote (adjust the
    # corresponding weight by ALPHA).
    if demote == 1:
        for col in range(1, no_of_attributes + 1):
            if(np_training_data[row,col] == 1):
                weights[col] /= ALPHA
    if promote == 1:
        for col in range(1, no_of_attributes + 1):
            if(np_training_data[row,col] == 1):
                weights[col] *= ALPHA

# Open a new file to save the weights
outfile1 = open(TRAIN_WEIGHTS_FILE,"w") 

# Write the weights of the Learned model to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(" " + ALGORITHM_NAME + " Training Weights\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")
outfile1.write("\n----------------------------\n")
outfile1.write("Weights of the Learned Model\n")
outfile1.write("----------------------------\n")
for col in range(1, no_of_attributes + 1):
    colname = pd_training_data.columns[col]
    s = str(weights[col])
    outfile1.write(colname + " : " + s + "\n")

# Write the relevant constants used in the model to a file
outfile1.write("\n")
outfile1.write("\n")
outfile1.write("-----------\n")
outfile1.write("Constants\n")
outfile1.write("-----------\n")
s = str(THETA)
outfile1.write("THETA = " + s + "\n")
s = str(ALPHA)
outfile1.write("ALPHA = " + s + "\n")

# Close the weights file
outfile1.close()

# Print the weights of the Learned model
print("----------------------------------------------------------")
print(" " + ALGORITHM_NAME + " Results")
print("----------------------------------------------------------")
print("Data Set : " + DATA_PATH)
print()
print()
print("----------------------------")
print("Weights of the Learned Model")
print("----------------------------")
for col in range(1, no_of_attributes + 1):
    colname = pd_training_data.columns[col]
    s = str(weights[col])
    print(colname + " : " + s)

# Print the relevant constants used in the model
print()
print()
print("-----------")
print("Constants")
print("-----------")
s = str(THETA)
print("THETA = " + s)
s = str(ALPHA)
print("ALPHA = " + s)
print()

# Print the learned model runs in binary form
print("-------------------------------------------------------")
print("Learned Model Runs of the Training Data Set (in binary)")
print("-------------------------------------------------------")
print(np_training_data)
print()
print()

# Convert Numpy array to a dataframe
df = pd.DataFrame(data=np_training_data)

# Replace 0s and 1s in the attribute columns with False and True
for col in range(1, no_of_attributes + 1):
    df[[col]] = df[[col]].replace([0,1],["False","True"])

# Replace values in Actual Class column with more descriptive values
df[[actual_class_column]] = df[[actual_class_column]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Replace values in Predicted Class column with more descriptive values
df[[actual_class_column + 2]] = df[[actual_class_column + 2]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Change prediction outcomes to more descriptive values
for col in range(actual_class_column + 3,actual_class_column + 9):
    df[[col]] = df[[col]].replace([0,1],["No","Yes"])

# Rename the columns
df.rename(columns={actual_class_column + 1 : "Weighted Sum" }, inplace = True)
df.rename(columns={actual_class_column + 2 : "Predicted Class" }, inplace = True)
df.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)
df.rename(columns={actual_class_column + 7 : "Promote" }, inplace = True)
df.rename(columns={actual_class_column + 8 : "Demote" }, inplace = True)

# Change remaining columns names from position numbers to descriptive names
for pos in range(0,actual_class_column + 1):
    df.rename(columns={pos : pd_data.columns[pos] }, inplace = True)

print("-------------------------------------------------------")
print("Learned Model Runs of the Training Data Set (readable) ")
print("-------------------------------------------------------")
# Print the revamped dataframe
print(df)

# Write revamped dataframe to a file
df.to_csv(TRAIN_OUT_FILE, sep=",", header=True)
################ End Training Phase #####################################

################ Begin Testing Phase ######################################

# Calculate the number of instances, columns, and attributes in the data set
# Assumes 1 column for the instance ID and 1 column for the class
# Record the index of the column that contains the actual class
no_of_instances = np_testing_data.shape[0]
no_of_columns = np_testing_data.shape[1]
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Create a new array that has 6 columns, initialized to 99 for each value
extra_columns_test = np.full((no_of_instances, 6),99)

# Add extra columns to the testing data set
np_testing_data = np.append(np_testing_data, extra_columns_test, axis=1)

# Make sure it is an array of floats
np_testing_data = np_testing_data.astype(float)

# Test the learning model one instance at a time
for row in range(0, no_of_instances):

    # Set the weighted sum to 0
    weighted_sum = 0

    # Calculate the weighted sum of the attributes
    for col in range(1, no_of_attributes + 1):
        weighted_sum += (weights[col] * np_testing_data[row,col])

    # Record the weighted sum into column N + 1, the column just to the right
    # of the actual class column
    np_testing_data[row, actual_class_column + 1] = weighted_sum

    # Set the predicted class to 99
    predicted_class = 99

    # Learner's prediction: Is the weighted sum > THETA?
    if weighted_sum > THETA:
        predicted_class = 1
    else:
        predicted_class = 0

    # Record the predicted class into column N + 2
    np_testing_data[row, actual_class_column + 2] = predicted_class

    # Record the actual class into a variable
    actual_class = np_testing_data[row, actual_class_column]

    # Initialize the prediction outcomes
    # These variables are standard inputs into a "Confusion Matrix"
    true_positive = 0   # Predicted class = 1; Actual class = 1 (hit)
    false_positive = 0  # Predicted class = 1; Actual class = 0 (false alarm)
    false_negative = 0  # Predicted class = 0; Actual class = 1 (miss)
    true_negative = 0   # Predicted class = 0; Actual class = 0 

    # Determine the outcome of the Learner's prediction
    if predicted_class == 1 and actual_class == 1:
        true_positive = 1
    elif predicted_class == 1 and actual_class == 0:
        false_positive = 1
    elif predicted_class == 0 and actual_class == 1:
        false_negative = 1
    else:
        true_negative = 1

    # Record the outcome of the Learner's prediction
    np_testing_data[row, actual_class_column + 3] = true_positive
    np_testing_data[row, actual_class_column + 4] = false_positive
    np_testing_data[row, actual_class_column + 5] = false_negative
    np_testing_data[row, actual_class_column + 6] = true_negative

# Convert Numpy array to a dataframe
df = pd.DataFrame(data=np_testing_data)

# Replace 0s and 1s in the attribute columns with False and True
for col in range(1, no_of_attributes + 1):
    df[[col]] = df[[col]].replace([0,1],["False","True"])

# Replace values in Actual Class column with more descriptive values
df[[actual_class_column]] = df[[actual_class_column]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Replace values in Predicted Class column with more descriptive values
df[[actual_class_column + 2]] = df[[actual_class_column + 2]].replace([0,1],[CLASS_IF_ZERO,CLASS_IF_ONE])

# Change prediction outcomes to more descriptive values
for col in range(actual_class_column + 3,actual_class_column + 7):
    df[[col]] = df[[col]].replace([0,1],["No","Yes"])

# Rename the columns
df.rename(columns={actual_class_column + 1 : "Weighted Sum" }, inplace = True)
df.rename(columns={actual_class_column + 2 : "Predicted Class" }, inplace = True)
df.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)

df_numerical = pd.DataFrame(data=np_testing_data) # Keep the values in this dataframe numerical
df_numerical.rename(columns={actual_class_column + 3 : "True Positive" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 4 : "False Positive" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 5 : "False Negative" }, inplace = True)
df_numerical.rename(columns={actual_class_column + 6 : "True Negative" }, inplace = True)

# Change remaining columns names from position numbers to descriptive names
for pos in range(0,actual_class_column + 1):
    df.rename(columns={pos : pd_data.columns[pos] }, inplace = True)

print("-------------------------------------------------------")
print("Learned Model Predictions on Testing Data Set")
print("-------------------------------------------------------")
# Print the revamped dataframe
print(df)

# Write revamped dataframe to a file
df.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Open a new file to save the summary statistics
outfile2 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile2.write("----------------------------------------------------------\n")
outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile2.write("----------------------------------------------------------\n")
outfile2.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile2.write("\n")
outfile2.write("Number of Test Instances : " + 
    str(np_testing_data.shape[0])+ "\n")

tp = df_numerical["True Positive"].sum()
s = str(int(tp))
outfile2.write("True Positives : " + s + "\n")

fp = df_numerical["False Positive"].sum()
s = str(int(fp))
outfile2.write("False Positives : " + s + "\n")

fn = df_numerical["False Negative"].sum()
s = str(int(fn))
outfile2.write("False Negatives : " + s + "\n")

tn = df_numerical["True Negative"].sum()
s = str(int(tn))
outfile2.write("True Negatives : " + s + "\n")

accuracy = (tp + tn)/(tp + tn + fp + fn)
accuracy *= 100
s = str(accuracy)
outfile2.write("Accuracy : " + s + "%\n")

specificity = (tn)/(tn + fp)
specificity *= 100
s = str(specificity)
outfile2.write("Specificity : " + s + "%\n")

precision = (tp)/(tp + fp)
precision *= 100
s = str(precision)
outfile2.write("Precision : " + s + "%\n")

recall = (tp)/(tp + fn)
recall *= 100
s = str(recall)
outfile2.write("Recall : " + s + "%\n")

neg_pred_value = (tn)/(tn + fn)
neg_pred_value *= 100
s = str(neg_pred_value)
outfile2.write("Negative Predictive Value : " + s + "%\n")

miss_rate = (fn)/(fn + tp)
miss_rate *= 100
s = str(miss_rate)
outfile2.write("Miss Rate  : " + s + "%\n")

fall_out = (fp)/(fp + tn)
fall_out *= 100
s = str(fall_out)
outfile2.write("Fall-Out : " + s + "%\n")

false_discovery_rate = (fp)/(fp + tp)
false_discovery_rate *= 100
s = str(false_discovery_rate)
outfile2.write("False Discovery Rate : " + s + "%\n")

false_omission_rate = (fn)/(fn + tn)
false_omission_rate *= 100
s = str(false_omission_rate)
outfile2.write("False Omission Rate  : " + s + "%\n")

f1_score = (2 * tp)/((2 * tp) + fp + fn)
s = str(f1_score)
outfile2.write("F1 Score: " + s)

# Close the weights file
outfile2.close()

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Test Instances : " + 
    str(np_testing_data.shape[0]))

s = str(int(tp))
print("True Positives : " + s)

s = str(int(fp))
print("False Positives : " + s)

s = str(int(fn))
print("False Negatives : " + s)

s = str(int(tn))
print("True Negatives : " + s)

s = str(accuracy)
print("Accuracy : " + s + "%")

s = str(specificity)
print("Specificity : " + s + "%")

s = str(precision)
print("Precision : " + s + "%")

s = str(recall)
print("Recall : " + s + "%")

s = str(neg_pred_value)
print("Negative Predictive Value : " + s + "%")

s = str(miss_rate)
print("Miss Rate  : " + s + "%")

s = str(fall_out)
print("Fall-Out : " + s + "%")

s = str(false_discovery_rate)
print("False Discovery Rate : " + s + "%")

s = str(false_omission_rate)
print("False Omission Rate  : " + s + "%")

s = str(f1_score)
print("F1 Score: " + s)


###################### End Testing Phase ######################################

Return to Table of Contents

Output Statistics of Winnow2 on the Breast Cancer Data Set

Here is a link to a screenshot of the summary statistics:

winnow2_summary_stats_screenshot

Return to Table of Contents

What Is a Maximum Spanning Tree?

In this post, I will explain the concept of a maximum spanning tree.

What is a Spanning Tree?

Let’s say we have a graph G with three nodes, A, B, and C. Each node represents an attribute. For example, for a classification problem for breast cancer, A = clump size, B = blood pressure, C = body weight.

Graph G:

maximum-spanning-tree-1

A spanning tree is a subset of the graph G that includes all of the attributes with the minimum number of edges (that would have to be 2 because a tree with just one edge would only connect at most 2 attributes). In the graph above, there are three spanning trees. All spanning trees in this graph G must have the same number of attributes (3 in total) and edges (2 in total).

Spanning Tree 1:

Spanning Tree 2:

Spanning Tree 3:

What is a Maximum Spanning Graph?

OK, so we have our spanning trees. Now, imagine that each edge has a weight. This weight would be some number. Weighted graphs look like this:

The graph above could has three spanning trees, subsets of the graph G that include all of the attributes with the minimum number of edges.

Which one of those spanning graphs is the “maximum spanning graph?”…the one that, when you add up the weights of each edge of the spanning graph, delivers the greatest result. The answer to that is our maximum spanning tree.

Here is the maximum spanning tree:

Since the Attribute Designated as the Root Is Arbitrary, Is It Safe to Assume That This Choice Does Not Affect the Model Effectiveness?

Yes, it is safe to assume that. The graph doesn’t change, and Kruskal’s algorithm, the algorithm for finding the maximum spanning tree in a graph doesn’t care what the root is…it just wants to find the largest edge at each step that doesn’t produce a cycle.

The number of maximum spanning trees in a graph G remains constant. Whether you start at C, B, and E, doesn’t matter. The graph is what it is…unless of course you decide to add a new attribute…but then it would be a different graph with a whole other set of spanning trees.