K-Means Clustering and the Local Search Problem

One of the disadvantages of k-means is that it is a local search (optimization) procedure. In order to explain what that means, I will do a quick overview of:

  1. How k-means works.
  2. The main disadvantage of local search procedures like k-means
  3. A real-world example of k-means and the local search problem

How K-means Works 

The K-means algorithm is used to group unlabeled data sets into clusters based on similar attributes. It involves four main steps. 

  1. Select k: Determine the number of clusters k you want to group your data set into. 
  2. Select k Random Instances: You choose, at random, k instances from the data set. These instances are the initial cluster centroids. 
  3. Cluster Assignment: This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it. 
  4. Move Centroid: You move each of the k centroids to the average of all the instances that were assigned to that cluster centroid. 

Notice how k-means is a local search procedure. What this means is that at each step the optimal solution for each instance is the one that is the most local. Local search entails defining a neighborhood, searching for that neighborhood, and then proceeding based on the assumption that local is better…the neighboring centroid that is the closest is the best. 

Why is Local Search a Disadvantage? 

The problem with methods like k-means that make continuous local improvements to the solution is that they are at risk of “missing the forest for the trees.” They might be so focused on what is local, that they miss the obvious big picture.

france_aerial_landscape_river
Ahh, a nice river with a forest next to it. Too bad k-means clustering can’t see this…
forest_path_forest_away_1
This is the only thing k-means can see because it searches locally for the optimal solution at each step of the algorithm.

To explain this point, I will present an example from the real world. 

Real-world Example 

Suppose we had data on the voting behavior, income level, and age for 10,000 people in Los Angeles, CA. We want to group these people into clusters based on similarities of these attributes. Without looking at the unlabeled data set, we think there should be three clusters (Republican, Democrat, and Independent). We decide to run k-means on the data set and select k=3. 

Here is the unlabeled data set:

1-clustering

Here are the results after running k-means:

2-colored-clusters

The algorithm has defined three clusters, each with its own centroid. 

What is the problem with this solution? The problem is that our solution is stuck in local minima. It is obvious to the human eye that there are two clusters not three (blue = Democrats and red/green = Republicans). The red and green dots are really a single group. However, since k-means is a local search procedure, it optimizes locally.

We could run k-means 10,000 times, and the centroids would not move. The real centroid for those green and red points is somewhere between the green and red dots. However, k-means doesn’t know that. It just cares about finding the optimal local solution. 

Globally, we can see k-means gave us incorrect clustering because it was so focused on what was good in the neighborhoods and could not see the big picture. An analogy is this graph below: 

3-local-search

Since k-means is so sensitive to the initial choice of centroids as well as the value for k, it helps to have some prior knowledge of the data set, This knowledge will be invaluable to making sure the results of k-means make sense in the context of the problem.

Would local searching still be a potential issue if we had chosen k=2 in this case?

Yep! It would still be an issue depending on the choice of initial centroids. As I mentioned earlier, k-means is sensitive to the initial choice of centroids as well as the value for k. Choose bad starting centroids that gets the algorithm stuck in bad local optima, and you could run k-means 10,000 times, and the centroids would not budge. 

K-means Clustering Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-means clustering algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the K-means Clustering Algorithm?

The K-means algorithm is an unsupervised clustering algorithm (the target variable we want to predict is not available) that is used to group unlabeled data set instances into clusters based on similar attributes.

Step 1: Choose a Value for K and Select the Initial Centroids

The first step of K-means is to determine the valve for k, the number of clusters you want to group your data set into. We then randomly select k cluster centroids (i.e. k random instances) from the data set, where k is some positive integer.

Step 2: Cluster Assignment

The next step is the cluster assignment step. This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it.

Step 3: Move Centroid

The next part is the move centroid step. We move each of the k centroids to the average of all the instances that were assigned to that cluster centroid.

Step 4: Repeat Steps 2 and 3 Until Centroids Stop Changing

We keep running the cluster assignment and move centroid steps until the cluster centroids stop changing. At that point, the data is deemed clustered. 

Andrew Ng, a professor of machine learning at Stanford University, does a good job of showing you visually how k-means clustering works:

Return to Table of Contents

K-means Clustering Algorithm Design

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class (used for classification accuracy calculation)

The program then adds 4 additional columns.

  • N + 1: Cluster
  • N + 2: Silhouette Coefficient
  • N + 3: Predicted Class
  • N + 4: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

iris-data-set

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set.

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarization-kmeans

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-means Clustering Algorithm in Python, Coded From Scratch

K-means appears to be particularly sensitive to the starting centroids. The starting centroids for the k clusters were chosen at random. When these centroids started out poor, the algorithm took longer to converge to a solution. Future work would be to fine-tune the initial centroid selection process. 

Here is the full code for k-means clustering. Don’t be scared at how long this code is. I include a bunch of comments at each step so that you know what is going on. Just copy and paste this into your favorite IDE and run it!

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: kmeans.py
# Author: Addison Sears-Collins
# Date created: 6/12/2019
# Python version: 3.7
# Description: Implementation of K-means clustering algorithm from scratch.
# K-means algorithm is a clustering algorithm that is used to group 
# unlabeled data set instances into clusters based on similar attributes.

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class (used for classification accuracy calculation)

# This program then adds 4 additional columns.
# N + 1: Cluster
# N + 2: Silhouette Coefficient
# N + 3: Predicted Class
# N + 4: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-means"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_kmeans_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_kmeans_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_kmeans_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Read the full text file and store records in a Pandas dataframe
pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Copy the dataframe into a new dataframe so we don't mess up the
# original data
pd_data_set = pd_full_data_set.copy() 

# Calculate the number of instances, columns, and attributes in the
# training data set. Assumes 1 column for the instance ID and 1 column
# for the class. Record the index of the column that contains 
# the actual class
no_of_instances = len(pd_data_set.index) # number of rows
no_of_columns = len(pd_data_set.columns) # number of columns
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Store class values in a column and then create a list of unique
# classes and store in a dataframe and a Numpy array
unique_class_list_df = pd_data_set.iloc[:,actual_class_column]
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

# Record the number of unique classes in the data set
num_unique_classes = len(unique_class_list_df)

# Record the value for K, the number of clusters
K = num_unique_classes

# Remove the Instance and the Actual Class Column to create an unlabled
# data set
instance_id_colname = pd_data_set.columns[0]
class_column_colname = pd_data_set.columns[actual_class_column]
pd_data_set = pd_data_set.drop(columns = [ # Each row is a different instance
        instance_id_colname, class_column_colname]) 

# Convert dataframe into a Numpy array
np_data_set = pd_data_set.to_numpy(copy=True)

# Randomly select k instances from the data set. 
# These will be the cluster centroids for the first iteration
# of the algorithm.
centroids = np_data_set[np.random.choice(np_data_set.shape[
    0], size=K, replace=False), :]


##################### Cluster Assignment Step ################################
# Go through each instance and assign that instance to the closest 
# centroid (based on Euclidean distance).

# Initialize an array which will contain the cluster assignments for each
# instance.
cluster_assignments = np.empty(no_of_instances)

# Goes True if new centroids are the same as the old centroids
centroids_the_same = False

# Sets the maximum number of iterations
max_iterations = 300

while max_iterations > 0 and not(centroids_the_same):
    # Go through each data point and assign it to the nearest centroid
    for row in range(0, no_of_instances):
    
        this_instance = np_data_set[row]

        # Calculate the Euclidean distance of each instance in the data set
        # from each of the centroids
        # Find the centroid with the minimum distance and assign the instance
        # to that centroid.
        # Record that centroid in the cluster assignments array.
    
        # Reset the minimum distance to infinity
        min_distance = float("inf")

        for row_centroid in range(0, K):
            this_centroid = centroids[row_centroid]
        
            # Calculate the Euclidean distance from this instance to the
            # centroid
            distance = np.linalg.norm(this_instance - this_centroid)

            # If we have a centroid that is closer to this instance,
            # update the cluster assignment for this instance.
            if distance < min_distance:
                cluster_assignments[row] = row_centroid
                min_distance = distance # Update the minimum distance

    # Print after each cluster assignment has completed
    print("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:")
    print(cluster_assignments)
    print()
    print("Now calculating the new centroids...")
    print()

    outfile3.write("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:"+ "\n")
    outfile3.write(str(cluster_assignments))
    outfile3.write("\n")
    outfile3.write("\n")
    outfile3.write("Now calculating the new centroids..." + "\n")
    outfile3.write("\n")


    ##################### Move Centroid Step ################################
    # Calculate the centroids of the clusters by computing the average
    # of the attribute values of the instances in each cluster
    # For each row in the centroids 2D array

    # Store the old centroids
    old_centroids = centroids.copy()

    for row_centroid in range(0, K):

        # For each column of each row of the centroids 2D array
        for col_centroid in range(0, no_of_attributes):

            # Reset the running sum and the counter
            running_sum = 0.0
            count = 0.0
            average = None

            for row in range(0, no_of_instances):

                # If this instance belongs to this cluster
                if(row_centroid == cluster_assignments[row]):
                
                    # Add this value to the running sum
                    running_sum += np_data_set[row,col_centroid]

                    # Increment the counter
                    count += 1
        
                    if (count > 0):
                        # Calculate the average
                        average = running_sum / count

            # Update the centroids array with this average
            centroids[row_centroid,col_centroid] = average
    
    # Print to after each cluster assignment has completed
    print("New centroids have been created. Here they are:")
    print(centroids)
    print()

    outfile3.write("New centroids have been created. Here they are:" + "\n")
    outfile3.write(str(centroids))
    outfile3.write("\n")
    outfile3.write("\n")

    # Check if cluster centroids are the same
    centroids_the_same = np.array_equal(old_centroids,centroids)

    if centroids_the_same:
        print(
        "Cluster membership is unchanged. Stopping criteria has been met.")
        outfile3.write("Cluster membership is unchanged. ")
        outfile3.write("Stopping criteria has been met." + "\n")
        outfile3.write("\n")

    # Update the number of iterations
    max_iterations -= 1

# Record the actual class column name
actual_class_col_name = pd_full_data_set.columns[len(
    pd_full_data_set.columns) - 1]

# Add 4 additional columns to the original data frame
pd_full_data_set = pd_full_data_set.reindex(
      columns=[*pd_full_data_set.columns.tolist(
      ), 'Cluster', 'Silhouette Coefficient', 'Predicted Class', (
      'Prediction Correct?')])

# Add the final cluster assignments to the Pandas dataframe
pd_full_data_set['Cluster'] = cluster_assignments

outfile3.write("Calculating the Silhouette Coefficients. Please wait..." + "\n")
outfile3.write("\n")
print()
print("Calculating the Silhouette Coefficients. Please wait...")
print()
################## Calculate the Silhouette Coefficients ######################
# Rewards clusterings that have good cohesion and good separation. Varies 
# between 1 and -1. -1 means bad clustering, 1 means great clustering.

# 1. For each instance calculate the average distance to all other instances 
# in that cluster. This is a.
# 2. (Find the average distance to all the instances in the nearest neighbor 
# cluster). For each instance and any cluster that does not contain the 
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
# 3. For each instance calculate the Silhouette Coefficient s where
# s = (b-a)/max(a,b)
# Store the value in the data frame

silhouette_column = actual_class_column + 2

# Go through one instance at a time
for row in range(0, no_of_instances):

    this_instance = np_data_set[row]
    this_cluster = cluster_assignments[row]

    a = None
    running_sum = 0.0
    counter = 0.0

    # Calculate the average distance to all other instances 
    # in this cluster. This is a.
    # Go through one instance at a time
    for row_2 in range(0, no_of_instances):

        # If the other instance is in the same cluster as this instance
        if this_cluster == cluster_assignments[row_2]:

            # Calculate the distance
            distance = np.linalg.norm(this_instance - np_data_set[row_2])

            # Add the distance to the running sum
            running_sum += distance
            counter += 1

    # Calculate the value for a
    if counter > 0:
        a = running_sum / counter

    # For each instance and any cluster that does not contain the 
    # instance calculate the average distance to all
    # of the points in that other cluster. Then return the minimum such value
    # over all of the clusters. This is b.
    b = float("inf") 
    
    for clstr in range(0, K):

        running_sum = 0.0
        counter = 0.0

        # Must be other clusters, not the one this instance is in
        if clstr != this_cluster:

            # Calculate the average distance to instances in that 
            # other cluster
            for row_3 in range(0, no_of_instances):

                if cluster_assignments[row_3] == clstr:

                    # Calculate the distance
                    distance = np.linalg.norm(this_instance - np_data_set[
                        row_3])

                    # Add the distance to the running sum
                    running_sum += distance
                    counter += 1
        
            if counter > 0:
                avg_distance_to_cluster = running_sum / counter
        
            # Update b if we have a new minimum
            if avg_distance_to_cluster < b:
                b = avg_distance_to_cluster

    # Calculate the Silhouette Coefficient s where s = (b-a)/max(a,b)
    s = (b - a) / max(a,b)

    # Store the Silhouette Coefficient in the Pandas data frame
    pd_full_data_set.iloc[row,silhouette_column] = s

#################### Predict the Class #######################################
# For each cluster, determine the predominant class and assign that 
# class to the cluster. Then determine if the prediction was correct.
# Create a data frame that maps clusters to actual classes
class_mappings = pd.DataFrame(index=range(K),columns=range(1))

for clstr in range(0, K):

    # Select rows whose column equals that cluster value
    temp_df = pd_full_data_set.loc[pd_full_data_set['Cluster'] == clstr]
    
    # Select the predominant class
    class_mappings.iloc[clstr,0] = temp_df.mode()[actual_class_col_name][0]

cluster_column = actual_class_column + 1
pred_class_column = actual_class_column + 3
pred_correct_column = actual_class_column + 4

# Assign the relevant class to each instance
# See if prediction was correct
for row in range(0, no_of_instances):

    # Go through each of the clusters to check if the instance is a member
    # of that cluster
    for clstr in range(0, K):
        if clstr == pd_full_data_set.iloc[row,cluster_column]:

            # Assign the relevant class to this instance
            pd_full_data_set.iloc[
                row,pred_class_column] = class_mappings.iloc[clstr,0]

    # If the prediction was correct
    if pd_full_data_set.iloc[row,pred_class_column] == pd_full_data_set.iloc[
        row,actual_class_column]:
        pd_full_data_set.iloc[row,pred_correct_column] = 1
    else: # If incorrect prediction
        pd_full_data_set.iloc[row,pred_correct_column] = 0

# Write dataframe to a file
pd_full_data_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Print data frame to the console
print()
print()
print("Data Set")
print(pd_full_data_set)
print()
print()

################### Summary Statistics #######################################
# Calculate the average Silhouette Coefficient for the data set
# Calculate the accuracy of the clustering-based classifier

# Open a new file to save the summary statistics
outfile1 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile1.write("\n")
outfile1.write("Number of Instances : " + str(no_of_instances) + "\n")
outfile1.write("\n")
outfile1.write("Value for k : " + str(K) + "\n")

# Calculate average Silhouette Coefficient for the data set
silhouette_coefficient = pd_full_data_set.loc[
    :,"Silhouette Coefficient"].mean()

# Write the Silhouette Coefficient to the file
outfile1.write("Silhouette Coefficient : " + str(
    silhouette_coefficient) + "\n")
      
# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (pd_full_data_set.iloc[
        :,pred_correct_column].sum())/no_of_instances

accuracy *= 100

# Write accuracy to the file
outfile1.write("Accuracy : " + str(accuracy) + "%\n")

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Instances : " + str(no_of_instances))
print("Value for k : " + str(K))

# Print the Silhouette Coefficient to the console
print("Silhouette Coefficient : " + str(
    silhouette_coefficient))

# Print accuracy to the console
print("Accuracy : " + str(accuracy) + "%")

# Close the files
outfile1.close()
outfile3.close()

Return to Table of Contents

Output Statistics of K-means Clustering on the Iris Data Set

This section shows the results for the runs of K-means Clustering on the iris data set.

Test Statistics

test-stats-kmeans

Trace Runs

Here are the trace runs of the algorithm.

Output

Here is the output of the algorithm.

Return to Table of Contents

Stepwise Forward Selection Algorithm From Scratch

In this post, I will walk you through the Stepwise Forward Selection algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python and use it for feature selection for the Naive Bayes algorithm we previously developed. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the Stepwise Forward Selection Algorithm?

One of the fundamental ideas in machine learning is the Curse of Dimensionality – as the number of attributes in the training set grows, you need exponentially more instances to learn the underlying concept. And the more instances you have, the more computational time is required to process the data.

In other words, adding attributes improves performance, to a point. Beyond that point, you need to have exponentially more instances in order for performance to improve. This phenomenon is commonly known as the Hughes Effect (or Hughes Phenomenon).

curse-of-dimensionality

Image Source: Hughes, G. (1968). On the Mean Accuracy Of Statistical Pattern Recognizers. IEEE Transactions on Information Theory, 14(1), 55-63.

In the above image, n = number of attributes and m = number of instances.

One technique for combatting the Curse of Dimensionality is known as Stepwise Forward Selection (SFS). SFS involves selecting only the most relevant attributes for learning and discarding the rest. The metric that determines what attribute is “most relevant” is determined by the programmer. In this project, I will use the classification accuracy of Naïve Bayes.

Because SFS is just a method to create an optimal attribute subset, it needs to be wrapped with another algorithm (in this case Naïve Bayes) that does the actual learning. For this reason, SFS is known as a wrapper method.

The goal of SFS is to reduce the number of features that a machine learning algorithm needs to examine by eliminating unnecessary features and finding an optimal attribute subset.

Suppose we have attributes A1 through Ak which are used to describe a data set D.

A = <A1,…Ak>

SFS begins with an empty attribute set:

A0 = <>

It then selects an attribute from A, one at a time, until the performance (e.g. as measured by classification accuracy) of the target learning algorithm (e.g. Naïve Bayes) on a subset of the training data fails to improve performance. 

Return to Table of Contents

Stepwise Forward Selection Implementation

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class

The program I developed (code is below) then adds 2 additional columns for the testing set.

  • N + 1: Predicted Class
  • N + 2: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

preprocessing-data

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set. Here is what an iris flower looks like in case you are interested:

iris-flower

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarized-data

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

Stepwise Forward Selection Algorithm in Python, Coded From Scratch

Naïve Bayes was wrapped with SFS for feature selection. Classification accuracy was used to select the optimal attribute at each step. Once the optimal attribute subset was obtained, Naïve Bayes was run on the complete training set (67% of instances) and testing set (33% of instances), and classification accuracy was calculated.

Here is the code. Just copy and paste it into your favorite IDE. Don’t be scared at how long the code is. I use a lot of comments so that you know what is going on at each step:

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: sfs_naive_bayes.py
# Author: Addison Sears-Collins
# Date created: 6/10/2019
# Python version: 3.7
# Description: Implementation of Naive Bayes which uses Stepwise Forward 
# Selection (SFS) for feature selection. This code works for multi-class 
# classification problems (e.g. democrat/republican/independent)
# Calculate P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0) and
# P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1) and
# P(E1|CL2)P(E2|CL2)P(E3|CL2)...P(E#|CL2) * P(CL2), etc. and predict the
# class with the maximum result. 
# E is an attribute, and CL means class.
# Only need class prior probability and likelihoods to make a prediction
# (i.e. the numerator of Bayes formula) since denominators are same for both 
# the P(CL0|E1,E2,E3...)*P(CL0) and P(CL1|E1,E2,E3...)*P(CL1), etc. cases 
# where P means "probability of" and | means "given".

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class 

# This program then adds 2 additional columns for the testing set.
# N + 1: Predicted Class
# N + 2: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "SFS-Wrapped Naive Bayes"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_naive_bayes_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_naive_bayes_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_naive_bayes_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
TRAINING_DATA_PRCT = 0.67 # % of data set used for training. Default 0.67
testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing
SEED = 99  # SEED for the random number generator. Default: 99
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Naive Bayes algorithm. Accepts a Pandas dataframeas its parameter
def naive_bayes(pd_data):
    # Create a training dataframe by sampling random instances from original data.
    # random_state guarantees that the pseudo-random number generator generates 
    # the same sequence of random numbers each time.
    pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED)

    # Create a testing dataframe. Dropping the training data from the original
    # dataframe ensures training and testing dataframes have different instances
    pd_testing_data = pd_data.drop(pd_training_data.index)

    ######COMMENT OUT THESE TWO LINES BEFORE RUNNING ################
    #pd_training_data = pd_data.iloc[:20] # Used for testing only, gets 1st 20 rows
    #pd_testing_data = pd_data.iloc[20:22,:] # Used for testing only, rows 20 & 21

    # Calculate the number of instances, columns, and attributes in the
    # training data set. Assumes 1 column for the instance ID and 1 column
    # for the class. Record the index of the column that contains 
    # the actual class
    no_of_instances_train = len(pd_training_data.index) # number of rows
    no_of_columns_train = len(pd_training_data.columns) # number of columns
    no_of_attributes = no_of_columns_train - 2
    actual_class_column = no_of_columns_train - 1

    # Store class values in a column, sort them, then create a list of unique
    # classes and store in a dataframe and a Numpy array
    unique_class_list_df = pd_training_data.iloc[:,actual_class_column]
    unique_class_list_df = unique_class_list_df.sort_values()
    unique_class_list_np = unique_class_list_df.unique() #Numpy array
    unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

    # Record the number of unique classes in the data set
    num_unique_classes = len(unique_class_list_df)

    # Record the frequency counts of each class in a Numpy array
    freq_cnt_class = pd_training_data.iloc[:,actual_class_column].value_counts(
        sort=True)

    # Record the frequency percentages of each class in a Numpy array
    # This is a list of the class prior probabilities
    class_prior_probs = pd_training_data.iloc[:,actual_class_column].value_counts(
        normalize=True, sort=True)

    # Add 2 additional columns to the testing dataframe
    pd_testing_data = pd_testing_data.reindex(
                      columns=[*pd_testing_data.columns.tolist(
                      ), 'Predicted Class', 'Prediction Correct?'])

    # Calculate the number of instances and columns in the
    # testing data set. Record the index of the column that contains the 
    # predicted class and prediction correctness (1 if yes; 0 if no)
    no_of_instances_test = len(pd_testing_data.index) # number of rows
    no_of_columns_test = len(pd_testing_data.columns) # number of columns
    predicted_class_column = no_of_columns_test - 2
    prediction_correct_column = no_of_columns_test - 1

    ######################### Training Phase of the Model ########################
    # Create a an empty dictionary
    my_dict = {}

    # Calculate the likelihood tables for each attribute. If an attribute has
    # four levels, there are (# of unique classes x 4) different probabilities 
    # that need to be calculated for that attribute.
    # Start on the first attribute and make your way through all the attributes
    for col in range(1, no_of_attributes + 1):

        # Record the name of this column 
        colname = pd_training_data.columns[col]

        # Create a dataframe containing the unique values in the column
        unique_attribute_values_df = pd_training_data[colname].drop_duplicates()
        # Create a Numpy array containing the unique values in the column
        unique_attribute_values_np = pd_training_data[colname].unique()
    
        # Calculate likelihood of the attribute given each unique class value
        for class_index in range (0, num_unique_classes):
        
            # For each unique attribute value, calculate the likelihoods 
            # for each class
            for attr_val in range (0, unique_attribute_values_np.size) :
                running_sum = 0

                # Calculate N(unique attribute value && class value)
                # Where N means "number of" and && means "and"
                # Go through each row of the training set
                for row in range(0, no_of_instances_train):
                    if (pd_training_data.iloc[row,col] == (
                        unique_attribute_values_df.iloc[attr_val])) and (
                        pd_training_data.iloc[row, actual_class_column] == (
                        unique_class_list_df.iloc[class_index])):
                            running_sum += 1

                # With N(unique attribute value && class value) as the numerator
                # we now need to divide by the total number of times the class
                # appeared in the data set
                try:
                    denominator = freq_cnt_class[class_index]
                except:
                    denominator = 1.0
            
                likelihood = running_sum / denominator
            
                # Add a new likelihood to the dictionary
                # Format of search key is 
                # <attribute_name><attribute_value><class_value>
                search_key = str(colname) + str(unique_attribute_values_df.iloc[
                             attr_val]) + str(unique_class_list_df.iloc[
                             class_index])
                my_dict[search_key] = likelihood
 
    # Print the likelihood table to the console
    # print(pd.DataFrame.from_dict(my_dict, orient='index'))

    ################# End of Training Phase of the Naive Bayes Model ########

    ################# Testing Phase of the Naive Bayes Model ################

    # Proceed one instance at a time and calculate the prediction
    for row in range(0, no_of_instances_test):

        # Initialize the prediction outcome
        predicted_class = unique_class_list_df.iloc[0]
        max_numerator_of_bayes = 0.0

        # Calculate the Bayes equation numerator for each test instance
        # That is: P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0),
        # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1)...
        for class_index in range (0, num_unique_classes):

            # Reset the running product with the class
            # prior probability, P(CL)
            try:
                running_product = class_prior_probs[class_index]
            except:
                running_product = 0.0000001 # Class not found in data set
        
            # Calculation of P(CL) * P(E1|CL) * P(E2|CL) * P(E3|CL)...
            # Format of search key is 
            # <attribute_name><attribute_value><class_value>
            # Record each search key value
            for col in range(1, no_of_attributes + 1):
                attribute_name = pd_testing_data.columns[col]
                attribute_value = pd_testing_data.iloc[row,col]
                class_value = unique_class_list_df.iloc[class_index]

                # Set the search key
                key = str(attribute_name) + str(
                          attribute_value) + str(class_value)

                # Update the running product
                try:
                    running_product *= my_dict[key]
                except:
                    running_product *= 0

            # Record the prediction if we have a new max
            # Bayes numerator
            if running_product > max_numerator_of_bayes:
                max_numerator_of_bayes = running_product
                predicted_class = unique_class_list_df.iloc[
                             class_index] # New predicted class

        # Store the prediction in the dataframe
        pd_testing_data.iloc[row,predicted_class_column] = predicted_class
    
        # Store if the prediction was correct
        if predicted_class == pd_testing_data.iloc[row,actual_class_column]:
            pd_testing_data.iloc[row,prediction_correct_column] = 1
        else: 
            pd_testing_data.iloc[row,prediction_correct_column] = 0

    print("-------------------------------------------------------")
    print("Learned Model Predictions on Testing Data Set")
    print("-------------------------------------------------------")

    # Print the revamped dataframe
    print(pd_testing_data)

    # accuracy = (total correct predictions)/(total number of predictions)
    accuracy = (pd_testing_data.iloc[
        :,prediction_correct_column].sum())/no_of_instances_test
    # Return classification accuracy
    return accuracy
    ####################### End Testing Phase ######################################    

# Stepwise forward selection method accepts a Pandas dataframe as a parameter
# and then returns the optimal dataframe from that dataframe.
def sfs(pd_data):

    # Record the name of the column that contains the instance IDs and the class
    instance_id_colname = pd_data.columns[0]
    no_of_columns = len(pd_data.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = pd_data.columns[class_column_index]

    # Record the number of available attributes
    no_of_available_attributes = no_of_columns - 2

    # Create a dataframe containing the available attributes by removing
    # the Instance and the Class Column
    available_attributes_df = pd_data.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Create an empty optimal attribute dataframe containing only the
    # Instance and the Class Columns
    optimal_attributes_df = pd_data[[instance_id_colname,class_column_colname]]

    # Set the base performance to a really low number
    base_performance = -9999.0

    # Check whether adding a new attribute to the optimal attributes dataframe
    # improves performance
    # While there are still available attributes left
    while no_of_available_attributes > 0: 
        # Set the best performance to a low number
        best_performance = -9999.0

        # Initialize the best attribute variable to a placeholder
        best_attribute = "Placeholder"

        # For all attributes in the available attribute data frame
        for col in range(0, len(available_attributes_df.columns)):

            # Record the name of this attribute
            this_attr = available_attributes_df.columns[col]
        
            # Create a new dataframe with this attribute inserted
            temp_opt_attr_df = optimal_attributes_df.copy()
            temp_opt_attr_df.insert(
                loc=1,column=this_attr,value=(available_attributes_df[this_attr]))

            # Run Naive Bayes on this new dataframe and return the 
            # classification accuracy
            current_performance = naive_bayes(temp_opt_attr_df)

            # Find the new attribute that yielded the greatest
            # classification accuracy
            if current_performance > best_performance:
                best_performance = current_performance
                best_attribute = this_attr

        # Did adding another feature lead to improvement?
        if best_performance > base_performance:
            base_performance = best_performance

            # Add the best attribute to the optimal attribute data frame
            optimal_attributes_df.insert(
                loc=1,column=best_attribute,value=(
                available_attributes_df[best_attribute]))

            # Remove the best attribute from the available attribute data frame
            available_attributes_df = available_attributes_df.drop(
                columns = [best_attribute]) 

            # Print the best attribute to the console
            print()
            print(str(best_attribute) + " added to the optimal attribute subset")
            print()

            # Write to file
            outfile3.write(str(
                best_attribute) + (
                " added to the optimal attribute subset") + "\n")
           
            # Decrement the number of available attributes by 1
            no_of_available_attributes -= 1

            # Print number of attributes remaining to the console
            print()
            print(str(no_of_available_attributes) + " attributes remaining")
            print()
            print()

            # Write to file
            outfile3.write(str(no_of_available_attributes) + (
                " attributes remaining") + "\n" + "\n")
           
        else:
            print()
            print("Performance did not improve this round.")
            print("End of Stepwise Forward Selection.")
            print()
            outfile3.write("Performance did not improve this round." + "\n")
            outfile3.write("End of Stepwise Forward Selection." + "\n")
            break

    # Return the optimal attribute set
    return optimal_attributes_df


def main():
    # Read the full text file and store records in a Pandas dataframe
    pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

    # Used for feature selection of large data sets only
    #pd_full_data_set = pd_full_data_set.sample(frac=0.10, random_state=SEED)
    
    # Runs SFS on the full data set to find the optimal attribute data frame
    optimal_attribute_set = sfs(pd_full_data_set)

    # Write dataframe to a file
    optimal_attribute_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

    # Open a new file to save the summary statistics
    outfile2 = open(TEST_STATS_FILE,"w") 

    # Write to a file
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
    outfile2.write("----------------------------------------------------------\n")
    outfile2.write("Data Set : " + DATA_PATH + "\n")

    # Write the relevant stats to a file
    outfile2.write("\n")
    outfile2.write("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)) + "\n")
       
    # Run Naive Bayes on the optimal attribute set
    accuracy = naive_bayes(optimal_attribute_set)
    accuracy *= 100

    # Write accuracy to the file
    outfile2.write("Accuracy : " + str(accuracy) + "%\n")

    # Print statistics to console
    print()
    print()
    print("-------------------------------------------------------")
    print(ALGORITHM_NAME + " Summary Statistics (Testing)")
    print("-------------------------------------------------------")
    print("Data Set : " + DATA_PATH)

    # Print the relevant stats to the console
    print()
    print("Number of Test Instances : " + 
        str(round(len(optimal_attribute_set.index) * testing_data_prct)))

    # Print accuracy to the console
    print()
    print("Accuracy : " + str(accuracy) + "%")
    print()

    # Create a dataframe containing the optimal attributes by removing
    # the Instance and the Class Column
    instance_id_colname = optimal_attribute_set.columns[0]
    no_of_columns = len(optimal_attribute_set.columns) # number of columns
    class_column_index = no_of_columns - 1
    class_column_colname = optimal_attribute_set.columns[class_column_index]
    optimal_attribute_set = optimal_attribute_set.drop(columns = [
        instance_id_colname, class_column_colname]) 

    # Write a list of the optimal attribute set to a file
    outfile2.write("Optimal Attribute Subset : " + str(list(
        optimal_attribute_set)))

    # Print a list of the optimal attribute set
    print("Optimal Attribute Subset : " + str(list(optimal_attribute_set)))
    print()
    print()   
    
    # Close the files
    outfile2.close()
    outfile3.close()

main() # Call the main function

Return to Table of Contents

Output Statistics of Stepwise Forward Selection on the Iris Data Set

This section shows the results for the runs of Stepwise Forward Selection wrapped with Naïve Bayes on the iris data set.

Test Statistics

sfs-test-stats-1

Trace Runs

stepwise-forward-selection-trace-runs

Here is a link to the output file.

Return to Table of Contents