K-means Clustering Algorithm From Scratch | Machine Learning

grapes_grapevine_agriculture_winery_5

In this post, I will walk you through the k-means clustering algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

Table of Contents

What is the K-means Clustering Algorithm?

The K-means algorithm is an unsupervised clustering algorithm (the target variable we want to predict is not available) that is used to group unlabeled data set instances into clusters based on similar attributes.

Step 1: Choose a Value for K and Select the Initial Centroids

The first step of K-means is to determine the valve for k, the number of clusters you want to group your data set into. We then randomly select k cluster centroids (i.e. k random instances) from the data set, where k is some positive integer.

Step 2: Cluster Assignment

The next step is the cluster assignment step. This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it.

Step 3: Move Centroid

The next part is the move centroid step. We move each of the k centroids to the average of all the instances that were assigned to that cluster centroid.

Step 4: Repeat Steps 2 and 3 Until Centroids Stop Changing

We keep running the cluster assignment and move centroid steps until the cluster centroids stop changing. At that point, the data is deemed clustered. 

Andrew Ng, a professor of machine learning at Stanford University, does a good job of showing you visually how k-means clustering works:

Return to Table of Contents

K-means Clustering Algorithm Design

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

  • 0: Instance ID
  • 1: Attribute 1
  • 2: Attribute 2
  • 3: Attribute 3
  •  …
  • N: Actual Class (used for classification accuracy calculation)

The program then adds 4 additional columns.

  • N + 1: Cluster
  • N + 2: Silhouette Coefficient
  • N + 3: Predicted Class
  • N + 4: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

iris-data-set

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set.

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

=IF(B2>AVERAGE(B$2:B$151),1,0)

Here is how the data looks after the binarization:

binarization-kmeans

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

Return to Table of Contents

K-means Clustering Algorithm in Python, Coded From Scratch

K-means appears to be particularly sensitive to the starting centroids. The starting centroids for the k clusters were chosen at random. When these centroids started out poor, the algorithm took longer to converge to a solution. Future work would be to fine-tune the initial centroid selection process. 

Here is the full code for k-means clustering. Don’t be scared at how long this code is. I include a bunch of comments at each step so that you know what is going on. Just copy and paste this into your favorite IDE and run it!

import pandas as pd # Import Pandas library 
import numpy as np # Import Numpy library

# File name: kmeans.py
# Author: Addison Sears-Collins
# Date created: 6/12/2019
# Python version: 3.7
# Description: Implementation of K-means clustering algorithm from scratch.
# K-means algorithm is a clustering algorithm that is used to group 
# unlabeled data set instances into clusters based on similar attributes.

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class (used for classification accuracy calculation)

# This program then adds 4 additional columns.
# N + 1: Cluster
# N + 2: Silhouette Coefficient
# N + 3: Predicted Class
# N + 4: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-means"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_kmeans_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_kmeans_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_kmeans_trace_runs.txt" 
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w") 

# Read the full text file and store records in a Pandas dataframe
pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR)

# Copy the dataframe into a new dataframe so we don't mess up the
# original data
pd_data_set = pd_full_data_set.copy() 

# Calculate the number of instances, columns, and attributes in the
# training data set. Assumes 1 column for the instance ID and 1 column
# for the class. Record the index of the column that contains 
# the actual class
no_of_instances = len(pd_data_set.index) # number of rows
no_of_columns = len(pd_data_set.columns) # number of columns
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Store class values in a column and then create a list of unique
# classes and store in a dataframe and a Numpy array
unique_class_list_df = pd_data_set.iloc[:,actual_class_column]
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

# Record the number of unique classes in the data set
num_unique_classes = len(unique_class_list_df)

# Record the value for K, the number of clusters
K = num_unique_classes

# Remove the Instance and the Actual Class Column to create an unlabled
# data set
instance_id_colname = pd_data_set.columns[0]
class_column_colname = pd_data_set.columns[actual_class_column]
pd_data_set = pd_data_set.drop(columns = [ # Each row is a different instance
        instance_id_colname, class_column_colname]) 

# Convert dataframe into a Numpy array
np_data_set = pd_data_set.to_numpy(copy=True)

# Randomly select k instances from the data set. 
# These will be the cluster centroids for the first iteration
# of the algorithm.
centroids = np_data_set[np.random.choice(np_data_set.shape[
    0], size=K, replace=False), :]


##################### Cluster Assignment Step ################################
# Go through each instance and assign that instance to the closest 
# centroid (based on Euclidean distance).

# Initialize an array which will contain the cluster assignments for each
# instance.
cluster_assignments = np.empty(no_of_instances)

# Goes True if new centroids are the same as the old centroids
centroids_the_same = False

# Sets the maximum number of iterations
max_iterations = 300

while max_iterations > 0 and not(centroids_the_same):
    # Go through each data point and assign it to the nearest centroid
    for row in range(0, no_of_instances):
    
        this_instance = np_data_set[row]

        # Calculate the Euclidean distance of each instance in the data set
        # from each of the centroids
        # Find the centroid with the minimum distance and assign the instance
        # to that centroid.
        # Record that centroid in the cluster assignments array.
    
        # Reset the minimum distance to infinity
        min_distance = float("inf")

        for row_centroid in range(0, K):
            this_centroid = centroids[row_centroid]
        
            # Calculate the Euclidean distance from this instance to the
            # centroid
            distance = np.linalg.norm(this_instance - this_centroid)

            # If we have a centroid that is closer to this instance,
            # update the cluster assignment for this instance.
            if distance < min_distance:
                cluster_assignments[row] = row_centroid
                min_distance = distance # Update the minimum distance

    # Print after each cluster assignment has completed
    print("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:")
    print(cluster_assignments)
    print()
    print("Now calculating the new centroids...")
    print()

    outfile3.write("Cluster assignments completed for all " + str(
        no_of_instances) + " instances. Here they are:"+ "\n")
    outfile3.write(str(cluster_assignments))
    outfile3.write("\n")
    outfile3.write("\n")
    outfile3.write("Now calculating the new centroids..." + "\n")
    outfile3.write("\n")


    ##################### Move Centroid Step ################################
    # Calculate the centroids of the clusters by computing the average
    # of the attribute values of the instances in each cluster
    # For each row in the centroids 2D array

    # Store the old centroids
    old_centroids = centroids.copy()

    for row_centroid in range(0, K):

        # For each column of each row of the centroids 2D array
        for col_centroid in range(0, no_of_attributes):

            # Reset the running sum and the counter
            running_sum = 0.0
            count = 0.0
            average = None

            for row in range(0, no_of_instances):

                # If this instance belongs to this cluster
                if(row_centroid == cluster_assignments[row]):
                
                    # Add this value to the running sum
                    running_sum += np_data_set[row,col_centroid]

                    # Increment the counter
                    count += 1
        
                    if (count > 0):
                        # Calculate the average
                        average = running_sum / count

            # Update the centroids array with this average
            centroids[row_centroid,col_centroid] = average
    
    # Print to after each cluster assignment has completed
    print("New centroids have been created. Here they are:")
    print(centroids)
    print()

    outfile3.write("New centroids have been created. Here they are:" + "\n")
    outfile3.write(str(centroids))
    outfile3.write("\n")
    outfile3.write("\n")

    # Check if cluster centroids are the same
    centroids_the_same = np.array_equal(old_centroids,centroids)

    if centroids_the_same:
        print(
        "Cluster membership is unchanged. Stopping criteria has been met.")
        outfile3.write("Cluster membership is unchanged. ")
        outfile3.write("Stopping criteria has been met." + "\n")
        outfile3.write("\n")

    # Update the number of iterations
    max_iterations -= 1

# Record the actual class column name
actual_class_col_name = pd_full_data_set.columns[len(
    pd_full_data_set.columns) - 1]

# Add 4 additional columns to the original data frame
pd_full_data_set = pd_full_data_set.reindex(
      columns=[*pd_full_data_set.columns.tolist(
      ), 'Cluster', 'Silhouette Coefficient', 'Predicted Class', (
      'Prediction Correct?')])

# Add the final cluster assignments to the Pandas dataframe
pd_full_data_set['Cluster'] = cluster_assignments

outfile3.write("Calculating the Silhouette Coefficients. Please wait..." + "\n")
outfile3.write("\n")
print()
print("Calculating the Silhouette Coefficients. Please wait...")
print()
################## Calculate the Silhouette Coefficients ######################
# Rewards clusterings that have good cohesion and good separation. Varies 
# between 1 and -1. -1 means bad clustering, 1 means great clustering.

# 1. For each instance calculate the average distance to all other instances 
# in that cluster. This is a.
# 2. (Find the average distance to all the instances in the nearest neighbor 
# cluster). For each instance and any cluster that does not contain the 
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
# 3. For each instance calculate the Silhouette Coefficient s where
# s = (b-a)/max(a,b)
# Store the value in the data frame

silhouette_column = actual_class_column + 2

# Go through one instance at a time
for row in range(0, no_of_instances):

    this_instance = np_data_set[row]
    this_cluster = cluster_assignments[row]

    a = None
    running_sum = 0.0
    counter = 0.0

    # Calculate the average distance to all other instances 
    # in this cluster. This is a.
    # Go through one instance at a time
    for row_2 in range(0, no_of_instances):

        # If the other instance is in the same cluster as this instance
        if this_cluster == cluster_assignments[row_2]:

            # Calculate the distance
            distance = np.linalg.norm(this_instance - np_data_set[row_2])

            # Add the distance to the running sum
            running_sum += distance
            counter += 1

    # Calculate the value for a
    if counter > 0:
        a = running_sum / counter

    # For each instance and any cluster that does not contain the 
    # instance calculate the average distance to all
    # of the points in that other cluster. Then return the minimum such value
    # over all of the clusters. This is b.
    b = float("inf") 
    
    for clstr in range(0, K):

        running_sum = 0.0
        counter = 0.0

        # Must be other clusters, not the one this instance is in
        if clstr != this_cluster:

            # Calculate the average distance to instances in that 
            # other cluster
            for row_3 in range(0, no_of_instances):

                if cluster_assignments[row_3] == clstr:

                    # Calculate the distance
                    distance = np.linalg.norm(this_instance - np_data_set[
                        row_3])

                    # Add the distance to the running sum
                    running_sum += distance
                    counter += 1
        
            if counter > 0:
                avg_distance_to_cluster = running_sum / counter
        
            # Update b if we have a new minimum
            if avg_distance_to_cluster < b:
                b = avg_distance_to_cluster

    # Calculate the Silhouette Coefficient s where s = (b-a)/max(a,b)
    s = (b - a) / max(a,b)

    # Store the Silhouette Coefficient in the Pandas data frame
    pd_full_data_set.iloc[row,silhouette_column] = s

#################### Predict the Class #######################################
# For each cluster, determine the predominant class and assign that 
# class to the cluster. Then determine if the prediction was correct.
# Create a data frame that maps clusters to actual classes
class_mappings = pd.DataFrame(index=range(K),columns=range(1))

for clstr in range(0, K):

    # Select rows whose column equals that cluster value
    temp_df = pd_full_data_set.loc[pd_full_data_set['Cluster'] == clstr]
    
    # Select the predominant class
    class_mappings.iloc[clstr,0] = temp_df.mode()[actual_class_col_name][0]

cluster_column = actual_class_column + 1
pred_class_column = actual_class_column + 3
pred_correct_column = actual_class_column + 4

# Assign the relevant class to each instance
# See if prediction was correct
for row in range(0, no_of_instances):

    # Go through each of the clusters to check if the instance is a member
    # of that cluster
    for clstr in range(0, K):
        if clstr == pd_full_data_set.iloc[row,cluster_column]:

            # Assign the relevant class to this instance
            pd_full_data_set.iloc[
                row,pred_class_column] = class_mappings.iloc[clstr,0]

    # If the prediction was correct
    if pd_full_data_set.iloc[row,pred_class_column] == pd_full_data_set.iloc[
        row,actual_class_column]:
        pd_full_data_set.iloc[row,pred_correct_column] = 1
    else: # If incorrect prediction
        pd_full_data_set.iloc[row,pred_correct_column] = 0

# Write dataframe to a file
pd_full_data_set.to_csv(TEST_OUT_FILE, sep=",", header=True)

# Print data frame to the console
print()
print()
print("Data Set")
print(pd_full_data_set)
print()
print()

################### Summary Statistics #######################################
# Calculate the average Silhouette Coefficient for the data set
# Calculate the accuracy of the clustering-based classifier

# Open a new file to save the summary statistics
outfile1 = open(TEST_STATS_FILE,"w") 

# Write to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile1.write("\n")
outfile1.write("Number of Instances : " + str(no_of_instances) + "\n")
outfile1.write("\n")
outfile1.write("Value for k : " + str(K) + "\n")

# Calculate average Silhouette Coefficient for the data set
silhouette_coefficient = pd_full_data_set.loc[
    :,"Silhouette Coefficient"].mean()

# Write the Silhouette Coefficient to the file
outfile1.write("Silhouette Coefficient : " + str(
    silhouette_coefficient) + "\n")
      
# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (pd_full_data_set.iloc[
        :,pred_correct_column].sum())/no_of_instances

accuracy *= 100

# Write accuracy to the file
outfile1.write("Accuracy : " + str(accuracy) + "%\n")

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Instances : " + str(no_of_instances))
print("Value for k : " + str(K))

# Print the Silhouette Coefficient to the console
print("Silhouette Coefficient : " + str(
    silhouette_coefficient))

# Print accuracy to the console
print("Accuracy : " + str(accuracy) + "%")

# Close the files
outfile1.close()
outfile3.close()

Return to Table of Contents

Output Statistics of K-means Clustering on the Iris Data Set

This section shows the results for the runs of K-means Clustering on the iris data set.

Test Statistics

test-stats-kmeans

Trace Runs

Here are the trace runs of the algorithm.

Output

Here is the output of the algorithm.

Return to Table of Contents