## Difference Between Supervised and Unsupervised Learning

In this post, I will explain the difference between supervised and unsupervised learning.

## What is Supervised Learning?

Imagine you have a computer. The computer is really good at doing math and making complex calculations. You want to “train” your computer to predict the price for any home in the United States. So you search around the Internet and find a dataset. The dataset contains the following information  for 100,000 houses that sold during the last 30 days in various cities across the United States:

1. Square footage
2. Number of bathrooms
3. Number of bedrooms
4. Number of garages
5. Year the house was constructed
6. Average size of the house’s windows
7. Sale price

Variables 1 through 6 above are the dataset’s features (also known as input variables, attributes, independent variables, etc.). Variable 7, the house sale price, is the output variable (or target variable) that we want our computer to get good at predicting.

So the question is…how do we train our computer to be a good house price predictor?  We need to write a software program. The program needs to take as input the 100,000-house dataset that I mentioned earlier. This program needs to then find a mathematical relationship between variables 1-6 (features) and variable 7 (output variable). Once the program has found a relationship between the features (input) and the output, it can predict the sale price of a house it has never seen before.

Let’s take a look at an analogy. Supervised learning is like baking a cake.

Suppose you had a cake machine that was able to cook many different types of cake from the same set of ingredients. All you have to do as the cake chef is to just throw the ingredients into the machine, and the cake machine will automatically make the cake.

The “features,” the inputs to the cake machine, are the following ingredients:

1. Butter
2. Sugar
3. Vanilla Extract
4. Flour
5. Chocolate
6. Eggs
7. Salt

The output is the type of cake:

1. Vanilla Cake
2. Pound Cake
3. Chocolate Cake
4. Dark Chocolate Cake

Different amounts of each ingredient will produce different types of cake. How does the cake machine know what type of cake to produce given a set of ingredients?

Fortunately, a machine learning engineer has written a software program (containing a supervised learning algorithm) that is running inside the cake machine. The program was pre-trained on a dataset containing 1 million cakes. Each entry (i.e. example or training instance) in this gigantic dataset contained two key pieces of data:

1. How much of each ingredient was used in the making of that given cake
2. The type of cake that was produced

During the training phase of the program, the program found a fairly accurate mathematical relationship between the amount of each ingredient and the cake type. And now, when the cake machine is provided with a new set of ingredients by the chef, it automatically “knows” what type of cake to produce. Pretty cool huh!

What I described above is called supervised learning. It is called supervised learning because the input data (which the supervised learning algorithm used to train) is already labeled with the “correct” answers (e.g. the type of cake in our example above; or the sale price values for those 100,000 homes from the earlier example I presented in this post.). We supervised the learning algorithm by “telling” it what the output (cake type) should be for 1 million different sets of input values (ingredients).

The fundamental idea of a supervised learning algorithm is to learn a mathematical relationship between inputs and outputs so that it can predict the output value given an entirely new set of input values.

Let’s take a look at a common supervised learning algorithm: linear regression. The goal of linear regression is to find a line that best fits the relationship between input and output. For example, the learning algorithm for linear regression could be trained on square footage and sale price data for 100,000 homes. It would learn the mathematical relationship (e.g. a straight line in the form y = mx + b) between square footage and the sale price of a home.

With this relationship (i.e. line of best fit) in hand, the algorithm can now easily predict the sale price of any home just by being provided with the home’s square footage value.

For example, let’s say we wanted to find the price of a house that is 2000 ft2. We feed 2,000 ft2 into the algorithm. The algorithm predicts a sale price of \$500,000.

As you can imagine, before we make any predictions using a supervised learning algorithm, it is imperative to train it on a lot of data. Lots and lots of data. The more the merrier.

In the example above, to get that best fit line, we want to feed it with as many examples of houses as possible during training. The more data it has, the better its predictions will be when given new data. This, my friends, is supervised learning, and it is incredibly powerful. In fact, supervised learning is the bread and butter of most of the state-of-the-art machine learning techniques today, such as deep learning.

Now, let’s look at unsupervised learning.

## What is Unsupervised Learning?

Let’s suppose you have the following dataset for a group of 13 people. For each person, we have the following features:

• Height (in inches)
• Hair Length (in inches)

Let’s plot the data to see what it looks like:

In contrast to supervised learning, in this case there is no output value that we are trying to predict (e.g. house sale price, cake type, etc.). All we have are features (inputs) with no corresponding output variables. In machine learning jargon, we say that the data points are unlabeled

So instead of trying to force the dataset to fit a straight line or some sort of predetermined mathematical model, we let an unsupervised learning algorithm find a pattern all on its own. And here is what we get:

Aha! It looks like the algorithm found some sort of pattern in the data. The data is clustered into two distinct groups. What these clusters mean, we do not know because the data points are unlabeled. However, what we do suspect given our prior knowledge of this dataset, is that the blue dots are males, and the red dots are females given the attributes are height and hair length.

What I have described above is known as unsupervised learning. It is called unsupervised because the input dataset is unlabeled. There is no output variable we are trying to predict. There is no prior mathematical model we are trying to fit the data to. All we want to do is let the algorithm find some sort of structure or pattern in the data. We let the data speak for itself.

Any time you are given a dataset and want to group similar data points into clusters, you’re going to want to use an unsupervised learning algorithm.

The K-means clustering algorithm is used to group unlabeled data set instances into clusters based on similar attributes. It has a number of advantages over other types of machine learning models, including the linear models, such as logistic regression and Naive Bayes.

# Unlabeled Data Sets

A lot of real-world data comes unlabeled, without any particular class. The benefit of using an algorithm like K-means clustering is that we often do not know how instances in a data set should be grouped.

For example, consider the problem of trying to group viewers of Netflix into clusters based on similar viewing behavior. We know that there are clusters, but we do not know what those clusters are. Linear models will not help us at all with these sorts of issues.

# Nonlinearly Separable Data

Consider the data set below containing a set of three concentric circles. It is nonlinearly separable. In other words, there is no straight line or plane that we could draw on the graph below that can easily discriminate the colored classes red, blue, and green. Using K-means clustering and converting the coordinate system below from Cartesian coordinates to Polar coordinates, we could use the information about the radius to create concentric clusters.

# Simplicity

The meat of the K-means clustering algorithm is just two steps, the cluster assignment step and the move centroid step. If we’re looking for an unsupervised learning algorithm that is easy to implement and can handle large data sets, K-means clustering is a good starting point.

# Availability

Most of the popular machine learning packages contain an implementation of K-means clustering.

# Speed

Based on my experience using K-means clustering, the algorithm does its work quickly, even for really big data sets.

## K-means Clustering Algorithm From Scratch | Machine Learning

In this post, I will walk you through the k-means clustering algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!

# What is the K-means Clustering Algorithm?

The K-means algorithm is an unsupervised clustering algorithm (the target variable we want to predict is not available) that is used to group unlabeled data set instances into clusters based on similar attributes.

## Step 1: Choose a Value for K and Select the Initial Centroids

The first step of K-means is to determine the valve for k, the number of clusters you want to group your data set into. We then randomly select k cluster centroids (i.e. k random instances) from the data set, where k is some positive integer.

## Step 2: Cluster Assignment

The next step is the cluster assignment step. This step involves going through each of the instances and assigning that instance to the cluster centroid closest to it.

## Step 3: Move Centroid

The next part is the move centroid step. We move each of the k centroids to the average of all the instances that were assigned to that cluster centroid.

## Step 4: Repeat Steps 2 and 3 Until Centroids Stop Changing

We keep running the cluster assignment and move centroid steps until the cluster centroids stop changing. At that point, the data is deemed clustered.

Andrew Ng, a professor of machine learning at Stanford University, does a good job of showing you visually how k-means clustering works:

# K-means Clustering Algorithm Design

The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:

Columns (0 through N)

• 0: Instance ID
• 1: Attribute 1
• 2: Attribute 2
• 3: Attribute 3
•  …
• N: Actual Class (used for classification accuracy calculation)

• N + 1: Cluster
• N + 2: Silhouette Coefficient
• N + 3: Predicted Class
• N + 4: Prediction Correct? (1 if yes, 0 if no)

Here is what your data set should look like:

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set.

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:

`=IF(B2>AVERAGE(B\$2:B\$151),1,0)`

Here is how the data looks after the binarization:

Save the file as a csv file (comma-delimited), and load it into the program below (Python).

Here is a link to the final data set I used.

# K-means Clustering Algorithm in Python, Coded From Scratch

K-means appears to be particularly sensitive to the starting centroids. The starting centroids for the k clusters were chosen at random. When these centroids started out poor, the algorithm took longer to converge to a solution. Future work would be to fine-tune the initial centroid selection process.

Here is the full code for k-means clustering. Don’t be scared at how long this code is. I include a bunch of comments at each step so that you know what is going on. Just copy and paste this into your favorite IDE and run it!

```import pandas as pd # Import Pandas library
import numpy as np # Import Numpy library

# File name: kmeans.py
# Date created: 6/12/2019
# Python version: 3.7
# Description: Implementation of K-means clustering algorithm from scratch.
# K-means algorithm is a clustering algorithm that is used to group
# unlabeled data set instances into clusters based on similar attributes.

# Required Data Set Format:
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1
# 2: Attribute 2
# 3: Attribute 3
# ...
# N: Actual Class (used for classification accuracy calculation)

# N + 1: Cluster
# N + 2: Silhouette Coefficient
# N + 3: Predicted Class
# N + 4: Prediction Correct? (1 if yes, 0 if no)

################ INPUT YOUR OWN VALUES IN THIS SECTION ######################
ALGORITHM_NAME = "K-means"
DATA_PATH = "iris_dataset.txt"  # Directory where data set is located
TEST_STATS_FILE = "iris_dataset_kmeans_test_stats.txt"#Testing statistics
TEST_OUT_FILE = "iris_dataset_kmeans_test_out.txt" # Testing output
# Show functioning of the program
TRACE_RUNS_FILE  = "iris_dataset_kmeans_trace_runs.txt"
SEPARATOR = ","  # Separator for the data set (e.g. "\t" for tab data)
#############################################################################

# Open a new file to save trace runs
outfile3 = open(TRACE_RUNS_FILE,"w")

# Read the full text file and store records in a Pandas dataframe

# Copy the dataframe into a new dataframe so we don't mess up the
# original data
pd_data_set = pd_full_data_set.copy()

# Calculate the number of instances, columns, and attributes in the
# training data set. Assumes 1 column for the instance ID and 1 column
# for the class. Record the index of the column that contains
# the actual class
no_of_instances = len(pd_data_set.index) # number of rows
no_of_columns = len(pd_data_set.columns) # number of columns
no_of_attributes = no_of_columns - 2
actual_class_column = no_of_columns - 1

# Store class values in a column and then create a list of unique
# classes and store in a dataframe and a Numpy array
unique_class_list_df = pd_data_set.iloc[:,actual_class_column]
unique_class_list_np = unique_class_list_df.unique() #Numpy array
unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df

# Record the number of unique classes in the data set
num_unique_classes = len(unique_class_list_df)

# Record the value for K, the number of clusters
K = num_unique_classes

# Remove the Instance and the Actual Class Column to create an unlabled
# data set
instance_id_colname = pd_data_set.columns[0]
class_column_colname = pd_data_set.columns[actual_class_column]
pd_data_set = pd_data_set.drop(columns = [ # Each row is a different instance
instance_id_colname, class_column_colname])

# Convert dataframe into a Numpy array
np_data_set = pd_data_set.to_numpy(copy=True)

# Randomly select k instances from the data set.
# These will be the cluster centroids for the first iteration
# of the algorithm.
centroids = np_data_set[np.random.choice(np_data_set.shape[
0], size=K, replace=False), :]

##################### Cluster Assignment Step ################################
# Go through each instance and assign that instance to the closest
# centroid (based on Euclidean distance).

# Initialize an array which will contain the cluster assignments for each
# instance.
cluster_assignments = np.empty(no_of_instances)

# Goes True if new centroids are the same as the old centroids
centroids_the_same = False

# Sets the maximum number of iterations
max_iterations = 300

while max_iterations > 0 and not(centroids_the_same):
# Go through each data point and assign it to the nearest centroid
for row in range(0, no_of_instances):

this_instance = np_data_set[row]

# Calculate the Euclidean distance of each instance in the data set
# from each of the centroids
# Find the centroid with the minimum distance and assign the instance
# to that centroid.
# Record that centroid in the cluster assignments array.

# Reset the minimum distance to infinity
min_distance = float("inf")

for row_centroid in range(0, K):
this_centroid = centroids[row_centroid]

# Calculate the Euclidean distance from this instance to the
# centroid
distance = np.linalg.norm(this_instance - this_centroid)

# If we have a centroid that is closer to this instance,
# update the cluster assignment for this instance.
if distance < min_distance:
cluster_assignments[row] = row_centroid
min_distance = distance # Update the minimum distance

# Print after each cluster assignment has completed
print("Cluster assignments completed for all " + str(
no_of_instances) + " instances. Here they are:")
print(cluster_assignments)
print()
print("Now calculating the new centroids...")
print()

outfile3.write("Cluster assignments completed for all " + str(
no_of_instances) + " instances. Here they are:"+ "\n")
outfile3.write(str(cluster_assignments))
outfile3.write("\n")
outfile3.write("\n")
outfile3.write("Now calculating the new centroids..." + "\n")
outfile3.write("\n")

##################### Move Centroid Step ################################
# Calculate the centroids of the clusters by computing the average
# of the attribute values of the instances in each cluster
# For each row in the centroids 2D array

# Store the old centroids
old_centroids = centroids.copy()

for row_centroid in range(0, K):

# For each column of each row of the centroids 2D array
for col_centroid in range(0, no_of_attributes):

# Reset the running sum and the counter
running_sum = 0.0
count = 0.0
average = None

for row in range(0, no_of_instances):

# If this instance belongs to this cluster
if(row_centroid == cluster_assignments[row]):

# Add this value to the running sum
running_sum += np_data_set[row,col_centroid]

# Increment the counter
count += 1

if (count > 0):
# Calculate the average
average = running_sum / count

# Update the centroids array with this average
centroids[row_centroid,col_centroid] = average

# Print to after each cluster assignment has completed
print("New centroids have been created. Here they are:")
print(centroids)
print()

outfile3.write("New centroids have been created. Here they are:" + "\n")
outfile3.write(str(centroids))
outfile3.write("\n")
outfile3.write("\n")

# Check if cluster centroids are the same
centroids_the_same = np.array_equal(old_centroids,centroids)

if centroids_the_same:
print(
"Cluster membership is unchanged. Stopping criteria has been met.")
outfile3.write("Cluster membership is unchanged. ")
outfile3.write("Stopping criteria has been met." + "\n")
outfile3.write("\n")

# Update the number of iterations
max_iterations -= 1

# Record the actual class column name
actual_class_col_name = pd_full_data_set.columns[len(
pd_full_data_set.columns) - 1]

pd_full_data_set = pd_full_data_set.reindex(
columns=[*pd_full_data_set.columns.tolist(
), 'Cluster', 'Silhouette Coefficient', 'Predicted Class', (
'Prediction Correct?')])

# Add the final cluster assignments to the Pandas dataframe
pd_full_data_set['Cluster'] = cluster_assignments

outfile3.write("Calculating the Silhouette Coefficients. Please wait..." + "\n")
outfile3.write("\n")
print()
print("Calculating the Silhouette Coefficients. Please wait...")
print()
################## Calculate the Silhouette Coefficients ######################
# Rewards clusterings that have good cohesion and good separation. Varies
# between 1 and -1. -1 means bad clustering, 1 means great clustering.

# 1. For each instance calculate the average distance to all other instances
# in that cluster. This is a.
# 2. (Find the average distance to all the instances in the nearest neighbor
# cluster). For each instance and any cluster that does not contain the
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
# 3. For each instance calculate the Silhouette Coefficient s where
# s = (b-a)/max(a,b)
# Store the value in the data frame

silhouette_column = actual_class_column + 2

# Go through one instance at a time
for row in range(0, no_of_instances):

this_instance = np_data_set[row]
this_cluster = cluster_assignments[row]

a = None
running_sum = 0.0
counter = 0.0

# Calculate the average distance to all other instances
# in this cluster. This is a.
# Go through one instance at a time
for row_2 in range(0, no_of_instances):

# If the other instance is in the same cluster as this instance
if this_cluster == cluster_assignments[row_2]:

# Calculate the distance
distance = np.linalg.norm(this_instance - np_data_set[row_2])

# Add the distance to the running sum
running_sum += distance
counter += 1

# Calculate the value for a
if counter > 0:
a = running_sum / counter

# For each instance and any cluster that does not contain the
# instance calculate the average distance to all
# of the points in that other cluster. Then return the minimum such value
# over all of the clusters. This is b.
b = float("inf")

for clstr in range(0, K):

running_sum = 0.0
counter = 0.0

# Must be other clusters, not the one this instance is in
if clstr != this_cluster:

# Calculate the average distance to instances in that
# other cluster
for row_3 in range(0, no_of_instances):

if cluster_assignments[row_3] == clstr:

# Calculate the distance
distance = np.linalg.norm(this_instance - np_data_set[
row_3])

# Add the distance to the running sum
running_sum += distance
counter += 1

if counter > 0:
avg_distance_to_cluster = running_sum / counter

# Update b if we have a new minimum
if avg_distance_to_cluster < b:
b = avg_distance_to_cluster

# Calculate the Silhouette Coefficient s where s = (b-a)/max(a,b)
s = (b - a) / max(a,b)

# Store the Silhouette Coefficient in the Pandas data frame
pd_full_data_set.iloc[row,silhouette_column] = s

#################### Predict the Class #######################################
# For each cluster, determine the predominant class and assign that
# class to the cluster. Then determine if the prediction was correct.
# Create a data frame that maps clusters to actual classes
class_mappings = pd.DataFrame(index=range(K),columns=range(1))

for clstr in range(0, K):

# Select rows whose column equals that cluster value
temp_df = pd_full_data_set.loc[pd_full_data_set['Cluster'] == clstr]

# Select the predominant class
class_mappings.iloc[clstr,0] = temp_df.mode()[actual_class_col_name][0]

cluster_column = actual_class_column + 1
pred_class_column = actual_class_column + 3
pred_correct_column = actual_class_column + 4

# Assign the relevant class to each instance
# See if prediction was correct
for row in range(0, no_of_instances):

# Go through each of the clusters to check if the instance is a member
# of that cluster
for clstr in range(0, K):
if clstr == pd_full_data_set.iloc[row,cluster_column]:

# Assign the relevant class to this instance
pd_full_data_set.iloc[
row,pred_class_column] = class_mappings.iloc[clstr,0]

# If the prediction was correct
if pd_full_data_set.iloc[row,pred_class_column] == pd_full_data_set.iloc[
row,actual_class_column]:
pd_full_data_set.iloc[row,pred_correct_column] = 1
else: # If incorrect prediction
pd_full_data_set.iloc[row,pred_correct_column] = 0

# Write dataframe to a file

# Print data frame to the console
print()
print()
print("Data Set")
print(pd_full_data_set)
print()
print()

################### Summary Statistics #######################################
# Calculate the average Silhouette Coefficient for the data set
# Calculate the accuracy of the clustering-based classifier

# Open a new file to save the summary statistics
outfile1 = open(TEST_STATS_FILE,"w")

# Write to a file
outfile1.write("----------------------------------------------------------\n")
outfile1.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n")
outfile1.write("----------------------------------------------------------\n")
outfile1.write("Data Set : " + DATA_PATH + "\n")

# Write the relevant stats to a file
outfile1.write("\n")
outfile1.write("Number of Instances : " + str(no_of_instances) + "\n")
outfile1.write("\n")
outfile1.write("Value for k : " + str(K) + "\n")

# Calculate average Silhouette Coefficient for the data set
silhouette_coefficient = pd_full_data_set.loc[
:,"Silhouette Coefficient"].mean()

# Write the Silhouette Coefficient to the file
outfile1.write("Silhouette Coefficient : " + str(
silhouette_coefficient) + "\n")

# accuracy = (total correct predictions)/(total number of predictions)
accuracy = (pd_full_data_set.iloc[
:,pred_correct_column].sum())/no_of_instances

accuracy *= 100

# Write accuracy to the file
outfile1.write("Accuracy : " + str(accuracy) + "%\n")

# Print statistics to console
print()
print()
print("-------------------------------------------------------")
print(ALGORITHM_NAME + " Summary Statistics (Testing)")
print("-------------------------------------------------------")
print("Data Set : " + DATA_PATH)

# Print the relevant stats to the console
print()
print("Number of Instances : " + str(no_of_instances))
print("Value for k : " + str(K))

# Print the Silhouette Coefficient to the console
print("Silhouette Coefficient : " + str(
silhouette_coefficient))

# Print accuracy to the console
print("Accuracy : " + str(accuracy) + "%")

# Close the files
outfile1.close()
outfile3.close()
```

# Output Statistics of K-means Clustering on the Iris Data Set

This section shows the results for the runs of K-means Clustering on the iris data set.

## Trace Runs

Here are the trace runs of the algorithm.

## Output

Here is the output of the algorithm.