In this post, I will walk you through the Stepwise Forward Selection algorithm, step-by-step. We will develop the code for the algorithm from scratch using Python and use it for feature selection for the Naive Bayes algorithm we previously developed. We will then run the algorithm on a real-world data set, the iris data set (flower classification) from the UCI Machine Learning Repository. Without further ado, let’s get started!
Table of Contents
- What is the Stepwise Forward Selection Algorithm?
- Stepwise Forward Selection Implementation
- Stepwise Forward Selection Algorithm in Python, Coded From Scratch
- Output Statistics of Stepwise Forward Selection on the Iris Data Set
What is the Stepwise Forward Selection Algorithm?
One of the fundamental ideas in machine learning is the Curse of Dimensionality – as the number of attributes in the training set grows, you need exponentially more instances to learn the underlying concept. And the more instances you have, the more computational time is required to process the data.
In other words, adding attributes improves performance, to a point. Beyond that point, you need to have exponentially more instances in order for performance to improve. This phenomenon is commonly known as the Hughes Effect (or Hughes Phenomenon).

Image Source: Hughes, G. (1968). On the Mean Accuracy Of Statistical Pattern Recognizers. IEEE Transactions on Information Theory, 14(1), 55-63.
In the above image, n = number of attributes and m = number of instances.
One technique for combatting the Curse of Dimensionality is known as Stepwise Forward Selection (SFS). SFS involves selecting only the most relevant attributes for learning and discarding the rest. The metric that determines what attribute is “most relevant” is determined by the programmer. In this project, I will use the classification accuracy of Naïve Bayes.
Because SFS is just a method to create an optimal attribute subset, it needs to be wrapped with another algorithm (in this case Naïve Bayes) that does the actual learning. For this reason, SFS is known as a wrapper method.
The goal of SFS is to reduce the number of features that a machine learning algorithm needs to examine by eliminating unnecessary features and finding an optimal attribute subset.
Suppose we have attributes A1 through Ak which are used to describe a data set D.
A = <A1,…Ak>
SFS begins with an empty attribute set:
A0 = <>
It then selects an attribute from A, one at a time, until the performance (e.g. as measured by classification accuracy) of the target learning algorithm (e.g. Naïve Bayes) on a subset of the training data fails to improve performance.
Stepwise Forward Selection Implementation
The first thing we need to do is preprocess the iris data set so that it meets the input requirements of both algorithms. Below is the required data format. I downloaded the data into Microsoft Excel and then made sure I had the following columns:
Columns (0 through N)
- 0: Instance ID
- 1: Attribute 1
- 2: Attribute 2
- 3: Attribute 3
- …
- N: Actual Class
The program I developed (code is below) then adds 2 additional columns for the testing set.
- N + 1: Predicted Class
- N + 2: Prediction Correct? (1 if yes, 0 if no)
Here is what your data set should look like:

The iris data set contains 3 classes of 50 instances each (150 instances in total), where each class refers to a different type of iris flower. There are 4 attributes in the data set. Here is what an iris flower looks like in case you are interested:

Binarization is optional, but I prefer to do it since it speeds up the attribute selection process. You need to go through each attribute, one by one. Attribute values greater than the mean of the attribute need to be changed to 1, otherwise set it to 0. Here is the general Excel formula:
=IF(B2>AVERAGE(B$2:B$151),1,0)
Here is how the data looks after the binarization:

Save the file as a csv file (comma-delimited), and load it into the program below (Python).
Here is a link to the final data set I used.
Stepwise Forward Selection Algorithm in Python, Coded From Scratch
Naïve Bayes was wrapped with SFS for feature selection. Classification accuracy was used to select the optimal attribute at each step. Once the optimal attribute subset was obtained, Naïve Bayes was run on the complete training set (67% of instances) and testing set (33% of instances), and classification accuracy was calculated.
Here is the code. Just copy and paste it into your favorite IDE. Don’t be scared at how long the code is. I use a lot of comments so that you know what is going on at each step:
import pandas as pd # Import Pandas library import numpy as np # Import Numpy library # File name: sfs_naive_bayes.py # Author: Addison Sears-Collins # Date created: 6/10/2019 # Python version: 3.7 # Description: Implementation of Naive Bayes which uses Stepwise Forward # Selection (SFS) for feature selection. This code works for multi-class # classification problems (e.g. democrat/republican/independent) # Calculate P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0) and # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1) and # P(E1|CL2)P(E2|CL2)P(E3|CL2)...P(E#|CL2) * P(CL2), etc. and predict the # class with the maximum result. # E is an attribute, and CL means class. # Only need class prior probability and likelihoods to make a prediction # (i.e. the numerator of Bayes formula) since denominators are same for both # the P(CL0|E1,E2,E3...)*P(CL0) and P(CL1|E1,E2,E3...)*P(CL1), etc. cases # where P means "probability of" and | means "given". # Required Data Set Format: # Columns (0 through N) # 0: Instance ID # 1: Attribute 1 # 2: Attribute 2 # 3: Attribute 3 # ... # N: Actual Class # This program then adds 2 additional columns for the testing set. # N + 1: Predicted Class # N + 2: Prediction Correct? (1 if yes, 0 if no) ################ INPUT YOUR OWN VALUES IN THIS SECTION ###################### ALGORITHM_NAME = "SFS-Wrapped Naive Bayes" DATA_PATH = "iris_dataset.txt" # Directory where data set is located TEST_STATS_FILE = "iris_dataset_naive_bayes_test_stats.txt"#Testing statistics TEST_OUT_FILE = "iris_dataset_naive_bayes_test_out.txt" # Testing output # Show functioning of the program TRACE_RUNS_FILE = "iris_dataset_naive_bayes_trace_runs.txt" SEPARATOR = "," # Separator for the data set (e.g. "\t" for tab data) TRAINING_DATA_PRCT = 0.67 # % of data set used for training. Default 0.67 testing_data_prct = 1 - TRAINING_DATA_PRCT # % of data set used for testing SEED = 99 # SEED for the random number generator. Default: 99 ############################################################################# # Open a new file to save trace runs outfile3 = open(TRACE_RUNS_FILE,"w") # Naive Bayes algorithm. Accepts a Pandas dataframeas its parameter def naive_bayes(pd_data): # Create a training dataframe by sampling random instances from original data. # random_state guarantees that the pseudo-random number generator generates # the same sequence of random numbers each time. pd_training_data = pd_data.sample(frac=TRAINING_DATA_PRCT, random_state=SEED) # Create a testing dataframe. Dropping the training data from the original # dataframe ensures training and testing dataframes have different instances pd_testing_data = pd_data.drop(pd_training_data.index) ######COMMENT OUT THESE TWO LINES BEFORE RUNNING ################ #pd_training_data = pd_data.iloc[:20] # Used for testing only, gets 1st 20 rows #pd_testing_data = pd_data.iloc[20:22,:] # Used for testing only, rows 20 & 21 # Calculate the number of instances, columns, and attributes in the # training data set. Assumes 1 column for the instance ID and 1 column # for the class. Record the index of the column that contains # the actual class no_of_instances_train = len(pd_training_data.index) # number of rows no_of_columns_train = len(pd_training_data.columns) # number of columns no_of_attributes = no_of_columns_train - 2 actual_class_column = no_of_columns_train - 1 # Store class values in a column, sort them, then create a list of unique # classes and store in a dataframe and a Numpy array unique_class_list_df = pd_training_data.iloc[:,actual_class_column] unique_class_list_df = unique_class_list_df.sort_values() unique_class_list_np = unique_class_list_df.unique() #Numpy array unique_class_list_df = unique_class_list_df.drop_duplicates()#Pandas df # Record the number of unique classes in the data set num_unique_classes = len(unique_class_list_df) # Record the frequency counts of each class in a Numpy array freq_cnt_class = pd_training_data.iloc[:,actual_class_column].value_counts( sort=True) # Record the frequency percentages of each class in a Numpy array # This is a list of the class prior probabilities class_prior_probs = pd_training_data.iloc[:,actual_class_column].value_counts( normalize=True, sort=True) # Add 2 additional columns to the testing dataframe pd_testing_data = pd_testing_data.reindex( columns=[*pd_testing_data.columns.tolist( ), 'Predicted Class', 'Prediction Correct?']) # Calculate the number of instances and columns in the # testing data set. Record the index of the column that contains the # predicted class and prediction correctness (1 if yes; 0 if no) no_of_instances_test = len(pd_testing_data.index) # number of rows no_of_columns_test = len(pd_testing_data.columns) # number of columns predicted_class_column = no_of_columns_test - 2 prediction_correct_column = no_of_columns_test - 1 ######################### Training Phase of the Model ######################## # Create a an empty dictionary my_dict = {} # Calculate the likelihood tables for each attribute. If an attribute has # four levels, there are (# of unique classes x 4) different probabilities # that need to be calculated for that attribute. # Start on the first attribute and make your way through all the attributes for col in range(1, no_of_attributes + 1): # Record the name of this column colname = pd_training_data.columns[col] # Create a dataframe containing the unique values in the column unique_attribute_values_df = pd_training_data[colname].drop_duplicates() # Create a Numpy array containing the unique values in the column unique_attribute_values_np = pd_training_data[colname].unique() # Calculate likelihood of the attribute given each unique class value for class_index in range (0, num_unique_classes): # For each unique attribute value, calculate the likelihoods # for each class for attr_val in range (0, unique_attribute_values_np.size) : running_sum = 0 # Calculate N(unique attribute value && class value) # Where N means "number of" and && means "and" # Go through each row of the training set for row in range(0, no_of_instances_train): if (pd_training_data.iloc[row,col] == ( unique_attribute_values_df.iloc[attr_val])) and ( pd_training_data.iloc[row, actual_class_column] == ( unique_class_list_df.iloc[class_index])): running_sum += 1 # With N(unique attribute value && class value) as the numerator # we now need to divide by the total number of times the class # appeared in the data set try: denominator = freq_cnt_class[class_index] except: denominator = 1.0 likelihood = running_sum / denominator # Add a new likelihood to the dictionary # Format of search key is # <attribute_name><attribute_value><class_value> search_key = str(colname) + str(unique_attribute_values_df.iloc[ attr_val]) + str(unique_class_list_df.iloc[ class_index]) my_dict[search_key] = likelihood # Print the likelihood table to the console # print(pd.DataFrame.from_dict(my_dict, orient='index')) ################# End of Training Phase of the Naive Bayes Model ######## ################# Testing Phase of the Naive Bayes Model ################ # Proceed one instance at a time and calculate the prediction for row in range(0, no_of_instances_test): # Initialize the prediction outcome predicted_class = unique_class_list_df.iloc[0] max_numerator_of_bayes = 0.0 # Calculate the Bayes equation numerator for each test instance # That is: P(E1|CL0)P(E2|CL0)P(E3|CL0)...P(E#|CL0) * P(CL0), # P(E1|CL1)P(E2|CL1)P(E3|CL1)...P(E#|CL1) * P(CL1)... for class_index in range (0, num_unique_classes): # Reset the running product with the class # prior probability, P(CL) try: running_product = class_prior_probs[class_index] except: running_product = 0.0000001 # Class not found in data set # Calculation of P(CL) * P(E1|CL) * P(E2|CL) * P(E3|CL)... # Format of search key is # <attribute_name><attribute_value><class_value> # Record each search key value for col in range(1, no_of_attributes + 1): attribute_name = pd_testing_data.columns[col] attribute_value = pd_testing_data.iloc[row,col] class_value = unique_class_list_df.iloc[class_index] # Set the search key key = str(attribute_name) + str( attribute_value) + str(class_value) # Update the running product try: running_product *= my_dict[key] except: running_product *= 0 # Record the prediction if we have a new max # Bayes numerator if running_product > max_numerator_of_bayes: max_numerator_of_bayes = running_product predicted_class = unique_class_list_df.iloc[ class_index] # New predicted class # Store the prediction in the dataframe pd_testing_data.iloc[row,predicted_class_column] = predicted_class # Store if the prediction was correct if predicted_class == pd_testing_data.iloc[row,actual_class_column]: pd_testing_data.iloc[row,prediction_correct_column] = 1 else: pd_testing_data.iloc[row,prediction_correct_column] = 0 print("-------------------------------------------------------") print("Learned Model Predictions on Testing Data Set") print("-------------------------------------------------------") # Print the revamped dataframe print(pd_testing_data) # accuracy = (total correct predictions)/(total number of predictions) accuracy = (pd_testing_data.iloc[ :,prediction_correct_column].sum())/no_of_instances_test # Return classification accuracy return accuracy ####################### End Testing Phase ###################################### # Stepwise forward selection method accepts a Pandas dataframe as a parameter # and then returns the optimal dataframe from that dataframe. def sfs(pd_data): # Record the name of the column that contains the instance IDs and the class instance_id_colname = pd_data.columns[0] no_of_columns = len(pd_data.columns) # number of columns class_column_index = no_of_columns - 1 class_column_colname = pd_data.columns[class_column_index] # Record the number of available attributes no_of_available_attributes = no_of_columns - 2 # Create a dataframe containing the available attributes by removing # the Instance and the Class Column available_attributes_df = pd_data.drop(columns = [ instance_id_colname, class_column_colname]) # Create an empty optimal attribute dataframe containing only the # Instance and the Class Columns optimal_attributes_df = pd_data[[instance_id_colname,class_column_colname]] # Set the base performance to a really low number base_performance = -9999.0 # Check whether adding a new attribute to the optimal attributes dataframe # improves performance # While there are still available attributes left while no_of_available_attributes > 0: # Set the best performance to a low number best_performance = -9999.0 # Initialize the best attribute variable to a placeholder best_attribute = "Placeholder" # For all attributes in the available attribute data frame for col in range(0, len(available_attributes_df.columns)): # Record the name of this attribute this_attr = available_attributes_df.columns[col] # Create a new dataframe with this attribute inserted temp_opt_attr_df = optimal_attributes_df.copy() temp_opt_attr_df.insert( loc=1,column=this_attr,value=(available_attributes_df[this_attr])) # Run Naive Bayes on this new dataframe and return the # classification accuracy current_performance = naive_bayes(temp_opt_attr_df) # Find the new attribute that yielded the greatest # classification accuracy if current_performance > best_performance: best_performance = current_performance best_attribute = this_attr # Did adding another feature lead to improvement? if best_performance > base_performance: base_performance = best_performance # Add the best attribute to the optimal attribute data frame optimal_attributes_df.insert( loc=1,column=best_attribute,value=( available_attributes_df[best_attribute])) # Remove the best attribute from the available attribute data frame available_attributes_df = available_attributes_df.drop( columns = [best_attribute]) # Print the best attribute to the console print() print(str(best_attribute) + " added to the optimal attribute subset") print() # Write to file outfile3.write(str( best_attribute) + ( " added to the optimal attribute subset") + "\n") # Decrement the number of available attributes by 1 no_of_available_attributes -= 1 # Print number of attributes remaining to the console print() print(str(no_of_available_attributes) + " attributes remaining") print() print() # Write to file outfile3.write(str(no_of_available_attributes) + ( " attributes remaining") + "\n" + "\n") else: print() print("Performance did not improve this round.") print("End of Stepwise Forward Selection.") print() outfile3.write("Performance did not improve this round." + "\n") outfile3.write("End of Stepwise Forward Selection." + "\n") break # Return the optimal attribute set return optimal_attributes_df def main(): # Read the full text file and store records in a Pandas dataframe pd_full_data_set = pd.read_csv(DATA_PATH, sep=SEPARATOR) # Used for feature selection of large data sets only #pd_full_data_set = pd_full_data_set.sample(frac=0.10, random_state=SEED) # Runs SFS on the full data set to find the optimal attribute data frame optimal_attribute_set = sfs(pd_full_data_set) # Write dataframe to a file optimal_attribute_set.to_csv(TEST_OUT_FILE, sep=",", header=True) # Open a new file to save the summary statistics outfile2 = open(TEST_STATS_FILE,"w") # Write to a file outfile2.write("----------------------------------------------------------\n") outfile2.write(ALGORITHM_NAME + " Summary Statistics (Testing)\n") outfile2.write("----------------------------------------------------------\n") outfile2.write("Data Set : " + DATA_PATH + "\n") # Write the relevant stats to a file outfile2.write("\n") outfile2.write("Number of Test Instances : " + str(round(len(optimal_attribute_set.index) * testing_data_prct)) + "\n") # Run Naive Bayes on the optimal attribute set accuracy = naive_bayes(optimal_attribute_set) accuracy *= 100 # Write accuracy to the file outfile2.write("Accuracy : " + str(accuracy) + "%\n") # Print statistics to console print() print() print("-------------------------------------------------------") print(ALGORITHM_NAME + " Summary Statistics (Testing)") print("-------------------------------------------------------") print("Data Set : " + DATA_PATH) # Print the relevant stats to the console print() print("Number of Test Instances : " + str(round(len(optimal_attribute_set.index) * testing_data_prct))) # Print accuracy to the console print() print("Accuracy : " + str(accuracy) + "%") print() # Create a dataframe containing the optimal attributes by removing # the Instance and the Class Column instance_id_colname = optimal_attribute_set.columns[0] no_of_columns = len(optimal_attribute_set.columns) # number of columns class_column_index = no_of_columns - 1 class_column_colname = optimal_attribute_set.columns[class_column_index] optimal_attribute_set = optimal_attribute_set.drop(columns = [ instance_id_colname, class_column_colname]) # Write a list of the optimal attribute set to a file outfile2.write("Optimal Attribute Subset : " + str(list( optimal_attribute_set))) # Print a list of the optimal attribute set print("Optimal Attribute Subset : " + str(list(optimal_attribute_set))) print() print() # Close the files outfile2.close() outfile3.close() main() # Call the main function
Output Statistics of Stepwise Forward Selection on the Iris Data Set
This section shows the results for the runs of Stepwise Forward Selection wrapped with Naïve Bayes on the iris data set.
Test Statistics

Trace Runs
