Five-Fold Stratified Cross-Validation

validation-stamp

In a lot of my machine learning projects, you might have noticed that I use a technique called five-fold stratified cross-validation. The purpose of cross-validation is to test the effectiveness of a machine learning algorithm. You don’t just want to spend all that time building your model only to find out that it only works well on the training data but works terribly on data it has never seen before. Cross-validation is the process that helps combat that risk.

The basic idea is that you shuffle your data randomly and then divide it into five equally-sized subsets. Ideally, you would like to have the same number of instances to be in each class in each of the five partitions.

For example, if you have a data set of 100 points where 1/3 of the data is in one class and 2/3 of the data is in another class, you will create five partitions of 20 instances each. Then for each of these partitions, 1/3 of the instances (~6 or 7 points) should be from the one class, and the remaining points should be in the other class. This is the “stratified” part of five-fold stratified cross-validation.

You then run five experiments where you train on four of the partitions (80% of the data) and test on the remaining partition (20% of the data). You rotate through the partitions so that each one serves as a test set exactly once. Then you average the performance on the five experiments when you report the results.

Let’s take a look at this process visually:

five-fold-stratified-cross-validation
  • Divide the data set into five random groups of equal size. Make sure that the proportion of each class in each group is roughly equal to its proportion in the entire data set.
  • Use four groups for training and one group for testing.
  • Calculate the classification accuracy.
  • Repeat the procedure four more times, rotating the test set so that each group serves as a test set exactly once.
  • Compute the average classification accuracy (or mean squared error) for the five runs.

Note that, if the target variable is continuous instead of a class, we use mean squared error instead of classification accuracy as the loss function.

Implementation 1 (Using Numpy in Python)

Here is the code for five-fold stratified cross-validation using the Numpy Python library. Just copy and paste it into your favorite IDE. Don’t be scared at how long the code is. I include a lot of comments so that you know what is going on.

import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Classification:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Continuous Class Values:
# Regression:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

class FiveFoldStratCv:

    # Constructor
    # Parameters: 
    #   np_dataset: The entire original data set as a numpy array
    #   problem_type: 'r' for regression and 'c' for classification 
    def __init__(self, np_dataset, problem_type):
        self.__np_dataset = np_dataset
        self.__problem_type = problem_type
  
    # Returns: 
    #   fold0, fold1, fold2, fold3, fold4
    #   Five folds whose class frequency distributions are 
    #   each representative of the entire original data set (i.e. Five-Fold 
    #   Stratified Cross Validation)
    def get_five_folds(self):

        # Record the number of columns in the data set
        no_of_columns = np.size(self.__np_dataset,1)

        # Record the number of rows in the data set
        no_of_rows = np.size(self.__np_dataset,0)

        # Create five empty folds (i.e. numpy arrays: fold0 through fold4)
        fold0 = np.arange(1)
        fold1 = np.arange(1)
        fold2 = np.arange(1)
        fold3 = np.arange(1)
        fold4 = np.arange(1)

        # Shuffle the data set randomly
        np.random.shuffle(self.__np_dataset)

        # Generate folds for classification problem
        if self.__problem_type == "c":

            # Record the column of the Actual Class
            actual_class_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Actual Class values
            unique_class_arr = np.unique(self.__np_dataset[
                :,actual_class_column])

            unique_class_arr_size = unique_class_arr.size

            # For each unique class in the unique Actual Class array
            for unique_class_arr_idx in range(0, unique_class_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique class. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique class is equal to the actual
                    # class in the original data set on this row
                    if unique_class_arr[unique_class_arr_idx] == (
                        self.__np_dataset[row,actual_class_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0

        # If this is a regression problem
        else:
            # Record the column of the Stratification Bin
            strat_bin_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Stratification Bin values
            unique_bin_arr = np.unique(self.__np_dataset[
                :,strat_bin_column])

            unique_bin_arr_size = unique_bin_arr.size

            # For each unique bin in the unique Stratification Bin array
            for unique_bin_arr_idx in range(0, unique_bin_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique bin. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique bin is equal to the actual
                    # bin in the original data set on this row
                    if unique_bin_arr[unique_bin_arr_idx] == (
                        self.__np_dataset[row,strat_bin_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0
        
        return fold0, fold1, fold2, fold3, fold4

Implementation 2 (Using the Counter subclass)

Here is another implementation using the Counter subclass.

import random
from collections import Counter # Used for counting

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/7/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

def get_five_folds(instances):
    """
    Parameters:
        instances: A list of dictionaries where each dictionary is an instance. 
            Each dictionary contains attribute:value pairs 
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Create five empty folds
    fold0 = []
    fold1 = []
    fold2 = []
    fold3 = []
    fold4 = []

    # Shuffle the data randomly
    random.shuffle(instances)

    # Generate a list of the unique class values and their counts
    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in instances:

            # If we have a match
            if uniqueclass == instance['Class']:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    random.shuffle(fold0)
    random.shuffle(fold1)
    random.shuffle(fold2)
    random.shuffle(fold3)
    random.shuffle(fold4)

    # Return the folds
    return  fold0, fold1, fold2, fold3, fold4