Five-Fold Stratified Cross-Validation

In a lot of my machine learning projects, you might have noticed that I use a technique called five-fold stratified cross-validation. The purpose of cross-validation is to test the effectiveness of a machine learning algorithm. You don’t just want to spend all that time building your model only to find out that it only works well on the training data but works terribly on data it has never seen before. Cross-validation is the process that helps combat that risk.

The basic idea is that you shuffle your data randomly and then divide it into five equally-sized subsets. Ideally, you would like to have the same number of instances to be in each class in each of the five partitions.

For example, if you have a data set of 100 points where 1/3 of the data is in one class and 2/3 of the data is in another class, you will create five partitions of 20 instances each. Then for each of these partitions, 1/3 of the instances (~6 or 7 points) should be from the one class, and the remaining points should be in the other class. This is the “stratified” part of five-fold stratified cross-validation.

You then run five experiments where you train on four of the partitions (80% of the data) and test on the remaining partition (20% of the data). You rotate through the partitions so that each one serves as a test set exactly once. Then you average the performance on the five experiments when you report the results.

Let’s take a look at this process visually:

five-fold-stratified-cross-validation
  • Divide the data set into five random groups of equal size. Make sure that the proportion of each class in each group is roughly equal to its proportion in the entire data set.
  • Use four groups for training and one group for testing.
  • Calculate the classification accuracy.
  • Repeat the procedure four more times, rotating the test set so that each group serves as a test set exactly once.
  • Compute the average classification accuracy (or mean squared error) for the five runs.

Note that, if the target variable is continuous instead of a class, we use mean squared error instead of classification accuracy as the loss function.

Implementation 1 (Using Numpy in Python)

Here is the code for five-fold stratified cross-validation using the Numpy Python library. Just copy and paste it into your favorite IDE. Don’t be scared at how long the code is. I include a lot of comments so that you know what is going on.

import numpy as np # Import Numpy library

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 6/20/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Disrete Class Values
# Classification:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class

# Required Data Set Format for Continuous Class Values:
# Regression:
# Must be all numerical
# Columns (0 through N)
# 0: Instance ID
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Actual Class
# N + 1: Stratification Bin

class FiveFoldStratCv:

    # Constructor
    # Parameters: 
    #   np_dataset: The entire original data set as a numpy array
    #   problem_type: 'r' for regression and 'c' for classification 
    def __init__(self, np_dataset, problem_type):
        self.__np_dataset = np_dataset
        self.__problem_type = problem_type
  
    # Returns: 
    #   fold0, fold1, fold2, fold3, fold4
    #   Five folds whose class frequency distributions are 
    #   each representative of the entire original data set (i.e. Five-Fold 
    #   Stratified Cross Validation)
    def get_five_folds(self):

        # Record the number of columns in the data set
        no_of_columns = np.size(self.__np_dataset,1)

        # Record the number of rows in the data set
        no_of_rows = np.size(self.__np_dataset,0)

        # Create five empty folds (i.e. numpy arrays: fold0 through fold4)
        fold0 = np.arange(1)
        fold1 = np.arange(1)
        fold2 = np.arange(1)
        fold3 = np.arange(1)
        fold4 = np.arange(1)

        # Shuffle the data set randomly
        np.random.shuffle(self.__np_dataset)

        # Generate folds for classification problem
        if self.__problem_type == "c":

            # Record the column of the Actual Class
            actual_class_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Actual Class values
            unique_class_arr = np.unique(self.__np_dataset[
                :,actual_class_column])

            unique_class_arr_size = unique_class_arr.size

            # For each unique class in the unique Actual Class array
            for unique_class_arr_idx in range(0, unique_class_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique class. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique class is equal to the actual
                    # class in the original data set on this row
                    if unique_class_arr[unique_class_arr_idx] == (
                        self.__np_dataset[row,actual_class_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0

        # If this is a regression problem
        else:
            # Record the column of the Stratification Bin
            strat_bin_column = no_of_columns - 1

            # Generate an array containing the unique 
            # Stratification Bin values
            unique_bin_arr = np.unique(self.__np_dataset[
                :,strat_bin_column])

            unique_bin_arr_size = unique_bin_arr.size

            # For each unique bin in the unique Stratification Bin array
            for unique_bin_arr_idx in range(0, unique_bin_arr_size):

                # Initialize the counter to 0
                counter = 0

                # Go through each row of the data set and find instances that
                # are part of this unique bin. Distribute them among one
                # of five folds
                for row in range(0, no_of_rows):

                    # If the value of the unique bin is equal to the actual
                    # bin in the original data set on this row
                    if unique_bin_arr[unique_bin_arr_idx] == (
                        self.__np_dataset[row,strat_bin_column]):

                            # Allocate instance to fold0
                            if counter == 0:

                                # If fold has not yet been created
                                if np.size(fold0) == 1:

                                    fold0 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold0 = np.vstack([fold0,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold1
                            elif counter == 1:

                                # If fold has not yet been created
                                if np.size(fold1) == 1:

                                    fold1 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold1 = np.vstack([fold1,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold2
                            elif counter == 2:

                                # If fold has not yet been created
                                if np.size(fold2) == 1:

                                    fold2 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold2 = np.vstack([fold2,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold3
                            elif counter == 3:

                                # If fold has not yet been created
                                if np.size(fold3) == 1:

                                    fold3 = self.__np_dataset[row,:]

                                    # Increase the counter by 1
                                    counter += 1

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold3 = np.vstack([fold3,new_row])
                                    
                                    # Increase the counter by 1
                                    counter += 1

                            # Allocate instance to fold4
                            else:

                                # If fold has not yet been created
                                if np.size(fold4) == 1:

                                    fold4 = self.__np_dataset[row,:]

                                    # Reset counter to 0
                                    counter = 0

                                # Append this instance to the fold
                                else:

                                    # Extract data for the new row
                                    new_row = self.__np_dataset[row,:]

                                    # Append that entire instance to fold
                                    fold4 = np.vstack([fold4,new_row])
                                    
                                    # Reset counter to 0
                                    counter = 0
        
        return fold0, fold1, fold2, fold3, fold4

Implementation 2 (Using the Counter subclass)

Here is another implementation using the Counter subclass.

import random
from collections import Counter # Used for counting

# File name: five_fold_stratified_cv.py
# Author: Addison Sears-Collins
# Date created: 7/7/2019
# Python version: 3.7
# Description: Implementation of five-fold stratified cross-validation
# Divide the data set into five random groups. Make sure 
# that the proportion of each class in each group is roughly equal to its 
# proportion in the entire data set.

# Required Data Set Format for Classification Problems:
# Columns (0 through N)
# 0: Class
# 1: Attribute 1 
# 2: Attribute 2
# 3: Attribute 3 
# ...
# N: Attribute N

def get_five_folds(instances):
    """
    Parameters:
        instances: A list of dictionaries where each dictionary is an instance. 
            Each dictionary contains attribute:value pairs 
    Returns: 
        fold0, fold1, fold2, fold3, fold4
        Five folds whose class frequency distributions are 
        each representative of the entire original data set (i.e. Five-Fold 
        Stratified Cross Validation)
    """
    # Create five empty folds
    fold0 = []
    fold1 = []
    fold2 = []
    fold3 = []
    fold4 = []

    # Shuffle the data randomly
    random.shuffle(instances)

    # Generate a list of the unique class values and their counts
    classes = []  # Create an empty list named 'classes'

    # For each instance in the list of instances, append the value of the class
    # to the end of the classes list
    for instance in instances:
        classes.append(instance['Class'])

    # Create a list of the unique classes
    unique_classes = list(Counter(classes).keys())

    # For each unique class in the unique class list
    for uniqueclass in unique_classes:

        # Initialize the counter to 0
        counter = 0
        
        # Go through each instance of the data set and find instances that
        # are part of this unique class. Distribute them among one
        # of five folds
        for instance in instances:

            # If we have a match
            if uniqueclass == instance['Class']:

                # Allocate instance to fold0
                if counter == 0:

                    # Append this instance to the fold
                    fold0.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold1
                elif counter == 1:

                    # Append this instance to the fold
                    fold1.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold2
                elif counter == 2:

                    # Append this instance to the fold
                    fold2.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold3
                elif counter == 3:

                    # Append this instance to the fold
                    fold3.append(instance)

                    # Increase the counter by 1
                    counter += 1

                # Allocate instance to fold4
                else:

                    # Append this instance to the fold
                    fold4.append(instance)

                    # Reset the counter to 0
                    counter = 0

    # Shuffle the folds
    random.shuffle(fold0)
    random.shuffle(fold1)
    random.shuffle(fold2)
    random.shuffle(fold3)
    random.shuffle(fold4)

    # Return the folds
    return  fold0, fold1, fold2, fold3, fold4

Difference Between Maximum Likelihood and Maximum a Posteriori Estimation

In this blog post, we will take a look at the difference between maximum likelihood estimation and maximum a posteriori estimation. Both are methods that attempt to estimate unknown values for parameters.

What is Maximum Likelihood Estimation?

I will explain the term maximum likelihood estimation by using a real-world example. 

Suppose you went out to the store and bought a six-sided die. You have no prior information about the type of die that you have. It could be a standard die which has a uniform prior probability distribution (i.e. each face has 1/6 chance of being face-up on any given roll), or you could have a weighted die where some numbers are more likely to appear than others. How would you calculate the probability of getting each number for a given roll of the die?

One way to calculate the probabilities is to roll the die 10,000 times and keep track of how many times (i.e. frequency in the table below) each face appeared. Your table might look something like this:

1-roll-dice-table

What you see above is the basis of maximum likelihood estimation. In maximum likelihood estimation, you estimate the parameters by maximizing the “likelihood function.” Stated more simply, you choose the value of the parameters that were most likely to have generated the data that was observed in the table above. 

So, what is the problem with maximum likelihood estimation? Well, as you saw above, we did not incorporate any prior knowledge (i.e. prior belief information) into our calculation. What if we knew that this was a weighted die where the probabilities of each face were as follows:

2-probabilities-dice

Maximum likelihood estimation does not take this into account. It assumes a uniform prior probability distribution. That is, it assumes that the probabilities of each face of the die are equal before we begin rolling the die 500 times. The problem with this assumption is that you would need to have a huge dataset (i.e. have to roll the die many times, perhaps until your arm gets too tired to continue rolling!) to get values for the appropriate probabilities that are close to what they should be.

More formally, maximum likelihood estimation looks like this:

Parameter = argmax P(Observed Data | Parameter)

where P = probability and | = given.

So what do we do? Maximum a posteriori (MAP) estimation to the rescue!

What is Maximum a Posteriori (MAP) Estimation?

Maximum a Posteriori (MAP) Estimation is similar to Maximum Likelihood Estimation (MLE) with a couple major differences. MAP takes prior probability information into account. 

For example, if we knew that the die in our example above was a weighted die with the probabilities noted in the table in the previous section, MAP estimation factors this information into the parameter estimation.

More formally, MAP estimation looks like this:

Parameter = argmax P(Observed Data | Parameter)P(Parameter)

where P = probability and | = given.

Contrast that equation above to the MLE equation from the previous section. Note that the likelihood is now weighted by the prior probability. This distinction is important. 

Imagine in MLE if we did 500 rolls, and one of the faces appeared only one time (i.e. 1/500). MLE would generate incorrect parameter values. Having that extra nonzero prior probability factor makes sure that the model does not overfit to the observed data in the way that MLE does.

Why Is Maximum a Posteriori (MAP) Estimation the Ideal Estimation Method for Smaller Datasets?

It is ideal because it takes into account prior knowledge of an event. MLE does not and is prone to overfitting. For this reason, MAP is considered a regularization of MLE. Adding the prior probability information reduces the overdependence on the observed data for parameter estimation.

Latent Variables and Latent Variable Models

In this post, I will explore the idea of latent variable models, where a latent variable is a “hidden” variable (i.e., one we cannot observe directly but that plays a part in the overall model).

What is a Latent Variable?

A latent variable is a variable that is not directly observed but is inferred from other variables we can measure directly.

A common example of a latent variable is quality of life. You cannot measure quality of life directly.

Real-world Example of Latent Variables

A few years ago, my wife and I decided to move cities to a place which would be able to offer us a better quality of life. I created a spreadsheet in Excel containing all the factors that influence the quality of life in a particular location. They are:

  1. Days per year of sunshine
  2. Unemployment rate
  3. Crime rate
  4. Average household income
  5. State income tax rate
  6. Cost of living
  7. Population density
  8. Median home price
  9. Average annual temperature
  10. Number of professional sports teams
  11. Average number of days of snow per year
  12. Population
  13. Average inches of rain per year
  14. Number of universities
  15. Distance to an international airport

If I wanted to feed this into a machine learning algorithm, I could then develop a mathematical model for quality of life.

The advantage of using the latent variable model with quality of life is that I can use a fewer number of features, which reduces the dimensionality of what would be an enormous data set. All those factors I mentioned are correlated in some way to the latent variable quality of life. So instead of using 15 columns of data, I can use just 1. The data then becomes not only easier to understand but useful if I want to use a quality of life score as a feature for any type of classification or regression problem because it captures the underlying phenomenon we are interested in.

Exploring the Use of Latent Variables Across Different Model Types

test_testing_exam_sat

One particularly useful framework for exploring the use of latent variables across different model types is the idea of path analysis. In path analysis, one develops a model that shows the relationships between variables that we can readily measure and variables that we cannot readily measure. These variables that we cannot readily measure are called latent variables.

By convention, variables that we can readily measure will be placed inside rectangles or boxes. Variables that we cannot readily measure, the latent variables, are typically located inside ovals.

Let’s look at an example now of how we can model latent variables using the latent variable of childhood intelligence.

1-sat-score

As you can see in this diagram we have four different boxes.

We have the oval at the bottom which is representative of the intelligence of a child. This is the latent variable. This is the variable that we want to predict.

Then up above we have two boxes. One box includes the father’s SAT score. The other box is the mother’s SAT score. And then finally we have another box which represents the average SAT scores of students in the state that the child lives in.

Note that in this particular path model, we draw arrows showing relationships between two variables. The arrow between the father and the mother is indicative of correlation. We are inferring in this particular model a correlation between the SAT score of the father and the SAT score of the mother, and we are also inferring a relationship between both of their SAT scores and the intelligence of a child.

We are also showing a correlation relationship between the average SAT score of the students in the state and the child’s intelligence.

But notice that there are no arrows that are drawn from either the father or the mother to the average SAT scores in the state. The lack of arrows is indicative of a lack of correlation between those two variables.

In short, I believe at the outset of a machine learning problem, we can draw different path models to begin to model the relationships and to develop an appropriate framework for selecting a model to tackle a particular machine learning problem.

Getting back to the idea of path modeling that I described earlier, we can use this diagram to not only explore the relationship between observed variables and latent variables, but we could use it as a springboard to actually develop quantifiable relationships between observed and latent variables…in other words, the numerical coefficients that we can use as inputs into a particular mathematical model.

Let’s take a look at an example, perhaps the simplest example we can think of…the equation for a line of best fit:

y = mx + b. 

You might recall the equation y = mx + b from grade school. In this particular equation, y is the dependent variable, b is the intercept, m is the slope, and x is the independent variable.

In a linear regression model we typically see an equation of the form:

Y = (BETA) * X + EPSILON 

where EPSILON is the “residual”, the difference between actual and predicted values for a given value for x.

What would a path diagram look like for this example?

Let’s draw it.

2-xyz-epsilon

I think this is a very cool way of being able to very easily take a very large data set and begin to think of ways of how we can condense this data set into a data set which would have lower storage requirements. By adding weightings to the arrows and by following the actual arrows that show the correlations between two variables whether observed or unobserved, we can then use this as an outline to start pruning irrelevant features from the data set. The next step would be able to use the subset of features to develop a hopefully useful predictive model with one of the common machine learning models available.

Factor Analysis

One framework for generating latent variable models is the idea of factor analysis.

In factor analysis, we want to determine if the variations in the measured, observed variables are primarily the result of variations in the underlying latent variables. We can then use this information, the interdependencies between the latent variables and the observed variables, to eliminate some of the features in our data set and therefore reduce the storage requirements for a particular machine learning problem. We do this by estimating the covariance (the extent to which two variables move in tandem with each other) between the observed variables and the unobserved latent variables.

Why is this promising? It is promising because we can create models of things we want to measure but are not easily able to measure. And in doing that, we can tackle the representation bias inherent in many decisions made by human beings.

Real-World Example of Factor Analysis

job_interview_colleagues_business

Let’s look at a concrete example to see how all of this fits together and how it would be implemented in the real world.

Imagine we are a software engineering team at a big company trying to hire employees that have “outstanding” technical ability. “Technical ability,” like the terms altruism, generosity, health, creativity, aggressiveness, and innovativeness, are all terms that are not easily quantifiable or measured. So let us consider the term technical ability a latent variable.

We can then consider some observed variables as part of the candidate evaluation process:

  • Number of lines of code written
  • Kaggle scores
  • Hacker rank
  • Grade point average
  • SAT score
  • GRE score
  • Number of years of experience
3-factor_analysis

These would be all of the observed variables that we believe are connected in some way to the underlying factor known as technical ability.

At the end of the day, technical ability is what we would like to discover, but since we can’t measure that directly , we use data from some other variables in order to back in to that technical ability. And if you see on each of these arrows, we could add additional weightings to our model to be able to determine which of the particular observed variables we can safely remove from the model and still achieve a result that is desirable that gets us a candidate that has the technical ability we want.

So you can see by having some kind of model developed by factor analysis, we can help eliminate some of the representation bias that might be evident in a hiring system.

At the end of the day, people who hire teams of software engineers have their own inherent bias about who to hire and who not to hire. These biases might be based on irrelevant factors such as the hiring manager’s experience with people who looked similar to the candidate, how the candidate speaks, etc.

By evaluating the results of a model that is based on mathematics, we can then compare the results of the model to individual opinions in the team to try to get a more appropriate result and hire the best candidate…all the while alleviating any concerns of representative bias that might exist in the hiring process.