Human Pose Estimation Using Deep Learning in OpenCV

In this tutorial, we will implement human pose estimation. Pose estimation means estimating the position and orientation of objects (in this case humans) relative to the camera. By the end of this tutorial, you will be able to generate the following output:

Real-World Applications

Human pose estimation has a number of real-world applications:

Robotic task learning: Enabling robots to acquire new skills by imitating the actions of a human teacher
Virtual reality applications
Augmented reality applications (overlaying graphics on top of physical objects)
Sign language understanding
Recognizing human poses (sitting, standing, running, etc.)

Let’s get started!

Prerequisites

Python 3.7 or higher

Installation and Setup

We need to make sure we have all the software packages installed. Check to see if you have OpenCV installed on your machine. If you are using Anaconda, you can type:

conda install -c conda-forge opencv

Alternatively, you can type:

pip install opencv-python

Make sure you have NumPy installed, a scientific computing library for Python.

If you’re using Anaconda, you can type:

conda install numpy

Alternatively, you can type:

pip install numpy

Find Some Videos

The first thing we need to do is find some videos to serve as our test cases.

We want to download videos that contain humans. The video files should be in mp4 format and 1920 x 1080 in dimensions.

I found some good candidates on Pixabay.com and Dreamstime.com.

Take your videos and put them inside a directory on your computer.

Download the Protobuf File

Inside the same directory as your videos, download the protobuf file on this page. It is named graph_opt.pb. This file contains the weights of the neural network. The neural network is what we will use to determine the human’s position and orientation (i.e. pose).

Brief Description of OpenPose

We will use the OpenPose application along with OpenCV to do what we need to do in this project. OpenPose is an open source real-time 2D pose estimation application for people in video and images. It was developed by students and faculty members at Carnegie Mellon University.

You can learn the theory and details of how OpenPose works in this paper and at GeeksforGeeks.

Write the Code

Here is the code. Make sure you put the code in the same directory on your computer where you put the other files.

The only lines you need to change are:

Line 14 (name of the input file in mp4 format)
Line 15 (input file size)
Line 18 (output file name)

# Project: Human Pose Estimation Using Deep Learning in OpenCV
# Author: Addison Sears-Collins
# Date created: February 25, 2021
# Description: A program that takes a video with a human as input and outputs
# an annotated version of the video with the human's position and orientation..

# Reference: https://github.com/quanhua92/human-pose-estimation-opencv

# Import the important libraries
import cv2 as cv # Computer vision library
import numpy as np # Scientific computing library

# Make sure the video file is in the same directory as your code
filename = 'dancing32.mp4'
file_size = (1920,1080) # Assumes 1920x1080 mp4 as the input video file

# We want to save the output to a video file
output_filename = 'dancing32_output.mp4'
output_frames_per_second = 20.0 

BODY_PARTS = { "Nose": 0, "Neck": 1, "RShoulder": 2, "RElbow": 3, "RWrist": 4,
               "LShoulder": 5, "LElbow": 6, "LWrist": 7, "RHip": 8, "RKnee": 9,
               "RAnkle": 10, "LHip": 11, "LKnee": 12, "LAnkle": 13, "REye": 14,
               "LEye": 15, "REar": 16, "LEar": 17, "Background": 18 }

POSE_PAIRS = [ ["Neck", "RShoulder"], ["Neck", "LShoulder"], ["RShoulder", "RElbow"],
               ["RElbow", "RWrist"], ["LShoulder", "LElbow"], ["LElbow", "LWrist"],
               ["Neck", "RHip"], ["RHip", "RKnee"], ["RKnee", "RAnkle"], ["Neck", "LHip"],
               ["LHip", "LKnee"], ["LKnee", "LAnkle"], ["Neck", "Nose"], ["Nose", "REye"],
               ["REye", "REar"], ["Nose", "LEye"], ["LEye", "LEar"] ]

# Width and height of training set
inWidth = 368
inHeight = 368

net = cv.dnn.readNetFromTensorflow("graph_opt.pb")

cap = cv.VideoCapture(filename)

# Create a VideoWriter object so we can save the video output
fourcc = cv.VideoWriter_fourcc(*'mp4v')
result = cv.VideoWriter(output_filename,  
                         fourcc, 
                         output_frames_per_second, 
                         file_size) 
# Process the video
while cap.isOpened():
    hasFrame, frame = cap.read()
    if not hasFrame:
        cv.waitKey()
        break

    frameWidth = frame.shape[1]
    frameHeight = frame.shape[0]
    
    net.setInput(cv.dnn.blobFromImage(frame, 1.0, (inWidth, inHeight), (127.5, 127.5, 127.5), swapRB=True, crop=False))
    out = net.forward()
    out = out[:, :19, :, :]  # MobileNet output [1, 57, -1, -1], we only need the first 19 elements

    assert(len(BODY_PARTS) == out.shape[1])

    points = []
    for i in range(len(BODY_PARTS)):
        # Slice heatmap of corresponging body's part.
        heatMap = out[0, i, :, :]

        # Originally, we try to find all the local maximums. To simplify a sample
        # we just find a global one. However only a single pose at the same time
        # could be detected this way.
        _, conf, _, point = cv.minMaxLoc(heatMap)
        x = (frameWidth * point[0]) / out.shape[3]
        y = (frameHeight * point[1]) / out.shape[2]
        # Add a point if it's confidence is higher than threshold.
        # Feel free to adjust this confidence value.  
        points.append((int(x), int(y)) if conf > 0.2 else None)

    for pair in POSE_PAIRS:
        partFrom = pair[0]
        partTo = pair[1]
        assert(partFrom in BODY_PARTS)
        assert(partTo in BODY_PARTS)

        idFrom = BODY_PARTS[partFrom]
        idTo = BODY_PARTS[partTo]

        if points[idFrom] and points[idTo]:
            cv.line(frame, points[idFrom], points[idTo], (0, 255, 0), 3)
            cv.ellipse(frame, points[idFrom], (3, 3), 0, 0, 360, (255, 0, 0), cv.FILLED)
            cv.ellipse(frame, points[idTo], (3, 3), 0, 0, 360, (255, 0, 0), cv.FILLED)

    t, _ = net.getPerfProfile()
    freq = cv.getTickFrequency() / 1000
    cv.putText(frame, '%.2fms' % (t / freq), (10, 20), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0))

    # Write the frame to the output video file
    result.write(frame)
		
# Stop when the video is finished
cap.release()
	
# Release the video recording
result.release()

Run the Code

To run the code, type:

python openpose.py

Video Output

Here is the output I got:

Further Work

If you would like to do a deep dive into pose estimation, check out the official GitHub for the OpenPose project here.

That’s it. Keep building!

Image Feature Detection, Description, and Matching in OpenCV

In this tutorial, we will implement various image feature detection (a.k.a. feature extraction) and description algorithms using OpenCV, the computer vision library for Python. I’ll explain what a feature is later in this post.

We will also look at an example of how to match features between two images. This process is called feature matching.

Real-World Applications

Object Detection
Object Tracking
Object Classification

Let’s get started!

Prerequisites

Python 3.7 or higher

What is a Feature?

Do you remember when you were a kid, and you played with puzzles? The objective was to put the puzzle pieces together. When the puzzle was all assembled, you would be able to see the big picture, which was usually some person, place, thing, or combination of all three.

What enabled you to successfully complete the puzzle? Each puzzle piece contained some clues…perhaps an edge, a corner, a particular color pattern, etc. You used these clues to assemble the puzzle.

The “clues” in the example I gave above are image features. A feature in computer vision is a region of interest in an image that is unique and easy to recognize. Features include things like, points, edges, blobs, and corners.

For example, suppose you saw this feature?

You see some shaped, edges, and corners. These features are clues to what this object might be.

Now, let’s say we also have this feature.

Can you recognize what this object is?

Many Americans and people who have traveled to New York City would guess that this is the Statue of Liberty. And in fact, it is.

With just two features, you were able to identify this object. Computers follow a similar process when you run a feature detection algorithm to perform object recognition.

The Python computer vision library OpenCV has a number of algorithms to detect features in an image. We will explore these algorithms in this tutorial.

Installation and Setup

Before we get started, let’s make sure we have all the software packages installed. Check to see if you have OpenCV installed on your machine. If you are using Anaconda, you can type:

conda install -c conda-forge opencv

Alternatively, you can type:

pip install opencv-python

Install Numpy, the scientific computing library.

pip install numpy

Install Matplotlib, the plotting library.

pip install matplotlib

Find an Image File

Find an image of any size. Here is mine:

Difference Between a Feature Detector and a Feature Descriptor

Before we get started developing our program, let’s take a look at some definitions.

The algorithms for features fall into two categories: feature detectors and feature descriptors.

A feature detector finds regions of interest in an image. The input into a feature detector is an image, and the output are pixel coordinates of the significant areas in the image.

A feature descriptor encodes that feature into a numerical “fingerprint”. Feature description makes a feature uniquely identifiable from other features in the image.

We can then use the numerical fingerprint to identify the feature even if the image undergoes some type of distortion.

Feature Detection Algorithms

Harris Corner Detection

A corner is an area of an image that has a large variation in pixel color intensity values in all directions. One popular algorithm for detecting corners in an image is called the Harris Corner Detector.

Here is some basic code for the Harris Corner Detector. I named my file harris_corner_detector.py.

# Code Source: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_features_harris/py_features_harris.html

import cv2
import numpy as np

filename = 'random-shapes-small.jpg'
img = cv2.imread(filename)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

gray = np.float32(gray)
dst = cv2.cornerHarris(gray,2,3,0.04)

#result is dilated for marking the corners, not important
dst = cv2.dilate(dst,None)

# Threshold for an optimal value, it may vary depending on the image.
img[dst>0.01*dst.max()]=[0,0,255]

cv2.imshow('dst',img)
if cv2.waitKey(0) & 0xff == 27:
    cv2.destroyAllWindows()

Here is my image before:

Here is my image after:

For a more detailed example, check out my post “Detect the Corners of Objects Using Harris Corner Detector.”

Shi-Tomasi Corner Detector and Good Features to Track

Another corner detection algorithm is called Shi-Tomasi. Let’s run this algorithm on the same image and see what we get. Here is the code. I named the file shi_tomasi_corner_detect.py.

# Code Source: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_shi_tomasi/py_shi_tomasi.html

import numpy as np
import cv2
from matplotlib import pyplot as plt

img = cv2.imread('random-shapes-small.jpg')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

# Find the top 20 corners
corners = cv2.goodFeaturesToTrack(gray,20,0.01,10)
corners = np.int0(corners)

for i in corners:
    x,y = i.ravel()
    cv2.circle(img,(x,y),3,255,-1)

cv2.imshow('Shi-Tomasi', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Here is the image after running the program:

Scale-Invariant Feature Transform (SIFT)

When we rotate an image or change its size, how can we make sure the features don’t change? The methods I’ve used above aren’t good at handling this scenario.

For example, consider these three images below of the Statue of Liberty in New York City. You know that this is the Statue of Liberty regardless of changes in the angle, color, or rotation of the statue in the photo. However, computers have a tough time with this task.

OpenCV has an algorithm called SIFT that is able to detect features in an image regardless of changes to its size or orientation. This property of SIFT gives it an advantage over other feature detection algorithms which fail when you make transformations to an image.

Here is an example of code that uses SIFT:

# Code source: https://docs.opencv.org/master/da/df5/tutorial_py_sift_intro.html
import numpy as np
import cv2 as cv

# Read the image
img = cv.imread('chessboard.jpg')

# Convert to grayscale
gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)

# Find the features (i.e. keypoints) and feature descriptors in the image
sift = cv.SIFT_create()
kp, des = sift.detectAndCompute(gray,None)

# Draw circles to indicate the location of features and the feature's orientation
img=cv.drawKeypoints(gray,kp,img,flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)

# Save the image
cv.imwrite('sift_with_features_chessboard.jpg',img)

Here is the before:

Here is the after. Each of those circles indicates the size of that feature. The line inside the circle indicates the orientation of the feature:

Speeded-up robust features (SURF)

SURF is a faster version of SIFT. It is another way to find features in an image.

Here is the code:

# Code Source: https://docs.opencv.org/master/df/dd2/tutorial_py_surf_intro.html

import numpy as np
import cv2 as cv

# Read the image
img = cv.imread('chessboard.jpg')

# Find the features (i.e. keypoints) and feature descriptors in the image
surf = cv.xfeatures2d.SURF_create(400)
kp, des = sift.detectAndCompute(img,None)

# Draw circles to indicate the location of features and the feature's orientation
img=cv.drawKeypoints(gray,kp,img,flags=cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)

# Save the image
cv.imwrite('surf_with_features_chessboard.jpg',img)

Features from Accelerated Segment Test (FAST)

A lot of the feature detection algorithms we have looked at so far work well in different applications. However, they aren’t fast enough for some robotics use cases (e.g. SLAM).

The FAST algorithm, implemented here, is a really fast algorithm for detecting corners in an image.

Blob Detectors With LoG, DoG, and DoH

A blob is another type of feature in an image. A blob is a region in an image with similar pixel intensity values. Another definition you will hear is that a blob is a light on dark or a dark on light area of an image.

Three popular blob detection algorithms are Laplacian of Gaussian (LoG), Difference of Gaussian (DoG), and Determinant of Hessian (DoH).
Basic implementations of these blob detectors are at this page on the scikit-image website. Scikit-image is an image processing library for Python.

Feature Descriptor Algorithms

Histogram of Oriented Gradients

The HoG algorithm breaks an image down into small sections and calculates the gradient and orientation in each section. This information is then gathered into bins to compute histograms. These histograms give an image numerical “fingerprints” that make it uniquely identifiable.

A basic implementation of HoG is at this page.

Binary Robust Independent Elementary Features (BRIEF)

BRIEF is a fast, efficient alternative to SIFT. A sample implementation of BRIEF is here at the OpenCV website.

Oriented FAST and Rotated BRIEF (ORB)

SIFT was patented for many years, and SURF is still a patented algorithm. ORB was created in 2011 as a free alternative to these algorithms. It combines the FAST and BRIEF algorithms. You can find a basic example of ORB at the OpenCV website.

Feature Matching Example

You can use ORB to locate features in an image and then match them with features in another image.

For example, consider this Whole Foods logo. This logo will be our training image.

I want to locate this Whole Foods logo inside this image below. This image below is our query image.

Here is the code you need to run. My file is called feature_matching_orb.py.

import numpy as np 
import cv2 
from matplotlib import pyplot as plt
	
# Read the training and query images
query_img = cv2.imread('query_image.jpg') 
train_img = cv2.imread('training_image.jpg') 

# Convert the images to grayscale 
query_img_gray = cv2.cvtColor(query_img,cv2.COLOR_BGR2GRAY) 
train_img_gray = cv2.cvtColor(train_img, cv2.COLOR_BGR2GRAY) 

# Initialize the ORB detector algorithm 
orb = cv2.ORB_create() 

# Detect keypoints (features) cand calculate the descriptors
query_keypoints, query_descriptors = orb.detectAndCompute(query_img_gray,None) 
train_keypoints, train_descriptors = orb.detectAndCompute(train_img_gray,None) 

# Match the keypoints
matcher = cv2.BFMatcher() 
matches = matcher.match(query_descriptors,train_descriptors) 

# Draw the keypoint matches on the output image
output_img = cv2.drawMatches(query_img, query_keypoints, 
train_img, train_keypoints, matches[:20],None) 

output_img = cv2.resize(output_img, (1200,650)) 

# Save the final image 
cv2.imwrite("feature_matching_result.jpg", output_img) 

# Close OpenCV upon keypress
cv2.waitKey(0)
cv2.destroyAllWindows()

Here is the result:

If you want to dive deeper into feature matching algorithms (Homography, RANSAC, Brute-Force Matcher, FLANN, etc.), check out the official tutorials on the OpenCV website. This page and this page have some basic examples.

That’s it. Keep building!

Difference Between Histogram Equalization and Histogram Matching

In this post, I will explain the difference between histogram equalization and histogram matching. If you are in a hurry, here is the short answer: while the goal of histogram equalization is to produce an output image that has a flattened histogram, the goal of histogram matching is to take an input image and generate an output image that is based upon the shape of a specific (or reference) histogram.

Let’s take a look at the long answer by first examining the definition of a histogram (continued after the Table of Contents).

What is a Histogram?
Histogram Equalization
–How Histogram Equalization Works
–Example of Histogram Equalization
Histogram Matching
–Example of Histogram Matching

What is a Histogram?

In image processing, a histogram shows the number of pixels (or voxels in the case of a 3D image) for each intensity value in a given image.

1-histogram — Image Source: Wikimedia Commons

A histogram is a statistical representation of an image. It doesn’t show any information about where the pixels are located in the image. Therefore, two different images can have equivalent histograms. For example, the two images below are different but have identical histograms because both are 50% white (grayscale value of 255) and 50% black (grayscale value of 0).

Return to Table of Contents

Histogram Equalization

In histogram equalization (also known as histogram flattening), the goal is to improve contrast in images that might be either blurry or have a background and foreground that are either both bright or both dark. Histogram equalization helps sharpen an image.

Low contrast images typically have histograms that are concentrated within a tight range of values. Histogram equalization can improve the contrast in these images by spreading out the histogram so that the intensity values are distributed uniformly over a larger intensity range. Ideally, the histogram of the output image will be perfectly flat.

The two images below are two examples of what the histogram for an input image might look like before and after it goes through histogram equalization.

4-histogram-tranform — Image Source: Wikimedia Commons

5-histogram-transform-2 — Image Source: Wikimedia Commons

Histogram equalization is useful in a number of real-world use cases, such as x-rays, thermal imagery, and satellite photos.

Here is some Python code you can use to perform histogram equalization:

# Author: Addison Sears-Collins
# https://automaticaddison.com
# Description: Sharpen an image (i.e. increase contrast) 
# using histogram equalization

import cv2 # Computer vision library
import numpy as np # Scientific computing library

# Read the image
img = cv2.imread('before.jpg',0)

# Perform histogram equalization
equ = cv2.equalizeHist(img)

# Stack images side-by-side
after = np.hstack((img,equ)) 

# Save the output image
cv2.imwrite('after.jpg',after)

Here is the input:

Here is the output generated by the program:

after — Original image (left), Enhanced image (right)

Return to Table of Contents

How Histogram Equalization Works

The process for histogram equalization is as follows:

Step 1: Obtain the histogram.

For example, if the image is grayscale with 256 distinct intensity levels i (where i = 0 [black], 1, 2, …. 253, 254, 255 [white]), the probability that a pixel chosen at random will have an intensity level i is as follows:

Step 2: Obtain the cumulative distribution function CDF.

The cumulative distribution function H(j) is defined as the probability H of a randomly selected pixel taking one of the intensity values from 0 through j (inclusive). Therefore, given our normalized histogram h(i) from above, we have the following formula:

The sum of all the components in the normalized histogram is equal to 1. Therefore,

Step 3: Calculate the transformation T to map the old intensity values to new intensity values.

Let K represent the total number of possible intensity values (e.g. 256). j is the old intensity value, and T(j) is the new intensity value.

Step 4: Given the new mappings of intensity values, we can use a lookup table to transform each pixel in the input image to a new intensity.

The result of this transformation is a new histogram which corresponds to a new output image.

Special note on transformation functions:

The formula I used for histogram equalization is a common one, but other transformation functions are possible. Different transformation functions will yield different output histograms.

Return to Table of Contents

Example of Histogram Equalization

Let us suppose we have a 3-bit, 8 x 8 grayscale image. The grayscale range is 2³= 8 intensity values (i.e. gray levels) because the image is 3 bits. We label these intensity values 0 through 7. Below is the histogram of this image.

Now, we calculate the cumulative distribution function and perform the transformation.

The two yellow columns above are our lookup table. We use these two columns to generate the output image. For example, we map all pixels that had a gray level of 3 to 1. We map all pixels that had a gray level of 6 to 5, etc. The resulting histogram looks like this:

Return to Table of Contents

Histogram Matching

While the goal of histogram equalization is to produce an output image that has a flattened histogram, the goal of histogram matching is to take an input image and generate an output image that is based upon the shape of a specific (or reference) histogram. Histogram matching is also known as histogram specification. You can consider histogram equalization as a special case of histogram matching in which we want to force an image to have a uniform histogram (rather than just any shape as is the case for histogram matching).

Let us suppose we have two images, an input image and a specified image. We want to use histogram matching to force the input image to have a histogram that is the shape of the histogram of the specified image. The first few steps are similar to histogram equalization, except we are performing histogram equalization on two images (original image and the specific image).

Step 1: Obtain the histogram for both the input image and the specified image (same method as in histogram equalization).

For example, if both images are grayscale with 256 distinct intensity levels i (where i = 0 [black], 1, 2, …. 253, 254, 255 [white]), the probability that a pixel chosen at random will have an intensity level i is as follows:

Step 2: Obtain the cumulative distribution function CDF for both the input image and the specified image (same method as in histogram equalization).

Step 3: Calculate the transformation T to map the old intensity values to new intensity values for both the input image and specified image (same method as in histogram equalization).

Let K represent the total number of possible intensity values (e.g. 256). j is the old intensity value, and T(j) is the new intensity value.

Step 4: Use the transformed intensity values for both the input image and specified image to map the intensity values of the input image to new values

We go through each available intensity value j one at a time, doing the following steps:

See what the transformed intensity value is for the input image given the intensity value j. Let us call this T_input(j).
We then find the T_specified(j) that is closest to T_input(j) and make a note of what j is. For example, if j = 4:

we map all intensity values of 4 in the input image to 1.

Here is another example. Let us suppose that:

Therefore, we map all intensity values of 5 in the input image to 2.

After we have gone through all available intensity values and performed all the mappings, we have our output image which has a histogram that will approximately match the shape of the unequalized specified histogram.

Return to Table of Contents

Example of Histogram Matching

Let us take a look at an example. For convenience, I am reposting the unequalized and equalized histogram from the histogram equalization example.

Here is the histogram of the original image.

Now, we equalize the original input image to get the following table and histogram.

Now, let us suppose we have the following specified histogram. We want to get the original image to have a histogram that is shaped like the specified histogram.

We equalize the specified histogram, yielding the following table.

Using the two yellow columns above to map the old intensity values for the pixels to new intensity values, we get the following histogram after equalization:

Now, we need to use the transformed intensity values for both the input image and specified image to map the intensity values of the input image to new values. To do that, all we need are the FLOOR((K – 1) * CDF) values for both the original image and the specified image.

We go through each available intensity value j one at a time, doing the following steps:

See what the transformed intensity value is for the input image given the intensity value j. Call this T_input(j).
We then find the T_specified(j) that is closest to T_input(j) and make a note of what j is.

For example, when the gray level is 4, the original image is 2. 2 in the specified image corresponds to a gray level of 1. Therefore, we map 4 to 1.

When the gray level is 5, the original image is 3. 3 in the specified image is closest to 2 (go to the next lowest level by convention) corresponds to a gray level of 1. Therefore, we map 5 to 1.

Here is the final mapping.

To finish the histogram matching process, we have to replace the values in the original image with the map values. The final matched histogram is shown below:

Therefore, the histogram matching process got us from the original image histogram below to that matched histogram above. Notice the matched histogram has a similar shape to the original specified histogram.

Return to Table of Contents