In this tutorial, we will go through how to detect objects in a video stream using OpenCV. We will use MobileNet SSD, a special type of convolutional neural network architecture.

Our output will look like this:

Table of Contents

Real-World Applications

Object Detection
Object Tracking
Object Classification
Autonomous Vehicles
Self-Driving Cars

Let’s get started!

Prerequisites

Python 3.7 or higher

Installation and Setup

We now need to make sure we have all the software packages installed. Check to see if you have OpenCV installed on your machine. If you are using Anaconda, you can type:

conda install -c conda-forge opencv

Alternatively, you can type:

pip install opencv-python

Download the Required Files

Download all the video files and other neural network-related files at this link. Place the files inside a directory on your computer.

Code

In the same folder where your image file is, open a new Python file called object_detection_mobile_ssd.py.

Here is the full code for the system. The only things you’ll need to change in this code is the name of your desired input video file on line 10 and the name of your desired output file on line 14.

# Project: How to Detect Objects in Video Using MobileNet SSD in OpenCV
# Author: Addison Sears-Collins
# Date created: March 1, 2021
# Description: Object detection using OpenCV

import cv2 # Computer vision library
import numpy as np # Scientific computing library 

# Make sure the video file is in the same directory as your code
filename = 'edmonton_canada.mp4'
file_size = (1920,1080) # Assumes 1920x1080 mp4

# We want to save the output to a video file
output_filename = 'edmonton_canada_obj_detect_mobssd.mp4'
output_frames_per_second = 20.0 

RESIZED_DIMENSIONS = (300, 300) # Dimensions that SSD was trained on. 
IMG_NORM_RATIO = 0.007843 # In grayscale a pixel can range between 0 and 255

# Load the pre-trained neural network
neural_network = cv2.dnn.readNetFromCaffe('MobileNetSSD_deploy.prototxt.txt', 
        'MobileNetSSD_deploy.caffemodel')

# List of categories and classes
categories = { 0: 'background', 1: 'aeroplane', 2: 'bicycle', 3: 'bird', 
               4: 'boat', 5: 'bottle', 6: 'bus', 7: 'car', 8: 'cat', 
               9: 'chair', 10: 'cow', 11: 'diningtable', 12: 'dog', 
              13: 'horse', 14: 'motorbike', 15: 'person', 
              16: 'pottedplant', 17: 'sheep', 18: 'sofa', 
              19: 'train', 20: 'tvmonitor'}

classes =  ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", 
            "bus", "car", "cat", "chair", "cow", 
           "diningtable",  "dog", "horse", "motorbike", "person", 
           "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
					 
# Create the bounding boxes
bbox_colors = np.random.uniform(255, 0, size=(len(categories), 3))
	
def main():

  # Load a video
  cap = cv2.VideoCapture(filename)

  # Create a VideoWriter object so we can save the video output
  fourcc = cv2.VideoWriter_fourcc(*'mp4v')
  result = cv2.VideoWriter(output_filename,  
                           fourcc, 
                           output_frames_per_second, 
                           file_size) 
	
  # Process the video
  while cap.isOpened():
		
    # Capture one frame at a time
    success, frame = cap.read() 

    # Do we have a video frame? If true, proceed.
    if success:
		
      # Capture the frame's height and width
      (h, w) = frame.shape[:2]

      # Create a blob. A blob is a group of connected pixels in a binary 
      # frame that share some common property (e.g. grayscale value)
      # Preprocess the frame to prepare it for deep learning classification
      frame_blob = cv2.dnn.blobFromImage(cv2.resize(frame, RESIZED_DIMENSIONS), 
                     IMG_NORM_RATIO, RESIZED_DIMENSIONS, 127.5)
	
      # Set the input for the neural network
      neural_network.setInput(frame_blob)

      # Predict the objects in the image
      neural_network_output = neural_network.forward()

      # Put the bounding boxes around the detected objects
      for i in np.arange(0, neural_network_output.shape[2]):
			
        confidence = neural_network_output[0, 0, i, 2]
    
        # Confidence must be at least 30%		
        if confidence > 0.30:
				
          idx = int(neural_network_output[0, 0, i, 1])

          bounding_box = neural_network_output[0, 0, i, 3:7] * np.array(
            [w, h, w, h])

          (startX, startY, endX, endY) = bounding_box.astype("int")

          label = "{}: {:.2f}%".format(classes[idx], confidence * 100) 
        
          cv2.rectangle(frame, (startX, startY), (
            endX, endY), bbox_colors[idx], 2)     
						
          y = startY - 15 if startY - 15 > 15 else startY + 15     

          cv2.putText(frame, label, (startX, y),cv2.FONT_HERSHEY_SIMPLEX, 
            0.5, bbox_colors[idx], 2)
		
      # We now need to resize the frame so its dimensions
      # are equivalent to the dimensions of the original frame
      frame = cv2.resize(frame, file_size, interpolation=cv2.INTER_NEAREST)

			# Write the frame to the output video file
      result.write(frame)
		
    # No more video frames left
    else:
      break
			
  # Stop when the video is finished
  cap.release()
	
  # Release the video recording
  result.release()

main()

To run the code, type the following command:

python object_detection_mobile_ssd.py

You will see the video output that is at the top of this tutorial.

That’s it. Keep building!