How to Detect Objects in Video Using MobileNet SSD in OpenCV


In this tutorial, we will go through how to detect objects in a video stream using OpenCV. We will use MobileNet SSD, a special type of convolutional neural network architecture

Our output will look like this:

Real-World Applications

  • Object Detection
  • Object Tracking
  • Object Classification
  • Autonomous Vehicles
  • Self-Driving Cars

Let’s get started!


Installation and Setup

We now need to make sure we have all the software packages installed. Check to see if you have OpenCV installed on your machine. If you are using Anaconda, you can type:

conda install -c conda-forge opencv

Alternatively, you can type:

pip install opencv-python

Download the Required Files

Download all the video files and other neural network-related files at this link. Place the files inside a directory on your computer.


In the same folder where your image file is, open a new Python file called

Here is the full code for the system. The only things you’ll need to change in this code is the name of your desired input video file on line 10 and the name of your desired output file on line 14.

# Project: How to Detect Objects in Video Using MobileNet SSD in OpenCV
# Author: Addison Sears-Collins
# Date created: March 1, 2021
# Description: Object detection using OpenCV

import cv2 # Computer vision library
import numpy as np # Scientific computing library 

# Make sure the video file is in the same directory as your code
filename = 'edmonton_canada.mp4'
file_size = (1920,1080) # Assumes 1920x1080 mp4

# We want to save the output to a video file
output_filename = 'edmonton_canada_obj_detect_mobssd.mp4'
output_frames_per_second = 20.0 

RESIZED_DIMENSIONS = (300, 300) # Dimensions that SSD was trained on. 
IMG_NORM_RATIO = 0.007843 # In grayscale a pixel can range between 0 and 255

# Load the pre-trained neural network
neural_network = cv2.dnn.readNetFromCaffe('MobileNetSSD_deploy.prototxt.txt', 

# List of categories and classes
categories = { 0: 'background', 1: 'aeroplane', 2: 'bicycle', 3: 'bird', 
               4: 'boat', 5: 'bottle', 6: 'bus', 7: 'car', 8: 'cat', 
               9: 'chair', 10: 'cow', 11: 'diningtable', 12: 'dog', 
              13: 'horse', 14: 'motorbike', 15: 'person', 
              16: 'pottedplant', 17: 'sheep', 18: 'sofa', 
              19: 'train', 20: 'tvmonitor'}

classes =  ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", 
            "bus", "car", "cat", "chair", "cow", 
           "diningtable",  "dog", "horse", "motorbike", "person", 
           "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
# Create the bounding boxes
bbox_colors = np.random.uniform(255, 0, size=(len(categories), 3))
def main():

  # Load a video
  cap = cv2.VideoCapture(filename)

  # Create a VideoWriter object so we can save the video output
  fourcc = cv2.VideoWriter_fourcc(*'mp4v')
  result = cv2.VideoWriter(output_filename,  
  # Process the video
  while cap.isOpened():
    # Capture one frame at a time
    success, frame = 

    # Do we have a video frame? If true, proceed.
    if success:
      # Capture the frame's height and width
      (h, w) = frame.shape[:2]

      # Create a blob. A blob is a group of connected pixels in a binary 
      # frame that share some common property (e.g. grayscale value)
      # Preprocess the frame to prepare it for deep learning classification
      frame_blob = cv2.dnn.blobFromImage(cv2.resize(frame, RESIZED_DIMENSIONS), 
                     IMG_NORM_RATIO, RESIZED_DIMENSIONS, 127.5)
      # Set the input for the neural network

      # Predict the objects in the image
      neural_network_output = neural_network.forward()

      # Put the bounding boxes around the detected objects
      for i in np.arange(0, neural_network_output.shape[2]):
        confidence = neural_network_output[0, 0, i, 2]
        # Confidence must be at least 30%		
        if confidence > 0.30:
          idx = int(neural_network_output[0, 0, i, 1])

          bounding_box = neural_network_output[0, 0, i, 3:7] * np.array(
            [w, h, w, h])

          (startX, startY, endX, endY) = bounding_box.astype("int")

          label = "{}: {:.2f}%".format(classes[idx], confidence * 100) 
          cv2.rectangle(frame, (startX, startY), (
            endX, endY), bbox_colors[idx], 2)     
          y = startY - 15 if startY - 15 > 15 else startY + 15     

          cv2.putText(frame, label, (startX, y),cv2.FONT_HERSHEY_SIMPLEX, 
            0.5, bbox_colors[idx], 2)
      # We now need to resize the frame so its dimensions
      # are equivalent to the dimensions of the original frame
      frame = cv2.resize(frame, file_size, interpolation=cv2.INTER_NEAREST)

			# Write the frame to the output video file
    # No more video frames left
  # Stop when the video is finished
  # Release the video recording


To run the code, type the following command:


You will see the video output that is at the top of this tutorial.

That’s it. Keep building!