Real-Time Object Recognition Using a Webcam and Deep Learning

In this tutorial, we will develop a program that can recognize objects in a real-time video stream on a built-in laptop webcam using deep learning.

object-detection-recognition-video-demo

Object recognition involves two main tasks:

  1. Object Detection (Where are the objects?): Locate objects in a photo or video frame
  2. Image Classification (What are the objects?): Predict the type of each object in a photo or video frame

Humans can do both tasks effortlessly, but computers cannot.

Computers require a lot of processing power to take full advantage of the state-of-the-art algorithms that enable object recognition in real time. However, in recent years, the technology has matured, and real-time object recognition is now possible with only a laptop computer and a webcam.

Real-time object recognition systems are currently being used in a number of real-world applications, including the following:

  • Self-driving cars: detection of pedestrians, cars, traffic lights, bicycles, motorcycles, trees, sidewalks, etc.
  • Surveillance: catching thieves, counting people, identifying suspicious behavior, child detection.
  • Traffic monitoring: identifying traffic jams, catching drivers that are breaking the speed limit.
  • Security: face detection, identity verification on a smartphone.
  • Robotics: robotic surgery, agriculture, household chores, warehouses, autonomous delivery.
  • Sports: ball tracking in baseball, golf, and football.
  • Agriculture: disease detection in fruits.
  • Food: food identification.

There are a lot of steps in this tutorial. Have fun, be patient, and be persistent. Don’t give up! If something doesn’t work the first time around, try again. You will learn a lot more by fighting through to the end of this project. Stay relentless!

By the end of this tutorial, you will have the rock-solid confidence to detect and recognize objects in real time on your laptop’s GPU (Graphics Processing Unit) using deep learning.

Let’s get started!

Table of Contents

You Will Need

Install TensorFlow CPU

We need to get all the required software set up on our computer. I will be following this really helpful tutorial.

Open an Anaconda command prompt terminal.

1-open-command-promptJPG

Type the command below to create a virtual environment named tensorflow_cpu that has Python 3.6 installed. 

conda create -n tensorflow_cpu pip python=3.6

Press y and then ENTER.

A virtual environment is like an independent Python workspace which has its own set of libraries and Python version installed. For example, you might have a project that needs to run using an older version of Python, like Python 2.7. You might have another project that requires Python 3.7. You can create separate virtual environments for these projects.

Now, let’s activate the virtual environment by using this command:

conda activate tensorflow_cpu
2-activate-virtual-environmetJPG

Type the following command to install TensorFlow CPU.

pip install --ignore-installed --upgrade tensorflow==1.9

Wait for Tensorflow CPU to finish installing. Once it is finished installing, launch Python by typing the following command:

python
3-launch-pythonJPG

Type:

import tensorflow as tf

Here is what my screen looks like now:

4-import-tensorflowJPG

Now type the following:

hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()

You should see a message that says: “Your CPU supports instructions that this TensorFlow binary….”. Just ignore that. Your TensorFlow will still run fine.

Now run this command to complete the test of the installation:

print(sess.run(hello))
6-test-installationJPG

Press CTRL+Z. Then press ENTER to exit.

Type:

exit

That’s it for TensorFlow CPU. Now let’s install TensorFlow GPU.

Return to Table of Contents

Install TensorFlow GPU

Your system must have the following requirements:

  • Nvidia GPU (GTX 650 or newer…I’ll show you later how to find out what Nvidia GPU version is in your computer)
  • CUDA Toolkit v9.0 (we will install this later in this tutorial)
  • CuDNN v7.0.5 (we will install this later in this tutorial)
  • Anaconda with Python 3.7+

Here is a good tutorial that walks through the installation, but I’ll outline all the steps below.

Install CUDA Toolkit v9.0

The first thing we need to do is to install the CUDA Toolkit v9.0. Go to this link.

Select your operating system. In my case, I will select Windows, x86_64, Version 10, and exe (local).

7-select-target-platformJPG

Download the Base Installer as well as all the patches. I downloaded all these files to my Desktop. It will take a while to download, so just wait while your computer downloads everything.

8-base-installerJPG
9-patchesJPG

Open the folder where the downloads were saved to.

10-downloaded-to-desktopJPG

Double-click on the Base Installer program, the largest of the files that you downloaded from the website.

Click Yes to allow the program to make changes to your device.

Click OK to extract the files to your computer.

11-click-okJPG
12-extract-filesJPG

I saw this error window. Just click Continue.

13-error-message-continueJPG

Click Agree and Continue.

14-agree-and-continueJPG

If you saw that error window earlier… “…you may not be able to run CUDA applications with this driver…,” select the Custom (Advanced) install option and click Next. Otherwise, do the Express installation and follow all the prompts.

15-custom-advancedJPG

Uncheck the Driver components, PhysX, and Visual Studio Integration options. Then click Next.

Click Next.

16-installation-location-click-nextJPG

Wait for everything to install.

17-prepare-for-installationJPG

Click Close.

18-installer-finishedJPG

Delete  C:\Program Files\NVIDIA Corporation\Installer2.

19-delete-installer-2JPG

Double-click on Patch 1.

Click Yes to allow changes to your computer.

Click OK.

20-click-okJPG

Click Agree and Continue.

21-agree-and-continueJPG

Go to Custom (Advanced) and click Next.

22-custom-advancedJPG

Click Next.

23-click-nextJPG

Click Close.

The process is the same for Patch 2. Double-click on Patch 2 now.

Click Yes to allow changes to your computer.

Click OK.

Click Agree and Continue.

Go to Custom (Advanced) and click Next.

Click Next.

24-click-nextJPG

Click Close.

25-click-closeJPG

The process is the same for Patch 3. Double-click on Patch 3 now.

Click Yes to allow changes to your computer.

Click OK.

Click Agree and Continue.

Go to Custom (Advanced) and click Next.

Click Next.

Click Close.

The process is the same for Patch 4. Double-click on Patch 4 now.

Click Yes to allow changes to your computer.

Click OK.

Click Agree and Continue.

Go to Custom (Advanced) and click Next.

Click Next.

After you’ve installed Patch 4, your screen should look like this:

26-installed-patch-4JPG

Click Close.

To verify your CUDA installation, go to the command terminal on your computer, and type:

nvcc --version
26-verify-cuda-versionJPG

Return to Table of Contents

Install the NVIDIA CUDA Deep Neural Network library (cuDNN)

Now that we installed the CUDA 9.0 base installer and its four patches, we need to install the NVIDIA CUDA Deep Neural Network library (cuDNN). Official instructions for installing are on this page, but I’ll walk you through the process below.

Go to https://developer.nvidia.com/rdp/cudnn-download

Create a user profile if needed and log in.

27-become-a-memberJPG

Go to this page: https://developer.nvidia.com/rdp/cudnn-download

Agree to the terms of the cuDNN Software License Agreement.

28-agree-to-termsJPG

We have CUDA 9.0, so we need to click cuDNN v7.6.4 (September 27, 2019), for CUDA 9.0.

29-download-cudnnJPG

I have Windows 10, so I will download cuDNN Library for Windows 10.

30-cudnn-windows10JPG

In my case, the zip file downloaded to my Desktop. I will unzip that zip file now, which will create a new folder of the same name…just without the .zip part. These are your cuDNN files. We’ll come back to these in a second.

31-unzipJPG

Before we get going, let’s double check what GPU we have. If you are on a Windows machine, search for the “Device Manager.”

32-my-gpuJPG

Once you have the Device Manager open, you should see an option near the top for “Display Adapters.” Click the drop-down arrow next to that, and you should see the name of your GPU. Mine is NVIDIA GeForce GTX 1060.

33-nvidia-control-panelJPG

If you are on Windows, you can also check what NVIDIA graphics driver you have by right-clicking on your Desktop and clicking the NVIDIA Control Panel. My version is 430.86. This version fits the requirements for cuDNN.

Ok, now that we have verified that our system meets the requirements, lets navigate to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0, your CUDA Toolkit directory.

34-navigate-to-cuda-toolkit-directoryJPG

Now go to your cuDNN files, that new folder that was created when you did the unzipping. Inside that folder, you should see a folder named cuda. Click on it.

35-named-cudaJPG

Click bin.

36-click-binJPG

Copy cudnn64_7.dll to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin. Your computer might ask you to allow Administrative Privileges. Just click Continue when you see that prompt.

Now go back to your cuDNN files. Inside the cuda folder, click on include. You should see a file named cudnn.h.

37-click-includeJPG
38-cudnnhJPG

Copy that file to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include. Your computer might ask you to allow Administrative Privileges. Just click Continue when you see that prompt.

Now go back to your cuDNN files. Inside the cuda folder, click on lib -> x64. You should see a file named cudnn.lib. 

39-cudnn-libJPG

Copy that file to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64. Your computer might ask you to allow Administrative Privileges. Just click Continue when you see that prompt.

If you are using Windows, do a search on your computer for Environment Variables. An option should pop up to allow you to edit the Environment Variables on your computer.

Click on Environment Variables.

40-environment-variablesJPG

Make sure you CUDA_PATH variable is set to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0.

41-cuda-path-setJPG

I recommend restarting your computer now.

Return to Table of Contents

Install TensorFlow GPU

Now we need to install TensorFlow GPU. Open a new Anaconda terminal window. 

42-anaconda-terminal-windowJPG

Create a new Conda virtual environment named tensorflow_gpu by typing this command:

conda create -n tensorflow_gpu pip python=3.6

Type y and press Enter.

43-conda-virtual-environmentJPG

Activate the virtual environment.

conda activate tensorflow_gpu
44-activate-virtual-envJPG

Install TensorFlow GPU for Python.

pip install --ignore-installed --upgrade tensorflow-gpu==1.9

Wait for TensorFlow GPU to install.

Now let’s test the installation. Launch the Python interpreter.

python

Type this command.

import tensorflow as tf

If you don’t see an error, TensorFlow GPU is successfully installed.

45-test-installation-1JPG

Now type this:

hello = tf.constant('Hello, TensorFlow!')
46-now-type-thisJPG

And run this command. It might take a few minutes to run, so just wait until it finishes:

sess = tf.Session()
47-all-finishedJPG

Now type this command to complete the test of the installation:

print(sess.run(hello))
48-complete-the-testJPG

You can further confirm whether TensorFlow can access the GPU, by typing the following into the Python interpreter (just copy and paste into the terminal window while the Python interpreter is running).

tf.test.is_gpu_available(
    cuda_only=True,
    min_cuda_compute_capability=None
)
49-further-testJPG

To exit the Python interpreter, type:

exit()
50-exit-the-editorJPG

And press Enter.

Return to Table of Contents

Install TensorFlow Models

Now that we have everything setup, let’s install some useful libraries. I will show you the steps for doing this in my TensorFlow GPU virtual environment, but the steps are the same for the TensorFlow CPU virtual environment.

Open a new Anaconda terminal window. Let’s take a look at the list of virtual environments that we can activate.

conda env list
51-conda-env-listJPG

I’m going to activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

Install the libraries. Type this command:

conda install pillow lxml jupyter matplotlib opencv cython

Press y to proceed.

Once that is finished, you need to create a folder somewhere that has the TensorFlow Models  (e.g. C:\Users\addis\Documents\TensorFlow). If you have a D drive, you can also save it there as well.

In your Anaconda terminal window, move to the TensorFlow directory you just created. You will use the cd command to change to that directory. For example:

cd C:\Users\addis\Documents\TensorFlow

Go to the TensorFlow models page on GitHub: https://github.com/tensorflow/models.

Click the button to download the zip file of the repository. It is a large file, so it will take a while to download.

52-download-zipJPG

Move the zip folder to the TensorFlow directory you created earlier and extract the contents.

Rename the extracted folder to models instead of models-master. Your TensorFlow directory hierarchy should look like this:

TensorFlow

  • models
    • official
    • research
    • samples
    • tutorials

Return to Table of Contents

Install Protobuf

Now we need to install Protobuf, which is used by the TensorFlow Object Detection API to configure the training and model parameters.

Go to this page: https://github.com/protocolbuffers/protobuf/releases

Download the latest *-win32.zip release (assuming you are on a Windows machine).

53-download-latestJPG

Create a folder in C:\Program Files named it Google Protobuf.

Extract the contents of the downloaded *-win32.zip, inside C:\Program Files\Google Protobuf

54-extract-contentsJPG

Search for Environment Variables on your system. A window should pop up that says System Properties.

55-system-propertiesJPG

Click Environment Variables.

Go down to the Path variable and click Edit.

56-path-variable-editJPG

Click New.

57-click-newJPG

Add C:\Program Files\Google Protobuf\bin

You can also add it the Path System variable.

Click OK a few times to close out all the windows.

Open a new Anaconda terminal window.

I’m going to activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your \TensorFlow\models\research\ directory and run the following command:

for /f %i in ('dir /b object_detection\protos\*.proto') do protoc object_detection\protos\%i --python_out=.

Now go back to the Environment Variables on your system. Create a New Environment Variable named PYTHONPATH (if you don’t have one already). Replace C:\Python27amd64 if you don’t have Python installed there. Also, replace <your_path> with the path to your TensorFlow folder.

C:\Python27amd64;C:\<your_path>\TensorFlow\models\research\object_detection
58-pythonpathJPG

For example:

C:\Python27amd64;C:\Users\addis\Documents\TensorFlow

Now add these two paths to your PYTHONPATH environment variable:

C:\<your_path>\TensorFlow\models\research\
C:\<your_path>\TensorFlow\models\research\slim

Return to Table of Contents

Install COCO API

Now, we are going to install the COCO API. You don’t need to worry about what this is at this stage. I’ll explain it later.

Download the Visual Studios Build Tools here: Visual C++ 2015 build tools from here: https://go.microsoft.com/fwlink/?LinkId=691126

Choose the default installation.

59-visual-studioJPG

After it has installed, restart your computer.

60-setup-completedJPG

Open a new Anaconda terminal window.

I’m going to activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your \TensorFlow\models\research\ directory and run the following command to install pycocotools (everything below goes on one line):

pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI
61-pycocotools-installedJPG

If it doesn’t work, install git: https://git-scm.com/download/win

Follow all the default settings for installing Git. You will have to click Next several times.

Once you have finished installing Git, run this command (everything goes on one line):

pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI

Return to Table of Contents

Test the Installation

Open a new Anaconda terminal window.

I’m going to activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your \TensorFlow\models\research\object_detection\builders directory and run the following command to test your installation.

python model_builder_test.py

You should see an OK message.

62-test-your-installationJPG

Return to Table of Contents

Install LabelImg

Now we will install LabelImg, a graphical image annotation tool for labeling object bounding boxes in images.

Open a new Anaconda/Command Prompt window.

Create a new virtual environment named labelImg by typing the following command:

conda create -n labelImg

Activate the virtual environment.

conda activate labelImg

Install pyqt.

conda install pyqt=5

Click y to proceed.

Go to your TensorFlow folder, and create a new folder named addons.

63-new-folder-named-addonsJPG

Change to that directory using the cd command.

Type the following command to clone the repository:

git clone https://github.com/tzutalin/labelImg.git

Wait while labelImg downloads.

You should now have a folder named addons\labelImg under your TensorFlow folder.

Type exit to exit the terminal.

Open a new terminal window.

Activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your TensorFlow\addons\labelImg directory.

Type the following commands, one right after the other.

conda install pyqt=5
conda install lxml
pyrcc5 -o libs/resources.py resources.qrc
exit

Test the LabelImg Installation

Open a new terminal window.

Activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your TensorFlow\addons\labelImg directory.

Type the following commands:

python labelImg.py

If you see this window, you have successfully installed LabelImg. Here is a tutorial on how to label your own images. Congratulations!

64-label-imgJPG

Return to Table of Contents

Recognize Objects Using Your WebCam

Approach

Note: This section gets really technical. If you know the basics of computer vision and deep learning, it will make sense. Otherwise, it will not. You can skip this section and head straight to the Implementation section if you are not interested in what is going on under the hood of the object recognition application we are developing.

In this project, we use OpenCV and TensorFlow to create a system capable of automatically recognizing objects in a webcam. Each detected object is outlined with a bounding box labeled with the predicted object type as well as a detection score.

The detection score is the probability that a bounding box contains the object of a particular type (e.g. the confidence a model has that an object identified as a “backpack” is actually a backpack).

The particular SSD with Inception v2 model used in this project is the ssd_inception_v2_coco model. The ssd_inception_v2_coco model uses the Single Shot MultiBox Detector (SSD) for its architecture and the Inception v2 framework for feature extraction.

Single Shot MultiBox Detector (SSD)

Most state-of-the-art object detection methods involve the following stages:

  1. Hypothesize bounding boxes 
  2. Resample pixels or features for each box
  3. Apply a classifier

The Single Shot MultiBox Detector (SSD) eliminates the multi-stage process above and performs all object detection computations using just a single deep neural network.

Inception v2

Most state-of-the-art object detection methods based on convolutional neural networks at the time of the invention of Inception v2 added increasingly more convolution layers or neurons per layer in order to achieve greater accuracy. The problem with this approach is that it is computationally expensive and prone to overfitting. The Inception v2 architecture (as well as the Inception v3 architecture) was proposed in order to address these shortcomings.

Rather than stacking multiple kernel filter sizes sequentially within a convolutional neural network, the approach of the inception-based model is to perform a convolution on an input with multiple kernels all operating at the same layer of the network. By factorizing convolutions and using aggressive regularization, the authors were able to improve computational efficiency. Inception v2 factorizes the traditional 7 x 7 convolution into 3 x 3 convolutions.

Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, (2015) conducted an empirically-based demonstration in their landmark Inception v2 paper, which showed that factorizing convolutions and using aggressive dimensionality reduction can substantially lower computational cost while maintaining accuracy.

Data Set

The ssd_inception_v2_coco model used in this project is pretrained on the Common Objects in Context (COCO) data set (COCO data set), a large-scale data set that contains 1.5 million object instances and more than 200,000 labeled images. The COCO data required 70,000 crowd worker hours to gather, annotate, and organize images of objects in natural environments.

Software Dependencies

The following libraries form the object recognition backbone of the application implemented in this project:

  • OpenCV, a library of programming functions for computer vision.
  • Pillow, a library for manipulating images.
  • Numpy, a library for scientific computing.
  • Matplotlib, a library for creating graphs and visualizations.
  • TensorFlow Object Detection API, an open source framework developed by Google that enables the development, training, and deployment of pre-trained object detection models.

Return to Table of Contents

Implementation

Now to the fun part, we will now recognize objects using our computer webcam.

Copy the following program, and save it to your TensorFlow\models\research\object_detection directory as object_detection_test.py .

# Import all the key libraries
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
import cv2

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from utils import label_map_util
from utils import visualization_utils as vis_util

# Define the video stream
cap = cv2.VideoCapture(0)  

# Which model are we downloading?
# The models are listed here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
MODEL_NAME = 'ssd_inception_v2_coco_2018_01_28'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

# Path to the frozen detection graph. 
# This is the actual model that is used for the object detection.
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'

# List of the strings that is used to add the correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')

# Number of classes to detect
NUM_CLASSES = 90

# Download Model
opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
    file_name = os.path.basename(file.name)
    if 'frozen_inference_graph.pb' in file_name:
        tar_file.extract(file, os.getcwd())

# Load a (frozen) Tensorflow model into memory.
detection_graph = tf.Graph()
with detection_graph.as_default():
    od_graph_def = tf.GraphDef()
    with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')

# Loading label map
# Label maps map indices to category names, so that when our convolution network 
# predicts `5`, we know that this corresponds to `airplane`.  Here we use internal 
# utility functions, but anything that returns a dictionary mapping integers to 
# appropriate string labels would be fine
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(
    label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)


# Helper code
def load_image_into_numpy_array(image):
    (im_width, im_height) = image.size
    return np.array(image.getdata()).reshape(
        (im_height, im_width, 3)).astype(np.uint8)
	
# Detection
with detection_graph.as_default():
    with tf.Session(graph=detection_graph) as sess:
        while True:

            # Read frame from camera
            ret, image_np = cap.read()
            # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
            image_np_expanded = np.expand_dims(image_np, axis=0)
            # Extract image tensor
            image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
            # Extract detection boxes
            boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
            # Extract detection scores
            scores = detection_graph.get_tensor_by_name('detection_scores:0')
            # Extract detection classes
            classes = detection_graph.get_tensor_by_name('detection_classes:0')
            # Extract number of detectionsd
            num_detections = detection_graph.get_tensor_by_name(
                'num_detections:0')
            # Actual detection.
            (boxes, scores, classes, num_detections) = sess.run(
                [boxes, scores, classes, num_detections],
                feed_dict={image_tensor: image_np_expanded})
            # Visualization of the results of a detection.
            vis_util.visualize_boxes_and_labels_on_image_array(
                image_np,
                np.squeeze(boxes),
                np.squeeze(classes).astype(np.int32),
                np.squeeze(scores),
                category_index,
                use_normalized_coordinates=True,
                line_thickness=8)

            # Display output
            cv2.imshow('object detection', cv2.resize(image_np, (800, 600)))

            if cv2.waitKey(25) &amp; 0xFF == ord('q'):
                cv2.destroyAllWindows()
                break


print("We are finished! That was fun!")

Open a new terminal window.

Activate the TensorFlow GPU virtual environment.

conda activate tensorflow_gpu

cd into your TensorFlow\models\research\object_detection directory.

At the time of this writing, we need to use Numpy version 1.16.4. Type the following command to see what version of Numpy you have on your system.

pip show numpy

If it is not 1.16.4, execute the following commands:

pip uninstall numpy
pip install numpy==1.16.4

Now run, your program:

python object_detection_test.py

In about 30 to 90 seconds, you should see your webcam power up and object recognition take action. That’s it! Congratulations for making it to the end of this tutorial!

object_detection_resultsJPG

Keep building!

Return to Table of Contents

How to Install TensorFlow 2 on Windows 10

In this post, I will show you how to install TensorFlow 2 on Windows 10. TensorFlow2 is a free software library used for machine learning applications. It comes integrated with Keras, a neural-network library written in Python. If you want to work with neural networks and deep learning, TensorFlow 2 should be your software of choice because of its popularity both in academia and in industry. Let’s get started!

Table of Contents

You Will Need 

Directions

Install TensorFlow 2

Here are the official instructions for downloading TensorFlow 2, but I will walk you through the process step-by-step.

Open an Anaconda command prompt terminal.

1-open-promptJPG

Type the command below to create a virtual environment named tf_2 with the latest version of Python installed. A virtual environment is like an independent Python workspace which has its own set of libraries and Python version installed. For example, you might have a project that needs to run using an older version of Python, like Python 2.7. You might have another project that requires Python 3.7. You can create separate virtual environments for these projects.

conda create -n tf_2 python

Press y and then ENTER.

2-type-yJPG

Wait for the software to download.

3-activate-tensorflow-2JPG

Once the download is finished, activate the virtual environment using this command:

conda activate tf_2

Check which version of Python you have installed on your system. I have Python 3.8.0.

python --version
4-python-versionJPG

Choose a TensorFlow package. I’ll install TensorFlow CPU. Let’s type the following command:

5-choose-a-packageJPG
pip install --upgrade tensorflow

You might see this error:

ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)

ERROR: No matching distribution found for tensorflow

If you do, you need to downgrade your version of Python. TensorFlow is not yet compatible with your newest version of Python.

conda install python=3.6

Press y and then ENTER.

Check which version of Python you have installed on your system. I have Python 3.6.9 now.

python --version
6-downgrade-pythonJPG

Now install TensorFlow 2.

pip install --upgrade tensorflow

Wait for Tensorflow CPU to finish installing. Once it is finished installing, verify the installation by typing:

python -c "import tensorflow as tf; x = [[2.]]; print('tensorflow version', tf.__version__); print('hello, {}'.format(tf.matmul(x, x)))"

Here is the output:

9-voilaJPG

You should see your TensorFlow version in the output.

You might see this message:

“I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2”

Don’t worry, TensorFlow is working just fine. To get rid of that message, you can set the environment variables inside the virtual environment. Type the following command:

set TF_CPP_MIN_LOG_LEVEL=2

Now run this command:

python -c "import tensorflow as tf; x = [[2.]]; print('tensorflow version', tf.__version__); print('hello, {}'.format(tf.matmul(x, x)))"

Voila! Message gone. 

9-voilaJPG-1

Return to Table of Contents

Create a Basic Neural Network Using TensorFlow 2

To really see what TensorFlow 2 can do, let’s do the following:

  • Build a neural network that classifies images of clothing.
  • Train this neural network.
  • And, finally, evaluate the accuracy of the model.

We are going to roughly follow the TensorFlow beginner tutorial.

First, install the Matplotlib library.

pip install matplotlib

I’m now going to open up a text editor and type a Python program. I will save it to my D drive as fashion_mnist.py. Here is the code:

from __future__ import absolute_import, division, print_function, unicode_literals

# Import the key libraries
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

# Rename tf.keras.layers
layers = tf.keras.layers

# Print the TensorFlow version
print(tf.__version__)

# Load and prepare the MNIST dataset. 
# Convert the samples from integers to floating-point numbers:
mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Let's plot the data so we can see it
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal',
 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
plt.figure(figsize=(10,10))

for i in range(25):
 plt.subplot(5,5,i+1)
 plt.xticks([])
 plt.yticks([])
 plt.grid(False)
 plt.imshow(x_train[i], cmap=plt.cm.binary)
 plt.xlabel(class_names[y_train[i]])
plt.show()

Within your virtual environment in the Anaconda terminal, navigate to where you saved your code. I will type.

D:

Then:

cd D:\<YOUR_PATH>\install_tensorflow2

Type dir to see if the Python (.py) file is in that directory.

Now run the code:

python fashion_mnist.py

You should see this graphic pop up.

10-fashion-datasetJPG

In the terminal window, press CTRL+C on your keyboard to stop the code from running.

Let’s add to our code. Open up the Python file again in the text editor and type the following code. If you are new to neural networks, don’t worry what everything means at this stage.

from __future__ import absolute_import, division, print_function, unicode_literals

# Import the key libraries
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

# Rename tf.keras.layers
layers = tf.keras.layers

# Print the TensorFlow version
print(tf.__version__)

# Load and prepare the MNIST dataset. 
# Convert the samples from integers to floating-point numbers:
mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Let's plot the data so we can see it
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal',
 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
plt.figure(figsize=(10,10))

for i in range(25):
 plt.subplot(5,5,i+1)
 plt.xticks([])
 plt.yticks([])
 plt.grid(False)
 plt.imshow(x_train[i], cmap=plt.cm.binary)
 plt.xlabel(class_names[y_train[i]])
plt.show()

# Build the neural network layer-by-layer
model = tf.keras.Sequential()
model.add(layers.Flatten()) # Make the input layer one-dimensional
model.add(layers.Dense(64, activation='relu')) # Layer has 64 nodes; Uses ReLU
model.add(layers.Dense(64, activation='relu')) # Layer has 64 nodes; Uses ReLU
model.add(layers.Dense(10, activation='softmax')) # Layer has 64 nodes; Uses Softmax

# Choose an optimizer and loss function for training:
model.compile(optimizer='adam',
 loss='sparse_categorical_crossentropy',
 metrics=['accuracy'])
 
# Train and evaluate the model's accuracy
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test,  y_test, verbose=2)

Run the code:

python fashion_mnist.py

When you see the plot of the clothes appear, just close that window so that the neural network build and run.

Here is the output.

11-accuracyJPG

The accuracy of classifying the clothing items was 87.5%. Pretty cool huh! Congratulations! You’ve built and run your first neural network on TensorFlow 2.

To deactivate the virtual environment, type:

conda deactivate

Then to exit the terminal, type:

exit

At this stage, I encourage you to go through the TensorFlow tutorials to get more practice using this really powerful tool.

Return to Table of Contents

Difference Between Supervised and Unsupervised Learning

In this post, I will explain the difference between supervised and unsupervised learning.

Table of Contents

What is Supervised Learning?

new_home_for_sale_4

Imagine you have a computer. The computer is really good at doing math and making complex calculations. You want to “train” your computer to predict the price for any home in the United States. So you search around the Internet and find a dataset. The dataset contains the following information  for 100,000 houses that sold during the last 30 days in various cities across the United States:

  1. Square footage
  2. Number of bathrooms
  3. Number of bedrooms
  4. Number of garages
  5. Year the house was constructed
  6. Average size of the house’s windows
  7. Sale price

Variables 1 through 6 above are the dataset’s features (also known as input variables, attributes, independent variables, etc.). Variable 7, the house sale price, is the output variable (or target variable) that we want our computer to get good at predicting. 

So the question is…how do we train our computer to be a good house price predictor?  We need to write a software program. The program needs to take as input the 100,000-house dataset that I mentioned earlier. This program needs to then find a mathematical relationship between variables 1-6 (features) and variable 7 (output variable). Once the program has found a relationship between the features (input) and the output, it can predict the sale price of a house it has never seen before.

1-ipo

Let’s take a look at an analogy. Supervised learning is like baking a cake. 

Suppose you had a cake machine that was able to cook many different types of cake from the same set of ingredients. All you have to do as the cake chef is to just throw the ingredients into the machine, and the cake machine will automatically make the cake.

The “features,” the inputs to the cake machine, are the following ingredients:

  1. Butter
  2. Sugar
  3. Vanilla Extract
  4. Flour
  5. Chocolate
  6. Eggs
  7. Salt

The output is the type of cake:

  1. Vanilla Cake
  2. Pound Cake
  3. Chocolate Cake
  4. Dark Chocolate Cake
cake_cakes_sweet_bake

Different amounts of each ingredient will produce different types of cake. How does the cake machine know what type of cake to produce given a set of ingredients? 

Fortunately, a machine learning engineer has written a software program (containing a supervised learning algorithm) that is running inside the cake machine. The program was pre-trained on a dataset containing 1 million cakes. Each entry (i.e. example or training instance) in this gigantic dataset contained two key pieces of data: 

  1. How much of each ingredient was used in the making of that given cake
  2. The type of cake that was produced

During the training phase of the program, the program found a fairly accurate mathematical relationship between the amount of each ingredient and the cake type. And now, when the cake machine is provided with a new set of ingredients by the chef, it automatically “knows” what type of cake to produce. Pretty cool huh!

2-ipo-2

What I described above is called supervised learning. It is called supervised learning because the input data (which the supervised learning algorithm used to train) is already labeled with the “correct” answers (e.g. the type of cake in our example above; or the sale price values for those 100,000 homes from the earlier example I presented in this post.). We supervised the learning algorithm by “telling” it what the output (cake type) should be for 1 million different sets of input values (ingredients). 

The fundamental idea of a supervised learning algorithm is to learn a mathematical relationship between inputs and outputs so that it can predict the output value given an entirely new set of input values. 

Let’s take a look at a common supervised learning algorithm: linear regression. The goal of linear regression is to find a line that best fits the relationship between input and output. For example, the learning algorithm for linear regression could be trained on square footage and sale price data for 100,000 homes. It would learn the mathematical relationship (e.g. a straight line in the form y = mx + b) between square footage and the sale price of a home. 

3-square-footage

With this relationship (i.e. line of best fit) in hand, the algorithm can now easily predict the sale price of any home just by being provided with the home’s square footage value. 

For example, let’s say we wanted to find the price of a house that is 2000 ft2. We feed 2,000 ft2 into the algorithm. The algorithm predicts a sale price of $500,000.

4-price-square-footage

As you can imagine, before we make any predictions using a supervised learning algorithm, it is imperative to train it on a lot of data. Lots and lots of data. The more the merrier.

In the example above, to get that best fit line, we want to feed it with as many examples of houses as possible during training. The more data it has, the better its predictions will be when given new data. This, my friends, is supervised learning, and it is incredibly powerful. In fact, supervised learning is the bread and butter of most of the state-of-the-art machine learning techniques today, such as deep learning.

Now, let’s look at unsupervised learning.

Return to Table of Contents

What is Unsupervised Learning?

Let’s suppose you have the following dataset for a group of 13 people. For each person, we have the following features:

  • Height (in inches)
  • Hair Length (in inches)

Let’s plot the data to see what it looks like:

5-height

In contrast to supervised learning, in this case there is no output value that we are trying to predict (e.g. house sale price, cake type, etc.). All we have are features (inputs) with no corresponding output variables. In machine learning jargon, we say that the data points are unlabeled

So instead of trying to force the dataset to fit a straight line or some sort of predetermined mathematical model, we let an unsupervised learning algorithm find a pattern all on its own. And here is what we get:

6-hair-length-vs-height

Aha! It looks like the algorithm found some sort of pattern in the data. The data is clustered into two distinct groups. What these clusters mean, we do not know because the data points are unlabeled. However, what we do suspect given our prior knowledge of this dataset, is that the blue dots are males, and the red dots are females given the attributes are height and hair length. 

What I have described above is known as unsupervised learning. It is called unsupervised because the input dataset is unlabeled. There is no output variable we are trying to predict. There is no prior mathematical model we are trying to fit the data to. All we want to do is let the algorithm find some sort of structure or pattern in the data. We let the data speak for itself. 

Any time you are given a dataset and want to group similar data points into clusters, you’re going to want to use an unsupervised learning algorithm.

Return to Table of Contents

Hierarchical Actions and Reinforcement Learning

One of the issues of reinforcement learning is how it handles hierarchical actions.

What are Hierarchical Actions?

In order to explain hierarchical actions, let us take a look at a real-world example. Consider the task of baking a sweet potato pie. The high-level action of making a sweet potato pie can be broken down into numerous low-level sub steps: cut the sweet potatoes, cook the sweet potatoes, add sugar, add flour, etc.

You will also notice that each of the low-level sub steps mentioned above can further be broken down into even further steps. For example, the task of cutting a sweet potato can be broken down into the following steps: move right arm to the right, orient right arm above the pie, bring arm down, etc.

Each of those sub steps of sub steps can then be further broken down into even smaller steps. For example, “moving right arm to the right” might involve thousands of different muscle contractions. Can you see where we are going here?

Reinforcement learning involves training a software agent to learn by experience through trial and error. A basic reinforcement learning algorithm would need to do a search over thousands of low-level actions in order to execute the task of making a sweet potato pie. Thus, reinforcement learning methods would quickly get inefficient for tasks that require a large number of low-level actions.

How to Solve the Hierarchical Action Problem

One way to solve the hierarchical action problem is to represent a high-level behavior (e.g. making a sweet potato pie) as a small sequence of high-level actions. 

For example, where the solution of making a sweet potato pie might entail 1000 low-level actions, we might condense these actions into 10 high-level actions. We could then have a single master policy that switches between each of the 10 sub-policies (one for each action) every N timesteps. The algorithm explained here is known as meta-learning shared hierarchies and is explained in more detail at OpenAi.com.

We could also integrate supervised learning techniques such as ID3 decision trees. Each sub-policy would be represented as a decision tree where the appropriate action taken is the output of the tree. The input would be a transformed version of the state and reward that was received from the environment. In essence, you would have decisions taken within decisions.

Partial Observability and Reinforcement Learning

In this post, I’m going to discuss how supervised learning can address the partial observability issue in reinforcement learning.

What is Partial Observability?

In a lot of the textbook examples of reinforcement learning, we assume that the agent, for example a robot, can perfectly observe the environment around it in order to extract relevant information about the current state. When this is the case, we say that the environment around the agent is fully observable

However, in many cases, such as in the real world, the environment is not always fully observable. For example, there might be noisy sensors, missing information about the state, or outside interferences that prohibit an agent from being able to develop an accurate picture of the state of its surrounding environment. When this is the case, we say that the environment is partially observable.

Let us take a look at an example of partial observability using the classic cart-pole balancing task that is often found in discussions on reinforcement learning.

Below is a video demonstrating the cart-pole balancing task. The goal is to keep to keep a pole from falling over by making small adjustments to the cart support underneath the pole.

In the video above, the agent learns to keep the pole balanced for 30 minutes after 600 trials. The state of the world consists of two parts:

  1. The pole angle
  2. The angular velocity

However, what happens if one of those parts is missing? For example, the pole angle reading might disappear. 

Also, what happens if the readings are noisy, where the pole angle and angular velocity measurements deviate significantly from the true value? 

In these cases, a reinforcement learning policy that depends only on the current observation xt (where x is the pole angle or angular velocity value and time t) will suffer in performance. This in a nutshell is the partial observability problem that is inherent in reinforcement learning techniques.

Addressing Partial Observability Using Long Short-Term Memory (LSTM) Networks

One strategy for addressing the partial observability problem (where information about the actual state of the environment is missing or noisy) is to use long short-term memory neural networks. In contrast to artificial feedforward neural networks which have a one-way flow of information from the input layer, LSTMs have feedback connections. Past information persists from run to run of the network, giving the system a “memory.” This memory can then be used to make predictions about the current state of the environment.

The details of exactly how the memory explained above is created is described in this paper written by Bram Baker of the Unit of Experimental and Theoretical Psychology at Leyden University. Baker showed that LSTM neural networks can help improve reinforcement learning policies by creating a “belief state.” This “belief state” is based on probabilities of reward, state transitions, and observations, given prior states and actions. 

Thus, when the actual state (as measured by a robot’s sensor for example) is unavailable or super noisy, an agent can use belief state information generated by an LSTM to determine the appropriate action to take.

Combining Deep Neural Networks With Reinforcement Learning for Improved Performance

The performance of reinforcement learning can be improved by incorporating supervised learning techniques. Let us take a look at a concrete example.

You all might be familiar with the Roomba robot created by iRobot. The Roomba robot is perhaps the most popular robot vacuum sold in the United States. 

roomba_discovery

The Roomba is completely autonomous, moving around the room with ease, cleaning up dust, pet hair, and dirt along the way. In order to do its job, the Roomba contains a number of sensors that enable it to perceive the current state of the environment (i.e. your house). 

Let us suppose that the Roomba is governed by a reinforcement learning policy. This learning policy could be improved if we have accurate readings of the current state of the environment. And one way to improve these readings is to incorporate computer vision.

Since reinforcement learning depends heavily on accurate readings of the current state of the environment, we could use deep neural networks (a supervised learning technique) to pre-train the robot so that it can perform common computer vision tasks such as recognizing objects, localizing objects, and classifying objects before we even start running the reinforcement learning algorithm. These “readings” would improve the state portion of the reinforcement learning loop.

Deep neural networks have already displayed remarkable accuracy for computer vision problems. We can use these techniques to enable the robot to get a more accurate reading of the current state of the environment so that it can then take the most appropriate actions towards maximizing cumulative reward.

Boltzmann Distribution and Epsilon Greedy Search

How Does the Boltzmann Distribution Fit Into the Discussion of Epsilon Greedy Search?

In order to answer your question, let us take a closer look at the definition of epsilon greedy search. With our knowledge of how that works, we can then see how the Boltzmann distribution fits into the discussion of epsilon greedy search.

What is Epsilon Greedy Search?

When you are training an agent (e.g. race car, robot, etc.) with an algorithm like Q-learning, you can either have the agent take a random action with probability ϵ or have the agent be greedy and take the action that corresponds to its policy with probability 1-ϵ (i.e. the action for a given state that has the highest Q-value). The former is known as exploration while the latter is called exploitation. In reinforcement learning, we have this constant dichotomy of:

  • exploration vs. exploitation
  • learn vs. earn
  • not greedy vs. greedy
  • Exploration: Try a new bar in your city.
  • Exploitation: Go to the same watering hole you have been going to for decades.
  • Exploration: Start a business.
  • Exploitation: Get a job.
  • Exploration: Try to make new friends.
  • Exploitation: Keep inviting over your college buddies.
  • Exploration: Download Tinder dating app.
  • Exploitation: Call the ex.
  • Exploration (with probability ϵ): Gather more information about the environment.
  • Exploitation (with probability 1-ϵ): Make a decision based on the best information (i.e. policy) that is currently available.

The epsilon greedy algorithm in which ϵ is 0.20 says that most of the time the agent will select the trusted action a, the one prescribed by its policy π(s) -> a. However, 20% of the time, the agent will choose a random action instead of following its policy. 

We often want to have the epsilon-greedy algorithm in place for a reinforcement learning problem because often what is best for the agent long term (e.g. trying something totally random that pays off in a big way down the road) might not be the best for the agent in the short term (e.g. sticking with the best option we already know).

What Does the Boltzmann Distribution Have to Do With Epsilon Greedy Search?

Notice in the epsilon greedy search section above, I said that 20% of the time the agent will choose a random action instead of following its policy. The problem with this is that it treats all actions equally when making a decision on what action to take. What happens though if some actions might look more promising than others? Plain old epsilon greedy search cannot handle a situation like this. The fact is that, in the real world, all actions are not created equal.

A common method is to use the Boltzmann distribution (also known as Gibbs distribution). Rather than blindly accepting any random action when it comes time for the agent to explore the environment from a given state s, the agent selections an action a (from a set of actions A) with probability:

boltzmann-distribution

What this system is doing above is ranking and weighting all actions in the set of possible actions based on their Q-values. This system is often referred to as softmax selection rules.

Take a closer look at the equation above to see what we are doing here. A really high value of tau means that all actions are equally likely to be selected because we are diluting the impact of the Q-values for each action (by dividing by tau). However, as tau gets lower and lower, there will be greater differences in the selection probabilities for each action. The action with the highest Q[s,a] value is therefore much more likely to get selected. And when tau gets close to zero, the Boltzmann selection criteria I outlined above becomes indistinguishable from greedy search. For an extremely low value of tau, the agent will select the action with the highest Q-value and therefore never explore the environment via a random action.

Value Iteration vs. Q-Learning Algorithm in Python Step-By-Step

In this post, I will walk you through how to implement the value iteration and Q-learning reinforcement learning algorithms from scratch, step-by-step. We will not use any fancy machine learning libraries, only basic Python libraries like Pandas and Numpy.

Our end goal is to implement and compare the performance of the value iteration and Q-learning reinforcement learning algorithms on the racetrack problem (Barto, Bradtke, & Singh, 1995).

In the racetrack problem, the goal is to control the movement of a race car along a predefined racetrack so that the race car travels from the starting position to the finish line in a minimum amount of time. The race car must stay on the racetrack and avoid crashing into walls along the way. The racetrack problem is analogous to a time trial rather than a competitive race since there are no other cars on the track to avoid.

Note: Before you deep dive into a section below, I recommend you check out my introduction to reinforcement learning post so you can familiarize yourself with what reinforcement learning is and how it works. Once you skim over that blog post, the rest of this article will make a lot more sense. If you are already familiar with how reinforcement learning works, no need to read that post. Just keep reading this one.

Without further ado, let’s get started!

Table of Contents

Testable Hypotheses

The two reinforcement learning algorithms implemented in this project were value iteration and Q-learning. Both algorithms were tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to take from the starting position to the finish line was calculated for each algorithm-racetrack combination.

Using the implementations of value iteration and Q-learning, three hypotheses will be tested.

Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

I hypothesize that value iteration and Q-learning will both generate policies that will enable the race car to successfully finish the race on all racetracks tested (i.e. move from the starting position of the racetracks to the finish line).

Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

I hypothesize that value iteration will generate a learning policy faster than Q-learning because it has access to the transition and reward functions (explained in detail in the next section “Algorithms Implemented”).  

Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

I hypothesize that it will take longer for the car to finish the race for the crash scenario in which the race car needs to return to the original starting position each time it crashes into a wall. In other words, Bad Crash Version 1 (return to nearest open position) performance will be better than Bad Crash Version 2 (return to start) performance.

Return to Table of Contents

How Value Iteration Works

In the case of the value iteration reinforcement learning algorithm, the race car (i.e. agent) knows two things before taking a specific action (i.e. accelerate in x and/or y directions) (Alpaydin, 2014):

  1. The probabilities of ending up in other new states given a particular action is taken from a current state.
    • More formally, the race car knows the transition function.
    • As discussed in the previous section, the transition function takes as input the current state s and selected action a and outputs the probability of transitioning to some new state s’.
  2. The immediate reward (e.g. race car gets -1 reward each time it moves to a new state) that would be received for taking a particular action from a current state.
    • More formally, this is known as the reward function.
    • The reward function takes as input the current state s and selected action a and outputs the reward.

Value iteration is known as a model-based reinforcement learning technique because it tries to learn a model of the environment where the model has two components:

  1. Transition Function
    • The race car looks at each time it was in a particular state s and took a particular action a and ended up in a new state s’. It then updates the transition function (i.e. transition probabilities) according to these frequency counts.
  2. Reward Function
    • This function answers the question: “Each time the race car was in a particular state s and took a particular action a, what was the reward?”

In short, in value iteration, the race car knows the transition probabilities and reward function (i.e. the model of the environment) and uses that information to govern how the race car should act when in a particular state. Being able to look ahead and calculate the expected rewards of a potential action gives value iteration a distinct advantage over other reinforcement learning algorithms when the set of states is small in size.

Let us take a look at an example. Suppose the following:

  • The race car is in state so, where s0 = <x0, y0, vx0, vy0>, corresponding to the x-position on a racetrack, y-position on a race track, velocity in the x direction, and velocity in the y direction, at timestep 0.
  • The race car takes action a0, where a0 = (ax0, ay0) = (change in velocity in x-direction, change in velocity in y-direction)
  • The race car then observes the new state, where the new state is s1, where s1 = <x1, y1, vx1, vy1>.
  • The race car receives a reward r0 for taking action a0 in state s0.

Putting the bold-faced variables together, we get the following expression which is known as an experience tuple. Tuples are just like lists except the data inside of them cannot be changed.

Experience Tuple = <s0, a0, s1, r0>

What the experience tuple above says is that if the race car is in state s0 and takes action a0, the race car will observe a new state s1 and will receive reward r0.

Then at the next time step, we generate another experience tuple that is represented as follows.

Experience Tuple = <s1, a1, s2, r1>

This process of collecting experience tuples as the race car explores the race track (i.e. environment) happens repeatedly.

Because value iteration is a model based approach, it builds a model of the transition function T[s, a, s’] and reward function R[s,a,s’] using the experience tuples.

  • The transition function can be built and updated by adding up how many times the race car was in state s and took a particular action a and then observed a new state s’. Recall that T[s, a, s’] stands for the probability the race car finds itself in a new state s’ (at the next timestep) given that it was in state s and took action a.
  • The reward function can be built by examining how many times the race car was in state s and took a particular action a and received a reward r. From that information the average reward for that particular scenario can be calculated.

Once these models are built, the race car can then can use value iteration to determine the optimal values of each state (hence the name value iteration). In some texts, values are referred to as utilities (Russell, Norvig, & Davis, 2010).

What are optimal values?

Each state s in the environment (denoted as <xt, yt, vxt, vyt > in this racetrack project) has a value V(s). Different actions can be taken in a given state s. The optimal values of each state s are based on the action a that generates the best expected cumulative reward.

Expected cumulative reward is defined as the immediate reward that the race car would receive if it takes action a and ends up in the state s + the discounted reward that the race car would receive if it always chose the best action (the one that maximizes total reward) from that state onward to the terminal state (i.e. the finish line of the racetrack).

V*(s) = best possible (immediate reward + discounted future reward)

where the * means “optimal.”

The reason why those future rewards are discounted (typically by a number in the range [0,1), known as gamma γ) is because rewards received far into the future are worth less than rewards received immediately. For all we know, the race car might have a gas-powered engine, and there is always the risk of running out of gas. After all, would you rather receive $10 today or $10 ten years from now? $10 received today is worth more (e.g. you can invest it to generate even more cash). Future rewards are more uncertain so that is why we incorporate the discount rate γ.

It is common in control applications to see state-action pairs denoted as Q*(s, a) instead of V*(s) (Alpaydin, 2014). Q*(s,a) is the [optimal] expected cumulative reward when the race car takes an action a in state s and always takes the best action after that timestep. All of these values are typically stored in a table. The table maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state] 

Note that Q(s,a) above is not labeled Q*(s,a). Only once value iteration is done can we label it Q*(s,a) because it takes time to find the optimal Q values for each state-action pair.

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep of the race, the race car observes the state (i.e. position and velocity) and decides what action to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state. The car then takes that action.

The pseudocode for calculating the optimal value for each state-action pair (denoted as Q*(s,a)) in the environment is below. This algorithm is the value iteration algorithm because it iterates over all the state-action pairs in the environment and gives them a value based on the cumulative expected reward of that state-action pair (Alpaydin, 2014; Russell, Norvig, & Davis, 2010; Sutton & Barto, 2018):

Value Iteration Algorithm Pseudocode

Inputs

  • States:
    • List of all possible values of x
    • List of all possible values of y
    • List of all possible values of vx
    • List of all possible values of vy
  • Actions:
    • List of all the possible values of ax
    • List of all possible values of ay
  • Model:
    • Model = Transition Model T[s, a, s’] + Reward Model R[s, a, s’]
    • Where Model is a single table with the following row entries
      • [s, a, s’, probability of ending up in a new state s’ given state s and action a, immediate reward for ending up in new state s’ given state s and action a]
      • = [s, a, s’, T(s, a, s’), R(s, a, s’)]
      • Note that the reward will be -1 for each state except the finish line states (i.e. absorbing or terminal states), which will have a reward of 0.
  • Discount Rate:
    • γ
      • Where 0 ≤ γ < 1
      • If γ = 0, rewards received immediately are the only thing that counts.
      • As γ gets close to 1, rewards received further in the future count more.
  • Error Threshold
    • Ɵ
      • Ɵ is a small number that helps us to determine when we have optimal values for each state-action pair (also known as convergence).
      • Ɵ helps us know when the values of each state-action pair, denoted as Q(s,a), stabilize (i.e. stop changing a lot from run to run of the loop).

Process

  • Create a table V(s) that will store the optimal Q-value for each state. This table will help us determine when we should stop the algorithm and return the output. Initialize all the values of V(s) to arbitrary values, except the terminal state (i.e. finish line state) that has a value of 0.
  • Initialize all Q(s,a) to arbitrary values, except the terminal state (i.e. finish line states) that has a value of 0.
  • Initialize a Boolean variable called done that is a flag to indicate when we are done building the model (or set a fixed number of training iterations using a for loop).
  • While (Not Done)
    • Initialize a value called Δ to 0. When this value gets below the error threshold Ɵ, we exit the main loop and return the output.
    • For all possible states of the environment
      • v := V(s)            // Extract and store the most recent value of the state
      • For all possible actions the race car (agent) can take
        • Q(s,a) = (expected immediate reward given the state s and an action a) + (discount rate) * [summation_over_all_new_states(probability of ending up in a new state s’given a state s and action a * value of the new state s’)]
        • More formally,
          •  
          • where  = expected immediate return when taking action a from state s
      • V(s) := maximum value of Q(s,a) across all the different actions that the race car can take from state s
      • Δ := maximum(Δ, |v – V(s)|)
    • Is Δ < Ɵ? If yes, we are done (because the values of V(s) have converged i.e. stabilized), otherwise we are not done.

Output

The output of the value iteration algorithm is the value table, the table that maps state-action pairs to the optimal Q-values (i.e. Q*(s,a)).

Each row in the table corresponds to a state-action-value combination. So in this racetrack problem, we have the following entries into the value table:

[x, y, vx, vy, ax, ay, Q(s,a)] = [x-coordinate, y-coordinate, velocity in x-direction, velocity in y-direction, acceleration in x-direction, acceleration in y-direction, value when taking that action in that state] 

Policy Determination

With the optimal Q values for each state-action pair, the race car can calculate the best action to take given a state. The best action to take given a state is the one with the highest Q value.

At this stage, the race car is ready to start the engine and leave the starting position.

At each timestep t of the race, the race car observes the state st (i.e. position and velocity) and decides what action at to apply by looking at the value table. It finds the action that corresponds to the highest Q value for that state (the action that will generate the highest expected cumulative reward). The car then takes that action at*, where at* means the optimal action to take at time t.

More formally, the optimal policy for a given states s at timestep t is π*(s) where:

For each state s do

value-iter-policy-equation

Value Iteration Summary

So, to sum up, on a high level, the complete implementation of an autonomous race car using value iteration has two main steps:

  • Step 1: During the training phase, calculate the value of each state-action pair.
  • Step 2: At each timestep of the time trial, given a current state s, select the action a where the state-action pair value (i.e. Q*(s,a)) is the highest.

Return to Table of Contents

How Q-Learning Works

Overview

In most real-world cases, it is not computationally efficient to build a complete model (i.e. transition and reward function) of the environment beforehand.  In these cases, the transition and reward functions are unknown. To solve this issue, model-free reinforcement learning algorithms like Q-learning were invented (Sutton & Barto, 2018) .

In the case of model-free learning, the race car (i.e. agent) has to interact with the environment and learn through trial-and-error. It cannot look ahead to inspect what the value would be if it were to take a particular action from some state. Instead, in Q-learning, the race car builds a table of Q-values denoted as Q(s,a) as it takes an action a from a state s and then receives a reward r. The race car uses the Q-table at each timestep to determine the best action to take given what it has learned in previous timesteps.

One can consider Q[s,a] as a two-dimensional table that has the states s as the rows and the available actions a as the columns. The value in each cell of this table (i.e. the Q-value) represents the value of taking action a in state s.

Just like in value iteration, value has two components, the immediate reward and the discounted future reward.

More formally,

Q[s,a] = immediate reward + discounted future reward

Where immediate reward is the reward you get when taking action a from state s, and discounted future reward is the reward you get for taking actions in the future, discounted by some discount rate γ. The reason why future rewards are discounted is because rewards received immediately are more certain than rewards received far into the future (e.g. $1 received today is worth more than $1 received fifty years from now).

So, how does the race car use the table Q[s,a] to determine what action to take at each timestep of the time trial (i.e. what is the policy given a state…π(s))?

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

After Q-learning is run a number of times, an optimal policy will eventually be found. The optimal policy is denoted as π*(s). Similarly the optimal Q-value is Q*[s,a].

How the Q[s,a] Table Is Built

Building the Q[s,a] table can be broken down into the following steps:

  1. Choose the environment we want to train on.
    • In this race car problem, the racetrack environment is what we want the race car to explore and train on.
    • The racetrack provides the x and y position data that constitutes a portion of the state s that will be observed at each timestep by the racecar (agent).
  2. Starting from the starting position, the race car moves through the environment, observing the state s at a given timestep
  3. The race car consults the policy π(s), the function that takes as input the current state and outputs an action a.
  4. The action a taken by the race car changes the state of the environment from s to s’. The environment also generates a reward r that is passed to the race car.
  5. Steps 2-4 generate a complete experience tuple (i.e. a tuple is a list that does not change) of the form <s, a, s’, r> = <state, action, new state, reward>.
  6. The experience tuple in 5 is used to update the Q[s,a] table.
  7. Repeat steps 3-6 above until the race car gets to the terminal state (i.e. the finish line of the racetrack).

With the Q[s,a] table built, the race car can now test the policy. Testing the policy means starting at the starting position and consulting the Q[s,a] table at each timestep all the way until the race car reaches the finish line. We count how many timesteps it took for the race car to reach the finish line. If the time it took the race car to complete the time trial is not improving, we are done. Otherwise, we make the race car go back through the training process and test its policy again with an updated Q[s,a] table.

We expect that eventually, the performance of the race car will reach a plateau.

How the Q[s,a] Table Is Updated

With those high level steps, let us take a closer look now at how the Q[s,a] table is updated in step 6 in the previous section.

Each time the race car takes an action, the environment transitions to a new state s’ and gives the race car a reward r. This information is used by the race car to update its Q[s,a] table.

The update rule consists of two parts, where Q’[s,a] is the new updated Q-value for an action a taken at state s.

Q’[s,a] = Q[s,a] + α * (improved estimate of Q[s,a] – Q[s,a])

where α is the learning rate.

The learning rate is a number 0 < α ≤ 1 (commonly 0.2). Thus, the new updated Q-value is a blend of the old Q-value and the improved estimate of the Q value for a given state-action pair. The higher the learning rate, the more new information is considered when updating the Q-value. A learning rate of 1 means that the new updated Q-value is only considering the new information.

The equation above needs to be expanded further.

Q’[s,a] =Q[s,a] + α*(immediate reward + discounted future reward – Q[s,a])

Q’[s,a] = Q[s,a] + α * (r + γ * future reward – Q[s,a])

Where γ is the discount rate. It is typically by a number in the range [0,1). It means that rewards received by the race car (agent) in the far future are worth less than an immediate reward.

Continuing with the expansion of the equation, we have:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, optimal action a’ at new state s’] – Q[s,a])

Note in the above equation that we assume that the race car reaches state s’ and takes the best action from there, where best action is the action a’ that has the highest Q-value for that given new state s’.

More formally the update rule is as follows:

Q’[s,a] = Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])

Where argmaxa’(Q[s’,a’]) returns the action a’ that has the highest Q value for a given state s’.

How an Action Is Selected at Each Step

The performance of Q-learning is improved the more the race car explores different states and takes different actions in those states. In other words, the more state-action pairs the race car explores, the better Q[s,a] table the race car is able to build.

One strategy for forcing the race car to explore as much of the state-action space as possible is to add randomness into the Q-learning procedure. This is called exploration. More specifically, at the step of Q-learning where the race car selects an action based on the Q[s,a] table:

  • There is a 20% chance ε that the race car will do some random action, and a 80% chance the race car will do the action with the highest Q-value (as determined by the Q[s,a] table). The latter is known as exploitation.
  • If the race car does some random action, the action is chosen randomly from the set of possible actions the race car can perform.

I chose 20% as the starting probability (i.e. ε) of doing some random action, but it could be another number (e.g. 30%) As the race car gains more experience and the Q[s,a] table gets better and better, we want the race car to take fewer random actions and follow its policy more often. For this reason, it is common to reduce ε with each successive iteration, until we get to a point where the race car stops exploring random actions.

Q-Learning Algorithm Pseudocode

Below are is the Q-learning algorithm pseudocode on a high level (Alpaydin, 2014; Sutton & Barto, 2018).

Inputs

  • Learning rate α, where 0 < α ≤ 1 (e.g. 0.2)
  • Discount rate γ (e.g. 0.9)
  • Exploration probability ε, corresponding to the probability that the race car will take a random action given a state (e.g. 0.2)
  • Reduction of exploration probability, corresponding to how much we want to decrease ε at each training iteration (defined as a complete trip from the starting position to the finish line terminal state) of the Q-learning process. (e.g. 0.5)
  • Number of training iterations (e.g. 1,000,000)
  • States:
    • List of all possible values of x
    • List of all possible values of y
    • List of all possible values of vx
    • List of all possible values of vy
  • Actions:
    • List of all the possible values of ax
    • List of all possible values of ay

Process

  • Initialize the Q[s,a] table to small random values, except for the terminal state which will have a value of 0.
  • For a fixed number of training iterations
    • Initialize the state to the starting position of the race track
    • Initialize a Boolean variable to see if the race car has crossed the finish line (i.e. reached the terminal state)
    • While the race car has not reached the finish line
      • Select the action a using the Q[s,a] table that corresponds to the highest Q value given state s
      • Take action a
      • Observe the reward r and the new state s’
      • Update the Q[s,a] table:
        • Q[s,a] := Q[s,a] + α * (r + γ * Q[s’, argmaxa’(Q[s’,a’])] – Q[s,a])
      • s := s’
      • Check if we have crossed the finish line

Output

Algorithm returns the Q[s,a] table which is used to determine the optimal policy.

Policy Determination

Each time the race car observes state s, it consults the Q[s,a] table. It takes a look to see which action generated the highest Q value given that observed state s. That action is the one that it takes.

More formally,

π(s) = argmaxa(Q[s,a])

argmax is a function that returns the action a that maximizes the expression Q[s,a]. In other words, the race car looks across all position actions given a state and selects the action that has the highest Q-value.

Helpful Video

Here is a good video where Professor Tucker Balch from Georgia Tech goes through Q-learning, step-by-step:

Return to Table of Contents

Experimental Design and Tuning Process

Experimental Design

The Q-learning and value iteration algorithms were implemented for the racetrack problem and then tested on two different racetracks: an R-shaped racetrack and an L-shaped racetrack. The number of timesteps the race car needed to find the finish line was calculated for each algorithm-racetrack combination. The number of training iterations was also monitored. 10 experiments for each algorithm-racetrack combination were performed to create data to graph a learning curve.

At each timestep t of the racetrack problem, we have the following variables (below). We assume the racetrack is a Cartesian grid with an x-axis and a y-axis, and that the system is governed by the standard laws of kinematics:

  • State = <xt, yt, vxt, vyt> = variables that describe the current situation (i.e. environment)
    • xt = x coordinate of the location of the car at timestep t
    • yt = y coordinate of the location of the car at timestep t
    • vxt = velocity in the x direction at time step t, where vxt = xt – xt-1
      • Maximum vxt = 5
      • Minimum vxt = -5
    • vyt = velocity in the y direction at timestep t, where vyt = yt – yt-1
  • Action = <axt, ayt> = control variables = what the race car (i.e. agent) can do to influence the state
    • axt = accelerate in the x direction at timestep t, where axt = vxt  – vx t-1
      • -1 (decrease velocity in x-direction by 1)
      • 0 (keep same velocity), or
      • 1 (increase velocity in x-direction by 1)
    • ayt = acceleration in the y direction at timestep t, where ayt = vyt  – vy t-1
      • -1 (decrease velocity in y-direction by 1)
      • 0 (keep same velocity)
      • 1 (increase velocity in y-direction by 1)

Example Calculation

At t = 0, the race car observes the current state. Location is (2,2) and velocity is (1,0).

  • State = <xt, yt, vxt, vyt> = <x0, y0, vx0, vy0> = <2, 2, 1, 0>
  • This means the race car is moving east one grid square per timestep because the x-component of the velocity is 1, and the y-component is 0.

After observing the state, the race car selects an action. It accelerates with acceleration (1,1).

  • Action=(ax0, ay0)=(1, 1) = (increase velocity in x-direction by 1, increase velocity in y-direction by 1)

At t = 1, the race car observes the state.

  • Velocity is now (vx0 + ax0, vy0 + ay0) = (1 + 1, 0 + 1) = (vx1, vy1) = (2, 1)
  • Position (i.e. x and y coordinates) is now (x0 + vx1, y0 + vy1) = (2 + 2, 2 + 1) = (x1, y1) = (4, 3)
  • Thus, putting it all together, the new state is now <x1, y1, vx1, vy1> = <4, 3, 2, 1>

Acceleration is Not Guaranteed

There is another twist with the racetrack problem. At any timestep t, when the race car attempts to accelerate, there is a 20% chance that the attempt will fail. In other words, each time the race car selects an action, there is a:

  • 20% chance that the attempt fails, and velocity remains unchanged at the next timestep; i.e. (axt, ayt) = (0, 0)
  • 80% chance velocity changes at the next timestep; (axt, ayt) = (selected acceleration in the x-direction, selected acceleration in y-direction)

Must Avoid Crashing Into Walls

The race car must stay on the track. Crashing into walls is bad. Two versions of a “bad crash” will be implemented in this project:

  • Bad Crash Version 1: If the car crashes into a wall,
    • The car is placed at the nearest position on the track to the place where the car crashed.
    • The velocity is set to (0, 0).
  • Bad Crash Version 2: If the car crashes into a wall,
    • The car must go back to the original starting position.
    • The velocity is set to (0, 0).

In this project, the performance of both versions of bad crash will be compared side by side for the reinforcement learning algorithms implemented in this project.

Reward Function

Because the goal is for the race car to go from the starting position to the finish line in the minimum number of timesteps, we assign a reward of -1 for each move the car makes. Because the reward is negative, this means that the race car incurs a cost (i.e. penalty) of 1 each time the race car moves to another position. The goal is to get to the finish line in as few moves (i.e. time steps) as possible in order to minimize the total cost (i.e. maximize the total reward).

The finish line is the final state (often referred to as terminal or absorbing state). This state has a reward of 0. There is no requirement to stop right on the finish position (i.e. finish line). It is sufficient to cross it.

Description of Any Tuning Process Applied

In order to compare the performance of the algorithm for the different crash scenarios, the principal hyperparameters were kept the same for all runs. These hyperparameters are listed below.

Learning Rate

The Learning rate α was set to 0.2.

Discount Rate

The discount rate γ was set to 0.9.

Exploration Probability

The exploration probability ε was set to 0.2 as dictated by the problem statement.

Number of Training Iterations

The number of training iterations was varied in order to computer the learning curve (see Results section below)

Return to Table of Contents

Input Files

Here are the tracks used in this project (S= Starting States; F = Finish States, # = Wall, . = Track):

Value Iteration Code in Python

Here is the full code for value iteration. Don’t be scared by how long this code is. I included a lot of comments so that you know what is going on at each step of the code. This and the racetrack text files above are all you need to run the program (just copy and paste into your favorite IDE!):

import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: value_iteration.py
# Author: Addison Sears-Collins
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the value iteration reinforcement learning
# algorithm for the racetrack problem. 
# The racetrack problem is described in full detail in: 
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using 
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and 
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning : 
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Value_Iteration"
FILENAME = "R-track.txt"
THIS_TRACK = "R_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
                 # we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a 
                            # according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 10 # A single training iteration runs through all
                            # possible states s
NO_RACES = 10 # How many times the race car does a single time trial from 
              # starting position to the finish line
FRAME_TIME = 0.3 # How many seconds between frames printed to the console
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
MAX_TRAIN_ITER = 50 # Maximum number of training iterations

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# All actions A the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
           (-1,0) , (0,0),  (1,0),
           (-1,1) , (0,1),  (1,1)]

def read_environment(filename):
    """
    This method reads in the environment (i.e. racetrack)
    :param str filename
    :return environment: list of lists
    :rtype list
    """
    # Open the file named filename in READ mode.
    # Make the file an object named 'file'
    with open(filename, 'r') as file:

        # Read until end of file using readline() 
        # readline() then returns a list of the lines
        # of the input file
        environment_data = file.readlines()
    
    # Close the file
    file.close()

    # Initialize an empty list
    environment = []

    # Adds a counter to each line in the environment_data list,
    # i is the index of each line and line is the actual contents.
    # enumerate() helps get not only the line of the environment but also 
    # the index of that line (i.e. row)
    for i,line in enumerate(environment_data):
        # Ignore the first line that contains the dimensions of the racetrack
        if i > 0:
            # Remove leading or trailing whitespace if applicable
            line = line.strip()

            # If the line is empty, ignore it
            if line == "": continue

            # Creates a list of lines. Within each line is a list of 
            # individual characters
            # The stuff inside append(stuff) first creates a new_list = []
            # It then appends all the values in a given line to that new 
            # list (e.g. new_list.append(all values inside the line))
            # Then we append that new list to the environment list.
            # Therefoer, environment is a list of lists.
            environment.append([x for x in line])

    # Return the environment (i.e. a list of lists/lines)
    return environment

def print_environment(environment, car_position = [0,0]):
    """
    This method reads in the environment and current 
    (y,x) position of the car and prints the environment to the console
    :param list environment
    :param list car_position 
    """
    # Store value of current grid square
    temp = environment[car_position[0]][car_position[1]]

    # Move the car to current grid square
    environment[car_position[0]][car_position[1]] = "X"

    # Delay 
    time.sleep(FRAME_TIME)

    # Clear the printed output
    clear()

    # For each line in the environment
    for line in environment: 

        # Initialize a string
        text = ""

        # Add each character to create a line
        for character in line: 
            text += character

        # Print the line of the environment
        print(text)

    # Retstore value of current grid square
    environment[car_position[0]][car_position[1]] = temp

def clear():
    """
    This method clears the print output
    """    
    os.system( 'cls' )

def get_random_start_position(environment):
    """
    This method reads in the environment and selects a random
    starting position on the racetrack (y, x). Note that 
    (0,0) corresponds to the upper left corner of the racetrack.
    :param list environment: list of lines
    :return random starting coordinate (y,x) on the racetrack
    :rtype tuple
    """
    # Collect all possible starting positions on the racetrack
    starting_positions = []

    # For each row in the environment
    for y,row in enumerate(environment):

        # For each column in each row of the environment
        for x,col in enumerate(row):

            # If we are at the starting position
            if col == START:

                # Add the coordiante to the list of available
                # starting positions in the environment
                starting_positions += [(y,x)]

    # Random shuffle the list of starting positions
    shuffle(starting_positions)

    # Select a starting position
    return starting_positions[0]

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
    """
    Get the new velocity values
    :param tuple old_vel: (vy, vx)
    :param tuple accel: (ay, ax)
    :param int min_vel: Minimum velocity of the car
    :param int max_vel: Maximum velocity of the car
    :return new velocities in y and x directions
    """
    new_y = old_vel[0] + accel[0] 
    new_x = old_vel[1] + accel[1]
    if new_x < min_vel: new_x = min_vel
    if new_x > max_vel: new_x = max_vel
    if new_y < min_vel: new_y = min_vel
    if new_y > max_vel: new_y = max_vel
    
    # Return the new velocities
    return new_y, new_x

def get_new_position(old_loc, vel, environment):
    """
    Get a new position using the old position and the velocity
    :param tuple old_loc: (y,x) position of the car
    :param tuple vel: (vy,vx) velocity of the car
    :param list environment
    :return y+vy, x+vx: (new_y,new_x)
    """
    y,x = old_loc[0], old_loc[1]
    vy, vx = vel[0], vel[1]

    # new_y = y+vy, new_x = x + vx    
    return y+vy, x+vx

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
        0), open = [TRACK, START, GOAL]):
    """
    Locate the nearest open cell in order to handle crash scenario.
    Distance is calculated as the Manhattan distance.
    Start from the crash grid square and expand outward from there with
    a radius of 1, 2, 3, etc. Forms a diamond search pattern.
    
    For example, a Manhattan distance of 2 would look like the following:     
            .
           ...
          ..#..
           ... 
            .   
    If velocity is provided, search in opposite direction of velocity so that
    there is no movement over walls
    :param list environment
    :param int ycrash: y coordinate where crash happened
    :param int xcrash: x coordinate where crash happened
    :param int vy: velocity in y direction when crash occurred
    :param int vx: velocity in x direction when crash occurred
    :param list of strings open: Contains environment types
    :return tuple of the nearest open y and x position on the racetrack
    """ 
    # Record number of rows (lines) and columns in the environment
    rows = len(environment)
    cols = len(environment[0])    
   
    # Add expanded coverage for searching for nearest open cell
    max_radius = max(rows,cols)

    # Generate a search radius for each scenario
    for radius in range(max_radius):

        # If car is not moving in y direction
        if vy == 0: 
            y_off_range = range(-radius, radius + 1)
        # If the velocity in y-direction is negative
        elif vy < 0:
            # Search in the positive direction
            y_off_range = range(0, radius + 1)
        else:
            # Otherwise search in the negative direction
            y_off_range = range(-radius, 1)

        # For each value in the search radius range of y
        for y_offset in y_off_range:

            # Start near to crash site and work outwards from there
            y = y_crash + y_offset
            x_radius = radius - abs(y_offset)

            # If car is not moving in x direction
            if vx == 0:
                x_range = range(x_crash - x_radius, x_crash + x_radius + 1)
            # If the velocity in x-direction is negative
            elif vx < 0:
                x_range = range(x_crash, x_crash + x_radius + 1)
            # If the velocity in y-direction is positive
            else:
                x_range = range(x_crash - x_radius, x_crash + 1)

            # For each value in the search radius range of x
            for x in x_range:
                # We can't go outside the environment(racetrack) boundary
                if y < 0 or y >= rows: continue
                if x < 0 or x >= cols: continue

                # If we find and open cell, return that (y,x) open cell
                if environment[y][x] in open: 
                    return(y,x)        
    
    # No open grid squares found on the racetrack
    return

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
    False),bad_crash = False):
    """
    This method generates the new state s' (position and velocity) from the old
    state s and the action a taken by the race car. It also takes as parameters
    the two different crash scenarios (i.e. go to nearest
    open position on the race track or go back to start)
    :param int old_y: The old y position of the car
    :param int old_x: The old x position of the car
    :param int old_vy: The old y velocity of the car
    :param int old_vx: The old x velocity of the car
    :param tuple accel: (ay,ax) - acceleration in y and x directions
    :param list environment: The racetrack
    :param boolean deterministic: True if we always follow the policy
    :param boolean bad_crash: True if we return to start after crash
    :return s' where s' = new_y, new_x, new_vy, and new_vx
    :rtype int
    """ 
    # This method is deterministic if the same output is returned given
    # the same input information
    if not deterministic:

        # If action fails (car fails to take the prescribed action a)
        if random() > PROB_ACCELER_SUCCESS: 
            #print("Car failed to accelerate!")
            accel = (0,0)
 
    # Using the old velocity values and the new acceleration values,
    # get the new velocity value
    new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

    # Using the new velocity values, update with the new position
    temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),( 
                                     environment))

    # Find the nearest open cell on the racetrack to this new position
    new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy, 
                                     new_vx)
    # If a crash happens (i.e. new position is not equal to the nearest
    # open position on the racetrack
    if new_y != temp_y or new_x != temp_x:

        # If this is a crash in which we would have to return to the
        # starting position of the racetrack and we are not yet
        # on the finish line
        if bad_crash and environment[new_y][new_x] != GOAL:

            # Return to the start of the race track
            new_y, new_x = get_random_start_position(environment)
        
        # Velocity of the race car is set to 0.
        new_vy, new_vx = 0,0

    # Return the new state
    return new_y, new_x, new_vy, new_vx

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
    """
    This method returns the policy pi(s) based on the action taken in each state
    that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
    best action that the race car should take in each state is the one that 
    maximizes the value of Q. (* means optimal)
    :param int cols: Number of columns in the environment
    :param int rows: Number of rows (i.e. lines) in the environment
    :param list vel_range: Range of the velocity of the race car 
    :param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
    :return pi : the policy
    :rtype: dictionary: key is the state tuple, value is the 
       action tuple (ay,ax)
    """
    # Create an empty dictionary called pi
    pi = {}

    # For each state s in the environment
    for y in range(rows): 
        for x in range(cols):
            for vy in vel_range:
                for vx in vel_range:
                    # Store the best action for each state...the one that
                    # maximizes the value of Q.
                    # argmax looks across all actions given a state and 
                    # returns the index ai of the maximum Q value
                    pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]        
                    
    # Return pi(s)
    return(pi)

def value_iteration(environment, bad_crash = False, reward = (
        0.0), no_training_iter = NO_TRAINING_ITERATIONS):
    """
    This method is the value iteration algorithm.
    :param list environment
    :param boolean bad_crash
    :param int reward of the terminal states (i.e. finish line)
    :param int no_training_iter
    :return policy pi(s) which maps a given state to an optimal action
    :rtype dictionary
    """
    # Calculate the number of rows and columns of the environment
    rows = len(environment)
    cols = len(environment[0])

    # Create a table V(s) that will store the optimal Q-value for each state. 
    # This table will help us determine when we should stop the algorithm and 
    # return the output. Initialize all the values of V(s) to arbitrary values, 
    # except the terminal state (i.e. finish line state) that has a value of 0.
    # values[y][x][vy][vx] 
    # Read from left to right, we create a list of vx values. Then for each
    # vy value we assign the list of vx values. Then for each x value, we assign
    # the list of vy values (which contain a list of vx values), etc.
    # This is called list comprehension.
    values = [[[[random() for _ in vel_range] for _ in vel_range] for _ in (
        line)] for line in environment]

    # Set the finish line states to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:                 
                        values[y][x][vy][vx] = reward

    # Initialize all Q(s,a) to arbitrary values, except the terminal state 
    # (i.e. finish line states) that has a value of 0.
    # Q[y][x][vy][vx][ai]
    Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
        vel_range)] for _ in line] for line in environment]

    # Set finish line state-action pairs to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:   
                        for ai, a in enumerate(actions):                        
                            Q[y][x][vy][vx][ai] = reward

    # This is where we train the agent (i.e. race car). Training entails 
    # optimizing the values in the tables of V(s) and Q(s,a)
    for t in range(no_training_iter):

        # Keep track of the old V(s) values so we know if we reach stopping 
        # criterion
        values_prev = deepcopy(values)

        # When this value gets below the error threshold, we stop training.
        # This is the maximum change of V(s)
        delta = 0.0

        # For all the possible states s in S
        for y in range(rows):
            for x in range(cols):
                for vy in vel_range:
                    for vx in vel_range:
                        
                        # If the car crashes into a wall
                        if environment[y][x] == WALL:

                            # Wall states have a negative value
                            # I set some arbitrary negative value since
                            # we want to train the car not to hit walls
                            values[y][x][vy][vx] = -9.9

                            # Go back to the beginning
                            # of this inner for loop so that we set
                            # all the other wall states to a negative value
                            continue

                        # For each action a in the set of possible actions A
                        for ai, a in enumerate(actions):

                            # The reward is -1 for every state except
                            # for the finish line states
                            if environment[y][x] == GOAL:
                                r = reward
                            else:
                                r = -1

                            # Get the new state s'. s' is based on the current 
                            # state s and the current action a
                            new_y, new_x, new_vy, new_vx = act(
                                y,x,vy,vx,a,environment,deterministic = True, 
                                bad_crash = bad_crash)

                            # V(s'): value of the new state when taking action
                            # a from state s. This is the one step look ahead.
                            value_of_new_state = values_prev[new_y][new_x][
                                new_vy][new_vx]

                            # Get the new state s'. s' is based on the current 
                            # state s and the action (0,0)
                            new_y, new_x, new_vy, new_vx = act(
                                y,x,vy,vx,(0,0),environment,deterministic = (
                                    True), bad_crash = bad_crash)

                            # V(s'): value of the new state when taking action
                            # (0,0) from state s. This is the value if for some
                            # reason the race car attemps to accelerate but 
                            # fails
                            value_of_new_state_if_action_fails = values_prev[
                                new_y][new_x][new_vy][new_vx]

                            # Expected value of the new state s'
                            # Note that each state-action pair has a unique 
                            # value for s'
                            expected_value = (
                                PROB_ACCELER_SUCCESS * value_of_new_state) + (
                                PROB_ACCELER_FAILURE * (
                                    value_of_new_state_if_action_fails))

                            # Update the Q-value in Q[s,a]
                            # immediate reward + discounted future value
                            Q[y][x][vy][vx][ai] = r + (
                                DISC_RATE * expected_value)

                        # Get the action with the highest Q value
                        argMaxQ = np.argmax(Q[y][x][vy][vx])

                        # Update V(s)
                        values[y][x][vy][vx] = Q[y][x][vy][vx][argMaxQ]

        # Make sure all the rewards to 0 in the terminal state
        for y in range(rows):
            for x in range(cols):
                # Terminal state has a value of 0
                if environment[y][x] == GOAL:
                    for vy in vel_range:
                        for vx in vel_range:                 
                            values[y][x][vy][vx] = reward

        # See if the V(s) values are stabilizing
        # Finds the maximum change of any of the states. Delta is a float.
        delta = max([max([max([max([abs(values[y][x][vy][vx] - values_prev[y][
            x][vy][vx]) for vx in vel_range]) for vy in (
            vel_range)]) for x in range(cols)]) for y in range(rows)])

        # If the values of each state are stabilizing, return the policy
        # and exit this method.
        if delta < ERROR_THRES:
            return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

    return(get_policy_from_Q(cols, rows, vel_range, Q, actions))

def do_time_trial(environment, policy, bad_crash = False, animate = True, 
                  max_steps = MAX_STEPS):
    """
    Race car will do a time trial on the race track according to the policy.   
    :param list environment
    :param dictionary policy: A dictionary containing the best action for a 
        given state. The key is the state y,x,vy,vx and value is the action 
        (ax,ay) acceleration
    :param boolean bad_crash: The crash scenario. If true, race car returns to
        starting position after crashes
    :param boolean animate: If true, show the car on the racetrack at each 
        timestep
    :return i: Total steps to complete race (i.e. from starting position to 
        finish line)
    :rtype int

    """
    # Copy the environment
    environment_display = deepcopy(environment)

    # Get a starting position on the race track
    starting_pos = get_random_start_position(environment)
    y,x = starting_pos
    vy,vx = 0,0  # We initialize velocity to 0

    # Keep track if we get stuck
    stop_clock = 0    

    # Begin time trial
    for i in range(max_steps):        

        # Show the race car on the racetrack
        if animate: 
            print_environment(environment_display, car_position = [y,x])
        
        # Get the best action given the current state
        a = policy[(y,x,vy,vx)]

        # If we are at the finish line, stop the time trial
        if environment[y][x] == GOAL: 
            return i 
        
        # Take action and get new a new state s'
        y,x,vy,vx = act(y, x, vy, vx, a, environment, bad_crash = bad_crash)

        # Determine if the car gets stuck
        if vy == 0 and vx == 0:
            stop_clock += 1
        else:
            stop_clock = 0

        # We have gotten stuck as the car has not been moving for 5 timesteps
        if stop_clock == 5:
            return max_steps
        
    # Program has timed out
    return max_steps

def main():
    """
    The main method of the program
    """    
    print("Welcome to the Racetrack Control Program!")
    print("Powered by the " + ALGORITHM_NAME + 
          " Reinforcement Learning Algorithm\n")
    print("Track: " + THIS_TRACK)
    print()
    print("What happens to the car if it crashes into a wall?")
    option_1 = """1. Starts from the nearest position on the track to the 
        place where it crashed."""
    option_2 = """2. Returns back to the original starting position."""
    print(option_1)
    print(option_2)
    crash_scenario = int(input("Crash scenario (1 or 2): "))
    no_training_iter = int(input(
        "Enter the initial number of training iterations (e.g. 5): "))
    print("\nThe race car is training. Please wait...")

    # Directory where the racetrack is located
    #racetrack_name = input("Enter the path to your input file: ") 
    racetrack_name = FILENAME
    racetrack = read_environment(racetrack_name)

    # How many times the race car will do a single time trial
    races = NO_RACES

    while(no_training_iter < MAX_TRAIN_ITER):
    
        # Keep track of the total number of steps
        total_steps = 0

        # Record the crash scenario
        bad_crash = False
        if crash_scenario == 1:
            bad_crash = False
        else:
            bad_crash = True

        # Retrieve the policy
        policy = value_iteration(racetrack, bad_crash = bad_crash,
                             no_training_iter=no_training_iter) 
 
        for each_race in range(races):
            total_steps += do_time_trial(racetrack, policy, bad_crash = (
            bad_crash), animate = True)

        print("Number of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            print("Bad Crash Scenario: " + option_1)
        else:
            print("Bad Crash Scenario: " + option_2)
        print("Average Number of Steps the Race Car Needs to Take Before " + 
              "Finding the Finish Line: " + str(total_steps/races) + " steps\n")
        print("\nThe race car is training. Please wait...")

        # Delay 
        time.sleep(FRAME_TIME + 4)

        # Testing statistics
        test_stats_file = THIS_TRACK 
        test_stats_file += "_"
        test_stats_file += ALGORITHM_NAME + "_iter"
        test_stats_file += str(no_training_iter)+ "_cr" 
        test_stats_file += str(crash_scenario) + "_stats.txt" 

        ## Open a test_stats_file 
        outfile_ts = open(test_stats_file,"w")
        
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write("Track: ") 
        outfile_ts.write(THIS_TRACK)
        outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
        else:
            outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
        outfile_ts.write("Average Number of Steps the Race Car Took " + 
              "Before Finding the Finish Line: " + str(total_steps/races) + 
              " steps\n")

        # Show functioning of the program
        trace_runs_file = THIS_TRACK 
        trace_runs_file += "_"
        trace_runs_file += ALGORITHM_NAME + "_iter"
        trace_runs_file += str(no_training_iter) + "_cr"
        trace_runs_file += str(crash_scenario) + "_trace.txt"

        if no_training_iter <= 5:
            ## Open a new file to save trace runs
            outfile_tr = open(trace_runs_file,"w") 
        
            # Print trace runs that demonstrate proper functioning of the code
            outfile_tr.write(str(policy))  
            
            outfile_tr.close()

        ## Close the files        
        outfile_ts.close()

        no_training_iter += 5

main()

Q-Learning Code in Python

And here is the code for Q-Learning. Again, don’t be scared by how long this code is. I included a lot of comments so that you know what is going on at each step of the code (just copy and paste this into your favorite IDE!):

import os # Library enables the use of operating system dependent functionality
import time # Library handles time-related tasks
from random import shuffle # Import shuffle() method from the random module
from random import random # Import random() method from the random module
from copy import deepcopy # Enable deep copying
import numpy as np # Import Numpy library

# File name: q_learning.py
# Author: Addison Sears-Collins
# Date created: 8/14/2019
# Python version: 3.7
# Description: Implementation of the Q-learning reinforcement learning
# algorithm for the racetrack problem
# The racetrack problem is described in full detail in: 
# Barto, A. G., Bradtke, S. J., Singh, S. P. (1995). Learning to Act Using 
#   Real-time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138.
# and 
# Sutton, Richard S., and Andrew G. Barto. Reinforcement learning : 
#   An Introduction. Cambridge, Massachusetts: The MIT Press, 2018. Print.
#   (modified version of Exercise 5.12 on pg. 111)

# Define constants
ALGORITHM_NAME = "Q_Learning"
FILENAME = "L-track.txt"
THIS_TRACK = "L_track"
START = 'S'
GOAL = 'F'
WALL = '#'
TRACK = '.'
MAX_VELOCITY = 5
MIN_VELOCITY = -5
DISC_RATE = 0.9  # Discount rate, also known as gamma. Determines by how much
                 # we discount the value of a future state s'
ERROR_THRES = 0.001 # Determine when Q-values stabilize (i.e.theta)
PROB_ACCELER_FAILURE = 0.20 # Probability car will try to take action a 
                            # according to policy pi(s) = a and fail.
PROB_ACCELER_SUCCESS = 1 - PROB_ACCELER_FAILURE
NO_TRAINING_ITERATIONS = 500000 # A single training iteration runs through all
                            # possible states s
TRAIN_ITER_LENGTH = 10 # The maximum length of each training iteration
MAX_TRAIN_ITER = 10000000 # Maximum number of training iterations
NO_RACES = 10 # How many times the race car does a single time trial from 
              # starting position to the finish line
LEARNING_RATE = 0.25 # If applicable, also known as alpha
MAX_STEPS = 500 # Maximum number of steps the car can take during time trial
FRAME_TIME = 0.1 # How many seconds between frames printed to the console

# Range of the velocity of the race car in both y and x directions
vel_range = range(MIN_VELOCITY, MAX_VELOCITY + 1)

# Actions the race car can take
# (acceleration in y direction, acceleration in x direction)
actions = [(-1,-1), (0,-1), (1,-1),
           (-1,0) , (0,0),  (1,0),
           (-1,1) , (0,1),  (1,1)]

def read_environment(filename):
    """
    This method reads in the environment (i.e. racetrack)
    :param str filename
    :return environment: list of lists
    :rtype list
    """
    # Open the file named filename in READ mode.
    # Make the file an object named 'file'
    with open(filename, 'r') as file:

        # Read until end of file using readline() 
        # readline() then returns a list of the lines
        # of the input file
        environment_data = file.readlines()
    
    # Close the file
    file.close()

    # Initialize an empty list
    environment = []

    # Adds a counter to each line in the environment_data list,
    # i is the index of each line and line is the actual contents.
    # enumerate() helps get not only the line of the environment but also 
    # the index of that line (i.e. row)
    for i,line in enumerate(environment_data):
        # Ignore the first line that contains the dimensions of the racetrack
        if i > 0:
            # Remove leading or trailing whitespace if applicable
            line = line.strip()

            # If the line is empty, ignore it
            if line == "": continue

            # Creates a list of lines. Within each line is a list of 
            # individual characters
            # The stuff inside append(stuff) first creates a new_list = []
            # It then appends all the values in a given line to that new 
            # list (e.g. new_list.append(all values inside the line))
            # Then we append that new list to the environment list.
            # Therefoer, environment is a list of lists.
            environment.append([x for x in line])

    # Return the environment (i.e. a list of lists/lines)
    return environment

def print_environment(environment, car_position = [0,0]):
    """
    This method reads in the environment and current 
    (y,x) position of the car and prints the environment to the console
    :param list environment
    :param list car_position 
    """
    # Store value of current grid square
    temp = environment[car_position[0]][car_position[1]]

    # Move the car to current grid square
    environment[car_position[0]][car_position[1]] = "X"

    # Delay 
    time.sleep(FRAME_TIME)

    # Clear the printed output
    clear()

    # For each line in the environment
    for line in environment: 

        # Initialize a string
        text = ""

        # Add each character to create a line
        for character in line: 
            text += character

        # Print the line of the environment
        print(text)

    # Retstore value of current grid square
    environment[car_position[0]][car_position[1]] = temp

def clear():
    """
    This method clears the print output
    """    
    os.system( 'cls' )

def get_random_start_position(environment):
    """
    This method reads in the environment and selects a random
    starting position on the racetrack (y, x). Note that 
    (0,0) corresponds to the upper left corner of the racetrack.
    :param list environment: list of lines
    :return random starting coordinate (y,x) on the racetrack
    :rtype tuple
    """
    # Collect all possible starting positions on the racetrack
    starting_positions = []

    # For each row in the environment
    for y,row in enumerate(environment):

        # For each column in each row of the environment
        for x,col in enumerate(row):

            # If we are at the starting position
            if col == START:

                # Add the coordiante to the list of available
                # starting positions in the environment
                starting_positions += [(y,x)]

    # Random shuffle the list of starting positions
    shuffle(starting_positions)

    # Select a starting position
    return starting_positions[0]

def act(old_y, old_x, old_vy, old_vx, accel, environment, deterministic=(
    False),bad_crash = False):
    """
    This method generates the new state s' (position and velocity) from the old
    state s and the action a taken by the race car. It also takes as parameters
    the two different crash scenarios (i.e. go to nearest
    open position on the race track or go back to start)
    :param int old_y: The old y position of the car
    :param int old_x: The old x position of the car
    :param int old_vy: The old y velocity of the car
    :param int old_vx: The old x velocity of the car
    :param tuple accel: (ay,ax) - acceleration in y and x directions
    :param list environment: The racetrack
    :param boolean deterministic: True if we always follow the policy
    :param boolean bad_crash: True if we return to start after crash
    :return s' where s' = new_y, new_x, new_vy, and new_vx
    :rtype int
    """ 
    # This method is deterministic if the same output is returned given
    # the same input information
    if not deterministic:

        # If action fails (car fails to take the prescribed action a)
        if random() > PROB_ACCELER_SUCCESS: 
            #print("Car failed to accelerate!")
            accel = (0,0)
 
    # Using the old velocity values and the new acceleration values,
    # get the new velocity value
    new_vy, new_vx = get_new_velocity((old_vy,old_vx), accel)

    # Using the new velocity values, update with the new position
    temp_y, temp_x = get_new_position((old_y,old_x), (new_vy, new_vx),( 
                                     environment))

    # Find the nearest open cell on the racetrack to this new position
    new_y, new_x = get_nearest_open_cell(environment, temp_y, temp_x, new_vy, 
                                     new_vx)
    # If a crash happens (i.e. new position is not equal to the nearest
    # open position on the racetrack
    if new_y != temp_y or new_x != temp_x:

        # If this is a crash in which we would have to return to the
        # starting position of the racetrack and we are not yet
        # on the finish line
        if bad_crash and environment[new_y][new_x] != GOAL:

            # Return to the start of the race track
            new_y, new_x = get_random_start_position(environment)
        
        # Velocity of the race car is set to 0.
        new_vy, new_vx = 0,0

    # Return the new state
    return new_y, new_x, new_vy, new_vx

def get_new_position(old_loc, vel, environment):
    """
    Get a new position using the old position and the velocity
    :param tuple old_loc: (y,x) position of the car
    :param tuple vel: (vy,vx) velocity of the car
    :param list environment
    :return y+vy, x+vx: (new_y,new_x)
    """
    y,x = old_loc[0], old_loc[1]
    vy, vx = vel[0], vel[1]

    # new_y = y+vy, new_x = x + vx    
    return y+vy, x+vx

def get_new_velocity(old_vel,accel,min_vel=MIN_VELOCITY,max_vel=MAX_VELOCITY):
    """
    Get the new velocity values
    :param tuple old_vel: (vy, vx)
    :param tuple accel: (ay, ax)
    :param int min_vel: Minimum velocity of the car
    :param int max_vel: Maximum velocity of the car
    :return new velocities in y and x directions
    """
    new_y = old_vel[0] + accel[0] 
    new_x = old_vel[1] + accel[1]
    if new_x < min_vel: new_x = min_vel
    if new_x > max_vel: new_x = max_vel
    if new_y < min_vel: new_y = min_vel
    if new_y > max_vel: new_y = max_vel
    
    # Return the new velocities
    return new_y, new_x

def get_nearest_open_cell(environment, y_crash, x_crash, vy = 0, vx = (
        0), open = [TRACK, START, GOAL]):
    """
    Locate the nearest open cell in order to handle crash scenario.
    Distance is calculated as the Manhattan distance.
    Start from the crash grid square and expand outward from there with
    a radius of 1, 2, 3, etc. Forms a diamond search pattern.
    
    For example, a Manhattan distance of 2 would look like the following:     
            .
           ...
          ..#..
           ... 
            .   
    If velocity is provided, search in opposite direction of velocity so that
    there is no movement over walls
    :param list environment
    :param int ycrash: y coordinate where crash happened
    :param int xcrash: x coordinate where crash happened
    :param int vy: velocity in y direction when crash occurred
    :param int vx: velocity in x direction when crash occurred
    :param list of strings open: Contains environment types
    :return tuple of the nearest open y and x position on the racetrack
    """ 
    # Record number of rows (lines) and columns in the environment
    rows = len(environment)
    cols = len(environment[0])    
   
    # Add expanded coverage for searching for nearest open cell
    max_radius = max(rows,cols)

    # Generate a search radius for each scenario
    for radius in range(max_radius):

        # If car is not moving in y direction
        if vy == 0: 
            y_off_range = range(-radius, radius + 1)
        # If the velocity in y-direction is negative
        elif vy < 0:
            # Search in the positive direction
            y_off_range = range(0, radius + 1)
        else:
            # Otherwise search in the negative direction
            y_off_range = range(-radius, 1)

        # For each value in the search radius range of y
        for y_offset in y_off_range:

            # Start near to crash site and work outwards from there
            y = y_crash + y_offset
            x_radius = radius - abs(y_offset)

            # If car is not moving in x direction
            if vx == 0:
                x_range = range(x_crash - x_radius, x_crash + x_radius + 1)
            # If the velocity in x-direction is negative
            elif vx < 0:
                x_range = range(x_crash, x_crash + x_radius + 1)
            # If the velocity in y-direction is positive
            else:
                x_range = range(x_crash - x_radius, x_crash + 1)

            # For each value in the search radius range of x
            for x in x_range:
                # We can't go outside the environment(racetrack) boundary
                if y < 0 or y >= rows: continue
                if x < 0 or x >= cols: continue

                # If we find and open cell, return that (y,x) open cell
                if environment[y][x] in open: 
                    return(y,x)        
    
    # No open grid squares found on the racetrack
    return

def get_policy_from_Q(cols, rows, vel_range, Q, actions):
    """
    This method returns the policy pi(s) based on the action taken in each state
    that maximizes the value of Q in the table Q[s,a]. This is pi*(s)...the
    best action that the race car should take in each state is the one that 
    maximizes the value of Q. (* means optimal)
    :param int cols: Number of columns in the environment
    :param int rows: Number of rows (i.e. lines) in the environment
    :param list vel_range: Range of the velocity of the race car 
    :param list of tuples actions: actions = [(ay,ax),(ay,ax)....]
    :return pi : the policy
    :rtype: dictionary: key is the state tuple, value is the 
       action tuple (ay,ax)
    """
    # Create an empty dictionary called pi
    pi = {}

    # For each state s in the environment
    for y in range(rows): 
        for x in range(cols):
            for vy in vel_range:
                for vx in vel_range:
                    # Store the best action for each state...the one that
                    # maximizes the value of Q.
                    # argmax looks across all actions given a state and 
                    # returns the index ai of the maximum Q value
                    pi[(y,x,vy,vx)] = actions[np.argmax(Q[y][x][vy][vx])]        
                    
    # Return pi(s)
    return(pi)

def q_learning(environment, bad_crash = False, reward = 0.0, no_training_iter =( 
    NO_TRAINING_ITERATIONS), train_iter_length = TRAIN_ITER_LENGTH):
    """ 
    Return a policy pi that maps states to actions
    Each episode uses a different initial state. This forces the agent to fully
    explore the environment to create a more informed Q[s,a] table.
    :param list environment
    :param boolean bad_crash
    :param int reward of the terminal states (i.e. finish line)
    :param int no_training_iter
    :param int train_iter_length
    :return policy pi(s) which maps a given state to an optimal action
    :rtype dictionary
    """
    rows = len(environment)
    cols = len(environment[0])    

    # Initialize all Q(s,a) to arbitrary values, except the terminal state 
    # (i.e. finish line states) that has a value of 0.
    # Q[y][x][vy][vx][ai]
    Q = [[[[[random() for _ in actions] for _ in vel_range] for _ in (
        vel_range)] for _ in line] for line in environment]

    # Set finish line state-action pairs to 0
    for y in range(rows):
        for x in range(cols):
            # Terminal state has a value of 0
            if environment[y][x] == GOAL:
                for vy in vel_range:
                    for vx in vel_range:   
                        for ai, a in enumerate(actions):                        
                            Q[y][x][vy][vx][ai] = reward

    # We run many training iterations for different initial states in order
    # to explore the environment as much as possible
    for iter in range(no_training_iter):
               
        # Reset all the terminal states to the value of the goal
        for y in range(rows):
            for x in range(cols): 
                if environment[y][x] == GOAL: 
                    Q[y][x] = [[[reward for _ in actions] for _ in (
                        vel_range)] for _ in vel_range] 
        
        # Select a random initial state
        # from anywhere on the racetrack
        y = np.random.choice(range(rows))
        x = np.random.choice(range(cols))
        vy = np.random.choice(vel_range) 
        vx = np.random.choice(vel_range) 

        # Do a certain number of iterations for each episode
        for t in range(train_iter_length):
            if environment[y][x] == GOAL: 
                break
            if environment[y][x] == WALL: 
                break

            # Choose the best action for the state s
            a = np.argmax(Q[y][x][vy][vx])           
            
            # Act and then observe a new state s'
            new_y, new_x, new_vy, new_vx = act(y, x, vy, vx, actions[
                a], environment, bad_crash = bad_crash) 
            r = -1

            # Update the Q table based on the immediate reward received from
            # taking action a in state s plus the discounted future reward
            Q[y][x][vy][vx][a] = ((1 - LEARNING_RATE)*Q[y][x][vy][vx][a] + 
                LEARNING_RATE*(r + DISC_RATE*max(Q[new_y][new_x][
                    new_vy][new_vx])))

            # The new state s' now becomes s
            y, x, vy, vx = new_y, new_x, new_vy, new_vx

    return(get_policy_from_Q(cols, rows, vel_range, Q, actions))


def do_time_trial(environment, policy, bad_crash = False, animate = True, 
                  max_steps = MAX_STEPS):
    """
    Race car will do a time trial on the race track according to the policy.   
    :param list environment
    :param dictionary policy: A dictionary containing the best action for a 
        given state. The key is the state y,x,vy,vx and value is the action 
        (ax,ay) acceleration
    :param boolean bad_crash: The crash scenario. If true, race car returns to
        starting position after crashes
    :param boolean animate: If true, show the car on the racetrack at each 
        timestep
    :return i: Total steps to complete race (i.e. from starting position to 
        finish line)
    :rtype int

    """
    # Copy the environment
    environment_display = deepcopy(environment)

    # Get a starting position on the race track
    starting_pos = get_random_start_position(environment)
    y,x = starting_pos
    vy,vx = 0,0  # We initialize velocity to 0

    # Keep track if we get stuck
    stop_clock = 0    

    # Begin time trial
    for i in range(max_steps):        

        # Show the race car on the racetrack
        if animate: 
            print_environment(environment_display, car_position = [y,x])
        
        # Get the best action given the current state
        a = policy[(y,x,vy,vx)]

        # If we are at the finish line, stop the time trial
        if environment[y][x] == GOAL: 
            return i 
        
        # Take action and get new a new state s'
        y,x,vy,vx = act(y, x, vy, vx, a, environment, bad_crash = bad_crash)

        # Determine if the car gets stuck
        if vy == 0 and vx == 0:
            stop_clock += 1
        else:
            stop_clock = 0

        # We have gotten stuck as the car has not been moving for 5 timesteps
        if stop_clock == 5:
            return max_steps
        
    # Program has timed out
    return max_steps

def main():
    """
    The main method of the program
    """    
    print("Welcome to the Racetrack Control Program!")
    print("Powered by the " + ALGORITHM_NAME + 
          " Reinforcement Learning Algorithm\n")
    print("Track: " + THIS_TRACK)
    print()
    print("What happens to the car if it crashes into a wall?")
    option_1 = """1. Starts from the nearest position on the track to the 
        place where it crashed."""
    option_2 = """2. Returns back to the original starting position."""
    print(option_1)
    print(option_2)
    crash_scenario = int(input("Crash scenario (1 or 2): "))
    no_training_iter = int(input(
        "Enter the initial number of training iterations (e.g. 500000): "))
    print("\nThe race car is training. Please wait...")

    # Directory where the racetrack is located
    #racetrack_name = input("Enter the path to your input file: ") 
    racetrack_name = FILENAME
    racetrack = read_environment(racetrack_name)

    # How many times the race car will do a single time trial
    races = NO_RACES

    while(no_training_iter <= MAX_TRAIN_ITER):
    
        # Keep track of the total number of steps
        total_steps = 0

        # Record the crash scenario
        bad_crash = False
        if crash_scenario == 1:
            bad_crash = False
        else:
            bad_crash = True

        # Retrieve the policy
        policy =  q_learning(racetrack, bad_crash = (
            bad_crash),no_training_iter=no_training_iter)
 
        for each_race in range(races):
            total_steps += do_time_trial(racetrack, policy, bad_crash = (
            bad_crash), animate = False)

        print("Number of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            print("Bad Crash Scenario: " + option_1)
        else:
            print("Bad Crash Scenario: " + option_2)
        print("Average Number of Steps the Race Car Needs to Take Before " + 
              "Finding the Finish Line: " + str(total_steps/races) + " steps\n")
        print("\nThe race car is training. Please wait...")

        # Delay 
        time.sleep(FRAME_TIME + 4)

        # Testing statistics
        test_stats_file = THIS_TRACK 
        test_stats_file += "_"
        test_stats_file += ALGORITHM_NAME + "_iter"
        test_stats_file += str(no_training_iter)+ "_cr" 
        test_stats_file += str(crash_scenario) + "_stats.txt" 

        ## Open a test_stats_file 
        outfile_ts = open(test_stats_file,"w")
        
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write(ALGORITHM_NAME + " Summary Statistics\n")
        outfile_ts.write(
            "------------------------------------------------------------------\n")
        outfile_ts.write("Track: ") 
        outfile_ts.write(THIS_TRACK)
        outfile_ts.write("\nNumber of Training Iterations: " + str(no_training_iter))
        if crash_scenario == 1:
            outfile_ts.write("\nBad Crash Scenario: " + option_1 + "\n")
        else:
            outfile_ts.write("Bad Crash Scenario: " + option_2 + "\n")
        outfile_ts.write("Average Number of Steps the Race Car Took " + 
              "Before Finding the Finish Line: " + str(total_steps/races) + 
              " steps\n")

        # Show functioning of the program
        trace_runs_file = THIS_TRACK 
        trace_runs_file += "_"
        trace_runs_file += ALGORITHM_NAME + "_iter"
        trace_runs_file += str(no_training_iter) + "_cr"
        trace_runs_file += str(crash_scenario) + "_trace.txt"

        if no_training_iter <= 5:
            ## Open a new file to save trace runs
            outfile_tr = open(trace_runs_file,"w") 
        
            # Print trace runs that demonstrate proper functioning of the code
            outfile_tr.write(str(policy))  
            
            outfile_tr.close()

        ## Close the files        
        outfile_ts.close()

        if no_training_iter == 500000:
            no_training_iter += 500000
        else:
            no_training_iter += 1000000
           
main()

Return to Table of Contents

Experimental Output and Analysis

Here are the results:

value-iteration-1
value-iteration-2
q-learning-1
q-learning-2
q-learning-3

Analysis

Hypothesis 1: Both Algorithms Will Enable the Car to Finish the Race

Both the value iteration and Q-learning algorithms were successfully able to finish the race for both crash scenarios and for both the R-track and the L-track. The race car completed the R-track in fewer time steps than the L-track, which was expected.

The crash policy in which the car had to return to the starting position after it crashed versus going to the nearest open position negatively impacted performance for both algorithms. Time trial performance was superior for the value iteration algorithm compared to the Q learning algorithm. This result was expected given that the value iteration algorithm had access to the transition and the reward probabilities during the training phase. In other words, value iteration it gets a complete model of the entire track, whereas Q-learning has to be forced to explore the environment in order to learn.

Hypothesis 2: Value Iteration Will Learn Faster Than Q-Learning

My hypothesis was correct. Value iteration generated a learning policy faster than Q-learning because it had access to the transition and reward functions. Q-learning required more than 6,000,000 training iterations on the R-track to achieve the time trial performance that the value iteration algorithm was able to produce in only 45 training iterations (holding all other hyperparameters constant). And whereas the Q-learning algorithm took 2-3 hours to train to get good time trial results, value iteration never took more than 10 minutes for both the R-track and L-track under both of the bad crash scenarios.

Hypothesis 3: Bad Crash Version 1 Will Outperform Bad Crash Version 2

The results show that my hypothesis was correct. It took longer for the car to finish the race for the crash scenario in which the race car needed to return to the original starting position each time it crashed into a wall. In other words, Bad Crash Version 1 (return to start) performance was better than Bad Crash Version 2 (return to nearest open position) performance. These results are what I expected since always returning to the starting position after a crash hampers the ability of the race car to continually explore new states.

Summary and Conclusions

My hypotheses were correct. Here are the following conclusions:

  • Value iteration and Q-learning are powerful reinforcement learning algorithms that can enable an agent to learn autonomously.
  • Value iteration led to faster learning than the Q-learning algorithm.
  • A crash policy in which the race car always returns to the starting position after a crash negatively impacts performance.

Return to Table of Contents

Video Demonstration

Here is a video demonstration of the algorithms in action.

References

Alpaydin, E. (2014). Introduction to Machine Learning. Cambridge, Massachusetts: The MIT Press.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to Act Using Real-Time Dynamic Programming. Artificial Intelligence, 81-138.

Russell, S. J., Norvig, P., & Davis, E. (2010). Artificial intelligence : A Modern Approach. Upper Saddle River, NJ: Prentice Hall.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning : An Introduction . Cambridge, Massachusetts: The MIT Press.

Return to Table of Contents

How Reinforcement Learning Works

Let’s examine what reinforcement learning is all about by looking at an example.

We have a race car that drives all by itself (i.e. autonomous race car). The race car will move around in an environment. Formally, the race car is known in reinforcement learning language as the agent. The agent is the entity in a reinforcement learning problem that is learning and making decisions. An agent in this example is a race car, but in other applications it could be a robot, self-driving car, autonomous helicopter, etc.

reinforcement-learning-first

The race car will interact with the environment. The environment in this example is the racetrack. It is represented as the little cloud in the image above. Formally, the environment is the world in which the agent learns and makes decisions. In this example, it is a race track, but in other applications it could be a chessboard, bedroom, warehouse, stock market, factory, etc. 

In order to interact with the environment, the race car has to take actions (i.e. accelerate, decelerate, turn). Each time the race car takes an action it will change the environment. Formally, actions are the set of things that an agent can do. In a board game like chess, for example, the set of actions could be move-left, move-right, move-up, and move-down.

reinforcement-learning-2

Each time the environment changes, the race car will sense the environment. In this example, sensing the environment means taking an observation of its current position on the racetrack and its velocity (i.e. speed). The position and velocity variables are what we call the current state of the environment. The state is what describes the current environment. In this example, position and velocity constitute the state, but in other applications the state could be any kind of observable value (i.e. roll, pitch, yaw, angle of rotation, force, etc.)

reinforcement-learning-3

After the race car observes the new state, it has to process that state in order to decide what to do next. That decision-making process is called the policy of the agent. You can think of the policy as a function. It takes as inputs the observed state (denoted as s) and outputs an action (denoted as a). The policy is commonly represented as the Greek letter π. So, the function is formally π(s) = a.

reinforcement-learning-4

In robotics, the cycle I described above is often referred to as the sense-think-act cycle

Diving into even greater detail on how this learning cycle works…once the policy of the race car (i.e. the agent) outputs an action, the race car then takes that action. That action causes the environment to transition to a new state. That transition is often represented as the letter T (i.e. the transition function). The transition function takes as input the previous state and the new action. The output of the transition function is a new state.

reinforcement-learning-5

The robot then observes the new state, examines its policy, and executes a new action. This cycle repeats over and over again.

So the question now is, how is the policy π(s) determined? In other words, how does the race car know what action to take (i.e. accelerate, decelerate, turn) when it observes a particular state (i.e. position on the racetrack and velocity)?

In order to answer those questions, we need to add another piece to the diagram. That piece is known as the reward. The reward is what helps the race car develop a policy (i.e. a plan of what action to take in each state). Let us draw the reward piece on the diagram and then take a closer look at what this is.

reinforcement-learning-6

Every time the race car is in a particular state and takes an action, there is a reward that the race car receives from the environment for taking that action in that particular state. That reward can be either a positive reward or a negative reward. Consider the reward the feedback that the race car (i.e. agent) receives from the environment for taking an action. The reward tells the race car how well it is doing.

It is helpful to imagine that there is a pocket on the side of the race car that stores “money.” Every time the race car takes an action from a particular state, the environment either adds money to that pocket (positive reward) or removes money from that pocket (negative reward). 

The actual value for the reward (i.e. r in the diagram above) depends on what is known as the reward function. We can denote the reward function as R. The reward function takes as inputs a state, action, and new state and outputs the reward.

R[s, a, s’] —-> r

The reward is commonly an integer (e.g. +1 for a positive reward; -1 for a negative reward). In this project, the reward for each move is -1. The reason why the reward for each move is -1 is because we want to teach the race car to move from the starting position to the finish line in the minimum number of timesteps. By penalizing the race car each time it makes a move, we ensure that it gradually learns how to avoid penalties and finishes the race as quickly as possible.

reinforcement-learning-7

Inside the race car, there is an algorithm that keeps track of what the rewards were when it took an action in a particular state. This algorithm helps the race car improve over time. Improvement in this case means learning a policy that enables it to take actions (given a particular state) that maximize its total reward. 

We will represent this algorithm as Q, where Q is used by the race car in order to develop an optimal policy, a policy that enables the race car to maximize its total reward.

reinforcement-learning-8

Reinforcement Learning Summary

Summarizing the steps in the previous section, on a high level reinforcement learning has the following steps:

  1. The agent (i.e race car) takes an observation of the environment. This observation is known as the state.
  2. Based on this state, the race car uses its policy to determine what action to take, where policy is typically a lookup table that matches states to particular actions.
  3. Based on the state and the action taken by the agent, the environment transitions to a new state and generates a reward.
  4. The reward is passed from the environment to the agent.
  5. The agent keeps track of the rewards it receives for the actions it takes in the environment from a particular state. It uses the reward information to update its policy. The goal of the agent is to learn the best policy over time (i.e. where policy is typically a lookup table that tells the race car what action to take given a particular state). The best policy is the one that maximizes the total reward of the agent.

Markov Decision Problem

reinforcement-learning-9

The description of reinforcement learning I outlined in the previous section is known as a Markov decision problem. A Markov decision problem has the following components:

  • Set of states S: All the different values of s that can be observed by the race car (e.g. position on the racetrack, velocity). In this project, the set of states is all the different values that <xt, yt, vxt, vyt> can be.
  • Set of actions A: All the different values of a that are the actions performed by the race car on the environment (e.g. accelerate, decelerate, turn). In this project, the set of actions is all the different values that = <axt, ayt> can be.
  • Transition Function T[s, a, s’]
    • Where T[s, a, s’] stands for the probability the race car finds itself in a new state s’ (at the next timestep) given that it was in state s and took action a.
    • In probability notation, this is formally written as: 
      • T[s, a, s’] = P[new state = s’ | state = s AND action = a], where P means probability and | means “given” or “conditioned upon.”
  • Reward Function R[s,a,s’]
    • Given a particular state s and a particular action a that changes to state from s to s’, the reward function determines what the reward will be (e.g. -1 reward for all portions of the racetrack except for the finish line, which has reward of 0).

The goal of a reinforcement learning algorithm is to find the policy (denoted as π(s)) that can tell the race car the best action to take in any particular state, where “best” in this context means the action that will maximize the total reward the race car receives from the environment. The end result is the best sequence of actions the race car can take to maximize its reward (thereby finishing the race in the minimum number of timesteps). The sequence of actions the race car takes from the starting position to the finish line is known denotes an episode or trial. (Alpaydin, 2014). Recall that the policy is typically a lookup table that matches states (input) to actions (output).

reinforcement-learning-11

There are a number of algorithms that can be used to enable the race car to learn a policy that will enable it to achieve its goal of moving from the starting position of the racetrack to the finish line as quickly as possible. The two algorithms implemented in this project were value iteration and Q-learning. These algorithms will be discussed in detail in future posts.

Helpful Video

This video series by Dr. Tucker Balch, a professor at Georgia Tech, provides an excellent introduction to reinforcement learning.

How Does Quickprop Address the Vanishing Gradient Problem in Cascade Correlation?

In order to answer this question, let us break it down piece-by-piece.

What is Quickprop?

Normal backpropagation when training an artificial feedforward neural network can take a long time to converge on the optimal set of weights. Updating the weights in the first layers of a neural network requires errors that are propagated backwards from the output layer. This procedure might work well for small networks, but when there are a lot of hidden layers, standard backpropagation can really slow down the training process.

Scott E. Fahlman, a Professor Emeritus at Carnegie Mellon University in the School of Computer Science, decided in the late 1980s to speed up backpropagation by inventing a new algorithm called Quickprop which was inspired by Newton’s method, a popular root-finding algorithm.

The idea behind QuickProp is to take the biggest steps possible without overstepping the solution (Kirk, 2015). Instead of the error information propagating backwards from the output layer (as is the case with standard backpropagation), QuickProp needs just the current and previous values of the weights, as well as the error derivative that was calculated during the most recent epoch. This twist on backpropagation means that the weight update process is now entirely local with respect to each node.

In Quickprop, we assume that the shape of the cost function for a given weight when near its optimal value is a paraboloid. With this assumption we can then follow a method similar to Newton’s method for converging on the optimal solution.

paraboloid_elipticky_png
A paraboloid

The downside of Quickprop is that it can get chaotic if the surface of the cost function has a lot of local minima (Kirk, 2015).

Title Image Source: Wikimedia Commons

What is the Vanishing Gradient Problem?

The vanishing gradient problem is as follows: As a neural network contains more and more layers, the gradients of the cost function with respect to the weights at each of the nodes gets smaller and smaller, approaching zero. Teeny tiny gradients make the network increasingly more difficult to train.

The vanishing gradient problem is caused by the nature of backpropagation as well as the activation function (e.g. logistic sigmoid function). Backpropagation computes gradients using the chain rule. And since the sigmoid function generates gradients in the range (0, 1), at each step of training, you are multiplying a chain of small numbers together in order to calculate the gradients.

Consistently multiplying smaller and smaller gradient values produces smaller and smaller weight update values since the magnitude of the weight updates (i.e. step size) is proportional to the learning rate and the gradients. 

Thus, because the gradients become vanishingly small, the algorithm takes longer and longer to converge on an optimal set of weights. And since the weights will only make teeny tiny changes during each step of the training phase, the weights do not update effectively, and the neural network network losses accuracy. 

What is Cascade Correlation?

Cascade correlation is a method that addresses the question of how many hidden layers and how many hidden nodes we need. Instead of starting with a fixed network with a set number of hidden layers and nodes per hidden layer, the cascade correlation method starts with a minimal network and then trains and adds one hidden layer at a time. 

How Does Quickprop Address the Vanishing Gradient Problem in Cascade Correlation?

Some scientists have combined the Cascade correlation learning architecture with the Quickprop weight update rule (Abrahart, 2004). When combined in this fashion, Cascade correlation prevents there from being a vanishing gradient problem due to the fact the gradient is never actually propagated from output all the way back to input through the various layers. 

As noted by Marquez et al. (2018), “such training of deep networks in a cascade, directly circumvents the well-known vanishing gradient problem by ensuring that the output is always adjacent to the layer being trained.” This is because of the way the weights are locked as new layers are inserted, and the inputs are still tied directly to the outputs.

So to answer the question posed in the title, it’s really the cascade correlation architecture that has the greatest effect on reducing the vanishing gradient problem even though there is some benefit from the Quickprop algorithm.

References

Abrahart, Robert J., Pauline E. Kneale, and Linda M. See. Neural networks for hydrological modelling. Leiden New York: A.A. Balkema Publishers, 2004. Print.

Marquez, Enrique S., Jonathon S. Hare, Mahesan Niranjan: Deep Cascade Learning. IEEE Trans. Neural Netw. Learning Syst. 29(11): 5475-5485 (2018).

Kirk, Matthew, et al. Thoughtful Machine learning : A Test-driven Approach. Sebastopol, California: O’Reilly, 2015. Print.