How Do Neural Networks Make Predictions?

Neural networks are the workhorses of the rapidly growing field known as deep learning. Neural networks are used for all sorts of applications where a prediction of some sort is desired. Here are some examples:

  • Predicting the type of objects in an image or video
  • Sales forecasting
  • Speech recognition
  • Medical diagnosis
  • Risk management
  • and countless other applications… 

In this post, I will explain how neural networks make those predictions by boiling these structures down to their fundamental parts and then building up from there.

You Will Need 

Create Your First Neural Network

Imagine you run a business that provides short online courses for working professionals. Some of your courses are free, but your best courses require the students to pay for a subscription. 

You want to create a neural network to predict if a free student is likely to upgrade to a paid subscription. Let’s create the most basic neural network you can make.


OK, so there is our neural network. To implement this neural network on a computer, we need to translate this diagram into a software program. Let’s do that now using Python, the most popular language for machine learning.

# Declare a variable named weight and 
# initiate it with a value
weight = 0.075

# Create a method called neural_network that 
# takes as inputs, the input data (number of 
free courses a student has taken during the 
# last 30 days) and the weight of the connection. 
# The method returns the prediction.

def neural_network(input, weight):

  # The input data multiplied by the weight 
  # equals the prediction
  prediction = input * weight

  # This is the output
  return prediction

So we currently have five students, all of whom are free students. The number of free courses these users have taken during the last 30 days is 12, 3, 5, 6, and 8. Let’s code this in Python as a list.

number_of_free_courses_taken = [12, 3, 5, 6, 8]

Let’s make a prediction for the first student, the one who has taken 12 free courses over the last 30 days.


Now let’s put the diagram into code.

# Extract the first value of the list...12...
# and store into a variable named input
first_student_input = number_of_free_courses_taken[0]

# Call the neural_network method and store the 
# prediction result into a variable
first_student_prediction = neural_network(
                         first_student_input, weight)

# Print the prediction

OK. We have finished the code. Let’s see how it looks all together.

weight = 0.075

def neural_network(input, weight):

  prediction = input * weight

  return prediction

number_of_free_courses_taken = [
                        12, 3, 5, 6, 8]

first_student_input = number_of_free_courses_taken[0]

first_student_prediction = neural_network(
                       first_student_input, weight)


Open a Jupyter Notebook and run the code above, or run the code inside your favorite Python IDE.

Here is what I got:


What did you get? Did you get 0.9? If so, congratulations!

Let’s see what is happening when we run our code. We called the neural_network method. The first operation performed inside that method is to multiply the input by the weight and return the result. In this case, the input is 12, and the weight is 0.075. The result is 0.9.


0.9 is stored in the first_student_prediction variable.


And this, my friend, is the most basic building block of a neural network. A neural network in its simplest form consists of one or more weights which you can multiply by input data to make a prediction

Let’s take a look at some questions you might have at this stage.

What kind of input data can go into a neural network?

Real numbers that can be measured or calculated somewhere in the real world. Yesterday’s high temperature, a medical patient’s blood pressure reading, previous year’s rainfall, or average annual rainfall are all valid inputs into a neural network. Negative numbers are totally acceptable as well.

A good rule of thumb is, if you can quantify it, you can use it as an input into a neural network. It is best to use input data into a neural network that you think will be relevant for making the prediction you desire.

For example, if you are trying to create a neural network to predict if a patient has breast cancer or not, how many fingers a person has probably not going to be all that relevant. However, how many days per month a patient exercises is likely to be a relevant piece of input data that you would want to feed into your neural network.

What does a neural network predict?

A neural network outputs some real number. In some neural network implementations, we can do some fancy mathematics to limit the output to some real number between 0 and 1. Why would we want to do that? Well in some applications we might want to output probabilities. Let me explain.

Suppose you want to predict the probability that tomorrow will be sunny. The input into a neural network to make such a prediction could be today’s high temperature. 


If the output is some number like 0.30, we can interpret this as a 30% change of the weather being sunny tomorrow given today’s high temperature. Pretty cool huh!

We don’t have to limit the output to between 0 and 1. For example, let’s say we have a neural network designed to predict the price of a house given the house’s area in square feet. Such a network might tell us, “given the house’s area in square feet, the predicted price of the house is $432,000.”

What happens if the neural network’s predictions are incorrect?

The neural network will adjust its weights so that the next time it makes a more accurate prediction. Recall that the weights are multiplied by the input values to make a prediction.

What is a neural network really learning?

A neural network is learning the best possible set of weights. “Best” in the context of neural networks means the weights that minimize the prediction error.

Remember, the core math operation in a neural network is multiplication, where the simplest neural network is:

Input Value * Weight = Prediction

How does the neural network find the best set of weights?

Short answer: Trial and error

Long answer: A neural network starts out with random numbers for weights. It then takes in a single input data point, makes a prediction, and then sees if its prediction was either too high or too low. The neural network then adjusts its weight(s) accordingly so that the next time it sees the same input data point, it makes a more accurate prediction.

Once the weights are adjusted, the neural network is fed the next data point, and so on. A neural network gets better and better each time it makes a prediction. It “learns” from its mistakes one data point at a time.

Do you notice something here?

Standard neural networks have no memory. They are fed an input data point, make a prediction, see how close the prediction was to reality, adjust the weights accordingly, and then move on to the next data point. At each step of the learning process of a neural network, it has no memory of the most recent prediction it made.

Standard neural networks focus on one input data point at a time. For example, in our subscriber prediction neural network we built earlier in this tutorial, if we feed our neural network number_of_free_courses_taken[1], it will have no clue what it predicted when number_of_free_courses_taken[0] was the input value.

There are some networks that have short term memories. These are called Long short-term memory networks (LSTM).

How the Canny Edge Detector Works

In this post, I will explain how the Canny Edge Detector works. The Canny Edge Detector is a popular edge detection algorithm developed by John F. Canny in 1986. The goal of the Canny Edge Detector is to:

  • Minimize Error: Edges that are detected by the algorithm as edges should be real edges and not noise.
  • Good Localization: Minimize the distance between detected edge pixels and real edge pixels.
  • Minimal Responses to Single Edges: In other words, areas of the image that are not marked as edges should not be edges.

How the Canny Edge Detector Works

The Canny Edge Detector Process is as follows:

  1. Gaussian Filter: Smooth the input image with a Gaussian filter to remove noise (using a discrete Gaussian kernel).
  2. Calculate Intensity Gradients: Identify the areas in the image with the strongest intensity gradients (using a Sobel, Prewitt, or Roberts kernel).
  3. Non-maximum Suppression: Apply non-maximum suppression to thin out the edges. We want to remove unwanted pixels that might not be part of an edge.
  4. Thresholding with Hysteresis:  Hysteresis or double thresholding involves:
    • Accepting pixels as edges if the intensity gradient value exceeds an upper threshold.
    • Rejecting pixels as edges if the intensity gradient value is below a lower threshold.
    • If a pixel is between the two thresholds, accept it only if it is adjacent to a pixel that is above the upper threshold.

Mathematical Formulation of the Canny Edge Detector

More formally, in step 1 of the Canny Edge Detector, we smooth an image by convolving the image with a Gaussian kernel. An example calculation showing the convolving mathematical operation is shown in the Sobel Operator discussion. Below is an example 5×5 Gaussian kernel that can be used.


We must go through each 5×5 region in the image and apply the convolving operation between a 5×5 portion of the input image (with the pixel of interest as the center cell, or anchor) and the 5×5 kernel above. The result is then summed to give us the new intensity value for that pixel.

After smoothing the image using the Gaussian kernel, we then calculate the intensity gradients. A common method is to use the Sobel Operator.

Here are the two kernels used in the Sobel algorithm:


The gradient approximations at pixel (x,y) given a 3×3 portion of the source image Ii are calculated as follows:

Gx = x-direction kernel * (3x3 portion of image A with (x,y) as the center cell)
Gy = y-direction kernel * (3x3 portion of image A with (x,y) as the center cell)

* above is not normal matrix multiplication. * denotes the convolution operation.

We then combine the values above to calculate the magnitude of the gradient:

magnitude(G) = square_root(Gx2 + Gy2)

The direction of the gradient Ɵ is:

Ɵ = atan(Gy / Gx)

where atan is the arctangent operator.

Once we have the gradient magnitude and direction, we perform non-maximum suppression by scanning the entire image to get rid of pixels that might not be part of an edge. Non-maximum suppression works by finding pixels that are local maxima in the direction of the gradient (gradient direction is perpendicular to edges).

If, for example, we have three pixels that are next to each other: pixels a, b, and then c. Pixel b is larger in intensity than both a and c where pixels a and c are in the gradient direction of b. Therefore, pixel b is marked as an edge. Otherwise, if pixel b was not a local maximum, it would be set to 0 (i.e. black), meaning it would not be an edge pixel.

a ——> b <edge> ——> c

Non-maximum suppression is not perfect because some edges might actually be noise and not real edges. To solve this, Canny Edge Detector goes one step further and applies thresholding to remove the weakest edges and keep the strongest ones. Edge pixels that are borderline weak or strong are only considered strong if they are connected to strong edge pixels.

Canny Edge Detector Code

This tutorial has the Python code for the Canny Edge Detector.


In this discussion, we covered the Canny Edge Detector. The Canny Edge Detector is just one of many edge detection algorithms.

The most common edge detection algorithms fall into the following categories:

  • Gradient Operators
    • Roberts Cross Operator
    • Sobel Operator
    • Prewitt Operator
  • Canny Edge Detector
  • Laplacian of Gaussian
  • Haralick Operator

Which edge detection algorithm you choose depends on what you are trying to achieve with your application.

Keep building!

How the Laplacian of Gaussian Filter Works

In this post, I will explain how the Laplacian of Gaussian (LoG) filter works. Laplacian of Gaussian is a popular edge detection algorithm.

Edge detection is an important part of image processing and computer vision applications. It is used to detect objects, locate boundaries, and extract features. Edge detection is about identifying sudden, local changes in the intensity values of the pixels in an image.


How LoG Works

Edge detection algorithms like the Sobel Operator work on the first derivative of an image. In other words, if we have a graph of the intensity values for each pixel in an image, the Sobel Operator takes a look at where the slope of the graph of the intensity reaches a peak, and that peak is marked as an edge.

For our 10×1 pixel image, the blue curve below is a plot of the intensity values, and the orange curve is the plot of the first derivative of the blue curve. In layman’s terms, the orange curve is a plot of the slope.


The orange curve peaks in the middle, so we know that is likely an edge. When we look at the original source image, we confirm that yes, it is an edge.

One limitation with the approach above is that the first derivative of an image might be subject to a lot of noise. Local peaks in the slope of the intensity values might be due to shadows or tiny color changes that are not edges at all.

An alternative to using the first derivative of an image is to use the second derivative, which is the slope of the first derivative curve (i.e. that orange curve above). Such a curve looks something like this (see the gray curve below):


An edge occurs where the graph of the second derivative crosses zero. This second derivative-based method is called the Laplacian algorithm.

The Laplacian algorithm is also subject to noise. For example, consider a photo of a cat.

A cat hair or whisker might register as an edge because it is an area of a sharp change in intensity. However, it is not an edge. It is just noise. To solve this problem, a Gaussian smoothing filter is commonly applied to an image to reduce noise before the Laplacian is applied. This method is called the Laplacian of Gaussian (LoG).

We also set a threshold value to distinguish noise from edges. If the second derivative magnitude at a pixel exceeds this threshold, the pixel is part of an edge.

Mathematical Formulation of LoG

More formally, given a pixel (x, y), the Laplacian L(x,y) of an image with intensity values Ii can be written mathematically as follows:


Just like in the case of the Sobel Operator, we cannot calculate the second derivative directly because pixels in an image are discrete. We need to approximate it using the convolution operator. The two most common kernels are:


Calculating just the Laplacian will result in a lot of noise, so we need to convolve a Gaussian smoothing filter with the Laplacian filter to reduce noise prior to computing the second derivatives. The equation that combines both of these filters is called the Laplacian of Gaussian and is as follows:


The above equation is continuous, so we need to discretize it so that we can use it on discrete pixels in an image.

Here is an example of a LoG approximation kernel where σ = 1.4. This is just an example of one convolution kernel that can be used. There are others that would work as well.


This LoG kernel is convolved with a grayscale input image to detect the zero crossings of the second derivative. We set a threshold for these zero crossings and retain only those zero crossings that exceed the threshold. Strong zero crossings are ones that have a big difference between the positive maximum and the negative minimum on either size of the zero crossing. Weak zero crossings are most likely noise, so they are ignored due to the thresholding we apply.

Laplacian of Gaussian Code

This tutorial has the Python code for the Laplacian of Gaussian.