Computer vision

The Essential Guide to Pytorch Loss Functions

13 min read

Jul 13, 2022

In this article, we will go in-depth about the loss functions and their implementation in the PyTorch framework.

Deval Shah

Deval Shah

Humans evolve by learning from their past mistakes. 

Similarly, deep learning training uses a feedback mechanism called loss functions to evaluate mistakes and improve learning trajectories.

In this article, we will go in-depth about the loss functions and their implementation in the PyTorch framework.

Here’s what we’ll cover:

  • What are Loss Functions

  • How to setup PyTorch and define loss functions

  • Loss Functions in PyTorch

  • How to define a custom PyTorch loss function

  • How to monitor PyTorch loss functions

  • PyTorch Loss Functions: Summary

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Ready to streamline AI product deployment right away? Check out:

What are Loss Functions

Before diving into the Pytorch specifics, let’s quickly recap the basics of loss functions and their characteristics.

Loss functions measure how close a predicted value is to the actual value. When our model makes predictions that are very close to the actual values on our training and testing dataset, it means we have a pretty robust model.

Loss functions guide the model training process towards correct predictions. The loss function is a mathematical function or expression used to measure a dataset's performance on a model.

Loss Function

Pro tip: Looking for a perfect source for a recap of activation functions? Check out Types of Neural Networks Activation Functions.

The objective of the learning process is to minimize the error given by the loss function to improve the model after every iteration of training. Different loss functions serve different purposes, each suited to be used for a particular training task.

Different loss functions suit different problems, each carefully crafted by researchers to ensure stable gradient flow during training.

How to setup PyTorch and define loss functions

Here’s what you need to do before getting hands-on experience with PyTorch.

First, you must set up PyTorch to test and run your code. 

We can do this using these amazing tools:

  • Google Colab

  • Anaconda

Google Colab

Google Colab is helpful if you prefer to run your PyTorch code in your web browser. It comes with preinstalled all major frameworks out of the box that you can use for running Pytorch loss functions.

Anaconda

Another option is to install the PyTorch framework on a local machine using an anaconda package installer.  

With Anaconda, it's easy to get and manage Python, Jupyter Notebook, and other commonly used scientific computing and data science packages, like PyTorch.

Let me walk you through the installation steps:-

  1. Download and install Anaconda (choose the latest Python version).

  2. Go to PyTorch's site and find the get started section locally.

  3. Specify the appropriate configuration options for your particular environment.

  4. Run the presented command in the terminal to install PyTorch.

Pytorch has two fundamental libraries, torch, and torch nn, that encompass the starter functions required to construct your loss functions like creating a tensor.

torch library provides excellent flexibility and support for tensor operations on the GPU. It has a wide range of functionalities to train different neural network models.

The torch nn module provides building blocks like data loaders, train, loss functions, and more essential to training a model.

You can read more about the torch.nn here.

Once you have PyTorch up and running, here’s how you can add loss functions in PyTorch.

Torch NN module in pytorch has predefined and ready-to-use loss functions out of the box that you can use to train your neural network.

Let’s do a simple code walk-through that will guide you on how to add a loss function in PyTorch using a torch.nn library.

First import the libraries from the PyTorch library
import torch

  • import torch.nn

  • Now, it's time to define a loss function variable. Here, we will use cross-entropy loss, for example, but you can use any loss function from the library.
    loss_fn = nn.CrossEntropyLoss()

Remember, you can also write a custom loss function definition based on your application and use it instead of using an inbuilt PyTorch loss function.

While writing the forward pass for training the neural network, you can use the above loss function to pass your inputs and outputs to get a loss scalar value that is further used for backward propagation.

Here’s a link to the PyTorch documentation that lists all the predefined loss functions that you can use directly in your neural network training code. 

I would recommend you refer to the source code for these loss functions to get a better understanding of how loss functions are defined internally in PyTorch.

To learn more about the standard coding practices around writing a neural network training pipeline code, I would strongly recommend referring to Pytorch Image Models' open-source framework to train computer vision models.

Pro tip: Check out V7 Model Training to learn more.  

Loss Functions in PyTorch

There are three types of loss functions in PyTorch:

Regression loss functions deal with continuous values, which can take any value between two limits., such as when predicting the GDP per capita of a country given its rate of population growth, urbanization, historical GDP trends, etc.

Classification loss functions deal with discrete values, like the task of classifying an object with a confidence value. For instance, image classification into two labels: cat and dog.

Ranking loss functions predict the relative distances between values. An example would be face verification, where we want to know which face images belong to a particular face. We can do so by ranking faces that do not belong to the original face-holder via their degree of relative approximation to the target face scan.

Now, let’s discuss each PyTorch’s function in more detail

  • L1 loss function (Mean Absolute Error Loss)

  • Mean Squared Error Loss

  • Negative Log-Likelihood Loss

  • Cross-Entropy Loss

  • Binary Cross Entropy Loss

  • Hinge Embedding Loss

  • Margin Ranking Loss

  • Triplet Margin Loss

  • Kullback-Leibler divergence

L1 loss function (Mean Absolute Error Loss)

The L1 loss function computes the mean absolute error between each value in the predicted and target tensor. It computes the sum of all the values returned from each absolute difference computation and takes the average of this sum value to obtain the Mean Absolute Error (MAE). 

L1 Loss

When to use?

  • Regression

  • Specifically in those cases where target variables contain outliers. 

  • It is robust for handling noise.

Syntax 

torch.nn.L1Loss

PyTorch Code Implementation

import torch
import torch.nn as nn
#reduction specifies the method of reduction to apply to output. Possible values are 'mean' (default) where we compute the average of the output, 'sum' where the output is summed and 'none' which applies no reduction to output
loss_fn = nn.L1Loss(size_average=None, reduce=None, reduction='mean')
input = torch.randn(4, 5, requires_grad=True)
target = torch.randn(4, 5)
output = loss_fn(input, target)
print(output) #tensor(0.8467, grad_fn=)

Smooth L1 Loss

The smooth L1 loss function combines the benefits of MSE loss and MAE loss through a heuristic value beta. It uses a squared term if the absolute error falls below one and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and, in some cases, prevents exploding gradients. 

In mean square error loss, we square the difference, resulting in a number much larger than the original number. These high values result in exploding gradients. It is avoided here for numbers greater than 1; the numbers are not squared.

Smooth L1 Loss

When to use?

  • Regression.

  • When the features have large values.

  • Well suited for most problems.

Syntax 

torch.nn.SmoothL1Loss

PyTorch Code Implementation

import torch
import torch.nn as nnloss = nn.SmoothL1Loss()
input = torch.randn(4, 5, requires_grad=True)
target = torch.randn(4, 5)
output = loss(input, target)
print(output) #tensor(1.4590, grad_fn=)

Mean Squared Error Loss(MSE)

The Mean Square Error shares similarities with the Mean Absolute Error. It computes the square difference between values in the prediction tensor and that of the target tensor. 

By doing so, relatively significant differences are penalized more, while relatively minor differences are penalized less. MSE is considered less robust at handling outliers and noise than MAE.

MSE Loss

When to use it?

  • Regression problems.

  • The numerical value features are not large.

  • The problem is not very high-dimensional.

Syntax

torch.nn.MSELoss

PyTorch Code Implementation

import torch.nn as nn
import torch loss = nn.MSELoss(size_average=None, reduce=None, reduction='mean')
#L1 loss function parameters explanation applies here.
input = torch.randn(4, 5, requires_grad=True)
target = torch.randn(4, 5)
output = loss(input, target)
print(output) #tensor(1.4590, grad_fn=)

Cross-Entropy Loss

Cross-entropy as a loss function is used to learn the probability distribution of the data. While other loss functions like squared loss penalize wrong predictions, cross-entropy gives a more significant penalty when incorrect predictions are predicted with high confidence. 

What differentiates it from negative log-likelihood loss is that cross-entropy also penalizes wrong but confident predictions and correct but less confident predictions. In contrast, negative log loss does not penalize according to the confidence of predictions.

Cross Entropy Loss

When to use?

  • Classification tasks

  • Making a confident model i.e. model will not only predict accurately, but it will also do so with higher probability.

  • For higher precision/recall values.

Syntax

torch.nn.CrossEntropyLoss

PyTorch Code Implementation

import torch
import torch.nn as nn
loss = nn.CrossEntropyLoss()
input = torch.randn(4, 5, requires_grad=True)
target = torch.empty(4, dtype=torch.long).random_(5)
#Cross Entropy loss function parameters explanation applies here.output = loss(input, target)
output.backward()
print(output) #tensor(1.6370, grad_fn=)

Negative Log-Likelihood Loss

It maximizes the overall probability of the data. It penalizes the model when it predicts the correct class with smaller probabilities and incentivizes when the prediction is made with a higher probability. The logarithm does the penalizing part here. 

The smaller the probabilities, the higher its logarithm will be. The negative sign is used here because the probabilities lie in the range [0, 1], and the logarithms of values in this range are negative. So it makes the loss value to be positive.

NLL Loss

When to use?

  • Classification.

  • Smaller quicker training.

  • Simple tasks.

Syntax

torch.nn.NLLLoss

PyTorch Code Implementation

import torch
import torch.nn as nn
m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 4])
output = loss(m(input), target)
output.backward()
# 2D loss example (used, for example, with image inputs)
N, C = 5, 4
loss = nn.NLLLoss()
# input is of size N x C x height x width
data = torch.randn(N, 16, 10, 10)
conv = nn.Conv2d(16, C, (3, 3))
m = nn.LogSoftmax(dim=1)
# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(conv(data)), target)
print(output) #tensor(1.5017, grad_fn=)

Binary Cross-Entropy Loss

Binary Cross-Entropy loss is a particular class of Cross-Entropy losses used for the unique problem of classifying data points into only two classes. Labels for this type of problem are usually binary, and our goal is to push the model to predict a number close to zero for a zero label and a number close to one for one label.

Usually, when using BCE loss for binary classification problems, the neural network's output is a Sigmoid layer to ensure that the output is either a value close to zero or a value close to one.

Binary Cross Entropy Loss

When to use?

  • Binary Classification tasks

Syntax

torch.nn.BCELoss

PyTorch Code Implementation

import torch.nn as nn
import torch
m = nn.Sigmoid()
loss = nn.BCELoss()
input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
output = loss(m(input), target)
print(output) #tensor(0.4555, grad_fn=)

Binary Cross-Entropy Loss with Logits

It adds a Sigmoid layer and the BCELoss in one single class. It provides numerical stability for log-sum-exp. It is more numerically stable than a plain Sigmoid followed by a BCELoss.

BCE With Logits

Syntax

torch.nn.BCEWithLogitsLoss

PyTorch Code Implementation

import torch
import torch.nn as nn

target = torch.ones([10, 64], dtype=torch.float32)  # 64 classes, batch size = 10
output = torch.full([10, 64], 1.5)  # A prediction (logit)
pos_weight = torch.ones([64])  # All weights are equal to 1
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
loss = criterion(output, target)  # -log(sigmoid(1.5))
print(loss) #tensor(0.2014)

Hinge Embedding Loss

Hinge Embedding Loss measures the loss given an input target tensor x and labels tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.

Hinge Embedding Loss

When to use?

  • Learning nonlinear embeddings

  • Semi-supervised learning

  • Where similarity or dissimilar of two inputs is to be measured.

Syntax

torch.nn.HingeEmbeddingLoss

PyTorch Code Implementation

import torch
import torch.nn as nn

input = torch.randn(4, 5, requires_grad=True)
target = torch.randn(4, 5)
loss = nn.HingeEmbeddingLoss()
output = loss(input, target)
output.backward()
print('input: ', input)
print('target: ', target)
print('output: ', output)

"""
input:  tensor([[ 1.1145, -0.4745, -0.1327,  0.5509, -0.3407],
		[-1.3486,  0.8496, -0.0060, -1.2351,  0.5980],
    [ 0.3332, -2.5689, -0.3113,  0.9949,  0.7497],
    [ 1.0508, -0.2495, -0.4236, -2.1967, -0.7984]], requires_grad=True)
target:  tensor([[ 0.6027,  1.0252,  1.7027,  1.0782,  0.1277],
		[-0.7123, -0.7314, -0.0545,  0.5189,  0.5584],
    [ 0.4742,  2.0229,  0.7183, -0.6529, -1.2161],
    [-1.7511, -0.1510, -0.4265,  0.5902, -0.4493]])
output:  tensor(1.0083, grad_fn=)
""

Margin Ranking Loss

Margin Ranking loss belongs to the ranking losses whose main objective, unlike other loss functions, is to measure the relative distance between a set of inputs in a dataset. 

The margin Ranking loss function takes two inputs and a label containing only 1 or -1. 

If the label is 1, then it is assumed that the first input should have a higher ranking than the second input, and if the label is -1, it is assumed that the second input should have a higher ranking than the first input.

Margin Ranking Loss

When to use?

  • GANs.

  • Ranking tasks.

Syntax

torch.nn.MarginRankingLoss

PyTorch Code Implementation

import torch.nn as nn
import torch
loss = nn.MarginRankingLoss()
input1 = torch.randn(4, requires_grad=True)
input2 = torch.randn(4, requires_grad=True)
target = torch.randn(4).sign()
output = loss(input1, input2, target)
print('input1: ', input1)
print('input2: ', input2)
print('output: ', output)

#input1:  tensor([-1.6942,  0.7510, -1.8400,  0.3829], requires_grad=True)
#input2:  tensor([-0.0612, -0.4592,  0.3321,  0.6844], requires_grad=True)
#output:  tensor(0.3026, grad_fn=)

Triplet Margin Loss

The Triplet loss function is widely used to evaluate the inputs' similarity. It uses a triplet pair generated from training data. The triplet pair comprises three sample points (anchor, positive and negative). The anchor and positive belongs to the same class but different data point, whereas the negative examples belong to a different class.

The objective is to learn to minimize the distance between the anchor and positive data points and maximize the distance between the anchor and negative data points with a margin. You can think of it as a learning separation boundary(margin) between each class cluster.

Triplet Margin Loss Function

When to use?

  • In ranking tasks like face matching, search retrieval, etc.

  •  Embedding learning

Syntax

torch.nn.TripletMarginLoss

PyTorch Code Implementation

import torch.nn as nn
import torch
triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.randn(200, 128, requires_grad=True)
positive = torch.randn(200, 128, requires_grad=True)
negative = torch.randn(200, 128, requires_grad=True)
output = triplet_loss(anchor, positive, negative)
print(output)  #tensor(1.2055, grad_fn=)

Cosine Embedding loss

The criterion measures similarity by computing the cosine distance between the two data points in space. The cosine distance correlates to the angle between the two points, which means that the smaller the angle, the closer the inputs and the more similar they are. 

Cosine Embedding Loss

When to use?

  • Learning nonlinear embeddings

  • Semi-supervised learning

  • Where similarity or dissimilar of two inputs is to be measured.

Syntax

torch.nn.CosineEmbeddingLoss

PyTorch Code Implementation

import torch.nn as nn
import torch
 
loss = nn.CosineEmbeddingLoss()
input1 = torch.randn(3, 6, requires_grad=True)
input2 = torch.randn(3, 6, requires_grad=True)
target = torch.randn(3).sign()
output = loss(input1, input2, target)
print('input1: ', input1)
print('input2: ', input2)
print('output: ', output)
 
"""
input1:  tensor([[ 0.9912, -0.9039,  0.8332,  0.5011,  0.3535,  0.4681],
       [-0.3198, -0.4781, -0.7364,  1.3372, -0.8918, -0.2378],
       [ 0.3939,  1.2550,  2.1777, -0.1647, -0.6127, -0.8228]],
      requires_grad=True)
input2:  tensor([[-0.0759, -0.7073, -0.1205,  0.8391, -0.8668, -1.3887],
       [-1.5071, -0.9096, -0.5043, -1.0851,  1.5451,  0.0149],
       [ 0.8178, -0.0502, -0.2965, -0.4258, -0.0321, -0.6708]],
      requires_grad=True)
output:  tensor(0.6474, grad_fn=)
""

Kullback-Leibler divergence

KL divergence measures how two probability distributions are different from each other.

Given two distributions, P and Q, Kullback Leibler Divergence (KLD) loss measures how much information is lost when P (assumed to be the true distribution) is replaced with Q. 

By measuring how much information is lost when we use Q to approximate P, we can obtain the similarity between P and Q and drive our algorithm to produce a distribution very close to the true distribution, P. 

The information loss when Q is used to approximate P is not the same as when P is used to approximate Q; thus, KL Divergence is not symmetric.

KLD Loss

When to use?

  • Classification

  • The same can be achieved with cross-entropy with lesser computation, so avoid it.

Syntax

torch.nn.KLDivLoss

PyTorch Code Implementation

import torch
import torch.nn as nn

loss = nn.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)
input1 = torch.randn(4, 6, requires_grad=True)
input2 = torch.randn(4, 6, requires_grad=True)
output = loss(input1, input2)

print('output: ', output) #output:  tensor(0.0591, grad_fn=)
A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

How to define a custom PyTorch loss function

Writing a custom PyTorch loss function is simple.

The standard way of doing it is to write a Class definition per loss function. The class will have mainly two methods.

init

  • The init method defines the input member variables required for the loss function. 

  • In EmbeddingLoss fn, the margin variable defines the separation between positive and negative samples in high dimensional space. 

forward

  • The forward function is defined to run the loss function over the input and output and return a scalar value to be used by the backward propagation layer.

  • In EmbeddingLoss fn, the first distance is calculated between two samples using the sum of error formula, and then we use margin to have a minimum separation.

Here’s a link to the Kaggle discussion thread that is a great resource to learn more about the right way of defining loss functions.

Pro tip: Learn more by reading The Essential Guide to Neural Network Architectures.

How to monitor PyTorch loss functions

Once you have done the hard work of writing your loss function and training code for the neural network, monitoring the loss values is essential.

Let’s see how to do it!

Let’s take the example of the FashionMNIST classification task. The objective of the task is to classify ten clothes classes.

FashionMNIST dataset samples

A neural network training pipeline consists mainly of the following low-level components:- 

  • Dataset Class to load, filter and visualize data.

  • Network Architecture Class to define the network.

  • Training function to train the model for every epoch using loss values.

  • Loss validation and visualization function to monitor the loss values.

  • Inference function to test the trained model.

Pro Tip: Read this introductory Guide to Image Classification and start building your own classifiers with V7.

Please refer to the google collab link to train and test the Fashion MNIST dataset on your own. 

Let’s go through what losses we need to monitor while training and how we can visualize them to track the training progress.

Training a model requires the calculation of 2 types of losses:

  • train_loss - The training loss value indicates how much the model has learned from the training data. Ideally, the training loss should decrease with every iteration.

  • validation_loss - The validation loss value indicates the model’s performance on data it hasn’t seen before. It shows whether the model is overfitting or underfitting the training data.

Pro tip: Want to train reliable models? Check out our guide to Overfitting vs. Underfitting: What's the Difference?

I am using the code snippet from the google collab link.

When you plot the train_losses and valid_losses values on a graph, you should expect an exponential decrease over the number of epochs.

The training loss should be ideally lower than the valid loss indicating that the model has learned well on training data and generalized over the unseen data. So, when the model is used in production, it does not fail on real-world constraints.

Pro Tip: Read this article for more insight on Train-Validation-Test sets.

PyTorch Loss Functions: Summary

  • PyTorch has predefined loss functions that you can use to train almost any neural network architecture.

  • The loss function guides the model training to convergence.

  • Choosing the correct loss function is crucial to the model performance. 

  • Loss values should be monitored visually to track the model learning progress.

References

  1. https://neptune.ai/blog/pytorch-loss-functions/

  2. https://blog.paperspace.com/pytorch-loss-functions/amp/

  3. https://analyticsindiamag.com/all-pytorch-loss-function/

  4. https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7

  5. https://towardsdatascience.com/understanding-pytorch-loss-functions-the-maths-and-algorithms-part-1-6e439b27117e

  6. https://machinelearningknowledge.ai/ultimate-guide-to-pytorch-loss-functions/

Next steps

Label videos with V7.

Rewind less, achieve more.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Rewind less, achieve more.