Blog

Webinar

Academy

Resources

Computer vision

Video Classification: Methods, Use Cases, Tutorial

19 min read

—

Mar 14, 2023

What are the different techniques used for video classification? What are its greatest challenges and how to overcome them? And how to build a video classifier? Let's find out.

Rohit Kundu

Guest Author

With the rapid growth of video content on the internet, video classification has become an important task in computer vision. From identifying a person's actions in a surveillance video to understanding video's content for video retrieval and recommendation, video classification has a wide range of applications for many industries.

In this article, we will explore the different techniques used for video classification. We will also discuss the challenges that come with video classification and how to overcome them. Whether you are a computer vision researcher, a machine learning engineer, or just someone interested in the topic, this article will come in handy.

In this article, we’ll cover:

What is video classification?
Video classification datasets
Video classification methods
Video classification use cases
How to build a video classifier with V7?

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Video annotation

AI video annotation

Get started today

And if you’d like to start annotating your videos right away, check out:

What is video classification?

Video classification is a rapidly growing field in computer vision and machine learning. Much like image classification, video classification involves sorting videos into respective categories, such as actions being performed (e.g., dancing, walking, running) or behavioral emotions (e.g., cheerful, sad, surprised).

However, while images are classified based on the spatial content (e.g., a picture of a person vs. a picture of a dog), videos need to be classified based on both their spatial and temporal (time-domain) content—since two videos can contain the same person (same spatial information) performing different actions (difference in temporal content).

One of the key challenges in video classification is the sheer volume of data that must be processed. A video typically consists of a large number of frames, each containing a wealth of information. To make matters even more complicated, videos can also be shot in different lighting conditions, from different angles, and at different frame rates. Here are the most common video classification challenges:

Scalability: As the volume of video data continues to grow, it is becoming increasingly challenging to process and classify all of it in a reasonable amount of time. This is particularly problematic for deep learning approaches, which require large amounts of data and computational resources to train.

Generalization: Many video classification algorithms are designed to work on a specific dataset or task but may not generalize well to other datasets or tasks.

Video annotation: Labeling large amounts of video data is a time-consuming and labor-intensive task. This can make it difficult to obtain large, high-quality datasets for training and evaluating video classification algorithms—especially supervised learning models.

Privacy and security: The use of video data raises a number of privacy and security concerns, particularly when it comes to surveillance and facial recognition. There are also concerns about the potential misuse of video data, such as in the case of DeepFake videos.

Video quality: Videos can come in various qualities and formats, which can be difficult to handle and may adversely affect the performance of the video classifier.

To overcome these challenges, a variety of techniques and algorithms have been developed for video classification, most of which use deep learning for automation.

Video Classification Datasets

Open-sourced video classification datasets provide researchers with labeled video data that can be used to train and evaluate video classification models. Several open datasets exist for video classification, which helps researchers standardize their models and compare them against the existing state-of-the-art.

Some of the most widely used datasets for video classification are:

UCF101: One of the most widely used video classification datasets is the UCF101 dataset, which consists of 13320 videos from 101 different action classes, such as walking, jogging, and playing soccer. The dataset is commonly used for evaluating the performance of video classification algorithms in a wide range of action recognition tasks. The UCF101 also has several daughter datasets that are subsets of the same containing fewer or some specific action classes—such as the UCF50, UCF11 (YouTube Action), UCF Aerial Action, etc.

HMDB51: This dataset contains 6849 videos from 51 different action classes. This dataset is similar to UCF101, but it has a smaller number of classes and videos. Therefore, it’s more suitable for less complex models (such as fine-tuning student models in Knowledge Distillation pipelines).

Kinetics: Kinetics is another popular video classification dataset consisting of over 400,000 videos from 600 human action classes. The videos in Kinetics dataset are taken from YouTube and other sources and labeled by human annotators.

YouTube-8M: YouTube-8M is another large-scale dataset that includes 8 million YouTube video URLs and associated labels from a vocabulary of 4716 classes. Unlike the previous ones, this dataset is not specific to action recognition but encompasses a wide range of video classification tasks.

Sports-1M: This sports action recognition dataset contains 1 million videos from 487 classes of sports, such as basketball, soccer, and ice hockey.

In addition to the datasets that are publicly available, researchers also create their own datasets for specific research projects. This can be especially useful when the available datasets do not have the necessary classes or enough samples for a particular research problem.

It is important to note that the quality and size of the dataset can greatly affect the performance of the video classification algorithms. Datasets that are larger and have more diverse classes tend to result in better performance, as they provide the model with more diverse examples and improve the generalization capabilities of the model.

Pro tip: Check out V7’s collection of 500+ open datasets

Video classification methods

Over the years, several different types of deep learning-based algorithms have been developed for video classification. Most of them use supervised learning, a framework that uses videos and their labels to train a neural network (Convolutional Neural Networks or Recurrent Neural Networks). However, recent methods focus on reducing the reliance on labeled data since they are difficult and expensive to collect.

Although low supervision methods sound perfect on paper, most of the time, they come at the price of accuracy. So, the question of whether to use supervised or low supervision methods depends on the use case. Sensitive applications, such as surveillance and healthcare, value accuracy over computational burden. On the other hand, sports or general-purpose action recognition (such as entertainment) can work with comparatively lower accuracy models. They still need efficient storage solutions, since such domains often have large amounts of unlabelled data readily available.

Let us look into both supervised and low supervision methods for video classification next.

Supervised

Supervised learning is a type of machine learning in which a model is trained to make predictions based on labeled data. A set of labeled data is split into train, validation, and test sets for evaluating a model. Then, the trained model is used on unlabeled data to check the model’s performance qualitatively. In the context of video classification, supervised learning can be used to train a model to recognize specific objects, actions, or scenes in a video.

Early attempts at supervised learning-based video classification include this paper that extensively studied CNN models for the task at a time when CNNs gained popularity for image recognition.

However, CNNs require extensively long periods of training time to effectively optimize the millions of model parameters. This difficulty rises when extending the connectivity of the architecture temporally—the network must process not just one image but several frames of video at a time.

The authors of the paper modified the CNN architectures to contain two separate streams of processing: a “context” stream that learns features on low-resolution frames, and a high-resolution “fovea” stream that only operates on the middle portion of the frame, which yielded a 2-4x boost in runtime performance.

The authors studied several approaches to fusing information across the temporal domain (figure below)—the fusion can be done early in the network by modifying the first layer convolutional filters to extend in time, or it can be done late by placing two separate single-frame networks some distance in time apart and fusing their outputs later in the processing.