Computer vision

20+ Open Source Computer Vision Datasets

6 min read

Aug 4, 2021

What is the best place to find computer vision datasets? Check out this list of 20+ curated image and video datasets and start annotating data and training your models today.

Alberto Rizzoli

Alberto Rizzoli

Co-founder & CEO

AI is driven by data—not code.

This bold statement could have sounded outlandish a few years back, but not anymore. However—

There is still one problem.

Quality training data can be really hard to access. It might take you days or weeks to find a suitable dataset for your computer vision tasks.

But, worry not.

In this article, we've put together a comprehensive list of quality computer vision datasets that you can access for free.

Have a look.

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

Ready to streamline AI product deployment right away? Check out:

COVID-19 X-Ray Dataset (V7)

COVID-19 X-Ray Dataset is V7’s original dataset containing 6500 images of AP/PA chest X-Rays with pixel-level polygonal lung segmentations. There are 517 cases of COVID-19 amongst these. 

Each image contains:

  • Two "Lung" segmentation masks

  • A tag for the type of pneumonia (viral, bacterial, fungal, healthy/none)

  • If the patient has COVID-19, additional tags stating age, sex, temperature, location, intubation status, ICU admission, and patient outcome.

Lung annotations are polygons following pixel-level boundaries. You can export them in COCO, VOC, or Darwin JSON formats. Each annotation file contains a URL to the original full resolution image and a reduced size thumbnail.

For more details, check out: COVID-19 X-Ray dataset (Github)

CIFAR-10 & CIFAR-100

The CIFAR-10 & CIFAR-100 are labeled subsets of the 80 million tiny images dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

CIFAR-10 contains 60000 32x32 color images with 10 classes (animals and real-life objects). There are 6000 images per class. This dataset has 50000 training images and 10000 test images. The classes are mutually exclusive, without any overlaps.

CIFAR-100 consists of 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. 

ImageNet

ImageNet is one of the most popular image databases with more than 14 million hand-annotated images.

This database is organized according to the WordNet hierarchy (currently only the nouns), in which hundreds and thousands of images depict each node of the hierarchy. Object-level annotations provide a bounding box around the (visible part of the) indicated object. 

Kinetics-700

Kinetics-700 is a large video dataset consisting of 650,000 clips covering 700 human action classes. 

The videos include human-object interactions like playing instruments and human-human interactions like hugging. Each action class has at least 700 video clips, and each clip is annotated with an action class lasting for about 10 seconds.

MNIST

MNIST is a large database of handwritten single digits containing 60,000 training images and 10,000 testing images. 

It was released in 1999 and is used for classification tasks.

LSUN 

LSUN (The Large-scale Scene Understanding) contains close to one million labeled images for each of 10 scene categories and 20 object categories. 

For training data, each category contains from 120,000 to even 300,000,000 images. The validation data includes 300 images, and the test data has 1000 images for each category.

Pro tip: Check out The Train, Validation, and Test Sets: How to Split Your Machine Learning Data to learn more.

IMDB-Wiki

IMDB-Wiki is one of the largest publicly available datasets of human faces with gender, age, and name. 

It contains 523,051 images in total, with 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia.

MS COCO 

The MS COCO (Microsoft Common Objects in Context) dataset is consisting of 328K images. It contains annotations for object detection, keypoints detection, panoptic segmentation, stuff image segmentation, captioning, and Dense human pose estimation.

Labeled Faces in the Wild

Labeled Faces in the Wild is a large-scale database of 13.000 face photographs designed for facial recognition tasks. Each face has been labeled with the person’s name.

Cityscapes

Cityscapes is a database containing a diverse set of stereo video sequences recorded in street scenes from 50 different cities. The images were captured over time in various light conditions and weather. 

Cityscapes dataset includes semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories. It provides pixel-level annotations of 5000 frames and 20,000 coarsely annotated frames.

LabelMe-12-50k

LabelMe-12-50k is a dataset that contains 50,000 JPEG images (40,000 for training and 10,000 for testing) with 12 classes. The images are extracted from LabelMe.

Classes include objects such as a car, a person, a tree, or a keyboard. 50% of the images in the training and testing set show a centered object, while the remaining 50% show a randomly selected region of a randomly selected image ("clutter").

This dataset can be used for object recognition.

Places

Places dataset consists of 2.5 million images (with a category label) and 205 scene categories. There are more than 5,000 images per category. It’s trained using CNNs and can be used for scene recognition tasks.

Places2 (365-Standard)

Places2 (365-Standard) is another dataset contributed by MIT. There are 1.8 million images from 365 scene categories. The dataset contains 50 images per category in the validation set and 900 in the testing set. Places2 Database can be used for scene recognition and generic deep scene features for visual recognition. 

VisualGenome

VisualGenome is a large dataset and knowledge base with 108,077 images with annotated objects, attributes, and their relationships.

Stanford Dogs 

Stanford Dogs dataset has been built using images and annotations (class labels, bounding boxes) from ImageNet. It is a large-scale dataset containing images of 120 breeds of dogs from around the world. There are 20.580 images and 120 categories. 

Stanford Cars 

Stanford Cars dataset contains 16,185 images and 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. 

You have to download the images and their class labels and bounding boxes separately.

Cat Dataset 

The CAT dataset includes over 9,000 cat images with annotated facial features. There are annotations of the cat’s head with nine points for each image: two for eyes, one for the mouth, and six for the ears.

CelebFaces 

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200.000 celebrity images, each with 40 attribute annotations. The annotations include 10,177 unique identities and five landmark locations per image.

The dataset can be used as training and test sets for face detection, face attribute recognition, localization, and landmark (or facial part) localization.

Face Mask Detection

Face Mask Detection dataset contains 853 images belonging to the 3 classes and their bounding boxes in the PASCAL VOC format. The classes include “with mask”, “without mask” and “Mask worn incorrectly”.

Fire and Smoke Dataset

Fire and Smoke is a dataset with more than 7000 unique images in HD resolution. 

It consists of early fire and smoke images captured using mobile phones in real-world scenarios. The images were captured under a wide variety of lighting conditions and weather. This dataset can be used for fire and smoke recognition, detection, plus anomaly detection.

It also contains various domestic scenes, including garbage and field crop burning, as well as domestic cooking, etc.

FloodNet Dataset

FloodNet dataset consists of high-resolution UAS imageries with detailed semantic annotation regarding the damages caused by hurricanes.

The data is collected with a small UAS platform, DJI Mavic Pro quadcopters, after Hurricane Harvey. The whole dataset has 2343 images, divided into training (~60%), validation (~20%), and test (~20%) sets. 

PS. Floodnet Dataset was annotated using V7.

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

Next steps

Label videos with V7.

Rewind less, achieve more.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Rewind less, achieve more.