Datasets

Curated lists

EGG datasets
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in-depth, please update the repository.

DeepMind open datasets
List of open datasets of Deep Mind

Dataset List
A list of the biggest datasets for machine learning from across the web.

Awesome Public Datasets
This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses.

50 Best Free Datasets for Machine Learning

Microsoft Research Open Data
be aware of the license

OpenDataSoft
list of 2600+ Open Data portals around the world

Awesome Linguistics Resources for Spanish (2015)
Curated list of Linguistic Resources for doing Spanish NLP & CL.

Review of existing facial expression databases that are often used in social psycholgy

Figure Eight data for everyone

Images

MAAD-Face: A Massively Annotated Attribute Dataset for Face Images (Dec 2020)
Face dataset with 123.9M attribute annotations from 47 soft-biometric attributes.

WiderPerson (Oct 2019)
A Diverse Dataset for Dense Pedestrian Detection in the Wild. 13,382 images and label about 400K annotations with various kinds of occlusions

1mFakeFaces other link (Jun 2019)
1 million images of fake faces generated with styleGAN

Diversity in Faces Dataset (Feb 2019)
Annotations of 1 million human facial images

Flickr-Faces-HQ Dataset (FFHQ) (Feb 2019)
70k images from flicker for faces

DeepFashion2 [GitHub] (Jan 2019)
Four: clothes detection, pose estimation, segmentation, and retrieval. 801K clothing items where each item has rich annotations such as style, scale, viewpoint, occlusion, bounding box, dense landmarks and masks.

HDR+ Burst Photography Dataset (Jan 2019)
3,640 bursts of full-resolution raw images, made up of 28,461 individual images, along with HDR+ intermediate and final results for comparison.

CheXpert (Jan 2019)
Dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients.

Tencent ML (Jan 2019)
OpenImages + ImageNet datasets

FFHQ (Jan 2019)
High quality faces dataset from Flickr collected by nvidia. 70.000 Images at 1024x1024

NoCaps (Dec 2018)
166.100 human-generated captions describing 15.100 images from the Open Images validation and test sets

DeepLesion (Jul 2018)
32.000 annotated lesions identified on CT images from 4.400 unique patients.

SCUT-FBP5500 Dataset (Jan 2018)
5500 frontal faces with diverse properties (male/female, Asian/Caucasian, ages) and diverse labels (facial landmarks, beauty scores in 5 scales, beauty score distribution)

OpenImages (2017)
80M Images (creative commons license), 4764 trainable classes, 600 trainable classes with bounding boxes

Large-scale Fashion (DeepFashion) Database (2016)
800k fashion images, 50 categories, 1000 descriptive attributes

CelebFaces Attributes Dataset (CelebA) (2015)
200k face images, 5 landmark locations, 40 binary attributes

The Child Affective Facial Expression (CAFE) set (Jul 2014)
Photographs taken of 2- to 8-year-old children posing for 6 emotional facial expressions—sadness, happiness, surprise, anger, disgust, and fear—plus a neutral face.

Text

Code Search Challenge (Sep 2019)
Annotated code by github to perform search over the code.

The Natural Questions Dataset (Jan 2019)
Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing.

OPUS (Dec 2018)
Collection of translated texts from the web

Toxic Comment Classification (Jun 2018)
150k comments of wikipedia with the labels: toxic, severe_toxic, obscene, threat, insult and identity_hate.

General Language Understanding Evaluation (GLUE) (Apr 2018)
Nine sentence- or sentence-pair language understanding tasks built on established existing datasets

Dataset for generating TL;DR (Feb 2018)
3M posts from the Reddit corpus (the content and summary), suitable for abstractive summarization. An average length of 211 words for content, and 25 words for the summary.

SentEval (Dec 2017)
Evaluation toolkit for sentence embeddings

Full Reddit Submission Corpus now available (2006 thru August 2015) (2016)
200 million submission objects with score data, author, title, self_text, media tags and all other attributes available via the Reddit API

Reddit comments (Jul 2015)
1.7 billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API

The Stanford Sentiment Treebank (2013)
sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences

European Parliament Proceedings Parallel Corpus 1996-2011 (2005)
Translation of millions of sentences of 21 European languages.

Audio

XLSR, speech to vec (Dec 2020)
MLS: Multilingual LibriSpeech (8 languages, 50.7k hours), CommonVoice (36 languages, 3.6k hours), Babel (17 languages, 1.7k hours

ToyADMOS (Aug 2019)
540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate

Million Song dataset (Jan 2011)

Video

FaceForensics (Sep 2019)
DeepFake videos. The DeepFakeDetection dataset contains over 363 original sequences from 28 paid actors in 16 different scenes as well as over 3000 manipulated videos using DeepFakes.

YouTube-8M (Jan 2019)
Large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities, 3 labels per video on average.

AVA (Jan 2019)
Two datasets: associates speaking activity with a visible face resulting in 3.65 million frames labeled across ~39K face tracks; and, densely annotates audio-based speech activity in videos, and explicitly labels 3 background noise conditions, resulting in ~46K labeled segments spanning 45 hours of data.

3D

Replica Dataset (Jun 2019)
High quality reconstructions of a variety of indoor spaces (dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, planar segmentation as well as semantic class and instance segmentation).

Miscelanea

Google Research Football (Dec 2019)
Soccer game for Reinforcement learning by google

MIMIC-CXR Database (Sep 2019)
Chest radiographs in DICOM format with free-text radiology reports. The dataset contains 377,110 images corresponding to 227,835 radiographic studies performed.

bsuite - Behaviour Suite for Reinforcement Learning (Aug 2019)
Collection of carefully-designed experiments that investigate core capabilities of reinforcement learning

Robotrix (Jan 2019)
512 sequences of actions taking place across 16 simulated rooms, rendered at high-definition via the Unreal Engine. A robot avatar which uses its hands to interact with the objects.

Academic torrents
Torrents for different academic resources, including some datasets.

Sussex-Huawei Locomotion Dataset (Jun 2018)
300 hours of 8 different activities

OpenAI Gym (2016)
Toolkit for developing and comparing reinforcement learning algorithms

How to add a paper / dataset:

Check in which category the paper fits
Check in which subcategory the paper fits (create a new one if needed)
Add the title, link, the month and year it was published, a link to the code if exits and the contribution of the paper. Papers should be sorted by more recent first in each category. Example:

Examples:

Title of the paper [code] (Jun 2018)
A couple of lines describing the main contribution of the paper. Do not copy the abstract or write more than 2 lines in order to keep the wiki tidy.

Title of the paper (Jan 2018)
A couple of lines describing the main contribution of the paper. Do not copy the abstract or write more than 2 lines in order to keep the wiki tidy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Curated lists

Images

Text

Audio

Video

3D

Miscelanea

How to add a paper / dataset:

Examples:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Categories:

Datasets

Clone this wiki locally