-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets
EGG datasets
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in-depth, please update the repository.
DeepMind open datasets
List of open datasets of Deep Mind
Dataset List
A list of the biggest datasets for machine learning from across the web.
Awesome Public Datasets
This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses.
50 Best Free Datasets for Machine Learning
Microsoft Research Open Data
be aware of the license
OpenDataSoft
list of 2600+ Open Data portals around the world
Awesome Linguistics Resources for Spanish (2015)
Curated list of Linguistic Resources for doing Spanish NLP & CL.
Review of existing facial expression databases that are often used in social psycholgy
Figure Eight data for everyone
MAAD-Face: A Massively Annotated Attribute Dataset for Face Images (Dec 2020)
Face dataset with 123.9M attribute annotations from 47 soft-biometric attributes.
WiderPerson (Oct 2019)
A Diverse Dataset for Dense Pedestrian Detection in the Wild. 13,382 images and label about 400K annotations with various kinds of occlusions
1mFakeFaces other link (Jun 2019)
1 million images of fake faces generated with styleGAN
Diversity in Faces Dataset (Feb 2019)
Annotations of 1 million human facial images
Flickr-Faces-HQ Dataset (FFHQ) (Feb 2019)
70k images from flicker for faces
DeepFashion2 [GitHub] (Jan 2019)
Four: clothes detection, pose estimation, segmentation, and retrieval. 801K clothing items where
each item has rich annotations such as style, scale, viewpoint, occlusion, bounding box, dense landmarks and masks.
HDR+ Burst Photography Dataset (Jan 2019)
3,640 bursts of full-resolution raw images, made up of 28,461 individual images, along with HDR+ intermediate and final results for comparison.
CheXpert (Jan 2019)
Dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients.
Tencent ML (Jan 2019)
OpenImages + ImageNet datasets
FFHQ (Jan 2019)
High quality faces dataset from Flickr collected by nvidia. 70.000 Images at 1024x1024
NoCaps (Dec 2018)
166.100 human-generated captions describing 15.100 images from the Open Images validation and test sets
DeepLesion (Jul 2018)
32.000 annotated lesions identified on CT images from 4.400 unique patients.
SCUT-FBP5500 Dataset (Jan 2018)
5500 frontal faces with diverse properties (male/female, Asian/Caucasian, ages) and diverse labels (facial landmarks, beauty scores in 5 scales, beauty score distribution)
OpenImages (2017)
80M Images (creative commons license), 4764 trainable classes, 600 trainable classes with bounding boxes
Large-scale Fashion (DeepFashion) Database (2016)
800k fashion images, 50 categories, 1000 descriptive attributes
CelebFaces Attributes Dataset (CelebA) (2015)
200k face images, 5 landmark locations, 40 binary attributes
The Child Affective Facial Expression (CAFE) set (Jul 2014)
Photographs taken of 2- to 8-year-old children posing for 6 emotional facial expressions—sadness, happiness, surprise, anger, disgust, and fear—plus a neutral face.
Code Search Challenge (Sep 2019)
Annotated code by github to perform search over the code.
The Natural Questions Dataset (Jan 2019)
Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing.
OPUS (Dec 2018)
Collection of translated texts from the web
Toxic Comment Classification (Jun 2018)
150k comments of wikipedia with the labels: toxic, severe_toxic, obscene, threat, insult and identity_hate.
General Language Understanding Evaluation (GLUE) (Apr 2018)
Nine sentence- or sentence-pair language understanding tasks built on established existing datasets
Dataset for generating TL;DR (Feb 2018)
3M posts from the Reddit corpus (the content and summary), suitable for abstractive summarization. An average length of 211 words for content, and 25 words for the summary.
SentEval (Dec 2017)
Evaluation toolkit for sentence embeddings
Full Reddit Submission Corpus now available (2006 thru August 2015) (2016)
200 million submission objects with score data, author, title, self_text, media tags and all other attributes available via the Reddit API
Reddit comments (Jul 2015)
1.7 billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API
The Stanford Sentiment Treebank (2013)
sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences
European Parliament Proceedings Parallel Corpus 1996-2011 (2005)
Translation of millions of sentences of 21 European languages.
XLSR, speech to vec (Dec 2020)
MLS: Multilingual LibriSpeech (8 languages, 50.7k hours), CommonVoice (36 languages, 3.6k hours), Babel (17 languages, 1.7k hours
ToyADMOS (Aug 2019)
540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate
Million Song dataset (Jan 2011)
FaceForensics (Sep 2019)
DeepFake videos. The DeepFakeDetection dataset contains over 363 original sequences from 28 paid actors in 16 different scenes as well as over 3000 manipulated videos using DeepFakes.
YouTube-8M (Jan 2019)
Large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities, 3 labels per video on average.
AVA (Jan 2019)
Two datasets: associates speaking activity with a visible face resulting in 3.65 million frames labeled across ~39K face tracks; and, densely annotates audio-based speech activity in videos, and explicitly labels 3 background noise conditions, resulting in ~46K labeled segments spanning 45 hours of data.
Replica Dataset (Jun 2019)
High quality reconstructions of a variety of indoor spaces (dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, planar segmentation as well as semantic class and instance segmentation).
Google Research Football (Dec 2019)
Soccer game for Reinforcement learning by google
MIMIC-CXR Database (Sep 2019)
Chest radiographs in DICOM format with free-text radiology reports. The dataset contains 377,110 images corresponding to 227,835 radiographic studies performed.
bsuite - Behaviour Suite for Reinforcement Learning (Aug 2019)
Collection of carefully-designed experiments that investigate core capabilities of reinforcement learning
Robotrix (Jan 2019)
512 sequences of actions taking place across 16 simulated rooms, rendered at high-definition via the Unreal Engine. A robot avatar which uses its hands to interact with the objects.
Academic torrents
Torrents for different academic resources, including some datasets.
Sussex-Huawei Locomotion Dataset (Jun 2018)
300 hours of 8 different activities
OpenAI Gym (2016)
Toolkit for developing and comparing reinforcement learning algorithms
- Check in which category the paper fits
- Check in which subcategory the paper fits (create a new one if needed)
- Add the title, link, the month and year it was published, a link to the code if exits and the contribution of the paper. Papers should be sorted by more recent first in each category. Example:
Title of the paper [code] (Jun 2018)
A couple of lines describing the main contribution of the paper. Do not copy the abstract or write more than 2 lines in order to keep the wiki tidy.
Title of the paper (Jan 2018)
A couple of lines describing the main contribution of the paper. Do not copy the abstract or write more than 2 lines in order to keep the wiki tidy.