The raw data and scripts to collect them, as used and mirrored for Stanford Computational Journalism data tutorials at http://tutorials.compjour.org.
Currently kind of hacked together, will eventually be a more formalized framework for fetching and packaging data in different formats and different stages of cleaning, so that they can be used as practice for both data gathering and analysis.
The data-holding directory contains the downloaded files and some of their compiled versions. Some of the bigger files have been split into smaller files so that they'd fit more gracefully into version control, but I haven't written the compilation scripts to re-assemble. The ultimate goal is to have scripts that produce downloadable links to easy-to-use CSVs and SQLite databases for class exercises (as soon as I finish learning SQLalchemy).
The scripts directory contains the (mostly Python 3) scripts for fetching them. I've been writing them as I go, so each subfolder/project is a bit different depending on my mood at that moment and whether I've learned from mistakes in fetching the other datasets.
- Congress legislator data from the unitedstates project. The compilation script creates 3 separate CSVs: legislator information, term info, and social media accounts.
- Congress vote data for the 114th Congress via Govtrack.us rsync server. The compilation script converts Govtrack's JSON into two flat CSV files for easier import into SQL: member-votes.csv and roll-call-votes.csv
- Congress Twitter data - Using the social media account info from the unitedstates project (which lists official accounts, not campaign accounts), I've collected the most recent tweets from each Congressmember, profile data, user_ids of who they follow, and profiles of the most followed users by congressmembers. However, most of the collecting code lives in my Ruby datajanitor repo. I've simply moved the resulting data files to this repo's compiled directory.
- San Francisco Police Department incident data
- Dallas Police Department datasets: incidents, arrests, and charges
- Los Angeles Police Department incidents from 2013 and 2014
- Southern Nevada (e.g. Las Vegas) restaurant inspections database
- New York restaurant inspections data, which is a flat table containing all restaurants, inspections, violations, and violation codes.
- San Francisco restaurant inspections
- Social Security Administration baby names, nationwide and per-state.
- The metadata for The Museum of Modern Art collection
- U.S. House member and staff expenditures, as collected and cleaned by the Sunlight Foundation
- UK Baby names
Todo:
- Iowa State Salaries
- Florida State Salaries
- Death row data from TX, FL, CA
- Stop and frisk, 2012,2013,2014