Skip to content

soodoku/biocong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Congressional Biographies

97th --- 104th Congress

We use text from the pdfs (downloaded from Google Books, from where these are freely available) and then parse the text.

Scripts

  1. parse
  2. clean

105th --- 115th Congress

We scrape congressional biographies for 105th to the 115th Congress from the Congressional Directory. We download the biographical files, e.g., https://www.govinfo.gov/content/pkg/CDIR-2018-10-29/html/CDIR-2018-10-29-STATISTICALINFORMATION-2.htm and parse them to extract information such as birthdate, number of children, education, etc.

Scripts

  1. Scrapes the Congressional Directory produces biocong.csv, biocong-browsepath.csv, and html files (tar.gz)
  2. Download Congressional Biographies Using the API provides the script for downloading the data using the API. (It produces incomplete data so we don't use this script.)
  3. Parse iterates through biocong-browsepath.csv and parses the html files (tar.gz) and produces biocong-parsed.csv
  4. Clean takes biocong-parsed.csv produces biocong-cleaned.csv

Data

The final dataset---biocong-cleaned.csv---has the following columns:

'level', 'docCount', 'browsePath', 'title', 'lastpage', 'granuleid', 'packageid', 'pdffile', 'pdf', 'text',
 'agencyLevel', 'nodeStatus', 'textfile', 'htmlfile', 'browseline1', 'processingcode', 'nodetype', 'index.1', 
 'publishdate', 'part', 'forGpo', 'hasChildren', 'hasParents', 'rootNode', 'documentResults', 'hasDocumentResults',
 'collectionCode', 'searchPath', 'isContentArea', 'pageSize', 'pageNumber', 'count', 'digitizedFR', 'section',
 'firstpage', 'congress', 'biography', 'name', 'party', 'location', 'born_in', 'birthdate', 'education', 'professional', 
 'married', 'children', 'committees', 'url', 'n_children'

About

Biographical data on members of congress (105th --- 115th).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •