[crec_stager] first commit by will-horning · Pull Request #4 · chartbeat-labs/Capitol-Words

will-horning · 2017-02-16T22:17:12Z

This adds a new entry point called crec_stager to the data pipeline. The existing pipeline operates entirely off of local disk (all three parts of the ETL pipeline).

The crec_stager downloads the previous day's CREC zip from gpo.gov to local disk, extracts all .htm files, then uploads each one to S3 using the key format <CUSTOM_PREFIX>/YYYY/MM/DD/<CREC_FILENAME>.zip.

Its designed to be run either locally or as an AWS lambda job.

To run locally:

python crec_stager.py

That will upload the data from yesterday to our good ol' test bukkit: use-this-bucket-to-test-your-bullshit. Check out the docstrings for more details.

I also deployed it to chartbeat's aws account, under the lambda job lambda_test and set up a scheduled event trigger for once per day.

Eventually we'll need to convert the parser and solr importer to work off of the staged files in S3 instead of local disk, but this is at least a starting point to make this ETL pipeline a little more robust.

@rmangi @ackramer @jeiranj

ackramer

Just curious, how does this relate to the current scraper.py? Will parts of that still be required or is this a replacement?

LGTM

ackramer · 2017-02-16T23:00:44Z

crec_stager/crec_stager.py

+    )
+    parser.add_argument(
+        '--loglevel',
+        help='Log level, one of INFO, ERROR, WARN, DEBUG or CRITICAL.',


choices=[logging.INFO, logging.ERROR, ...] and then default=logging.INFO. then you can get rid of the LOGLEVELS dict. :)

ackramer · 2017-02-16T23:03:39Z

crec_stager/crec_stager.py

+
+Attributes:
+    DEFAULT_LOG_FORMAT (:obj:`str`): A template string for log lines.
+    LOGLEVELS (:obj:`dict`): A lookup of loglevel name to the loglevel code.


😍 rst docstrings ftw

will-horning · 2017-02-17T13:30:01Z

@ackramer It's a replacement for the scraper, but its not currently compatible with the parser and solr importer.

[crec_stager] first commit

4c2a8f8

ackramer approved these changes Feb 16, 2017

View reviewed changes

will-horning added 2 commits February 17, 2017 15:55

[crec_stager] added support for downloading mods.xml file

ec2ea4f

[crec_parser] mods.xml loader

4e36310

rmangi force-pushed the master branch from c4ce53d to 1119f83 Compare July 30, 2017 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[crec_stager] first commit#4

[crec_stager] first commit#4
will-horning wants to merge 3 commits intomasterfrom
will_crec_stager

will-horning commented Feb 16, 2017 •

edited by manny

Loading

Uh oh!

ackramer left a comment

Uh oh!

ackramer Feb 16, 2017

Uh oh!

ackramer Feb 16, 2017

Uh oh!

will-horning commented Feb 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

will-horning commented Feb 16, 2017 • edited by manny Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ackramer left a comment

Choose a reason for hiding this comment

Uh oh!

ackramer Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

ackramer Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

will-horning commented Feb 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

will-horning commented Feb 16, 2017 •

edited by manny

Loading