Conversation
ackramer
approved these changes
Feb 16, 2017
ackramer
left a comment
There was a problem hiding this comment.
Just curious, how does this relate to the current scraper.py? Will parts of that still be required or is this a replacement?
LGTM
| ) | ||
| parser.add_argument( | ||
| '--loglevel', | ||
| help='Log level, one of INFO, ERROR, WARN, DEBUG or CRITICAL.', |
There was a problem hiding this comment.
choices=[logging.INFO, logging.ERROR, ...] and then default=logging.INFO. then you can get rid of the LOGLEVELS dict. :)
|
|
||
| Attributes: | ||
| DEFAULT_LOG_FORMAT (:obj:`str`): A template string for log lines. | ||
| LOGLEVELS (:obj:`dict`): A lookup of loglevel name to the loglevel code. |
Author
|
@ackramer It's a replacement for the scraper, but its not currently compatible with the parser and solr importer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a new entry point called crec_stager to the data pipeline. The existing pipeline operates entirely off of local disk (all three parts of the ETL pipeline).
The crec_stager downloads the previous day's CREC zip from gpo.gov to local disk, extracts all
.htmfiles, then uploads each one to S3 using the key format<CUSTOM_PREFIX>/YYYY/MM/DD/<CREC_FILENAME>.zip.Its designed to be run either locally or as an AWS lambda job.
To run locally:
That will upload the data from yesterday to our good ol' test bukkit:
use-this-bucket-to-test-your-bullshit. Check out the docstrings for more details.I also deployed it to chartbeat's aws account, under the lambda job
lambda_testand set up a scheduled event trigger for once per day.Eventually we'll need to convert the parser and solr importer to work off of the staged files in S3 instead of local disk, but this is at least a starting point to make this ETL pipeline a little more robust.
@rmangi @ackramer @jeiranj