-
Notifications
You must be signed in to change notification settings - Fork 1
Description
We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).
In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).
For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.
{
"article1.json": {
"ppversion": "0.1",
"hashversion": "0.1",
"processed": "20190503123114"
},
"article1.json": {
"ppversion": "0.1",
"hashversion": "0.1",
"processed": "20190503123114"
}
}