make json preprocess version numbers available external to jsons

We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).

In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).

For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.

```
{
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      },
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make json preprocess version numbers available external to jsons #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

make json preprocess version numbers available external to jsons #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions