Skip to content

make json preprocess version numbers available external to jsons #2

@jeremydouglass

Description

@jeremydouglass

We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).

In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).

For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.

{
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      },
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions