Skip to content

optimize METs handling #7

@krvoigt

Description

@krvoigt

Current situation

The more files are in a workspace, the larger a METS file becomes, the longer processing takes, especially when information needs to be searched in different places of the XML, such as getting page ID (which is in the mets:structMap[@TYPE=LOGICAL]) for mets:file (which are in the mets:fileSec/mets:fileGrp.

Access to the METS file is exclusive, there can be no workspace-wide parallelization because different processes will overwrite, not append to the METS file, causing data loss.

How it should be

The performance penalty of dealing with larger workspaces should grow linearly or logarithmically, not polynomally. Caching should be used to keep relevant information (such as pageID to file mappings) in memory.

A synchronization mechanism should be implemented, that allows multiple processes to add to the METS file without data loss.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions