-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Current situation
The more files are in a workspace, the larger a METS file becomes, the longer processing takes, especially when information needs to be searched in different places of the XML, such as getting page ID (which is in the mets:structMap[@TYPE=LOGICAL]) for mets:file (which are in the mets:fileSec/mets:fileGrp.
Access to the METS file is exclusive, there can be no workspace-wide parallelization because different processes will overwrite, not append to the METS file, causing data loss.
How it should be
The performance penalty of dealing with larger workspaces should grow linearly or logarithmically, not polynomally. Caching should be used to keep relevant information (such as pageID to file mappings) in memory.
A synchronization mechanism should be implemented, that allows multiple processes to add to the METS file without data loss.