Skip to content

MSU-Libraries/call-number-generation

Repository files navigation

call-number-generation

This code retrieves data from a VuFind Solr database, trains an AI model, and generates the first 2 letters of the call numbers for records that do not have one.

Prerequisites

Multi-valued callnumber-label fields are expected in VuFind. This field is single-valued before VuFind 11.0, but the biblio schema can be updated to run this in previous versions:

 <field name="callnumber-label" type="string" indexed="true" stored="true" multiValued="true"/>

This software requires tensorflow[and-cuda] (cuda for GPU use) and keras. Installing them can be done with pip in a python environment:

python -m venv ./.env
source ./.env/bin/activate
[...]
deactivate

Enhancing the data requires the following additional packages: ollama-python, argostranslate, fast-langdetect, python-iso639.

To use:

Do a full import without generating call numbers

Extract Solr data

  • mkdir solr_data (or create a symlink with the same name)
  • Define the following environment variables: SOLR_USERNAME, SOLR_PASSWORD, SOLR_HOST
  • ./extract_from_solr.py
  • Unset the variables

Optional (very long) step: enhance dataset

This will translate all titles with non-English or unidentified language into English. It can also generate topics for records without them (although this did not improve results in an initial test). Most of the translation is done with argos-translate, which is very fast. Translation for titles in languages not supported by argos and topic generation are done with ollama using llama3.1. This part is very slow.

This step can be very long. It can be skipped entirely, or just used to translate titles with argos for the main languages. Using an LLM for many entries requires a GPU - VRAM memory related to the chosen LLM. Argos can use either the GPU or CPU.

Check options at the beginning of the script.

  • mkdir enhanced_data
  • ./translation_and_keywords.py

Sort and shuffle documents

This step will separate documents with call numbers from records without them. It will also shuffle documents (this is important for training). When enhancing data is skipped, USE_ENHANCED_DATA should be set to False.

  • mkdir shuffled_documents_with_call_numbers shuffled_documents_without_call_numbers
  • ./sort_and_shuffle_documents.py

Train the model

This is using the following document properties: title, topic, language, format, simply contatenating the strings. Embeddings are generating from the resulting texts. Then the following steps are applied:

  • Dropout
  • Bidirectional LSTM
  • 1 dense layer
  • 1 dense layer for classification output

Check constants at the beginning. TRAIN_TEST_RATIO can be increased to generate the final models (model accuracy will not be precise without many documents in the test set, so this is not something to do when trying different parameters, but the resulting model accuracy will be better).

Execution:

  • mkdir keras_models
  • ./train.py

Use the model

This is just to try the resulting model after training.

  • ./use_model.py (output does not go to a log file automatically, feel free to redirect, for instance to ./logs/)

Find model mistakes

This will use documents with call numbers not used for training, and look for examples where generated classes and subclasses do not match the original call numbers.

  • ./find_model_mistakes.py

Generate call numbers for records without any LCC call number

This will output a file of id / call number pairs when the model can determine the call number class and subclass with a certain confidence level.

  • ./generate.py

Ouput: call_numbers.csv

Write results into Solr

Implementation for this is specific to MSU. See add_generated_call_numbers.py.

  • This should be done in the catalog's solr_solr container with the add_generated_call_numbers.py script from the catalog repo.

About

Call number generation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages