The referenceClassify tool from govscienceuseR is designed to take tagged reference data produced by anystyle.io and improve accuracy through disambiguating the publishing journal or agency, and classifying references into general groupings. See referenceClassify vignette for an example using California Groundwater Sustainability Plan documents.
Run the following line in your console to install the package from GitHub:
devtools::install_github("govscienceuseR/referenceClassify")
This package expects users to begin with data table of (potential) references that they would like to classify. This package provides the following functionality.
Three functions, journal_match(), agency_match(), and conference_match(), look for exact matches to indices of journal, agency, and conferences names. Each function outputs a logical vector for whether there is an exact match to an index per potential reference. The default indices are journal and conference lists from Scimago, and the agency list was constructed by the package authors with US federal agencies and partially comprehensive state-level agencies. Each function has the option to append additional data to the indices for greater personalization of the tool. For example, if the documents are from the EU, or specific to country-level agencies in a state, a new index of these agencies can be added.
This should maybe go into the reference_clean() function instead?
The journal_disambig() function takes one argument: a column containing (potential) journal names. If working through the govscienceuseR workflow, this column name is 'container'. The column is first matched against an official index of journal abbreviations, and then remaining, unmatched potential journal names are run through a series of regular expressions for detecting typical or observed journal abbreviations. The output is a new vector of journal names that have been changed to their complete journal or agency name is matched to an abbreviation. After running the journal_disambig() function it can be nice to check back in on the reference matching functions (Step 1) to see how many exact matches have been gained from the disambiguation process.
The regex_classify() function uses regular expressions to sort matched items into classes...
The keras_classify() function feeds the data into a keras model, trained using data from environmental impact assessments. The input is a data.table with the following columns: author, title, publisher, and journal. The columns are used as inputs for a single feature, multi-class model to predict the likelihood that a reference is to an agency, a journal, a conference, or not a reference.