diff --git a/README.md b/README.md index 3bd4f14..3a14186 100644 --- a/README.md +++ b/README.md @@ -1,225 +1,366 @@ # fairscape-cli + A utility for packaging objects and validating metadata for FAIRSCAPE. ---- -**Documentation**: [https://fairscape.github.io/fairscape-cli/](https://fairscape.github.io/fairscape-cli/) --- -## Features +## **Documentation**: [https://fairscape.github.io/fairscape-cli/](https://fairscape.github.io/fairscape-cli/) -fairscape-cli provides a Command Line Interface (CLI) that allows the client side to create: +## Features -* [RO-Crate](https://www.researchobject.org/ro-crate/) - a light-weight approach to packaging research data with their metadata. The CLI allows users to: - * Create Research Object Crates (RO-Crates) - * Add (transfer) digital objects to the RO-Crate - * Register metadata of the objects - * Describe the schema of tabular dataset objects as metadata and perform validation. +fairscape-cli provides a Command Line Interface (CLI) that allows the client side to create, manage, and publish scientific data packages: + +- **RO-Crate Management:** Create and manipulate [RO-Crate](https://www.researchobject.org/ro-crate/) packages locally. + - Initialize RO-Crates in new or existing directories. + - Add data, software, and computation metadata. + - Copy files into the crate structure alongside metadata registration. +- **Schema Handling:** Define, infer, and validate data schemas (Tabular, HDF5). + - Create schema definition files. + - Add properties with constraints. + - Infer schemas directly from data files. + - Validate data files against specified schemas. + - Register schemas within RO-Crates. +- **Data Import:** Fetch data from external sources and convert them into RO-Crates. + - Import NCBI BioProjects. + - Convert Portable Encapsulated Projects (PEPs) to RO-Crates. +- **Build Artifacts:** Generate derived outputs from RO-Crates. + - Create detailed HTML datasheets summarizing crate contents. + - Generate provenance evidence graphs (JSON and HTML). +- **Release Management:** Organize multiple related RO-Crates into a cohesive release package. + - Initialize a release structure. + - Automatically link sub-crates and propagate metadata. + - Build a top-level datasheet for the release. +- **Publishing:** Publish RO-Crate metadata to external repositories. + - Upload RO-Crate directories or zip files to Fairscape. + - Create datasets on Dataverse instances. + - Mint or update DOIs on DataCite. ## Requirements Python 3.8+ ## Installation + ```console $ pip install fairscape-cli ``` -## Minimal example +## Command Overview + +The CLI is organized into several top-level commands: + + rocrate: Core local RO-Crate manipulation (create, add files/metadata). + + schema: Operations on data schemas (create, infer, add properties, add to crate). -### Basic commands + validate: Validate data against schemas. -* Show all commands, arguments, and options + import: Fetch external data into RO-Crate format (e.g., bioproject, pep). + + build: Generate outputs from RO-Crates (e.g., datasheet, evidence-graph). + + release: Manage multi-part RO-Crate releases (e.g., create, build). + + publish: Publish RO-Crates to repositories (e.g., fairscape, dataverse, doi). + +Use --help for details on any command or subcommand: ```console $ fairscape-cli --help +$ fairscape-cli rocrate --help +$ fairscape-cli rocrate add --help +$ fairscape-cli schema create --help ``` -* Create an RO-Crate in a specified directory +## Examples + +### Creating an RO-Crate + +Create an RO-Crate in a specified directory: ```console $ fairscape-cli rocrate create \ - --name "test rocrate" \ - --description "Example RO Crate for Tests" \ - --organization-name "UVA" \ - --project-name "B2AI" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - "./test_rocrate" + --name "My Analysis Crate" \ + --description "RO-Crate containing analysis scripts and results" \ + --organization-name "My Org" \ + --project-name "My Project" \ + --keywords "analysis" \ + --keywords "python" \ + --author "Jane Doe" \ + --version "1.1.0" \ + ./my_analysis_crate ``` -* Create an RO-Crate in the current working directory +Initialize an RO-Crate in the current working directory: ```console +# Navigate to an empty directory first if desired +# mkdir my_analysis_crate && cd my_analysis_crate + $ fairscape-cli rocrate init \ - --name "test rocrate" \ - --description "Example RO Crate for Tests" \ - --organization-name "UVA" \ - --project-name "B2AI" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" + --name "My Analysis Crate" \ + --description "RO-Crate containing analysis scripts and results" \ + --organization-name "My Org" \ + --project-name "My Project" \ + --keywords "analysis" \ + --keywords "python" ``` -* Add a dataset to the RO-Crate +### Adding Content and Metadata to an RO-Crate + +These commands support adding both the file and its metadata (add) or just the metadata (register). + +Add a dataset file and its metadata: ```console $ fairscape-cli rocrate add dataset \ - --name "AP-MS embeddings" \ - --author "Krogan lab (https://kroganlab.ucsf.edu/krogan-lab)" \ - --version "1.0" \ - --date-published "2021-04-23" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - --data-format "CSV" \ - --source-filepath "./tests/data/APMS_embedding_MUSIC.csv" \ - --destination-filepath "./test_rocrate/APMS_embedding_MUSIC.csv" \ - "./test_rocrate" + --name "Raw Measurements" \ + --author "John Smith" \ + --version "1.0" \ + --date-published "2023-10-27" \ + --description "Raw sensor measurements from Experiment A." \ + --keywords "raw-data" \ + --keywords "sensors" \ + --data-format "csv" \ + --source-filepath "./source_data/measurements.csv" \ + --destination-filepath "data/measurements.csv" \ + ./my_analysis_crate ``` -* Add a software to the RO-Crate +Add a software script file and its metadata: ```console $ fairscape-cli rocrate add software \ - --name "calibrate pairwise distance" \ - --author "Qin, Y." \ - --version "1.0" \ - --description "script written in python to calibrate pairwise distance." \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - --file-format "py" \ - --source-filepath "./tests/data/calibrate_pairwise_distance.py" \ - --destination-filepath "./test_rocrate/calibrate_pairwise_distance.py" \ - --date-modified "2021-04-23" \ - "./test_rocrate" + --name "Analysis Script" \ + --author "Jane Doe" \ + --version "1.1.0" \ + --description "Python script for processing raw measurements." \ + --keywords "analysis" \ + --keywords "python" \ + --file-format "py" \ + --source-filepath "./scripts/process_data.py" \ + --destination-filepath "scripts/process_data.py" \ + ./my_analysis_crate ``` -* Register a computation to the RO-Crate +Register computation metadata (metadata only): ```console +# Assuming the script and dataset were added previously and have GUIDs: +# Dataset GUID: ark:59852/dataset-raw-measurements-xxxx +# Software GUID: ark:59852/software-analysis-script-yyyy + $ fairscape-cli rocrate register computation \ - --name "calibrate pairwise distance" \ - --run-by "Qin, Y." \ - --date-created "2021-05-23" \ - --description "Average the predicted proximities" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - "./test_rocrate" + --name "Data Processing Run" \ + --run-by "Jane Doe" \ + --date-created "2023-10-27T14:30:00Z" \ + --description "Execution of the analysis script on the raw measurements." \ + --keywords "processing" \ + --used-dataset "ark:59852/dataset-raw-measurements-xxxx" \ + --used-software "ark:59852/software-analysis-script-yyyy" \ + --generated "ark:59852/dataset-processed-results-zzzz" \ + ./my_analysis_crate + +# Note: You would typically register the generated dataset ('processed-results') separately. +``` + +Register dataset metadata (metadata only, file assumed present or external): + +```console +$ fairscape-cli rocrate register dataset \ + --name "Processed Results" \ + --guid "ark:59852/dataset-processed-results-zzzz" \ + --author "Jane Doe" \ + --version "1.0" \ + --description "Processed results from the analysis script." \ + --keywords "results" \ + --data-format "csv" \ + --filepath "results/processed.csv" \ + --generated-by "ark:59852/computation-data-processing-run-wwww" \ + ./my_analysis_crate ``` -* Create a schema +### Schema Management + +Create a tabular schema definition file: ```console -$ fairscape-cli schema create-tabular \ - --name 'APMS Embedding Schema' \ - --description 'Tabular format for APMS music embeddings from PPI networks from the music pipeline from the B2AI Cellmaps for AI project' \ +$ fairscape-cli schema create \ + --name 'Measurement Schema' \ + --description 'Schema for raw sensor measurements' \ + --schema-type tabular \ --separator ',' \ - --header False \ - ./schema_apms_music_embedding.json + --header true \ + ./measurement_schema.json ``` -* Add a string property +Add properties to the tabular schema file: ```console +# Add a string property (column 0) $ fairscape-cli schema add-property string \ - --name 'Experiment Identifier' \ + --name 'Timestamp' \ --index 0 \ - --description 'Identifier for the APMS experiment responsible for generating the raw PPI used to create this embedding vector' \ - --pattern '^APMS_[0-9]*$' \ - ./schema_apms_music_embedding.json + --description 'Measurement time (ISO8601)' \ + ./measurement_schema.json + +# Add a number property (column 1) +$ fairscape-cli schema add-property number \ + --name 'Value' \ + --index 1 \ + --description 'Sensor reading' \ + --minimum 0 \ + ./measurement_schema.json ``` -* Add annother string property +Infer a schema from an existing data file: ```console -$ fairscape-cli schema add-property string \ - --name 'Gene Symbol' \ - --index 1 \ - --description 'Gene Symbol for the APMS bait protien' \ - --pattern '^[A-Za-z0-9\-]*$' \ - --value-url 'http://edamontology.org/data_1026' \ - ./schema_apms_music_embedding.json +$ fairscape-cli schema infer \ + --name "Inferred Results Schema" \ + --description "Schema inferred from processed results" \ + ./my_analysis_crate/results/processed.csv \ + ./processed_schema.json ``` -* Add an array property +Add an existing schema file to an RO-Crate: ```console -$ fairscape-cli schema add-property array \ - --name 'MUSIC APMS Embedding' \ - --index '2::' \ - --description 'Embedding Vector values for genes determined by running node2vec on APMS PPI networks. Vector has 1024 values for each bait protien' \ - --items-datatype 'number' \ - --unique-items False \ - --min-items 1024 \ - --max-items 1024 \ - ./schema_apms_music_embedding.json +$ fairscape-cli schema add-to-crate \ + ./measurement_schema.json \ + ./my_analysis_crate ``` -* Show a successful validation of the schema against the dataset +### Validation + +Validate a data file against a schema file: ```console -$ fairscape-cli schema validate \ - --data ./examples/schemas/MUSIC_embedding/APMS_embedding_MUSIC.csv \ - --schema ./examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json +# Successful validation +$ fairscape-cli validate schema \ + --schema-path ./measurement_schema.json \ + --data-path ./my_analysis_crate/data/measurements.csv + +# Example failure +$ fairscape-cli validate schema \ + --schema-path ./measurement_schema.json \ + --data-path ./source_data/measurements_invalid.csv ``` -* Show an unsuccessful validation of the schema against the dataset +### Importing Data + +Import an NCBI BioProject into a new RO-Crate: ```console -$ fairscape-cli schema validate \ - --data examples/schemas/MUSIC_embedding/APMS_embedding_corrupted.csv \ - --schema examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json +$ fairscape-cli import bioproject \ + --accession PRJNA123456 \ + --author "Importer Name" \ + --output-dir ./bioproject_prjna123456_crate \ + --crate-name "Imported BioProject PRJNA123456" ``` -* Validate using default schemas +Convert a PEP project to an RO-Crate: ```console -# validate imageloader files -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/imageloader/samplescopy.csv" \ - --schema "ark:59852/schema-cm4ai-imageloader-samplescopy" - -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/imageloader/uniquecopy.csv" \ - --schema "ark:59852/schema-cm4ai-imageloader-uniquecopy" - -# validate image embedding outputs -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/image_embedding/image_emd.tsv" \ - --schema "ark:59852/schema-cm4ai-image-embedding-image-emd" - -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/image_embedding/labels_prob.tsv" \ - --schema "ark:59852/schema-cm4ai-image-embedding-labels-prob" +$ fairscape-cli import pep \ + ./path/to/my_pep_project \ + --output-path ./my_pep_rocrate \ + --crate-name "My PEP Project Crate" +``` + +### Building Outputs -# validate apsm loader input -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_gene_node_attributes.tsv" \ - --schema "ark:59852/schema-cm4ai-apmsloader-gene-node-attributes" +Generate an HTML datasheet for an RO-Crate: -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_edgelist.tsv" \ - --schema "ark:59852/schema-cm4ai-apmsloader-ppi-edgelist" +```console +$ fairscape-cli build datasheet ./my_analysis_crate +# Output will be ./my_analysis_crate/ro-crate-datasheet.html by default +``` -# validate apms embedding -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/apms_embedding/ppi_emd.tsv" \ - --schema "ark:59852/schema-cm4ai-apms-embedding" +Generate a provenance graph for a specific item within the crate: -# validate coembedding -$ fairscape-cli schema validate \ - --data "examples/schemas/cm4ai-rocrates/coembedding/coembedding_emd.tsv" \ - --schema "ark:59852/schema-cm4ai-coembedding" +```console +# Assuming 'ark:59852/dataset-processed-results-zzzz' is the item of interest +$ fairscape-cli build evidence-graph \ + ./my_analysis_crate \ + ark:59852/dataset-processed-results-zzzz \ + --output-json ./my_analysis_crate/prov/results_prov.json \ + --output-html ./my_analysis_crate/prov/results_prov.html ``` -## Contribution +### Release Management + +Create the structure for a multi-part release: + +```console +$ fairscape-cli release create \ + --name "My Big Release Q4 2023" \ + --description "Combined release of Experiment A and Experiment B crates" \ + --organization-name "My Org" \ + --project-name "Overall Project" \ + --keywords "release" \ + --keywords "experiment-a" \ + --keywords "experiment-b" \ + --version "2.0" \ + --author "Release Manager" \ + --publisher "My Org Publishing" \ + ./my_big_release + +# Manually copy or move your individual RO-Crate directories (e.g., experiment_a_crate, experiment_b_crate) +# into the ./my_big_release directory now. +``` + +Build the release (link sub-crates, update metadata, generate datasheet): + +```console +$ fairscape-cli release build ./my_big_release +``` -If you'd like to request a feature or report a bug, please create a [GitHub Issue](https://github.com/fairscape/fairscape-cli/issues) using one of the templates provided. +### Publishing + +Upload an RO-Crate to Fairscape: + +```console +# Ensure FAIRSCAPE_USERNAME and FAIRSCAPE_PASSWORD are set as environment variables or use options +$ fairscape-cli publish fairscape \ + --rocrate ./my_analysis_crate \ + --username \ + --password + +# Works with either directories or zip files +$ fairscape-cli publish fairscape \ + --rocrate ./my_analysis_crate.zip \ + --username \ + --password \ + --api-url https://fairscape.example.edu/api +``` + +Publish RO-Crate metadata to Dataverse: + +```console +# Ensure DATAVERSE_API_TOKEN is set as an environment variable or use --token +$ fairscape-cli publish dataverse \ + --rocrate ./my_analysis_crate/ro-crate-metadata.json \ + --url https://my.dataverse.instance.edu \ + --collection my_collection_alias \ + --token +``` + +Mint a DOI using DataCite: + +```console +# Ensure DATACITE_USERNAME and DATACITE_PASSWORD are set or use options +$ fairscape-cli publish doi \ + --rocrate ./my_analysis_crate/ro-crate-metadata.json \ + --prefix 10.1234 \ + --username MYORG.MYREPO \ + --password \ + --event publish # or 'register' for draft +``` + +## Contribution +If you'd like to request a feature or report a bug, please create a GitHub Issue using one of the templates provided. ## License diff --git a/docs/commands/build.md b/docs/commands/build.md new file mode 100644 index 0000000..deda41c --- /dev/null +++ b/docs/commands/build.md @@ -0,0 +1,102 @@ +# Build Commands + +This document provides detailed information about the build commands available in fairscape-cli. + +## Overview + +The `build` command group provides operations for generating derived artifacts from RO-Crates. These artifacts include datasheets, visualizations, and evidence graphs that make the RO-Crate content more accessible and understandable. + +```bash +fairscape-cli build [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`datasheet`](#datasheet) - Generate an HTML datasheet for an RO-Crate +- [`evidence-graph`](#evidence-graph) - Generate a provenance graph for a specific ARK identifier + +## Command Details + +### `datasheet` + +Generate an HTML datasheet for an RO-Crate, providing a human-readable summary of its content. + +```bash +fairscape-cli build datasheet [OPTIONS] ROCRATE_PATH +``` + +**Arguments:** + +- `ROCRATE_PATH` - Path to the RO-Crate directory or metadata file [required] + +**Options:** + +- `--output PATH` - Output HTML file path (defaults to ro-crate-datasheet.html in crate directory) +- `--template-dir PATH` - Custom template directory +- `--published` - Indicate if the crate is considered published (may affect template rendering) + +**Example:** + +```bash +fairscape-cli build datasheet ./my_rocrate +``` + +This command: + +1. Reads the RO-Crate metadata +2. Processes any subcrates +3. Generates a comprehensive HTML datasheet +4. Saves the datasheet in the specified location (or default location) + +The datasheet includes: + +- General metadata (title, authors, description) +- Datasets included in the crate +- Software included in the crate +- Computations documented in the crate +- Provenance relationships between elements +- References to external resources +- Information about subcrates (if any) + +### `evidence-graph` + +Generate a provenance graph for a specific ARK identifier within an RO-Crate. + +```bash +fairscape-cli build evidence-graph [OPTIONS] ROCRATE_PATH ARK_ID +``` + +**Arguments:** + +- `ROCRATE_PATH` - Path to the RO-Crate directory or metadata file [required] +- `ARK_ID` - ARK identifier for which to build the evidence graph [required] + +**Options:** + +- `--output-file PATH` - Path to save the JSON evidence graph (defaults to provenance-graph.json in the RO-Crate directory) + +**Example:** + +```bash +fairscape-cli build evidence-graph \ + ./my_rocrate \ + ark:59852/dataset-output-dataset-xDNPTmwoHl +``` + +This command: + +1. Reads the RO-Crate metadata +2. Identifies all relationships involving the specified ARK identifier +3. Builds a graph representing the provenance of the entity +4. Generates both JSON and HTML visualizations of the graph +5. Updates the RO-Crate metadata to reference the evidence graph + +The evidence graph shows: + +- Inputs used to create the entity +- Software used in the computations +- Computations that generated or used the entity +- Derived datasets or outputs +- All relevant metadata for each node in the graph + +The HTML visualization provides an interactive graph that can be viewed in a web browser, making it easy to explore the provenance of datasets, software, and computations in the RO-Crate. diff --git a/docs/commands/import.md b/docs/commands/import.md new file mode 100644 index 0000000..e21ba60 --- /dev/null +++ b/docs/commands/import.md @@ -0,0 +1,100 @@ +# Import Commands + +This document provides detailed information about the import commands available in fairscape-cli. + +## Overview + +The `import` command group provides operations for importing external data into RO-Crate format. These commands fetch data from external repositories and convert them to well-structured RO-Crates with appropriate metadata. + +```bash +fairscape-cli import [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`bioproject`](#bioproject) - Import data from an NCBI BioProject +- [`pep`](#pep) - Import a Portable Encapsulated Project (PEP) + +## Command Details + +### `bioproject` + +Import data from an NCBI BioProject into an RO-Crate. + +```bash +fairscape-cli import bioproject [OPTIONS] +``` + +**Options:** + +- `--accession TEXT` - NCBI BioProject accession (e.g., PRJNA12345) [required] +- `--output-dir PATH` - Directory to create the RO-Crate in [required] +- `--author TEXT` - Author name to associate with generated metadata [required] +- `--api-key TEXT` - NCBI API key (optional) +- `--name TEXT` - Override the default RO-Crate name +- `--description TEXT` - Override the default RO-Crate description +- `--keywords TEXT` - Override the default RO-Crate keywords (can be used multiple times) +- `--license TEXT` - Override the default RO-Crate license URL +- `--version TEXT` - Override the default RO-Crate version +- `--organization-name TEXT` - Set the organization name for the RO-Crate +- `--project-name TEXT` - Set the project name for the RO-Crate + +**Example:** + +```bash +fairscape-cli import bioproject \ + --accession "PRJDB2884" \ + --output-dir "./bioproject_crate" \ + --author "Jane Smith" \ + --keywords "genomics" \ + --keywords "sequencing" +``` + +This command: + +1. Fetches metadata from the NCBI BioProject database +2. Creates an RO-Crate with the BioProject metadata +3. Registers datasets, samples, and other relevant data from the BioProject +4. Outputs the ARK identifier of the created RO-Crate + +### `pep` + +Import a Portable Encapsulated Project (PEP) into an RO-Crate. + +```bash +fairscape-cli import pep [OPTIONS] PEP_PATH +``` + +**Arguments:** + +- `PEP_PATH` - Path to the PEP directory or config file [required] + +**Options:** + +- `--output-path PATH` - Path for the generated RO-Crate (defaults to PEP directory) +- `--name TEXT` - Name for the RO-Crate (overrides PEP metadata) +- `--description TEXT` - Description (overrides PEP metadata) +- `--author TEXT` - Author (overrides PEP metadata) +- `--organization-name TEXT` - Organization name +- `--project-name TEXT` - Project name +- `--keywords TEXT` - Keywords (overrides PEP metadata, can be used multiple times) +- `--license TEXT` - License URL (default: "https://creativecommons.org/licenses/by/4.0/") +- `--date-published TEXT` - Publication date +- `--version TEXT` - Version string (default: "1.0") + +**Example:** + +```bash +fairscape-cli import pep \ + ./my_pep_project \ + --output-path ./pep_rocrate \ + --author "John Doe" \ + --organization-name "University Example" \ + --project-name "My PEP Project" +``` + +This command: + +1. Reads the PEP project configuration +2. Creates an RO-Crate with metadata from the PEP +3. Outputs the ARK identifier of the created RO-Crate diff --git a/docs/commands/publish.md b/docs/commands/publish.md new file mode 100644 index 0000000..779b5f3 --- /dev/null +++ b/docs/commands/publish.md @@ -0,0 +1,168 @@ +# Publish Commands + +This document provides detailed information about the publish commands available in fairscape-cli. + +## Overview + +The `publish` command group provides operations for publishing RO-Crates to external repositories and registering persistent identifiers. These commands help make your research data FAIR (Findable, Accessible, Interoperable, and Reusable) by connecting it to wider research data ecosystems. + +```bash +fairscape-cli publish [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`fairscape`](#fairscape) - Upload RO-Crate directory or zip file to Fairscape +- [`dataverse`](#dataverse) - Publish RO-Crate metadata as a new dataset to Dataverse +- [`doi`](#doi) - Mint or update a DOI on DataCite using RO-Crate metadata + +## Command Details + +### `fairscape` + +Upload an RO-Crate directory or zip file to a Fairscape repository. + +```bash +fairscape-cli publish fairscape [OPTIONS] +``` + +**Options:** + +- `--rocrate PATH` - Path to the RO-Crate directory or zip file [required] +- `--username TEXT` - Fairscape username (can also be set via FAIRSCAPE_USERNAME env var) [required] +- `--password TEXT` - Fairscape password (can also be set via FAIRSCAPE_PASSWORD env var) [required] +- `--api-url TEXT` - Fairscape API URL (default: "https://fairscape.net/api") + +**Example:** + +```bash +fairscape-cli publish fairscape \ + --rocrate ./my_rocrate \ + --username "your_username" \ + --password "your_password" \ + --api-url "https://fairscape.example.org/api" +``` + +This command: + +1. Authenticates with the Fairscape repository +2. Uploads the RO-Crate directory or zip file +3. Registers the metadata in the repository +4. Returns a URL to access the published RO-Crate + +### `dataverse` + +Publish RO-Crate metadata as a new dataset to a Dataverse repository. + +```bash +fairscape-cli publish dataverse [OPTIONS] +``` + +**Options:** + +- `--rocrate PATH` - Path to the ro-crate-metadata.json file [required] +- `--prefix TEXT` - Your DataCite DOI prefix (e.g., "10.1234") [required] +- `--username TEXT` - DataCite API username (repository ID, e.g., "MEMBER.REPO") (can use DATACITE_USERNAME env var) [required] +- `--password TEXT` - DataCite API password (can use DATACITE_PASSWORD env var) [required] +- `--api-url TEXT` - DataCite API URL (default: "https://api.datacite.org", use "https://api.test.datacite.org" for testing) +- `--event TEXT` - DOI event type: 'publish' (make public), 'register' (create draft), 'hide' (make findable but hide metadata) [default: "publish"] + +**Example:** + +```bash +fairscape-cli publish doi \ + --rocrate ./my_rocrate/ro-crate-metadata.json \ + --prefix "10.1234" \ + --username "MYORG.MYREPO" \ + --password "your_datacite_password" \ + --event "publish" +``` + +This command: + +1. Reads the RO-Crate metadata +2. Transforms it into DataCite metadata +3. Mints or updates a DOI on DataCite +4. Returns the DOI URL + +## Working with DOIs + +When working with DOIs, keep in mind: + +1. **DOI States**: + + - `register`: Creates a draft DOI that is not yet publicly resolvable + - `publish`: Makes the DOI and its metadata public and resolvable + - `hide`: Makes the DOI resolvable but hides its metadata + +2. **Testing**: Use the test DataCite API URL before working with the production system: + + ```bash + --api-url "https://api.test.datacite.org" + ``` + +3. **Updating**: To update an existing DOI, ensure the RO-Crate metadata contains the DOI in the `identifier` field. + +## Integrating with Dataverse + +After minting a DOI, you can update your RO-Crate metadata with the DOI and then publish to Dataverse: + +```bash +# First mint a DOI +fairscape-cli publish doi --rocrate ./my_rocrate/ro-crate-metadata.json ... + +# Then update your RO-Crate with the DOI +# (This would typically be done programmatically) + +# Then publish to Dataverse +fairscape-cli publish dataverse --rocrate ./my_rocrate/ro-crate-metadata.json ... +``` + +This workflow ensures your research data is both persistently identified and accessible through established research data repositories. +json file [required] + +- `--url TEXT` - Base URL of the target Dataverse instance (e.g., "https://dataverse.example.edu") [required] +- `--collection TEXT` - Alias of the target Dataverse collection to publish into [required] +- `--token TEXT` - Dataverse API token (can also be set via DATAVERSE_API_TOKEN env var) [required] +- `--authors-csv PATH` - Optional CSV file with author details (name, affiliation, orcid). Requires "name" column header. + +**Example:** + +```bash +fairscape-cli publish dataverse \ + --rocrate ./my_rocrate/ro-crate-metadata.json \ + --url "https://dataverse.example.edu" \ + --collection "my_collection" \ + --token "your_dataverse_api_token" +``` + +This command: + +1. Reads the RO-Crate metadata +2. Transforms it into Dataverse dataset metadata +3. Creates a new dataset in the specified Dataverse collection +4. Returns the DOI of the created dataset + +### `doi` + +Mint or update a DOI on DataCite using RO-Crate metadata. + +```bash +fairscape-cli publish doi [OPTIONS] +``` + +**Options:** + +- `--rocrate PATH` - Path to the ro-crate-metadata.json file [required] +- `--prefix TEXT` - Your DataCite DOI prefix (e.g., "10.1234") [required] +- `--username TEXT` - DataCite API username (repository ID, e.g., "MEMBER.REPO") (can use DATACITE_USERNAME env var) [required] +- `--password TEXT` - DataCite API password (can use DATACITE_PASSWORD env var) [required] +- `--api-url TEXT` - DataCite API URL (default: "https://api.datacite.org", use "https://api.test.datacite.org" for testing) +- `--event TEXT` - DOI event type: 'publish' (make public), 'register' (create draft), 'hide' (make findable but hide metadata) [default: "publish"] + +**Example:** + +```bash +fairscape-cli publish doi \ + --rocrate ./my_rocrate/ro-crate-metadata. +``` diff --git a/docs/commands/release.md b/docs/commands/release.md new file mode 100644 index 0000000..5cfcfe0 --- /dev/null +++ b/docs/commands/release.md @@ -0,0 +1,117 @@ +# Release Commands + +This document provides detailed information about the release commands available in fairscape-cli. + +## Overview + +The `release` command group provides operations for creating and managing release packages that combine multiple RO-Crates. This allows you to organize related RO-Crates into a cohesive collection with unified metadata and documentation. + +```bash +fairscape-cli release [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`build`](#build) - Build a release RO-Crate from a directory containing multiple RO-Crates + +## Command Details + +### `build` + +Build a release RO-Crate in a directory, scanning for and linking existing sub-RO-Crates. This creates a parent RO-Crate that references and contextualizes the sub-crates. + +```bash +fairscape-cli release build [OPTIONS] RELEASE_DIRECTORY +``` + +**Arguments:** + +- `RELEASE_DIRECTORY` - Directory where the release RO-Crate will be built [required] + +**Options:** + +- `--guid TEXT` - GUID for the parent release RO-Crate (generated if not provided) +- `--name TEXT` - Name for the parent release RO-Crate [required] +- `--organization-name TEXT` - Organization name associated with the release [required] +- `--project-name TEXT` - Project name associated with the release [required] +- `--description TEXT` - Description of the release RO-Crate [required] +- `--keywords TEXT` - Keywords for the release RO-Crate (can be used multiple times) [required] +- `--license TEXT` - License URL for the release (default: "https://creativecommons.org/licenses/by/4.0/") +- `--date-published TEXT` - Publication date (ISO format, defaults to current date) +- `--author TEXT` - Author(s) of the release (defaults to combined authors from subcrates) +- `--version TEXT` - Version of the release (default: "1.0") +- `--associated-publication TEXT` - Associated publications for the release (can be used multiple times) +- `--conditions-of-access TEXT` - Conditions of access for the release +- `--copyright-notice TEXT` - Copyright notice for the release +- `--doi TEXT` - DOI identifier for the release +- `--publisher TEXT` - Publisher of the release +- `--principal-investigator TEXT` - Principal investigator for the release +- `--contact-email TEXT` - Contact email for the release +- `--confidentiality-level TEXT` - Confidentiality level for the release +- `--citation TEXT` - Citation for the release +- `--funder TEXT` - Funder of the release +- `--usage-info TEXT` - Usage information for the release +- `--content-size TEXT` - Content size of the release +- `--completeness TEXT` - Completeness information for the release +- `--maintenance-plan TEXT` - Maintenance plan for the release +- `--intended-use TEXT` - Intended use of the release +- `--limitations TEXT` - Limitations of the release +- `--prohibited-uses TEXT` - Prohibited uses of the release +- `--potential-sources-of-bias TEXT` - Potential sources of bias in the release +- `--human-subject TEXT` - Human subject involvement information +- `--ethical-review TEXT` - Ethical review information +- `--additional-properties TEXT` - JSON string with additional property values +- `--custom-properties TEXT` - JSON string with additional properties for the parent crate + +**Example:** + +```bash +fairscape-cli release build ./my_release \ + --guid "ark:59852/example-release-2023" \ + --name "SRA Genomic Data Example Release - 2023" \ + --organization-name "Example Research Institute" \ + --project-name "Genomic Data Analysis Project" \ + --description "This dataset contains genomic data from multiple sources prepared as AI-ready datasets in RO-Crate format." \ + --keywords "Genomics" \ + --keywords "SRA" \ + --keywords "RNA-seq" \ + --license "https://creativecommons.org/licenses/by/4.0/" \ + --publisher "University Example Dataverse" \ + --principal-investigator "Dr. Example PI" \ + --contact-email "example@example.org" \ + --confidentiality-level "HL7 Unrestricted" \ + --funder "Example Agency" \ + --citation "Example Research Institute (2023). Genomic Data Example Release." +``` + +This command: + +1. Creates a new parent RO-Crate in the specified directory +2. Scans the directory for existing RO-Crates to include as subcrates +3. Links the subcrates to the parent crate +4. Combines metadata from subcrates and the provided options +5. Outputs the ARK identifier of the created release RO-Crate + +## Release Workflow + +A typical release workflow involves: + +1. **Create individual RO-Crates** for specific datasets, software, and computations +2. **Place these RO-Crates** in a common directory structure +3. **Build a release** using the `release build` command to create a parent RO-Crate +4. **Generate a datasheet** using the `build datasheet` command +5. **Publish the release** using the `publish` commands + +The parent release RO-Crate provides context and relationships between the individual RO-Crates, making it easier to understand and work with complex datasets that span multiple files, processes, and research objects. + +## Metadata Inheritance + +When building a release, metadata is handled in the following ways: + +- **Author information** is combined from all subcrates unless explicitly provided +- **Keywords** include both the specified keywords and those from subcrates +- **Version** defaults to "1.0" unless specified +- **License** defaults to CC-BY 4.0 unless specified +- **Publication date** defaults to the current date unless specified + +All other metadata must be explicitly provided through the command options. diff --git a/docs/commands/rocrate.md b/docs/commands/rocrate.md new file mode 100644 index 0000000..2b3c6a3 --- /dev/null +++ b/docs/commands/rocrate.md @@ -0,0 +1,335 @@ +# RO-Crate Commands + +This document provides detailed information about the RO-Crate commands available in fairscape-cli. + +## Overview + +The `rocrate` command group provides operations for creating and manipulating Research Object Crates (RO-Crates). RO-Crates are a lightweight approach to packaging research data with their metadata, making them more FAIR (Findable, Accessible, Interoperable, and Reusable). + +```bash +fairscape-cli rocrate [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`create`](#create) - Create a new RO-Crate in a specified directory +- [`init`](#init) - Initialize an RO-Crate in the current working directory +- [`register`](#register) - Add metadata to an existing RO-Crate + - [`dataset`](#register-dataset) - Register dataset metadata + - [`software`](#register-software) - Register software metadata + - [`computation`](#register-computation) - Register computation metadata + - [`subrocrate`](#register-subrocrate) - Register a new RO-Crate within an existing RO-Crate +- [`add`](#add) - Add a file to the RO-Crate and register its metadata + - [`dataset`](#add-dataset) - Add a dataset file and its metadata + - [`software`](#add-software) - Add a software file and its metadata + +## Command Details + +### `create` + +Create a new RO-Crate in a specified directory. + +```bash +fairscape-cli rocrate create [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the RO-Crate +- `--name TEXT` - Name of the RO-Crate [required] +- `--organization-name TEXT` - Name of the organization [required] +- `--project-name TEXT` - Name of the project [required] +- `--description TEXT` - Description of the RO-Crate [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--license TEXT` - License URL (default: "https://creativecommons.org/licenses/by/4.0/") +- `--date-published TEXT` - Publication date (ISO format) +- `--author TEXT` - Author name (default: "Unknown") +- `--version TEXT` - Version number (default: "1.0") +- `--associated-publication TEXT` - Associated publication +- `--conditions-of-access TEXT` - Conditions of access +- `--copyright-notice TEXT` - Copyright notice +- `--custom-properties TEXT` - JSON string with additional properties to include + +**Example:** + +```bash +fairscape-cli rocrate create \ + --name "test rocrate" \ + --description "Example RO Crate for Tests" \ + --organization-name "UVA" \ + --project-name "B2AI" \ + --keywords "b2ai" \ + --keywords "cm4ai" \ + "./test_rocrate" +``` + +### `init` + +Initialize an RO-Crate in the current working directory. + +```bash +fairscape-cli rocrate init [OPTIONS] +``` + +**Options:** +The same options as for the `create` command are available. The difference is that `init` creates the RO-Crate in the current working directory. + +**Example:** + +```bash +fairscape-cli rocrate init \ + --name "test rocrate" \ + --description "Example RO Crate for Tests" \ + --organization-name "UVA" \ + --project-name "B2AI" \ + --keywords "b2ai" \ + --keywords "cm4ai" +``` + +### `register` + +Add metadata to an existing RO-Crate. This command has several subcommands depending on the type of metadata to register. + +#### `register dataset` + +Register dataset metadata with an existing RO-Crate. + +```bash +fairscape-cli rocrate register dataset [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the dataset +- `--name TEXT` - Name of the dataset [required] +- `--author TEXT` - Author of the dataset [required] +- `--version TEXT` - Version of the dataset [required] +- `--description TEXT` - Description of the dataset [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--data-format TEXT` - Format of the dataset (e.g., csv, json) [required] +- `--filepath TEXT` - Path to the dataset file [required] +- `--url TEXT` - URL reference for the dataset +- `--date-published TEXT` - Publication date of the dataset (ISO format) +- `--schema TEXT` - Schema identifier for the dataset +- `--used-by TEXT` - Identifiers of computations that use this dataset (can be specified multiple times) +- `--derived-from TEXT` - Identifiers of datasets this one is derived from (can be specified multiple times) +- `--generated-by TEXT` - Identifiers of computations that generated this dataset (can be specified multiple times) +- `--summary-statistics-filepath TEXT` - Path to summary statistics file +- `--associated-publication TEXT` - Associated publication identifier +- `--additional-documentation TEXT` - Additional documentation +- `--custom-properties TEXT` - JSON string with additional properties to include + +**Example:** + +```bash +fairscape-cli rocrate register dataset \ + --name "AP-MS embeddings" \ + --author "Krogan lab" \ + --version "1.0" \ + --date-published "2023-04-23" \ + --description "APMS embeddings for each protein" \ + --keywords "proteomics" \ + --data-format "CSV" \ + --filepath "./test_rocrate/embeddings.csv" \ + "./test_rocrate" +``` + +#### `register software` + +Register software metadata with an existing RO-Crate. + +```bash +fairscape-cli rocrate register software [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the software +- `--name TEXT` - Name of the software [required] +- `--author TEXT` - Author of the software [required] +- `--version TEXT` - Version of the software [required] +- `--description TEXT` - Description of the software [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--file-format TEXT` - Format of the software (e.g., py, js) [required] +- `--url TEXT` - URL reference for the software +- `--date-modified TEXT` - Last modification date of the software (ISO format) +- `--filepath TEXT` - Path to the software file (relative to crate root) +- `--used-by-computation TEXT` - Identifiers of computations that use this software (can be specified multiple times) +- `--associated-publication TEXT` - Associated publication identifier +- `--additional-documentation TEXT` - Additional documentation +- `--custom-properties TEXT` - JSON string with additional properties + +**Example:** + +```bash +fairscape-cli rocrate register software \ + --name "calibrate pairwise distance" \ + --author "Qin, Y." \ + --version "1.0" \ + --description "script written in python to calibrate pairwise distance." \ + --keywords "b2ai" \ + --file-format "py" \ + --filepath "./test_rocrate/calibrate_pairwise_distance.py" \ + --date-modified "2023-04-23" \ + "./test_rocrate" +``` + +#### `register computation` + +Register computation metadata with an existing RO-Crate. + +```bash +fairscape-cli rocrate register computation [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the computation +- `--name TEXT` - Name of the computation [required] +- `--run-by TEXT` - Person or entity that ran the computation [required] +- `--command TEXT` - Command used to run the computation (string or JSON list) +- `--date-created TEXT` - Date the computation was run (ISO format) [required] +- `--description TEXT` - Description of the computation [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--used-software TEXT` - Software identifiers used by this computation (can be specified multiple times) +- `--used-dataset TEXT` - Dataset identifiers used by this computation (can be specified multiple times) +- `--generated TEXT` - Dataset/Software identifiers generated by this computation (can be specified multiple times) +- `--associated-publication TEXT` - Associated publication identifier +- `--additional-documentation TEXT` - Additional documentation +- `--custom-properties TEXT` - JSON string with additional properties + +**Example:** + +```bash +fairscape-cli rocrate register computation \ + --name "calibrate pairwise distance" \ + --run-by "Qin, Y." \ + --date-created "2023-05-23" \ + --description "Average the predicted proximities" \ + --keywords "b2ai" \ + --keywords "cm4ai" \ + "./test_rocrate" +``` + +#### `register subrocrate` + +Register a new RO-Crate within an existing RO-Crate directory. + +```bash +fairscape-cli rocrate register subrocrate [OPTIONS] ROCRATE_PATH SUBROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the sub-crate +- `--name TEXT` - Name of the sub-crate [required] +- `--organization-name TEXT` - Name of the organization [required] +- `--project-name TEXT` - Name of the project [required] +- `--description TEXT` - Description of the sub-crate [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--author TEXT` - Author name (default: "Unknown") +- `--version TEXT` - Version number (default: "1.0") +- `--license TEXT` - License URL (default: "https://creativecommons.org/licenses/by/4.0/") + +**Example:** + +```bash +fairscape-cli rocrate register subrocrate \ + --name "Sub-Crate Example" \ + --organization-name "UVA" \ + --project-name "B2AI" \ + --description "A sub-crate within the main RO-Crate" \ + --keywords "sub-crate" \ + "./test_rocrate" "./test_rocrate/sub_crate" +``` + +### `add` + +Add a file to the RO-Crate and register its metadata. This command has several subcommands depending on the type of file to add. + +#### `add dataset` + +Add a dataset file and its metadata to an RO-Crate. + +```bash +fairscape-cli rocrate add dataset [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the dataset +- `--name TEXT` - Name of the dataset [required] +- `--url TEXT` - URL reference for the dataset +- `--author TEXT` - Author of the dataset [required] +- `--version TEXT` - Version of the dataset [required] +- `--date-published TEXT` - Publication date of the dataset (ISO format) [required] +- `--description TEXT` - Description of the dataset [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--data-format TEXT` - Format of the dataset (e.g., csv, json) [required] +- `--source-filepath TEXT` - Path to the source dataset file [required] +- `--destination-filepath TEXT` - Path where the dataset file will be copied in the RO-Crate [required] +- `--summary-statistics-source TEXT` - Path to source summary statistics file +- `--summary-statistics-destination TEXT` - Path where summary statistics file will be copied +- `--used-by TEXT` - Identifiers of computations that use this dataset (can be specified multiple times) +- `--derived-from TEXT` - Identifiers of datasets this one is derived from (can be specified multiple times) +- `--generated-by TEXT` - Identifiers of computations that generated this dataset (can be specified multiple times) +- `--schema TEXT` - Schema identifier for the dataset +- `--associated-publication TEXT` - Associated publication identifier +- `--additional-documentation TEXT` - Additional documentation + +**Example:** + +```bash +fairscape-cli rocrate add dataset \ + --name "AP-MS embeddings" \ + --author "Krogan lab" \ + --version "1.0" \ + --date-published "2023-04-23" \ + --description "APMS embeddings for each protein" \ + --keywords "proteomics" \ + --data-format "CSV" \ + --source-filepath "./data/embeddings.csv" \ + --destination-filepath "./test_rocrate/embeddings.csv" \ + "./test_rocrate" +``` + +#### `add software` + +Add a software file and its metadata to an RO-Crate. + +```bash +fairscape-cli rocrate add software [OPTIONS] ROCRATE_PATH +``` + +**Options:** + +- `--guid TEXT` - Optional custom identifier for the software +- `--name TEXT` - Name of the software [required] +- `--author TEXT` - Author of the software [required] +- `--version TEXT` - Version of the software [required] +- `--description TEXT` - Description of the software [required] +- `--keywords TEXT` - Keywords (can be specified multiple times) [required] +- `--file-format TEXT` - Format of the software (e.g., py, js) [required] +- `--url TEXT` - URL reference for the software +- `--source-filepath TEXT` - Path to the source software file [required] +- `--destination-filepath TEXT` - Path where the software file will be copied in the RO-Crate [required] +- `--date-modified TEXT` - Last modification date of the software (ISO format) [required] +- `--used-by-computation TEXT` - Identifiers of computations that use this software (can be specified multiple times) +- `--associated-publication TEXT` - Associated publication identifier +- `--additional-documentation TEXT` - Additional documentation + +**Example:** + +```bash +fairscape-cli rocrate add software \ + --name "calibrate pairwise distance" \ + --author "Qin, Y." \ + --version "1.0" \ + --description "script written in python to calibrate pairwise distance." \ + --keywords "b2ai" \ + --file-format "py" \ + --source-filepath "./scripts/calibrate_pairwise_distance.py" \ + --destination-filepath "./test_rocrate/calibrate_pairwise_distance.py" \ + --date-modified "2023-04-23" \ + "./test_rocrate" +``` diff --git a/docs/commands/schema.md b/docs/commands/schema.md new file mode 100644 index 0000000..971dcf2 --- /dev/null +++ b/docs/commands/schema.md @@ -0,0 +1,246 @@ +# Schema Commands + +This document provides detailed information about the schema commands available in fairscape-cli. + +## Overview + +The `schema` command group provides operations for creating, modifying, and working with data schemas. Schemas describe the structure and constraints of datasets, enabling validation and improved interoperability. + +```bash +fairscape-cli schema [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`create-tabular`](#create-tabular) - Create a new tabular schema definition +- [`add-property`](#add-property) - Add a property to an existing schema + - [`string`](#add-property-string) - Add a string property + - [`number`](#add-property-number) - Add a number property + - [`integer`](#add-property-integer) - Add an integer property + - [`boolean`](#add-property-boolean) - Add a boolean property + - [`array`](#add-property-array) - Add an array property +- [`infer`](#infer) - Infer a schema from a data file +- [`add-to-crate`](#add-to-crate) - Add a schema to an RO-Crate + +## Command Details + +### `create-tabular` + +Create a new tabular schema definition. + +```bash +fairscape-cli schema create-tabular [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the schema [required] +- `--description TEXT` - Description of the schema [required] +- `--guid TEXT` - Optional custom identifier for the schema +- `--separator TEXT` - Field separator character (e.g., `,` for CSV) [required] +- `--header BOOLEAN` - Whether the data file has a header row (default: False) + +**Example:** + +```bash +fairscape-cli schema create-tabular \ + --name 'APMS Embedding Schema' \ + --description 'Tabular format for APMS music embeddings' \ + --separator ',' \ + --header False \ + ./schema_apms_music_embedding.json +``` + +### `add-property` + +This command group allows you to add different types of properties to an existing schema. + +#### `add-property string` + +Add a string property to a schema. + +```bash +fairscape-cli schema add-property string [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the property [required] +- `--index INTEGER` - Column index in the data (0-based) [required] +- `--description TEXT` - Description of the property [required] +- `--value-url TEXT` - URL to a vocabulary term +- `--pattern TEXT` - Regular expression pattern for validation + +**Example:** + +```bash +fairscape-cli schema add-property string \ + --name 'Gene Symbol' \ + --index 1 \ + --description 'Gene Symbol for the APMS bait protein' \ + --pattern '^[A-Za-z0-9\-]*$' \ + --value-url 'http://edamontology.org/data_1026' \ + ./schema_apms_music_embedding.json +``` + +#### `add-property number` + +Add a number property to a schema. + +```bash +fairscape-cli schema add-property number [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the property [required] +- `--index INTEGER` - Column index in the data (0-based) [required] +- `--description TEXT` - Description of the property [required] +- `--maximum FLOAT` - Maximum allowed value +- `--minimum FLOAT` - Minimum allowed value +- `--value-url TEXT` - URL to a vocabulary term + +**Example:** + +```bash +fairscape-cli schema add-property number \ + --name 'Measurement' \ + --index 2 \ + --description 'Sensor reading in units of X' \ + --minimum 0.0 \ + --maximum 100.0 \ + ./schema_sensor_data.json +``` + +#### `add-property integer` + +Add an integer property to a schema. + +```bash +fairscape-cli schema add-property integer [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the property [required] +- `--index INTEGER` - Column index in the data (0-based) [required] +- `--description TEXT` - Description of the property [required] +- `--maximum INTEGER` - Maximum allowed value +- `--minimum INTEGER` - Minimum allowed value +- `--value-url TEXT` - URL to a vocabulary term + +**Example:** + +```bash +fairscape-cli schema add-property integer \ + --name 'Count' \ + --index 3 \ + --description 'Count of observations' \ + --minimum 0 \ + ./schema_count_data.json +``` + +#### `add-property boolean` + +Add a boolean property to a schema. + +```bash +fairscape-cli schema add-property boolean [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the property [required] +- `--index INTEGER` - Column index in the data (0-based) [required] +- `--description TEXT` - Description of the property [required] +- `--value-url TEXT` - URL to a vocabulary term + +**Example:** + +```bash +fairscape-cli schema add-property boolean \ + --name 'IsValid' \ + --index 4 \ + --description 'Whether the observation is valid' \ + ./schema_validation_data.json +``` + +#### `add-property array` + +Add an array property to a schema. + +```bash +fairscape-cli schema add-property array [OPTIONS] SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name of the property [required] +- `--index TEXT` - Column index or range in the data (e.g., "5" or "2::") [required] +- `--description TEXT` - Description of the property [required] +- `--value-url TEXT` - URL to a vocabulary term +- `--items-datatype TEXT` - Datatype of items in the array (`string`, `number`, `integer`, `boolean`) [required] +- `--min-items INTEGER` - Minimum number of items in the array +- `--max-items INTEGER` - Maximum number of items in the array +- `--unique-items BOOLEAN` - Whether items must be unique + +**Example:** + +```bash +fairscape-cli schema add-property array \ + --name 'MUSIC APMS Embedding' \ + --index '2::' \ + --description 'Embedding Vector values' \ + --items-datatype 'number' \ + --unique-items False \ + --min-items 1024 \ + --max-items 1024 \ + ./schema_apms_music_embedding.json +``` + +### `infer` + +Infer a schema from a data file. + +```bash +fairscape-cli schema infer [OPTIONS] INPUT_FILE SCHEMA_FILE +``` + +**Options:** + +- `--name TEXT` - Name for the schema [required] +- `--description TEXT` - Description for the schema [required] +- `--guid TEXT` - Optional custom identifier for the schema +- `--rocrate-path PATH` - Optional path to an RO-Crate to append the schema to + +**Example:** + +```bash +fairscape-cli schema infer \ + --name 'Output Dataset Schema' \ + --description 'Inferred schema for output data' \ + --rocrate-path ./my_rocrate \ + ./my_rocrate/output.csv \ + ./my_rocrate/output_schema.json +``` + +### `add-to-crate` + +Add a schema to an RO-Crate. + +```bash +fairscape-cli schema add-to-crate ROCRATE_PATH SCHEMA_FILE +``` + +**Arguments:** + +- `ROCRATE_PATH` - Path to the RO-Crate to add the schema to +- `SCHEMA_FILE` - Path to the schema file + +**Example:** + +```bash +fairscape-cli schema add-to-crate \ + ./my_rocrate \ + ./schema_apms_music_embedding.json +``` diff --git a/docs/commands/validate.md b/docs/commands/validate.md new file mode 100644 index 0000000..9cf73e2 --- /dev/null +++ b/docs/commands/validate.md @@ -0,0 +1,88 @@ +# Validation Commands + +This document provides detailed information about the validation commands available in fairscape-cli. + +## Overview + +The `validate` command group provides operations for validating data against schemas. This ensures that datasets conform to their expected structure and constraints. + +```bash +fairscape-cli validate [COMMAND] [OPTIONS] +``` + +## Available Commands + +- [`schema`](#schema) - Validate a dataset against a schema definition + +## Command Details + +### `schema` + +Validate a dataset against a schema definition. + +```bash +fairscape-cli validate schema [OPTIONS] +``` + +**Options:** + +- `--schema TEXT` - Path to the schema file or ARK identifier [required] +- `--data TEXT` - Path to the data file to validate [required] + +**Example:** + +```bash +fairscape-cli validate schema \ + --schema ./schema_apms_music_embedding.json \ + --data ./APMS_embedding_MUSIC.csv +``` + +When validation succeeds, you'll see: + +``` +Validation Success +``` + +If validation fails, you'll see a table of errors: + +``` ++-----+-----------------+----------------+-------------------------------------------------------+ +| row | error_type | failed_keyword | message | ++-----+-----------------+----------------+-------------------------------------------------------+ +| 3 | ParsingError | None | ValueError: Failed to Parse Attribute embed for Row 3 | +| 4 | ParsingError | None | ValueError: Failed to Parse Attribute embed for Row 4 | +| 0 | ValidationError | pattern | 'APMS_A' does not match '^APMS_[0-9]*$' | ++-----+-----------------+----------------+-------------------------------------------------------+ +``` + +## Error Types + +Errors are categorized into two main types: + +1. **ParsingError**: Occurs when the data cannot be parsed according to the schema structure. This often happens when: + + - The number of columns doesn't match the schema + - A value cannot be converted to the expected datatype + +2. **ValidationError**: Occurs when the data can be parsed but fails validation constraints like: + - String values not matching the specified pattern + - Numeric values outside the min/max range + - Array length not within specified bounds + +## Working with Different File Types + +The validation command automatically detects the file type based on its extension: + +- **CSV/TSV files**: Tabular validation with field separators +- **Parquet files**: Tabular validation with columnar storage +- **HDF5 files**: Hierarchical validation with nested structures + +## Using ARK Identifiers for Schemas + +Instead of providing a file path, you can reference a schema by its ARK identifier if it's registered in a FAIRSCAPE repository: + +```bash +fairscape-cli validate schema \ + --schema "ark:59852/schema-cm4ai-image-embedding-image-emd" \ + --data "examples/schemas/cm4ai-rocrates/image_embedding/image_emd.tsv" +``` diff --git a/docs/getting-started.md b/docs/getting-started.md deleted file mode 100644 index 5ace5f2..0000000 --- a/docs/getting-started.md +++ /dev/null @@ -1,409 +0,0 @@ - ---- -## RO-Crate ---- - -To perform any RO-Crate operation, simply use the `rocrate` sub-command within the `fairscape-cli` root command. - ---- - -### Create RO-Crate -To create an RO-Crate, you have the option to use either the `create` or `init` sub-commands. With `create`, you can specify the destination directory using the `ROCRATE_PATH` argument, whereas `init` creates the RO-Crate in the current working directory. Both sub-commands require five parameters: `name`, `description`, `keywords`, `organization-name`, and `project-name`, as well as an optional `guid` parameter. To view all available options and arguments, simply enter the command `fairscape-cli rocrate create --help` to display a comprehensive list. - - -``` bash -Usage: fairscape-cli rocrate create [OPTIONS] ROCRATE_PATH - - Create an ROCrate in a new path specified by the rocrate-path argument - -Options: - --guid TEXT - --name TEXT [required] - --organization-name TEXT [required] - --project-name TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --help Show this message and exit. -``` - -To create an RO-Crate with minimal metadata, use the following command. This will generate a unique identifier and create a `ro-crate-metadata.json` file at the specified `ROCRATE_PATH` location. - -``` bash -fairscape-cli rocrate create \ - --name "test rocrate" \ - --description "Example RO Crate for Tests" \ - --organization-name "UVA" \ - --project-name "B2AI" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - "./test_rocrate" -``` - -Alternatively, use the `fairscape-cli rocrate init` command to create the same RO-Crate in the current working directory. - -``` bash -fairscape-cli rocrate init \ - --name "test rocrate" \ - --description "Example RO Crate for Tests" \ - --organization-name "UVA" \ - --project-name "B2AI" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" -``` - ---- - -### Add object and metadata -In the FAIRSCAPE ecosystem, datasets and software are treated as objects that can be added to an RO-Crate using the `add` sub-command. This command fetches the object and transfers it to the crate. Enter the command `fairscape-cli rocrate add --help` to display the list of objects to add. - -``` bash -Usage: fairscape-cli rocrate add [OPTIONS] COMMAND [ARGS]... - - Add (transfer) object to RO-Crate and register object metadata. - -Options: - --help Show this message and exit. - -Commands: - dataset Add a Dataset file and its metadata to the RO-Crate. - software Add a Software and its corresponding metadata. -``` - ---- - -#### Dataset object -The sub-command below, labeled as `add dataset`, utilizes necessary options to add a dataset object to the crate and populate corresponding metadata in the `ro-crate-metadata.json` file. An identifier is generated to uniquely represent the dataset. It requires eight parameters including `name`, `author`, `version`, `date-published`, `description`, `data-format`, `source-filepath`, and `destination-filepath`. Additional parameters are optional. The dataset metadata is then added to the `ro-crate-metadata.json`, and the dataset object is transferred to the specified location in `ROCRATE_PATH`. Enter `fairscape-cli rocrate add dataset --help` to show its use: - -``` bash -Usage: fairscape-cli rocrate add dataset [OPTIONS] ROCRATE_PATH - - Add a Dataset file and its metadata to the RO-Crate. - -Options: - --guid TEXT - --name TEXT [required] - --url TEXT - --author TEXT [required] - --version TEXT [required] - --date-published TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --data-format TEXT [required] - --source-filepath TEXT [required] - --destination-filepath TEXT [required] - --used-by TEXT - --derived-from TEXT - --schema TEXT - --associated-publication TEXT - --additional-documentation TEXT - --help Show this message and exit. -``` - -The example below utilizes necessary options to add a dataset object to the crate and populate corresponding metadata in the `ro-crate-metadata.json` file. - -``` bash -fairscape-cli rocrate add dataset \ - --name "AP-MS embeddings" \ - --author "Krogan lab (https://kroganlab.ucsf.edu/krogan-lab)" \ - --version "1.0" \ - --date-published "2021-04-23" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - --data-format "CSV" \ - --source-filepath "./tests/data/APMS_embedding_MUSIC.csv" \ - --destination-filepath "./test_rocrate/APMS_embedding_MUSIC.csv" \ - "./test_rocrate" -``` - -The example below performs the same operation utilizing both required and optional parameters: - -``` bash -fairscape-cli rocrate add dataset \ - --guid "ark:5982/UVA/B2AI/example_rocrate/AP-MS_embeddings-Dataset" \ - --name "AP-MS embeddings" \ - --url "https://github.com/idekerlab/MuSIC/blob/master/Examples/APMS_embedding.MuSIC.csv" \ - --author "Krogan lab (https://kroganlab.ucsf.edu/krogan-lab)" \ - --version "1.0" \ - --date-published "2021-04-23" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - --data-format "CSV" \ - --source-filepath "./tests/data/APMS_embedding_MUSIC.csv" \ - --destination-filepath "./test_rocrate/APMS_embedding_MUSIC.csv" \ - --used-by "create labeled training & test sets random_forest_samples.py" \ - --derived-from "node2vec predict" \ - --associated-publication "Qin, Y. et al. A multi-scale map of cell structure fusing protein images and interactions" \ - --additional-documentation "https://idekerlab.ucsd.edu/music/" \ - "./test_rocrate" -``` - -One of the features offered by `fairscape-cli` is the ability to annotate certain types of dataset objects with schema-level metadata. The examples in [Schema Metadata](schema-metadata.md) demonstrate how to describe the schema of a dataset object as metadata. This feature includes a mechanism to validate the metadata against the object. - ---- - -#### Software object -To add a software object, use the `software` sub-command, which requires eight parameters, namely `name`, `author`, `version`, `description`, `file-format`, `source-filepath`, `destination-filepath`, and `date-modified`. Five additional parameters are optional. Metadata about the software is added to the `ro-crate-metadata.json` file, and the software object is sent to the location specified by `ROCRATE_PATH`. Enter `fairscape-cli rocrate add software --help` to show its use: - -``` bash -Usage: fairscape-cli rocrate add software [OPTIONS] ROCRATE_PATH - - Add a Software and its corresponding metadata. - -Options: - --guid TEXT - --name TEXT [required] - --author TEXT [required] - --version TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --file-format TEXT [required] - --url TEXT - --source-filepath TEXT [required] - --destination-filepath TEXT [required] - --date-modified TEXT [required] - --used-by-computation TEXT - --associated-publication TEXT - --additional-documentation TEXT - --help Show this message and exit. -``` - -The example below uses the required options to add a software object to the crate and populate the associated metadata within the metadata file `ro-crate-metadata.json`. An automatic identifier is generated to uniquely represent the software. - -``` bash -fairscape-cli rocrate add software \ - --name "calibrate pairwise distance" \ - --author "Qin, Y." \ - --version "1.0" \ - --description "script written in python to calibrate pairwise distance." \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - --file-format "py" \ - --source-filepath "./tests/data/calibrate_pairwise_distance.py" \ - --destination-filepath "./test_rocrate/calibrate_pairwise_distance.py" \ - --date-modified "2021-04-23" \ - "./test_rocrate" -``` - -The same operation can be performed using both required and optional parameters with the following command. - -``` bash -fairscape-cli rocrate add software \ - --guid "ark:5982/UVA/B2AI/example_rocrate/calibrate_pairwise_distance-Software" \ - --name "calibrate pairwise distance" \ - --author "Qin, Y." \ - --version "1.0" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "b2ai" \ - --keywords "U2OS" \ - --file-format "py" \ - --url "https://github.com/idekerlab/MuSIC/blob/master/calibrate_pairwise_distance.py" \ - --source-filepath "./tests/data/calibrate_pairwise_distance.py" \ - --destination-filepath "./test_rocrate/calibrate_pairwise_distance.py" \ - --date-modified "2021-06-20" \ - --used-by-computation "ARK:compute_standard_proximities.1/f9aa5f3f-665a-4ab9-8879-8d0d52f05265" \ - --associated-publication "Qin, Y. et al. A multi-scale map of cell structure fusing protein images and interactions. Nature 600, 536–542 2021" \ - --additional-documentation "https://idekerlab.ucsd.edu/music/" \ - "./test_rocrate" -``` - ---- - -### Register metadata -Registering metadata adds the metadata of an object (dataset, object) or an activity (computation) to the `ro-crate-metadata.json`. Before the execution of the `register` sub-command, objects are required to be present in the path specified by the `--filepath` option, hence, no transfer of objects takes place during the execution. There is no similar requirement to specify a path for registering a computation as an activity. - - -Enter `fairscape-cli rocrate register --help` to show its use: - - -``` bash -Usage: fairscape-cli rocrate register [OPTIONS] COMMAND [ARGS]... - - Add a metadata record to the RO-Crate for a Dataset, Software, or - Computation - -Options: - --help Show this message and exit. - -Commands: - computation Register a Computation with the specified RO-Crate - dataset Register Dataset object metadata with the specified RO-Crate - software Register a Software metadata record to the specified ROCrate -``` - ---- - -#### Computation metadata -To register a computation, use the `register computation` sub-command. In the FAIRSCAPE ecosystem, computation is considered an activity, unlike datasets and software that are treated as objects. This sub-command requires five mandatory parameters: `name`, `run-by`, `date-created`, `description`, and `keywords`, as well as five optional parameters. Once executed, metadata about the computation is added to `ro-crate-metadata.json` in the `ROCRATE_PATH` location. - -To view all available options and arguments for registering a computation, enter `fairscape-cli rocrate register computation --help`: - -``` bash -Usage: fairscape-cli rocrate register computation [OPTIONS] ROCRATE_PATH - - Register a Computation with the specified RO-Crate - -Options: - --guid TEXT - --name TEXT [required] - --run-by TEXT [required] - --command TEXT - --date-created TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --used-software TEXT - --used-dataset TEXT - --generated TEXT - --help Show this message and exit. -``` - -The `register computation` sub-command can also be used to populate the metadata of a computation within `ro-crate-metadata.json` using only the necessary options. Additionally, a unique identifier is generated automatically to represent the computation. - -``` bash -fairscape-cli rocrate register computation \ - --name "calibrate pairwise distance" \ - --run-by "Qin, Y." \ - --date-created "2021-05-23" \ - --description "Average the predicted proximities" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --keywords "U2OS" \ - "./test_rocrate" -``` - -The same operation can be performed using both required and optional parameters with the following command. - -``` bash -fairscape-cli rocrate register computation \ - --guid "ark:5982/UVA/B2AI/test_rocrate/calibrate_pairwise_distance-Computation" \ - --name "calibrate pairwise distance" \ - --run-by "Qin, Y." \ - --command "some command" \ - --date-created "2021-05-23" \ - --description "Average the predicted proximities" \ - --keywords "b2ai" \ - --keywords "clustering" \ - --used-software "random_forest_output (https://github.com/idekerlab/MuSIC/blob/master/random_forest_output.py)" \ - --used-dataset "IF_emd_1_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_1.pkl" \ - --used-dataset "IF_emd_2_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_1.pkl" \ - --used-dataset "IF_emd_1_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_2.pkl" \ - --used-dataset "IF_emd_2_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_2.pkl" \ - --used-dataset """Fold 1 proximities: IF_emd_1_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_3.pkl""" \ - --used-dataset "IF_emd_2_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_3.pkl" \ - --used-dataset """Fold 1 proximities: IF_emd_1_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_4.pkl""" \ - --used-dataset "IF_emd_2_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_4.pkl" \ - --used-dataset """Fold 1 proximities: IF_emd_1_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_5.pkl""" \ - --used-dataset "IF_emd_2_APMS_emd_1.RF_maxDep_30_nEst_1000.fold_5.pkl" \ - --generated "averages of predicted protein proximities (https://github.com/idekerlab/MuSIC/blob/master/Examples/MuSIC_predicted_proximity.txt)" \ - "./test_rocrate" -``` - ---- - -#### Dataset metadata -To register a dataset, use the `register dataset` sub-command and include the `filepath` option to specify the source file path. This command adds metadata about the dataset to `ro-crate-metadata.json` in the `ROCRATE_PATH` directory. - -To view all available options and arguments for registering a dataset, enter `fairscape-cli rocrate register dataset --help`: - -``` bash hl_lines="12" -Usage: fairscape-cli rocrate register dataset [OPTIONS] ROCRATE_PATH - - Register Dataset object metadata with the specified RO-Crate - -Options: - --guid TEXT - --name TEXT [required] - --url TEXT - --author TEXT [required] - --version TEXT [required] - --date-published TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --data-format TEXT [required] - --filepath TEXT [required] - --used-by TEXT - --derived-from TEXT - --schema TEXT - --associated-publication TEXT - --additional-documentation TEXT - --help Show this message and exit. -``` - -Execute the following command to use all available options and argument for registering a dataset: - -``` bash -fairscape-cli rocrate register dataset \ - --guid "ark:5982/UVA/B2AI/example_rocrate/AP-MS_embeddings-Dataset" \ - --name "AP-MS embeddings" \ - --url "https://github.com/idekerlab/MuSIC/blob/master/Examples/APMS_embedding.MuSIC.csv" \ - --author "Krogan lab (https://kroganlab.ucsf.edu/krogan-lab)" \ - --version "1.0" \ - --date-published "2021-04-23" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "apms" \ - --keywords "b2ai" \ - --keywords "cm4ai" \ - --data-format "CSV" \ - --filepath "./test_rocrate/APMS_embedding_MUSIC.csv" \ - --used-by "create labeled training & test sets random_forest_samples.py" \ - --derived-from "node2vec predict" \ - --associated-publication "Qin, Y. et al. A multi-scale map of cell structure fusing protein images and interactions" \ - --additional-documentation "https://idekerlab.ucsd.edu/music/" \ - "./test_rocrate" -``` - ---- - -#### Software metadata -Furthermore, to register software, you can make use of the `register software` sub-command. This sub-command necessitates the inclusion of the `filepath` option, which specifies the source file path. Upon execution, this command will append metadata about the software to the `ro-crate-metadata.json` file in the `ROCRATE_PATH` directory. - -To view all available options and arguments for registering a software, enter `fairscape-cli rocrate register software --help`: - -``` bash hl_lines="12" -Usage: fairscape-cli rocrate register software [OPTIONS] ROCRATE_PATH - - Register a Software metadata record to the specified ROCrate - -Options: - --guid TEXT - --name TEXT [required] - --author TEXT [required] - --version TEXT [required] - --description TEXT [required] - --keywords TEXT [required] - --file-format TEXT [required] - --url TEXT - --date-modified TEXT - --filepath TEXT - --used-by-computation TEXT - --associated-publication TEXT - --additional-documentation TEXT - --help Show this message and exit. -``` - -Execute the following command to use all available options and argument for registering a software: - -``` bash -fairscape-cli rocrate register software \ - --guid "ark:5982/UVA/B2AI/example_rocrate/calibrate_pairwise_distance-Software" \ - --name "calibrate pairwise distance" \ - --author "Qin, Y." \ - --version "1.0" \ - --description "Affinity purification mass spectrometer (APMS) embeddings for each protein in the study, generated by node2vec predict." \ - --keywords "b2ai" \ - --keywords "U20S" \ - --file-format "py" \ - --url "https://github.com/idekerlab/MuSIC/blob/master/calibrate_pairwise_distance.py" \ - --filepath "./test_rocrate/calibrate_pairwise_distance.py" \ - --date-modified "2021-06-20" \ - --used-by-computation "ARK:compute_standard_proximities.1/f9aa5f3f-665a-4ab9-8879-8d0d52f05265" \ - --associated-publication "Qin, Y. et al. A multi-scale map of cell structure fusing protein images and interactions. Nature 600, 536–542 2021" \ - --additional-documentation "https://idekerlab.ucsd.edu/music/" \ - "./test_rocrate" -``` diff --git a/docs/index.md b/docs/index.md index 3df8772..91c664c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,13 +1,74 @@ # fairscape-cli ----------------- + A utility for packaging objects and validating metadata for FAIRSCAPE. ## Features -fairscape-cli provides a Command Line Interface (CLI) that allows the client side to create: +fairscape-cli provides a Command Line Interface (CLI) that allows the client side to create, manage, and publish scientific data packages: + +- **RO-Crate Management:** Create and manipulate [RO-Crate](https://www.researchobject.org/ro-crate/) packages locally. + - Initialize RO-Crates in new or existing directories. + - Add data, software, and computation metadata. + - Copy files into the crate structure alongside metadata registration. +- **Schema Handling:** Define, infer, and validate data schemas (Tabular, HDF5). + - Create schema definition files. + - Add properties with constraints. + - Infer schemas directly from data files. + - Register schemas within RO-Crates. +- **Data Import:** Fetch data from external sources and convert them into RO-Crates. + - Import NCBI BioProjects. + - Convert Portable Encapsulated Projects (PEPs) to RO-Crates. +- **Build Artifacts:** Generate derived outputs from RO-Crates. + - Create detailed HTML datasheets summarizing crate contents. + - Generate provenance evidence graphs (JSON and HTML). +- **Release Management:** Organize multiple related RO-Crates into a cohesive release package. + - Initialize a release structure. + - Automatically link sub-crates and propagate metadata. + - Build a top-level datasheet for the release. +- **Publishing:** Publish RO-Crate metadata to external repositories. + - Upload RO-Crate directories or zip files to Fairscape. + - Create datasets on Dataverse instances. + - Mint or update DOIs on DataCite. + +## Requirements + +Python 3.8+ + +## Installation + +```bash +pip install fairscape-cli +``` + +## Command Overview + +The CLI is organized into several top-level commands: + +| Command | Description | +| ------------ | ------------------------------------------------------------------------- | +| **rocrate** | Core local RO-Crate manipulation (create, add files/metadata). | +| **schema** | Operations on data schemas (create, infer, add properties, add to crate). | +| **validate** | Validate data against schemas. | +| **import** | Fetch external data into RO-Crate format (e.g., bioproject, pep). | +| **build** | Generate outputs from RO-Crates (e.g., datasheet, evidence-graph). | +| **release** | Manage multi-part RO-Crate releases (e.g., create, build). | +| **publish** | Publish RO-Crates to repositories (e.g., fairscape, dataverse, doi). | + +Use `--help` for details on any command or subcommand: + +```bash +fairscape-cli --help +fairscape-cli rocrate --help +fairscape-cli rocrate create --help +fairscape-cli schema create-tabular --help +``` + +### Learn More + +For more detailed examples and a complete workflow demonstration, see the [Complete Workflow Demo](workflows/complete-demo.md). + +## Documentation -* [RO-Crate](https://www.researchobject.org/ro-crate/) - a light-weight approach to packaging research data with their metadata. The CLI allows users to: - * Create Research Object Crates (RO-Crates) - * Add (transfer) digital objects to the RO-Crate - * Register metadata of the objects - * Describe the schema of tabular dataset objects as metadata and perform validation. \ No newline at end of file +- [Installation](setup.md) +- [Command Reference](commands/rocrate.md) +- [Complete Workflow Demo](workflows/complete-demo.md) diff --git a/docs/workflows/complete-demo.md b/docs/workflows/complete-demo.md new file mode 100644 index 0000000..0266cd2 --- /dev/null +++ b/docs/workflows/complete-demo.md @@ -0,0 +1,357 @@ +# Fairscape-CLI Complete Workflow Demo + +This document demonstrates a complete workflow for using fairscape-cli to create, manage, and publish research data packages with proper metadata. The workflow follows these key steps: + +1. Build a crate with local files and computation +2. Create schemas and validate data +3. Build a crate from external repository data +4. Generate evidence graphs +5. Build a unified release crate with rich metadata + +## Prerequisites + +Before starting this workflow, make sure you have: + +- fairscape-cli installed + +## Step 1: Build a Crate with Local Files and Computation + +We'll start by creating a small data processing example using local files. This demonstrates the full research object lifecycle from input to output. + +### 1.1 Create Input File and Processing Script + +First, let's create a directory structure and generate sample files for our computation: + +```bash +# Create the base directory +mkdir -p ./simple-computation + +# Create sample input.csv with Python +python -c "import pandas as pd; pd.DataFrame({'value1': [10, 20, 30, 40, 50], 'value2': [5, 15, 25, 35, 45]}).to_csv('./simple-computation/input.csv', index=False)" + +# Create sample software.py +cat > ./simple-computation/software.py << 'EOF' +import pandas as pd +import sys + +def process_data(input_file, output_file): + # Read input data + data = pd.read_csv(input_file) + + # Process data (calculate sum and product) + data['sum'] = data['value1'] + data['value2'] + data['product'] = data['value1'] * data['value2'] + + # Save results + data.to_csv(output_file, index=False) + print(f"Processing complete. Results saved to {output_file}") + +if __name__ == "__main__": + if len(sys.argv) >= 3: + process_data(sys.argv[1], sys.argv[2]) + else: + print("Usage: python software.py input.csv output.csv") +EOF +``` + +### 1.2 Create and Register RO-Crate + +Now, let's create the RO-Crate and register our input dataset and software: + +```bash +# Create the RO-Crate +fairscape-cli rocrate create \ + --name 'Simple Computation Example' \ + --organization-name 'Example Organization' \ + --project-name 'Data Processing Demo' \ + --date-published '2025-04-16' \ + --description 'A simple demonstration of data processing in an RO-Crate' \ + --keywords 'computation,demo,rocrate' \ + './simple-computation' + +# Register the input dataset +fairscape-cli rocrate register dataset \ + './simple-computation' \ + --name 'Input Dataset' \ + --author 'Example Author' \ + --version '1.0' \ + --date-published '2025-04-16' \ + --description 'Input data for computation example' \ + --keywords 'data,input' \ + --data-format 'csv' \ + --filepath './simple-computation/input.csv' + +# Register the software +fairscape-cli rocrate register software \ + './simple-computation' \ + --name 'Data Processing Software' \ + --author 'Example Developer' \ + --version '1.0' \ + --description 'Software that computes sum and product of two columns' \ + --keywords 'software,processing' \ + --file-format 'py' \ + --filepath './simple-computation/software.py' \ + --date-modified '2025-04-16' +``` + +### 1.3 Infer and Validate Input Data Against Schema + +Let's create a schema for our input data and validate against it: + +```bash +# Create the tabular schema +fairscape-cli schema create-tabular \ + --name 'Input Dataset Schema' \ + --description 'Schema for the input data used in the computation example' \ + --separator ',' \ + './simple-computation/input_schema.json' + +# Add properties to the schema +fairscape-cli schema add-property integer \ + --name 'value1' \ + --index 0 \ + --description 'Column value1' \ + './simple-computation/input_schema.json' + +fairscape-cli schema add-property integer \ + --name 'value2' \ + --index 1 \ + --description 'Column value2' \ + './simple-computation/input_schema.json' + +# Register the schema with the RO-Crate +fairscape-cli schema add-to-crate \ + './simple-computation' \ + './simple-computation/input_schema.json' + +# Validate the input data against the schema +fairscape-cli validate schema \ + --schema './simple-computation/input_schema.json' \ + --data './simple-computation/input.csv' +``` + +### 1.4 Run and Register the Computation + +Execute the software and register the computation activity: + +```bash +# Run the software to generate output +python ./simple-computation/software.py \ + ./simple-computation/input.csv \ + ./simple-computation/output.csv + +# Register the computation +fairscape-cli rocrate register computation \ + './simple-computation' \ + --name 'Data Processing Computation' \ + --run-by 'Example Researcher' \ + --date-created '2025-04-16' \ + --description 'Computation that generates sum and product of input values' \ + --keywords 'computation,processing' \ + --used-software 'ark:59852/software-data-processing-software-XXXX' \ + --used-dataset 'ark:59852/dataset-input-dataset-XXXX' \ + --command 'python software.py input.csv output.csv' +``` + +Note: Replace the ARK identifiers with the actual values returned by your previous commands. + +### 1.5 Register Output and Infer Schema + +Register the output dataset and infer its schema: + +```bash +# Register the output dataset with explicit --generated-by parameter +fairscape-cli rocrate register dataset \ + './simple-computation' \ + --name 'Output Dataset' \ + --author 'Example Author' \ + --version '1.0' \ + --date-published '2025-04-16' \ + --description 'Output data from computation example' \ + --keywords 'data,output' \ + --data-format 'csv' \ + --filepath './simple-computation/output.csv' \ + --generated-by 'ark:59852/computation-data-processing-computation-XXXX' + +# Infer the schema and add it to the RO-Crate +fairscape-cli schema infer \ + --name 'Output Dataset Schema' \ + --description 'Schema for the output data used in the computation example' \ + --rocrate-path './simple-computation' \ + './simple-computation/output.csv' \ + './simple-computation/output_schema.json' + +# Validate the output data against the inferred schema +fairscape-cli validate schema \ + --schema './simple-computation/output_schema.json' \ + --data './simple-computation/output.csv' +``` + +### 1.6 Generate a Provenance Graph for the Main Output + +Create a visual representation of the data provenance: + +```bash +# Generate evidence graph for the output dataset +fairscape-cli build evidence-graph \ + './simple-computation' \ + 'ark:59852/dataset-output-dataset-XXXX' +``` + +This will create both JSON and HTML visualizations of the data provenance in the RO-Crate. + +## Step 2: Build a Crate from External Repository Data + +Now let's demonstrate how to pull data from an external repository and create a new RO-Crate. + +### 2.1 Pull Data from an External Repository + +```bash +# Pull data from a BioProject +fairscape-cli import bioproject \ + --accession "PRJDB2884" \ + --api-key "" \ + --output-dir "./sra-crate" \ + --author "Justin, Max" +``` + +This command fetches metadata from NCBI's BioProject database and creates a complete RO-Crate with that information. + +### 2.2 Create Schemas for External Data + +Let's create a schema for FASTQ sequence data: + +```bash +# Create a tabular schema for FASTQ format +fairscape-cli schema create-tabular \ + --name 'fastq_data' \ + --description 'FASTQ sequence data schema' \ + --separator '\n' \ + --header 'false' \ + './sra-crate/fastq_schema.json' + +# Add the header property to the schema +fairscape-cli schema add-property string \ + --name 'header' \ + --index '0' \ + --description 'The header line starting with @' \ + --pattern '^@.*' \ + './sra-crate/fastq_schema.json' + +# Add the sequence property to the schema +fairscape-cli schema add-property string \ + --name 'sequence' \ + --index '1' \ + --description 'The nucleotide sequence' \ + --pattern '^[ATCGN]+$' \ + './sra-crate/fastq_schema.json' + +# Add the plus sign line property to the schema +fairscape-cli schema add-property string \ + --name 'plus' \ + --index '2' \ + --description 'The plus sign line' \ + --pattern '^\+.*' \ + './sra-crate/fastq_schema.json' + +# Add the quality scores property to the schema +fairscape-cli schema add-property string \ + --name 'quality_scores' \ + --index '3' \ + --description 'The quality scores in Phred+33 encoding' \ + './sra-crate/fastq_schema.json' + +# Register the schema with the RO-Crate +fairscape-cli schema add-to-crate \ + './sra-crate' \ + './sra-crate/fastq_schema.json' +``` + +### 2.3 Generate Evidence Graph for External Data + +Find a key dataset in the crate and generate its evidence graph: + +```bash +# First, get the ID of a main dataset in the crate +DATASET_ID=$(grep -o "ark:59852/dataset-[a-zA-Z0-9-]*" ./sra-crate/ro-crate-metadata.json | head -1) + +# Generate evidence graph for the dataset +fairscape-cli build evidence-graph \ + './sra-crate' \ + "$DATASET_ID" \ + --output-file './sra-crate/provenance-graph.json' +``` + +## Step 3: Build a Unified Release Crate + +Now, let's build a release crate that combines our local computation and the external data: + +```bash +# Create a release RO-Crate +fairscape-cli release build ./ \ + --guid "ark:59852/example-release-for-demo" \ + --name "SRA Genomic Data Example Release - 2025" \ + --organization-name "Example Research Institute" \ + --project-name "Genomic Data Analysis Project" \ + --description "This comprehensive dataset contains genomic data from multiple sources, including Japanese flounder (PRJDB2884) and human RNA-seq data (PRJEB86838) from the Sequence Read Archive (SRA). All data has been processed and prepared as AI-ready datasets in RO-Crate format, with appropriate metadata and provenance information to ensure FAIR data principles compliance." \ + --keywords "Genomics" \ + --keywords "SRA" \ + --keywords "RNA-seq" \ + --keywords "Sequence Read Archive" \ + --keywords "Bioinformatics" \ + --license "https://creativecommons.org/licenses/by/4.0/" \ + --version "1.0" \ + --publisher "University of Virginia Dataverse" \ + --principal-investigator "Dr. Example PI" \ + --copyright-notice "Copyright (c) 2025 The Regents of the University of California except where otherwise noted." \ + --conditions-of-access "Attribution is required to the copyright holders and the authors." \ + --contact-email "example@example.org" \ + --confidentiality-level "HL7 Unrestricted" \ + --funder "Example Agency" \ + --usage-info "This dataset is intended for research purposes in genomics, bioinformatics, and related fields." \ + --content-size "2.45 GB" \ + --citation "Example Research Institute (2025). SRA Genomic Data Example Release." \ + --associated-publication "Smith et al. (2025). Novel approaches to genomic data analysis using SRA datasets." \ + --completeness "These data contain complete processed datasets from the specified SRA projects." \ + --maintenance-plan "This dataset will be periodically updated with corrections or additional annotations." \ + --intended-use "This dataset is intended for genomic research and educational purposes." \ + --limitations "While comprehensive quality control has been performed, researchers should be aware of inherent limitations." \ + --potential-sources-of-bias "Original sample collection methods may introduce biases." \ + --prohibited-uses "Commercial redistribution without attribution is prohibited." \ + --human-subject "No" + +# Generate a datasheet for the release +fairscape-cli build datasheet ./ +``` + +This creates a unified release that includes both our individual RO-Crates with a comprehensive datasheet. + +## Step 4: Publishing RO-Crates + +Once you've created your RO-Crates and assembled them into a release, you can publish them to repositories for broader access and assign persistent identifiers. + +### 4 Publish to Fairscape + +For repositories supporting the Fairscape API: + +```bash +fairscape-cli publish fairscape \ + --rocrate "./" \ + --username "your_username" \ + --password "your_password" \ + --api-url "https://fairscape.net/api" +``` + +## Conclusion + +This workflow demonstrates the complete process of creating, managing, combining, and publishing research data packages using fairscape-cli. By following these steps, you can: + +1. Create well-structured RO-Crates with proper metadata +2. Register data, software, and computations with appropriate relationships +3. Define and validate data schemas +4. Pull data from external repositories +5. Generate provenance visualizations +6. Build comprehensive release packages with rich metadata +7. Publish your data to fairscape + +These capabilities enable FAIR (Findable, Accessible, Interoperable, Reusable) data sharing practices for scientific research, making your data discoverable, properly cited, and reusable by the broader community. diff --git a/mkdocs.yml b/mkdocs.yml index 2230220..ac605da 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,5 +1,5 @@ # Project Information -site_name: FAIRSACPE CLI +site_name: FAIRSCAPE CLI site_description: A utility for packaging objects and validating metadata for FAIRSCAPE. site_url: https://github.com/fairscape/fairscape-cli @@ -14,8 +14,16 @@ copyright: Copyright © 2023 THE RECTOR AND VISITORS OF THE UNIVERSITY OF VI nav: - Home: index.md - Setup: setup.md - - Getting Started: getting-started.md - - Schema Metadata: schema-metadata.md + - Examples: + - Complete Demo: workflows/complete-demo.md + - Commands: + - RO-Crate: commands/rocrate.md + - Schema: commands/schema.md + - Validation: commands/validate.md + - Import: commands/import.md + - Build: commands/build.md + - Release: commands/release.md + - Publishing: commands/publish.md theme: name: readthedocs @@ -31,7 +39,6 @@ markdown_extensions: - md_in_html - toc: permalink: true - # Python Markdown Extensions - pymdownx.arithmatex: generic: true @@ -40,8 +47,8 @@ markdown_extensions: - pymdownx.caret - pymdownx.details - pymdownx.emoji: - #emoji_index: !!python/name:materialx.emoji.twemoji - #emoji_generator: !!python/name:materialx.emoji.to_svg + # emoji_index: !!python/name:materialx.emoji.twemoji + # emoji_generator: !!python/name:materialx.emoji.to_svg - pymdownx.highlight - pymdownx.inlinehilite - pymdownx.keys diff --git a/src/fairscape_cli/__main__.py b/src/fairscape_cli/__main__.py index 140bbfc..ac191e5 100644 --- a/src/fairscape_cli/__main__.py +++ b/src/fairscape_cli/__main__.py @@ -1,8 +1,13 @@ import click -from fairscape_cli.rocrate import rocrate -from fairscape_cli.schema import schema -#from fairscape_cli.client import client +# Import command groups from their new locations +from fairscape_cli.commands.rocrate_commands import rocrate_group +from fairscape_cli.commands.import_commands import import_group +from fairscape_cli.commands.build_commands import build_group +from fairscape_cli.commands.publish_commands import publish_group +from fairscape_cli.commands.release_commands import release_group +from fairscape_cli.commands.schema_commands import schema +from fairscape_cli.commands.validate_commands import validate_group @click.group() def cli(): @@ -11,15 +16,14 @@ def cli(): """ pass - -# ROCrate Subcommands -cli.add_command(rocrate.rocrate) - -# Schema Subcommands -cli.add_command(schema.schema) - -# Fairscape Client Commands -# cli.add_command(client) +# Add the new top-level command groups +cli.add_command(rocrate_group, name='rocrate') +cli.add_command(import_group, name='import') +cli.add_command(build_group, name='build') +cli.add_command(publish_group, name='publish') +cli.add_command(release_group, name='release') +cli.add_command(schema, name='schema') +cli.add_command(validate_group, name='validate') if __name__ == "__main__": cli() \ No newline at end of file diff --git a/src/fairscape_cli/commands/build_commands.py b/src/fairscape_cli/commands/build_commands.py new file mode 100644 index 0000000..3172176 --- /dev/null +++ b/src/fairscape_cli/commands/build_commands.py @@ -0,0 +1,152 @@ +import click +import pathlib +import os +import traceback +from pathlib import Path +import json +from typing import Optional + +from fairscape_cli.datasheet_builder.rocrate.datasheet_generator import DatasheetGenerator +from fairscape_cli.datasheet_builder.evidence_graph.graph_builder import generate_evidence_graph_from_rocrate +from fairscape_cli.datasheet_builder.evidence_graph.html_builder import generate_evidence_graph_html + +@click.group('build') +def build_group(): + """Build derived artifacts from RO-Crates (datasheets, previews, graphs).""" + pass + +@build_group.command('datasheet') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--output', required=False, type=click.Path(path_type=pathlib.Path), help="Output HTML file path (defaults to ro-crate-datasheet.html in crate dir).") +@click.option('--template-dir', required=False, type=click.Path(exists=True, path_type=pathlib.Path), help="Custom template directory.") +@click.option('--published', is_flag=True, default=False, help="Indicate if the crate is considered published (may affect template rendering).") +@click.pass_context +def build_datasheet(ctx, rocrate_path, output, template_dir, published): + """Generate an HTML datasheet for an RO-Crate.""" + + if rocrate_path.is_dir(): + metadata_file = rocrate_path / "ro-crate-metadata.json" + crate_dir = rocrate_path + elif rocrate_path.name == "ro-crate-metadata.json": + metadata_file = rocrate_path + crate_dir = rocrate_path.parent + else: + click.echo(f"ERROR: Input path must be an RO-Crate directory or a ro-crate-metadata.json file.", err=True) + ctx.exit(1) + + if not metadata_file.exists(): + click.echo(f"ERROR: Metadata file not found: {metadata_file}", err=True) + ctx.exit(1) + + output_path = output if output else crate_dir / "ro-crate-datasheet.html" + + package_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + template_dir = Path(os.path.join(package_dir, 'datasheet_builder', 'templates')) + + click.echo(f"Generating datasheet for {metadata_file}") + click.echo(f"Outputting to: {output_path}") + + try: + generator = DatasheetGenerator( + json_path=str(metadata_file), + template_dir=str(template_dir), + published=published + ) + + generator.process_subcrates() + + final_output_path = generator.save_datasheet(str(output_path)) + click.echo(f"Datasheet generated successfully: {final_output_path}") + except Exception as e: + click.echo(f"Error generating datasheet: {str(e)}", err=True) + traceback.print_exc() + ctx.exit(1) + +@build_group.command('evidence-graph') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=Path)) +@click.argument('ark-id', type=str) +@click.option('--output-file', required=False, type=click.Path(path_type=Path), help="Path to save the JSON evidence graph (defaults to provenance-graph.json in the RO-Crate directory)") +@click.pass_context +def generate_evidence_graph( + ctx, + rocrate_path: Path, + ark_id: str, + output_file: Optional[Path], +): + """ + Generate an evidence graph from an RO-Crate for a specific ARK identifier. + + ROCRATE_PATH can be either a directory containing ro-crate-metadata.json or the metadata file itself. + ARK_ID is the ARK identifier for which to build the evidence graph. + """ + # Determine RO-Crate metadata file path + if rocrate_path.is_dir(): + metadata_file = rocrate_path / "ro-crate-metadata.json" + if not metadata_file.exists(): + click.echo(f"ERROR: ro-crate-metadata.json not found in {rocrate_path}") + ctx.exit(1) + else: + metadata_file = rocrate_path + + # Determine output paths + crate_dir = metadata_file.parent + if not output_file: + output_file = crate_dir / "provenance-graph.json" + + # Generate the evidence graph + try: + click.echo(f"Generating evidence graph for {ark_id} from {metadata_file}...") + evidence_graph = generate_evidence_graph_from_rocrate( + rocrate_path=metadata_file, + output_path=output_file, + node_id=ark_id + ) + click.echo(f"Evidence graph saved to {output_file}") + + try: + html_output_path = output_file.with_suffix('.html') + click.echo("Generating visualization...") + result = generate_evidence_graph_html(str(output_file), str(html_output_path)) + + if result: + click.echo(f"Visualization saved to {html_output_path}") + else: + click.echo("ERROR: Failed to generate visualization") + except ImportError: + click.echo("WARNING: generate_evidence_graph_html module not found, skipping visualization") + click.echo("To generate visualizations, please install the visualization module.") + except Exception as e: + click.echo(f"ERROR generating visualization: {str(e)}")\ + + try: + with open(metadata_file, 'r') as f: + metadata = json.load(f) + + i = 0 + for entity in metadata.get('@graph', []): + if i == 1: + entity['hasEvidenceGraph'] = { + "@id": str(html_output_path) + } + break + i += 1 + + # Write the updated metadata back to the file + with open(metadata_file, 'w') as f: + json.dump(metadata, f, indent=2) + + click.echo(f"Added hasEvidenceGraph reference to {ark_id} in RO-Crate metadata") + except Exception as e: + click.echo(f"WARNING: Failed to add hasEvidenceGraph reference: {str(e)}") + + except Exception as e: + click.echo(f"ERROR: {str(e)}") + ctx.exit(1) + +# Placeholder for explicit preview generation +# @build_group.command('preview') +# @click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +# def build_preview(ctx, rocrate_path): +# """Generate an HTML preview for a specific RO-Crate.""" +# # Implementation using PreviewGenerator +# pass \ No newline at end of file diff --git a/src/fairscape_cli/commands/import_commands.py b/src/fairscape_cli/commands/import_commands.py new file mode 100644 index 0000000..6bd8a0f --- /dev/null +++ b/src/fairscape_cli/commands/import_commands.py @@ -0,0 +1,134 @@ +import click +import pathlib +from typing import List, Optional + +from fairscape_cli.data_fetcher.GenomicData import GenomicData +from fairscape_cli.models.pep import PEPtoROCrateMapper + + +@click.group('import') +def import_group(): + """Import external data or projects into RO-Crate format.""" + pass + +@import_group.command('bioproject') +@click.option('--accession', required=True, type=str, help='NCBI BioProject accession (e.g., PRJNA12345).') +@click.option('--output-dir', required=True, type=click.Path(file_okay=False, dir_okay=True, writable=True, path_type=pathlib.Path), help='Directory to create the RO-Crate in.') +@click.option('--author', required=True, type=str, help='Author name to associate with generated metadata.') +@click.option('--api-key', required=False, type=str, default=None, help='NCBI API key (optional).') +@click.option('--name', required=False, type=str, help='Override the default RO-Crate name.') +@click.option('--description', required=False, type=str, help='Override the default RO-Crate description.') +@click.option('--keywords', required=False, multiple=True, type=str, help='Override the default RO-Crate keywords (can be used multiple times).') +@click.option('--license', required=False, type=str, help='Override the default RO-Crate license URL.') +@click.option('--version', required=False, type=str, help='Override the default RO-Crate version.') +@click.option('--organization-name', required=False, type=str, help='Set the organization name for the RO-Crate.') +@click.option('--project-name', required=False, type=str, help='Set the project name for the RO-Crate.') +@click.pass_context +def pull_bioproject( + ctx, + accession: str, + output_dir: pathlib.Path, + author: str, + api_key: Optional[str], + name: Optional[str], + description: Optional[str], + keywords: Optional[List[str]], + license: Optional[str], + version: Optional[str], + organization_name: Optional[str], + project_name: Optional[str] +): + """Pulls NCBI BioProject data and converts it into an RO-Crate.""" + + click.echo(f"Pulling BioProject {accession}...") + + cache_details_path = output_dir + + try: + # Step 1: Fetch data using GenomicData.from_api + genomic_data_instance = GenomicData.from_api( + accession=accession, + api_key=api_key if api_key else "", + details_dir=str(cache_details_path) + ) + click.echo("Successfully fetched data from NCBI.") + + guid = genomic_data_instance.to_rocrate( + output_dir=str(output_dir), + author=author, + crate_name=name, + crate_description=description, + crate_keywords=list(keywords) if keywords else None, + crate_license=license, + crate_version=version, + organization_name=organization_name, + project_name=project_name + ) + + click.echo(f"Successfully created RO-Crate.") + click.echo(f"{guid}") + + except ValueError as e: + click.echo(f"ERROR: {e}", err=True) + ctx.exit(1) + except FileNotFoundError as e: + click.echo(f"ERROR: File not found during RO-Crate generation - {e}", err=True) + ctx.exit(1) + except Exception as e: + click.echo(f"An error occurred: {e}", err=True) + import traceback + traceback.print_exc() + ctx.exit(1) + + +@import_group.command('pep') +@click.argument('pep-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--output-path', required=False, type=click.Path(path_type=pathlib.Path), help='Path for RO-Crate (defaults to PEP directory)') +@click.option('--name', required=False, type=str, help='Name for the RO-Crate (overrides PEP metadata)') +@click.option('--description', required=False, type=str, help='Description (overrides PEP metadata)') +@click.option('--author', required=False, type=str, help='Author (overrides PEP metadata)') +@click.option('--organization-name', required=False, type=str, help='Organization name') +@click.option('--project-name', required=False, type=str, help='Project name') +@click.option('--keywords', required=False, multiple=True, type=str, help='Keywords (overrides PEP metadata)') +@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/", help='License URL') +@click.option('--date-published', required=False, type=str, help='Publication date') +@click.option('--version', required=False, type=str, default="1.0", help='Version string') +@click.pass_context +def from_pep( + ctx, + pep_path: pathlib.Path, + output_path: Optional[pathlib.Path], + name: Optional[str], + description: Optional[str], + author: Optional[str], + organization_name: Optional[str], + project_name: Optional[str], + keywords: Optional[List[str]], + license: Optional[str], + date_published: Optional[str], + version: str +): + """Convert a Portable Encapsulated Project (PEP) to an RO-Crate. + + PEP-PATH: Path to the PEP directory or config file + """ + try: + + + mapper = PEPtoROCrateMapper(pep_path) + rocrate_id = mapper.create_rocrate( + output_path=output_path, + name=name, + description=description, + author=author, + organization_name=organization_name, + project_name=project_name, + keywords=keywords, + license=license, + date_published=date_published, + version=version + ) + click.echo(rocrate_id) + except Exception as exc: + click.echo(f"ERROR: {str(exc)}") + ctx.exit(code=1) \ No newline at end of file diff --git a/src/fairscape_cli/commands/publish_commands.py b/src/fairscape_cli/commands/publish_commands.py new file mode 100644 index 0000000..c5671c1 --- /dev/null +++ b/src/fairscape_cli/commands/publish_commands.py @@ -0,0 +1,44 @@ +import click +from pathlib import Path +from typing import Optional + +from fairscape_cli.publish.publish_tools import DataversePublisher, DataCitePublisher, FairscapePublisher + +@click.group('publish') +def publish_group(): + """Publish RO-Crates to external repositories.""" + pass + +@publish_group.command('fairscape') +@click.option('--rocrate', required=True, type=click.Path(exists=True, file_okay=True, dir_okay=True, path_type=Path), help='Path to the RO-Crate directory or zip file.') +@click.option('--username', required=True, envvar='FAIRSCAPE_USERNAME', help='Fairscape username (can also be set via FAIRSCAPE_USERNAME env var).') +@click.option('--password', required=True, envvar='FAIRSCAPE_PASSWORD', help='Fairscape password (can also be set via FAIRSCAPE_PASSWORD env var).') +@click.option('--api-url', default='https://fairscape.net/api', help='Fairscape API URL (default: https://fairscape.net/api).') +def publish_fairscape(rocrate: Path, username: str, password: str, api_url: str): + """Upload RO-Crate directory or zip file to Fairscape.""" + publisher = FairscapePublisher(base_url=api_url) + publisher.publish(rocrate_path=rocrate, username=username, password=password) + +@publish_group.command('dataverse') +@click.option('--rocrate', required=True, type=click.Path(exists=True, dir_okay=False, path_type=Path), help='Path to the ro-crate-metadata.json file.') +@click.option('--url', required=True, help='Base URL of the target Dataverse instance (e.g., https://dataverse.example.edu).') +@click.option('--collection', required=True, help='Alias of the target Dataverse collection to publish into.') +@click.option('--token', required=True, envvar='DATAVERSE_API_TOKEN', help='Dataverse API token (can also be set via DATAVERSE_API_TOKEN env var).') +@click.option('--authors-csv', type=click.Path(exists=True, dir_okay=False, path_type=Path), help='Optional CSV file with author details (name, affiliation, orcid). Requires "name" column header.') +def publish_dataverse(rocrate: Path, url: str, collection: str, token: str, authors_csv: Optional[Path]): + """Publish RO-Crate metadata as a new dataset to Dataverse.""" + publisher = DataversePublisher(base_url=url, collection_alias=collection) + publisher.publish(rocrate_path=rocrate, api_token=token, authors_csv_path=str(authors_csv) if authors_csv else None) + +@publish_group.command('doi') +@click.option('--rocrate', required=True, type=click.Path(exists=True, dir_okay=False, path_type=Path), help='Path to the ro-crate-metadata.json file.') +@click.option('--prefix', required=True, help='Your DataCite DOI prefix (e.g., 10.1234).') +@click.option('--username', required=True, envvar='DATACITE_USERNAME', help='DataCite API username (repository ID, e.g., MEMBER.REPO) (can use DATACITE_USERNAME env var).') +@click.option('--password', required=True, envvar='DATACITE_PASSWORD', help='DataCite API password (can use DATACITE_PASSWORD env var).') +@click.option('--api-url', default='https://api.datacite.org', help='DataCite API URL (default: https://api.datacite.org, use https://api.test.datacite.org for testing).') +@click.option('--event', type=click.Choice(['publish', 'register', 'hide'], case_sensitive=False), default='publish', help="DOI event type: 'publish' (make public), 'register' (create draft), 'hide' (make findable but hide metadata).") +def publish_doi(rocrate: Path, prefix: str, username: str, password: str, api_url: str, event: str): + """Mint or update a DOI on DataCite using RO-Crate metadata.""" + repository_id = username + publisher = DataCitePublisher(prefix=prefix, repository_id=repository_id, api_url=api_url) + publisher.publish(rocrate_path=rocrate, username=username, password=password, event=event) \ No newline at end of file diff --git a/src/fairscape_cli/commands/release_commands.py b/src/fairscape_cli/commands/release_commands.py new file mode 100644 index 0000000..34b28a8 --- /dev/null +++ b/src/fairscape_cli/commands/release_commands.py @@ -0,0 +1,243 @@ +import click +import pathlib +import json +from datetime import datetime +from typing import List, Optional +import os +from pathlib import Path + +from fairscape_cli.models import ( + GenerateROCrate, + LinkSubcrates, + collect_subcrate_metadata +) + +from fairscape_cli.datasheet_builder.rocrate.datasheet_generator import DatasheetGenerator + +@click.group('release_group') +def release_group(): + """Invoke operations on Research Object Crate (RO-CRate). + """ + pass + +@release_group.command('build') +@click.argument('release-directory', type=click.Path(exists=False, path_type=pathlib.Path, file_okay=False, dir_okay=True)) +@click.option('--guid', required=False, type=str, default="", show_default=False, help="GUID for the parent release RO-Crate (generated if not provided).") +@click.option('--name', required=True, type=str, help="Name for the parent release RO-Crate.") +@click.option('--organization-name', required=True, type=str, help="Organization name associated with the release.") +@click.option('--project-name', required=True, type=str, help="Project name associated with the release.") +@click.option('--description', required=True, type=str, help="Description of the release RO-Crate.") +@click.option('--keywords', required=True, multiple=True, type=str, help="Keywords for the release RO-Crate.") +@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/", help="License URL for the release.") +@click.option('--date-published', required=False, type=str, help="Publication date (ISO format, defaults to now).") +@click.option('--author', required=False, type=str, default=None, help="Author(s) of the release.") +@click.option('--version', required=False, type=str, default="1.0", help="Version of the release.") +@click.option('--associated-publication', required=False, multiple=True, type=str, help="Associated publications for the release.") +@click.option('--conditions-of-access', required=False, type=str, help="Conditions of access for the release.") +@click.option('--copyright-notice', required=False, type=str, help="Copyright notice for the release.") +@click.option('--doi', required=False, type=str, help="DOI identifier for the release.") +@click.option('--publisher', required=False, type=str, help="Publisher of the release.") +@click.option('--principal-investigator', required=False, type=str, help="Principal investigator for the release.") +@click.option('--contact-email', required=False, type=str, help="Contact email for the release.") +@click.option('--confidentiality-level', required=False, type=str, help="Confidentiality level for the release.") +@click.option('--citation', required=False, type=str, help="Citation for the release.") +@click.option('--funder', required=False, type=str, help="Funder of the release.") +@click.option('--usage-info', required=False, type=str, help="Usage information for the release.") +@click.option('--content-size', required=False, type=str, help="Content size of the release.") +@click.option('--completeness', required=False, type=str, help="Completeness information for the release.") +@click.option('--maintenance-plan', required=False, type=str, help="Maintenance plan for the release.") +@click.option('--intended-use', required=False, type=str, help="Intended use of the release.") +@click.option('--limitations', required=False, type=str, help="Limitations of the release.") +@click.option('--prohibited-uses', required=False, type=str, help="Prohibited uses of the release.") +@click.option('--potential-sources-of-bias', required=False, type=str, help="Prohibited uses of the release.") +@click.option('--human-subject', required=False, type=str, help="Human subject involvement information.") +@click.option('--ethical-review', required=False, type=str, help="Ethical review information.") +@click.option('--additional-properties', required=False, type=str, help="JSON string with additional property values.") +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties for the parent crate.') +@click.pass_context +def build_release( + ctx, + release_directory: pathlib.Path, + guid: str, + name: str, + organization_name: str, + project_name: str, + description: str, + keywords: List[str], + license: str, + date_published: Optional[str], + author: Optional[str], + version: str, + associated_publication: Optional[List[str]], + conditions_of_access: Optional[str], + copyright_notice: Optional[str], + doi: Optional[str], + publisher: Optional[str], + principal_investigator: Optional[str], + contact_email: Optional[str], + confidentiality_level: Optional[str], + citation: Optional[str], + funder: Optional[str], + usage_info: Optional[str], + content_size: Optional[str], + completeness: Optional[str], + maintenance_plan: Optional[str], + intended_use: Optional[str], + limitations: Optional[str], + prohibited_uses: Optional[str], + potential_sources_of_bias: Optional[str], + human_subject: Optional[str], + ethical_review: Optional[str], + additional_properties: Optional[str], + custom_properties: Optional[str], +): + """ + Create a 'release' RO-Crate in RELEASE_DIRECTORY, scanning for and linking existing sub-RO-Crates. + """ + click.echo(f"Starting release process in: {release_directory.resolve()}") + + if not release_directory.exists(): + release_directory.mkdir(parents=True, exist_ok=True) + + + subcrate_metadata = collect_subcrate_metadata(release_directory) + + if author is None: + combined_authors = subcrate_metadata['authors'] + if combined_authors: + author = ", ".join(combined_authors) + else: + author = "Unknown" + + combined_keywords = list(keywords) + for keyword in subcrate_metadata['keywords']: + if keyword not in combined_keywords: + combined_keywords.append(keyword) + + parent_params = { + "guid": guid, + "name": name, + "organizationName": organization_name, + "projectName": project_name, + "description": description, + "keywords": combined_keywords, + "license": license, + "datePublished": date_published or datetime.now().isoformat(), + "author": author, + "version": version, + "associatedPublication": associated_publication if associated_publication else None, + "conditionsOfAccess": conditions_of_access, + "copyrightNotice": copyright_notice, + "path": release_directory + } + + if doi: + parent_params["identifier"] = doi + if publisher: + parent_params["publisher"] = publisher + if principal_investigator: + parent_params["principalInvestigator"] = principal_investigator + if contact_email: + parent_params["contactEmail"] = contact_email + if confidentiality_level: + parent_params["confidentialityLevel"] = confidentiality_level + if citation: + parent_params["citation"] = citation + if funder: + parent_params["funder"] = funder + if usage_info: + parent_params["usageInfo"] = usage_info + if content_size: + parent_params["contentSize"] = content_size + if ethical_review: + parent_params["ethicalReview"] = ethical_review + + additional_props = [] + if completeness: + additional_props.append({ + "@type": "PropertyValue", + "name": "Completeness", + "value": completeness + }) + if maintenance_plan: + additional_props.append({ + "@type": "PropertyValue", + "name": "Maintenance Plan", + "value": maintenance_plan + }) + if intended_use: + additional_props.append({ + "@type": "PropertyValue", + "name": "Intended Use", + "value": intended_use + }) + if limitations: + additional_props.append({ + "@type": "PropertyValue", + "name": "Limitations", + "value": limitations + }) + if prohibited_uses: + additional_props.append({ + "@type": "PropertyValue", + "name": "Prohibited Uses", + "value": prohibited_uses + }) + if potential_sources_of_bias: + additional_props.append({ + "@type": "PropertyValue", + "name": "Potential Sources of Bias", + "value": potential_sources_of_bias + }) + if human_subject: + additional_props.append({ + "@type": "PropertyValue", + "name": "Human Subject", + "value": human_subject + }) + + if additional_properties: + try: + add_props = json.loads(additional_properties) + if isinstance(add_props, list): + additional_props.extend(add_props) + else: + click.echo("ERROR: additional-properties must be a JSON array") + ctx.exit(1) + except json.JSONDecodeError: + click.echo("ERROR: Invalid JSON in --additional-properties") + ctx.exit(1) + + if additional_props: + parent_params["additionalProperty"] = additional_props + + if custom_properties: + try: + custom_props_dict = json.loads(custom_properties) + if not isinstance(custom_props_dict, dict): + raise ValueError("Custom properties must be a JSON object") + parent_params.update(custom_props_dict) + except json.JSONDecodeError: + click.echo("ERROR: Invalid JSON in --custom-properties") + ctx.exit(1) + except ValueError as e: + click.echo(f"ERROR: {e}") + ctx.exit(1) + + try: + parent_crate_root_dict = GenerateROCrate(**parent_params) + parent_crate_guid = parent_crate_root_dict['@id'] + click.echo(f"Initialized parent RO-Crate: {parent_crate_guid}") + except Exception as e: + click.echo(f"ERROR: Failed to initialize parent RO-Crate: {e}") + ctx.exit(1) + + linked_ids = LinkSubcrates(parent_crate_path=release_directory) + if linked_ids: + click.echo(f"Successfully linked {len(linked_ids)} sub-crate(s):") + for sub_id in linked_ids: + click.echo(f" - {sub_id}") + else: + click.echo("No valid sub-crates were found or linked.") + + click.echo(f"Release process finished successfully for: {parent_crate_guid}") \ No newline at end of file diff --git a/src/fairscape_cli/commands/rocrate_commands.py b/src/fairscape_cli/commands/rocrate_commands.py new file mode 100644 index 0000000..b54dd98 --- /dev/null +++ b/src/fairscape_cli/commands/rocrate_commands.py @@ -0,0 +1,689 @@ +import click +import pathlib +import json +from typing import List, Optional, Union +from pydantic import ValidationError +from datetime import datetime + + +from fairscape_cli.models.rocrate import ( + GenerateROCrate, ReadROCrateMetadata, AppendCrate, CopyToROCrate, ROCrate +) +from fairscape_cli.models.dataset import GenerateDataset +from fairscape_cli.models.software import GenerateSoftware +from fairscape_cli.models.computation import GenerateComputation + +from fairscape_cli.models.utils import FileNotInCrateException +from fairscape_cli.config import NAAN +from fairscape_cli.models import generateSummaryStatsElements +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid + + +@click.group('rocrate') +def rocrate_group(): + """Core operations for local RO-Crate manipulation.""" + pass + +@rocrate_group.command('init') +@click.option('--guid', required=False, type=str, default="", show_default=False) +@click.option('--name', required=True, type=str) +@click.option('--organization-name', required=True, type=str) +@click.option('--project-name', required=True, type=str) +@click.option('--description', required=True, type=str) +@click.option('--keywords', required=True, multiple=True, type=str) +@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/") +@click.option('--date-published', required=False, type=str) +@click.option('--author', required=False, type=str, default="Unknown") +@click.option('--version', required=False, type=str, default="1.0") +@click.option('--associated-publication', required=False, type=str) +@click.option('--conditions-of-access', required=False, type=str) +@click.option('--copyright-notice', required=False, type=str) +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties to include') +def init( + guid, name, organization_name, project_name, description, keywords, license, + date_published, author, version, associated_publication, conditions_of_access, + copyright_notice, custom_properties +): + """Initialize an RO-Crate in the current working directory.""" + params = { + "guid": guid, "name": name, "organizationName": organization_name, + "projectName": project_name, "description": description, "keywords": list(keywords), + "license": license, "datePublished": date_published, "author": author, + "version": version, "associatedPublication": associated_publication, + "conditionsOfAccess": conditions_of_access, "copyrightNotice": copyright_notice, + "path": pathlib.Path.cwd() + } + if custom_properties: + try: + custom_props = json.loads(custom_properties) + if not isinstance(custom_props, dict): raise ValueError("Custom properties must be a JSON object") + params.update(custom_props) + except Exception as e: + click.echo(f"ERROR processing custom properties: {e}", err=True) + return + + # Filter None values before passing + filtered_params = {k: v for k, v in params.items() if v is not None} + passed_crate = GenerateROCrate(**filtered_params) + click.echo(passed_crate.get("@id")) + + +@rocrate_group.command('create') +@click.argument('rocrate-path', type=click.Path(exists=False, path_type=pathlib.Path)) +@click.option('--guid', required=False, type=str, default="", show_default=False) +@click.option('--name', required=True, type=str) +@click.option('--organization-name', required=True, type=str) +@click.option('--project-name', required=True, type=str) +@click.option('--description', required=True, type=str) +@click.option('--keywords', required=True, multiple=True, type=str) +@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/") +@click.option('--date-published', required=False, type=str) +@click.option('--author', required=False, type=str, default="Unknown") +@click.option('--version', required=False, type=str, default="1.0") +@click.option('--associated-publication', required=False, type=str) +@click.option('--conditions-of-access', required=False, type=str) +@click.option('--copyright-notice', required=False, type=str) +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties to include') +def create( + rocrate_path, guid, name, organization_name, project_name, description, keywords, + license, date_published, author, version, associated_publication, + conditions_of_access, copyright_notice, custom_properties +): + """Create an RO-Crate in the specified path.""" + params = { + "guid": guid, "name": name, "organizationName": organization_name, + "projectName": project_name, "description": description, "keywords": list(keywords), + "license": license, "datePublished": date_published, "author": author, + "version": version, "associatedPublication": associated_publication, + "conditionsOfAccess": conditions_of_access, "copyrightNotice": copyright_notice, + "path": rocrate_path + } + if custom_properties: + try: + custom_props = json.loads(custom_properties) + if not isinstance(custom_props, dict): raise ValueError("Custom properties must be a JSON object") + params.update(custom_props) + except Exception as e: + click.echo(f"ERROR processing custom properties: {e}", err=True) + return + + # Filter None values before passing + filtered_params = {k: v for k, v in params.items() if v is not None} + passed_crate = GenerateROCrate(**filtered_params) + click.echo(passed_crate.get("@id")) + + +@rocrate_group.group('register') +def register(): + """Add a metadata record to the RO-Crate for a Dataset, Software, or Computation (metadata only).""" + pass + +@register.command('software') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--guid', type=str, required=False, default=None, help='Identifier for the software (generated if not provided)') +@click.option('--name', required=True, help='Name of the software') +@click.option('--author', required=True, help='Author of the software') +@click.option('--version', required=True, help='Version of the software') +@click.option('--description', required=True, help='Description of the software') +@click.option('--keywords', required=True, multiple=True, help='Keywords for the software') +@click.option('--file-format', required=True, help='Format of the software (e.g., py, js)') +@click.option('--url', required=False, help='URL reference for the software') +@click.option('--date-modified', required=False, help='Last modification date of the software (ISO format)') +@click.option('--filepath', required=False, help='Path to the software file (relative to crate root)') +@click.option('--used-by-computation', required=False, multiple=True, help='Identifiers of computations that use this software') +@click.option('--associated-publication', required=False, help='Associated publication identifier') +@click.option('--additional-documentation', required=False, help='Additional documentation') +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties.') +@click.pass_context +def registerSoftware( + ctx, + rocrate_path: pathlib.Path, + guid: Optional[str], + name: str, + author: str, + version: str, + description: str, + keywords: List[str], + file_format: str, + url: Optional[str], + date_modified: Optional[str], + filepath: Optional[str], + used_by_computation: Optional[List[str]], + associated_publication: Optional[str], + additional_documentation: Optional[str], + custom_properties: Optional[str], +): + """Register Software metadata with the specified RO-Crate.""" + try: + ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {exc}", err=True) + ctx.exit(code=1) + + params = { + "guid": guid, "name": name, "author": author, "version": version, + "description": description, "keywords": list(keywords), "fileFormat": file_format, + "url": url, "dateModified": date_modified, "filepath": filepath, + "usedByComputation": list(used_by_computation) if used_by_computation else [], + "associatedPublication": associated_publication, + "additionalDocumentation": additional_documentation, + "cratePath": rocrate_path + } + + if custom_properties: + try: + custom_props = json.loads(custom_properties) + if not isinstance(custom_props, dict): raise ValueError("Custom properties must be a JSON object") + params.update(custom_props) + except Exception as e: + click.echo(f"ERROR processing custom properties: {e}", err=True) + ctx.exit(code=1) + + # Filter None values before passing + filtered_params = {k: v for k, v in params.items() if v is not None} + + try: + software_instance = GenerateSoftware(**filtered_params) + AppendCrate(cratePath=rocrate_path, elements=[software_instance]) + click.echo(software_instance.guid) + except FileNotInCrateException as e: + click.echo(f"ERROR: {e}", err=True) + ctx.exit(code=1) + except ValidationError as e: + click.echo(f"ERROR: Software Validation Failure\n{e}", err=True) + ctx.exit(code=1) + except Exception as exc: + click.echo(f"ERROR: {exc}", err=True) + ctx.exit(code=1) + + +@register.command('dataset') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--guid', type=str, required=False, default=None, help='Identifier for the dataset (generated if not provided)') +@click.option('--name', required=True, help='Name of the dataset') +@click.option('--author', required=True, help='Author of the dataset') +@click.option('--version', required=True, help='Version of the dataset') +@click.option('--description', required=True, help='Description of the dataset') +@click.option('--keywords', required=True, multiple=True, help='Keywords for the dataset') +@click.option('--data-format', required=True, help='Format of the dataset (e.g., csv, json)') +@click.option('--filepath', required=True, help='Path to the dataset file') +@click.option('--url', required=False, help='URL reference for the dataset') +@click.option('--date-published', required=False, help='Publication date of the dataset (ISO format)') +@click.option('--schema', required=False, help='Schema identifier for the dataset') +@click.option('--used-by', required=False, multiple=True, help='Identifiers of computations that use this dataset') +@click.option('--derived-from', required=False, multiple=True, help='Identifiers of datasets this one is derived from') +@click.option('--generated-by', required=False, multiple=True, help='Identifiers of computations that generated this dataset') +@click.option('--summary-statistics-filepath', required=False, type=click.Path(exists=True), help='Path to summary statistics file') +@click.option('--associated-publication', required=False, help='Associated publication identifier') +@click.option('--additional-documentation', required=False, help='Additional documentation') +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties to include') +@click.pass_context +def registerDataset( + ctx, + rocrate_path: pathlib.Path, + guid: str, + name: str, + author: str, + version: str, + description: str, + keywords: List[str], + data_format: str, + filepath: str, + url: Optional[str] = None, + date_published: Optional[str] = None, + schema: Optional[str] = None, + used_by: Optional[List[str]] = None, + derived_from: Optional[List[str]] = None, + generated_by: Optional[List[str]] = None, + summary_statistics_filepath: Optional[str] = None, + associated_publication: Optional[str] = None, + additional_documentation: Optional[str] = None, + custom_properties: Optional[str] = None, +): + """Register Dataset object metadata with the specified RO-Crate. + + This command registers a dataset with the specified RO-Crate. It provides + common options directly, but also supports custom properties through the + --custom-properties option. + + Examples: + fairscape rocrate register dataset ./my-crate --name "My Dataset" --author "John Doe" ... + + # With custom properties: + fairscape rocrate register dataset ./my-crate --name "My Dataset" ... --custom-properties '{"publisher": "Acme Corp", "license": "CC-BY-4.0"}' + """ + + try: + ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {str(exc)}") + ctx.exit(code=1) + + try: + custom_props = {} + if custom_properties: + try: + custom_props = json.loads(custom_properties) + if not isinstance(custom_props, dict): + raise ValueError("Custom properties must be a JSON object") + except json.JSONDecodeError: + click.echo("ERROR: Invalid JSON in custom-properties") + ctx.exit(code=1) + + params = { + "guid": guid, + "name": name, + "author": author, + "description": description, + "keywords": keywords, + "version": version, + "format": data_format, + "filepath": filepath, + "cratePath": rocrate_path, + } + + if url: + params["url"] = url + if date_published: + params["datePublished"] = date_published + if schema: + params["schema"] = schema + if used_by: + params["usedBy"] = used_by + if derived_from: + params["derivedFrom"] = derived_from + if generated_by: + params["generatedBy"] = generated_by + if associated_publication: + params["associatedPublication"] = associated_publication + if additional_documentation: + params["additionalDocumentation"] = additional_documentation + + params.update(custom_props) + + summary_stats_guid = None + elements = [] + + if summary_statistics_filepath: + summary_stats_guid, summary_stats_instance, computation_instance = generateSummaryStatsElements( + name=name, + author=author, + keywords=keywords, + date_published=date_published or "", + version=version, + associated_publication=associated_publication, + additional_documentation=additional_documentation, + schema=schema, + dataset_guid=guid or "", + summary_statistics_filepath=summary_statistics_filepath, + crate_path=rocrate_path + ) + elements.extend([computation_instance, summary_stats_instance]) + params["summary_stats_guid"] = summary_stats_guid + + dataset_instance = GenerateDataset(**params) + + elements.insert(0, dataset_instance) + AppendCrate(cratePath=rocrate_path, elements=elements) + click.echo(dataset_instance.guid) + + except FileNotInCrateException as e: + click.echo(f"ERROR: {str(e)}") + ctx.exit(code=1) + + except ValidationError as e: + click.echo("Dataset Validation Error") + click.echo(e) + ctx.exit(code=1) + + except Exception as exc: + click.echo(f"ERROR: {str(exc)}") + ctx.exit(code=1) + + +@register.command('computation') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--guid', type=str, required=False, default=None, help='Identifier for the computation (generated if not provided)') +@click.option('--name', required=True, help='Name of the computation') +@click.option('--run-by', required=True, help='Person or entity that ran the computation') +@click.option('--command', required=False, help='Command used to run the computation (string or JSON list)') +@click.option('--date-created', required=True, help='Date the computation was run (ISO format)') +@click.option('--description', required=True, help='Description of the computation') +@click.option('--keywords', required=True, multiple=True, help='Keywords for the computation') +@click.option('--used-software', required=False, multiple=True, help='Software identifiers used by this computation') +@click.option('--used-dataset', required=False, multiple=True, help='Dataset identifiers used by this computation') +@click.option('--generated', required=False, multiple=True, help='Dataset/Software identifiers generated by this computation') +@click.option('--associated-publication', required=False, help='Associated publication identifier') +@click.option('--additional-documentation', required=False, help='Additional documentation') +@click.option('--custom-properties', required=False, type=str, help='JSON string with additional properties.') +@click.pass_context +def computation( + ctx, + rocrate_path: pathlib.Path, + guid: Optional[str], + name: str, + run_by: str, + command: Optional[str], + date_created: str, + description: str, + keywords: List[str], + used_software: Optional[List[str]], + used_dataset: Optional[List[str]], + generated: Optional[List[str]], + associated_publication: Optional[str], + additional_documentation: Optional[str], + custom_properties: Optional[str], +): + """Register Computation metadata with the specified RO-Crate.""" + try: + ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {exc}", err=True) + ctx.exit(code=1) + + params = { + "guid": guid, "name": name, "runBy": run_by, "command": command, + "dateCreated": date_created, "description": description, "keywords": list(keywords), + "usedSoftware": list(used_software) if used_software else [], + "usedDataset": list(used_dataset) if used_dataset else [], + "generated": list(generated) if generated else [], + "associatedPublication": associated_publication, + "additionalDocumentation": additional_documentation + } + + if custom_properties: + try: + custom_props = json.loads(custom_properties) + if not isinstance(custom_props, dict): raise ValueError("Custom properties must be a JSON object") + params.update(custom_props) + except Exception as e: + click.echo(f"ERROR processing custom properties: {e}", err=True) + ctx.exit(code=1) + + # Filter None values before passing + filtered_params = {k: v for k, v in params.items() if v is not None} + + try: + computationInstance = GenerateComputation(**filtered_params) + AppendCrate(cratePath=rocrate_path, elements=[computationInstance]) + click.echo(computationInstance.guid) + except ValidationError as e: + click.echo(f"ERROR: Computation Validation Failure\n{e}", err=True) + ctx.exit(code=1) + except Exception as exc: + click.echo(f"ERROR: {exc}", err=True) + ctx.exit(code=1) + + +@register.command('subrocrate') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.argument('subrocrate-path', type=click.Path(path_type=pathlib.Path)) +@click.option('--guid', required=False, type=str, default="", show_default=False) +@click.option('--name', required=True, type=str) +@click.option('--organization-name', required=True, type=str) +@click.option('--project-name', required=True, type=str) +@click.option('--description', required=True, type=str) +@click.option('--keywords', required=True, multiple=True, type=str) +@click.option('--author', required=False, type=str, default="Unknown") +@click.option('--version', required=False, type=str, default="1.0") +@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/") +@click.pass_context +def subrocrate( + ctx, + rocrate_path: pathlib.Path, + subrocrate_path: pathlib.Path, + guid: str, + name: str, + organization_name: str, + project_name: str, + description: str, + keywords: List[str], + author: str, + version: str, + license: str +): + """Register a new RO-Crate within an existing RO-Crate directory. + + ROCRATE_PATH: Path to the parent RO-Crate + SUBCRATE_PATH: Relative path within the parent RO-Crate where the subcrate should be created + """ + try: + metadata = ReadROCrateMetadata(rocrate_path) + root_metadata = metadata['@graph'][1].model_dump(by_alias=True) + + parent_author = root_metadata.get('author', author or "Unknown") + parent_version = root_metadata.get('version', version or "1.0") + parent_license = root_metadata.get('license', license) + + parent_crate = ROCrate( + guid=root_metadata['@id'], + metadataType=root_metadata.get('@type', ["Dataset", "https://w3id.org/EVI#ROCrate"]), + name=root_metadata['name'], + description=root_metadata['description'], + keywords=root_metadata['keywords'], + author=parent_author, + version=parent_version, + license=parent_license, + isPartOf=root_metadata.get('isPartOf', []), + hasPart=root_metadata.get('hasPart', []), + path=rocrate_path + ) + + subcrate_id = parent_crate.create_subcrate( + subcrate_path=subrocrate_path, + guid=guid, + name=name, + description=description, + keywords=keywords, + organization_name=organization_name, + project_name=project_name, + author=author or parent_author, + version=version or parent_version, + license=license or parent_license + ) + + click.echo(subcrate_id) + + except Exception as exc: + click.echo(f"ERROR: {str(exc)}") + ctx.exit(code=1) + +@rocrate_group.group('add') +def add(): + """Add a file to the RO-Crate and register its metadata.""" + pass + +@add.command('software') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--guid', type=str, required=False, default=None) +@click.option('--name', required=True) +@click.option('--author', required=True) +@click.option('--version', required=True) +@click.option('--description', required = True) +@click.option('--keywords', required=True, multiple=True) +@click.option('--file-format', required = True) +@click.option('--url', required = False) +@click.option('--source-filepath', required=True) +@click.option('--destination-filepath', required=True) +@click.option('--date-modified', required=True) +@click.option('--used-by-computation', required=False, multiple=True) +@click.option('--associated-publication', required=False) +@click.option('--additional-documentation', required=False) +@click.pass_context +def software( + ctx, + rocrate_path: pathlib.Path, + guid, + name, + author, + version, + description, + keywords, + file_format, + url, + source_filepath, + destination_filepath, + date_modified, + used_by_computation, + associated_publication, + additional_documentation +): + """Add a Software and its corresponding metadata. + """ + try: + crateInstance = ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {str(exc)}") + ctx.exit(code=1) + + + try: + CopyToROCrate(source_filepath, destination_filepath) + + software_instance = GenerateSoftware( + guid=guid, + url= url, + name=name, + version=version, + keywords=keywords, + fileFormat=file_format, + description=description, + author= author, + associatedPublication=associated_publication, + additionalDocumentation=additional_documentation, + dateModified=date_modified, + usedByComputation=used_by_computation, + filepath=destination_filepath, + cratePath =rocrate_path + ) + + AppendCrate(cratePath = rocrate_path, elements=[software_instance]) + # copy file to rocrate + click.echo(software_instance.guid) + + except ValidationError as e: + click.echo("Software Validation Error") + click.echo(e) + ctx.exit(code=1) + + # TODO add to cache + + +@add.command('dataset') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.option('--guid', type=str, required=False, default=None) +@click.option('--name', required=True) +@click.option('--url', required=False) +@click.option('--author', required=True) +@click.option('--version', required=True) +@click.option('--date-published', required=True) +@click.option('--description', required=True) +@click.option('--keywords', required=True, multiple=True) +@click.option('--data-format', required=True) +@click.option('--source-filepath', required=True) +@click.option('--destination-filepath', required=True) +@click.option('--summary-statistics-source', required=False, type=click.Path(exists=True)) +@click.option('--summary-statistics-destination', required=False, type=click.Path()) +@click.option('--used-by', required=False, multiple=True) +@click.option('--derived-from', required=False, multiple=True) +@click.option('--generated-by', required=False, multiple=True) +@click.option('--schema', required=False, type=str) +@click.option('--associated-publication', required=False) +@click.option('--additional-documentation', required=False) +@click.pass_context +def dataset( + ctx, + rocrate_path: pathlib.Path, + guid, + name, + url, + author, + version, + date_published, + description, + keywords, + data_format, + source_filepath, + destination_filepath, + summary_statistics_source, + summary_statistics_destination, + used_by, + derived_from, + generated_by, + schema, + associated_publication, + additional_documentation, +): + """Add a Dataset file and its metadata to the RO-Crate.""" + try: + crateInstance = ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {str(exc)}") + ctx.exit(code=1) + + try: + # Copy main dataset file + CopyToROCrate(source_filepath, destination_filepath) + + # Generate main dataset GUID + sq_dataset = GenerateDatetimeSquid() + dataset_guid = guid if guid else f"ark:{NAAN}/dataset-{name.lower().replace(' ', '-')}-{sq_dataset}" + + summary_stats_guid = None + elements = [] + + # Handle summary statistics if provided + if summary_statistics_source and summary_statistics_destination: + # Copy summary statistics file + CopyToROCrate(summary_statistics_source, summary_statistics_destination) + + # Generate summary statistics elements + summary_stats_guid, summary_stats_instance, computation_instance = generateSummaryStatsElements( + name=name, + author=author, + keywords=keywords, + date_published=date_published, + version=version, + associated_publication=associated_publication, + additional_documentation=additional_documentation, + schema=schema, + dataset_guid=dataset_guid, + summary_statistics_filepath=summary_statistics_destination, + crate_path=rocrate_path + ) + elements.extend([computation_instance, summary_stats_instance]) + + # Generate main dataset + dataset_instance = GenerateDataset( + guid=dataset_guid, + url=url, + author=author, + name=name, + description=description, + keywords=keywords, + datePublished=date_published, + version=version, + associatedPublication=associated_publication, + additionalDocumentation=additional_documentation, + dataFormat=data_format, + schema=schema, + derivedFrom=derived_from, + generatedBy=generated_by, + usedBy=used_by, + filepath=destination_filepath, + cratePath=rocrate_path, + summary_stats_guid=summary_stats_guid + ) + + elements.insert(0, dataset_instance) + AppendCrate(cratePath=rocrate_path, elements=elements) + click.echo(dataset_instance.guid) + + except ValidationError as e: + click.echo("Dataset Validation Error") + click.echo(e) + ctx.exit(code=1) + + except Exception as exc: + click.echo(f"ERROR: {str(exc)}") + ctx.exit(code=1) \ No newline at end of file diff --git a/src/fairscape_cli/schema/schema.py b/src/fairscape_cli/commands/schema_commands.py similarity index 80% rename from src/fairscape_cli/schema/schema.py rename to src/fairscape_cli/commands/schema_commands.py index a09cf2b..cc498cb 100644 --- a/src/fairscape_cli/schema/schema.py +++ b/src/fairscape_cli/commands/schema_commands.py @@ -10,6 +10,8 @@ Type ) +from fairscape_cli.models import ReadROCrateMetadata, AppendCrate + from fairscape_cli.models.schema.tabular import ( TabularValidationSchema, HDF5ValidationSchema, @@ -300,22 +302,24 @@ def validate(ctx, schema, data): for validation_failure in metadata_error.errors(): click.echo(f"property: {validation_failure.get('loc')} \tmsg: {validation_failure.get('msg')}") ctx.exit(1) - except Exception as e: - click.echo(f"Error during validation: {str(e)}") - ctx.exit(1) @schema.command('infer') @click.option('--name', required=True, type=str) @click.option('--description', required=True, type=str) @click.option('--guid', required=False, type=str, default="", show_default=False) @click.argument('input_file', type=click.Path(exists=True)) +@click.option('--rocrate-path', required=False, type=click.Path(exists=True, path_type=pathlib.Path), help='Optional path to an RO-Crate to append the schema to') @click.argument('schema_file', type=str) @click.pass_context -def infer_schema(ctx, name, description, guid, input_file, schema_file): - """Infer a schema from a file (CSV, TSV, Parquet, or HDF5).""" +def infer_schema_rocrate(ctx, name, description, guid, input_file, rocrate_path, schema_file): + """Infer a schema from a file and optionally append it to an RO-Crate. + + INPUT_FILE: File to infer schema from (CSV, TSV, Parquet, or HDF5) + SCHEMA_FILE: Path to save the schema file + """ try: + # Determine schema type and infer schema schema_class = determine_schema_type(input_file) - schema_model = schema_class.infer_from_file( input_file, name, @@ -323,15 +327,77 @@ def infer_schema(ctx, name, description, guid, input_file, schema_file): ) if guid: schema_model.guid = guid - + WriteSchema(schema_model, schema_file) ext = pathlib.Path(input_file).suffix.lower()[1:] click.echo(f"Inferred Schema from {ext} file: {str(schema_file)}") - + + # If RO-Crate path is provided, append the schema to it + if rocrate_path: + + # Read the RO-Crate to verify it exists and is valid + try: + ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {str(exc)}") + ctx.exit(code=1) + + # Append to RO-Crate + AppendCrate(cratePath=rocrate_path, elements=[schema_model]) + click.echo(f"Added Schema to RO-Crate with ID: {schema_model.guid}") + except ValueError as e: click.echo(f"Error with file type: {str(e)}") ctx.exit(code=1) except Exception as e: click.echo(f"Error inferring schema: {str(e)}") + ctx.exit(code=1) + + +@schema.command('add-to-crate') +@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +@click.argument('schema-file', type=click.Path(exists=True)) +@click.pass_context +def register_schema( + ctx, + rocrate_path: pathlib.Path, + schema_file: str, +): + """Register a JSON Schema with the specified RO-Crate. + + ROCRATE-PATH: Path to the RO-Crate to add the schema to + SCHEMA-FILE: Path to the schema JSON file + """ + try: + + try: + ReadROCrateMetadata(rocrate_path) + except Exception as exc: + click.echo(f"ERROR Reading ROCrate: {str(exc)}") + ctx.exit(code=1) + + # Read schema file + with open(schema_file, 'r') as f: + schema_data = json.load(f) + + try: + schema_model = TabularValidationSchema.from_dict(schema_data) + click.echo(f"Loaded schema as TabularValidationSchema") + except Exception as tabular_error: + # If that fails, try HDF5ValidationSchema + try: + schema_model = HDF5ValidationSchema.from_dict(schema_data) + click.echo(f"Loaded schema as HDF5ValidationSchema") + except Exception as hdf5_error: + click.echo(f"ERROR: Could not recognize schema format") + click.echo(f"TabularValidationSchema error: {str(tabular_error)}") + click.echo(f"HDF5ValidationSchema error: {str(hdf5_error)}") + ctx.exit(code=1) + + AppendCrate(cratePath=rocrate_path, elements=[schema_model]) + click.echo(f"Schema registered with ID: {schema_model.guid}") + + except Exception as exc: + click.echo(f"ERROR: {str(exc)}") ctx.exit(code=1) \ No newline at end of file diff --git a/src/fairscape_cli/commands/validate_commands.py b/src/fairscape_cli/commands/validate_commands.py new file mode 100644 index 0000000..1238d86 --- /dev/null +++ b/src/fairscape_cli/commands/validate_commands.py @@ -0,0 +1,85 @@ +import click +import pathlib +import json +from prettytable import PrettyTable +from pydantic import ValidationError + +from fairscape_cli.models.schema.tabular import ( + TabularValidationSchema, HDF5ValidationSchema +) +from fairscape_cli.commands.schema_commands import determine_schema_type + +@click.group('validate') +def validate_group(): + """Validate data against schemas or RO-Crate structure.""" + pass + +@validate_group.command('schema') +@click.option('--schema', type=str, required=True) +@click.option('--data', type=str, required=True) +@click.pass_context +def validate(ctx, schema, data): + """Execute validation of a Schema against the provided data.""" + if 'ark' not in schema: + schema_path = pathlib.Path(schema) + if not schema_path.exists(): + click.echo(f"ERROR: Schema file at path {schema} does not exist") + ctx.exit(1) + + data_path = pathlib.Path(data) + if not data_path.exists(): + click.echo(f"ERROR: Data file at path {data} does not exist") + ctx.exit(1) + + try: + with open(schema) as f: + schema_json = json.load(f) + + schema_class = determine_schema_type(data) + validation_schema = schema_class.from_dict(schema_json) + + validation_errors = validation_schema.validate_file(data) + + if len(validation_errors) != 0: + error_table = PrettyTable() + if isinstance(validation_schema, HDF5ValidationSchema): + error_table.field_names = ['path', 'error_type', 'failed_keyword', 'message'] + else: + error_table.field_names = ['row', 'error_type', 'failed_keyword', 'message'] + + for err in validation_errors: + if isinstance(validation_schema, HDF5ValidationSchema): + error_table.add_row([ + err.path, + err.type, + err.failed_keyword, + str(err.message) + ]) + else: + error_table.add_row([ + err.row, + err.type, + err.failed_keyword, + str(err.message) + ]) + + print(error_table) + ctx.exit(1) + else: + print('Validation Success') + ctx.exit(0) + + except ValidationError as metadata_error: + click.echo("Error with schema definition") + for validation_failure in metadata_error.errors(): + click.echo(f"property: {validation_failure.get('loc')} \tmsg: {validation_failure.get('msg')}") + ctx.exit(1) + +# Placeholder for future RO-Crate structural validation +# @validate_group.command('crate') +# @click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) +# def validate_crate(ctx, rocrate_path): +# """Validate the structure and metadata of an RO-Crate.""" +# # Implementation using RO-Crate-py or custom checks +# click.echo(f"Validating RO-Crate at {rocrate_path} (Not implemented yet)") +# pass \ No newline at end of file diff --git a/src/fairscape_cli/data_fetcher/GenomicData.py b/src/fairscape_cli/data_fetcher/GenomicData.py new file mode 100644 index 0000000..91f10ec --- /dev/null +++ b/src/fairscape_cli/data_fetcher/GenomicData.py @@ -0,0 +1,505 @@ +from pydantic import BaseModel +import json +import pathlib +from datetime import datetime +import os +from typing import List, Optional, Dict, Any + +import requests +import xml.etree.ElementTree as ET + +# Official Fairscape Models +from fairscape_models.sample import Sample +from fairscape_models.experiment import Experiment +from fairscape_models.instrument import Instrument +from fairscape_models.dataset import Dataset + +# Internal models for parsing NCBI structure before conversion +class InternalSample(BaseModel): + accession: str + title: Optional[str] = None + scientific_name: Optional[str] = None + taxon_id: Optional[str] = None + attributes: Dict[str, str] = {} + study_accession: Optional[str] = None + study_center_name: Optional[str] = None + study_title: Optional[str] = None + study_abstract: Optional[str] = None + study_description: Optional[str] = None + +class InternalExperiment(BaseModel): + accession: str + title: Optional[str] = None + study_ref: Optional[str] = None + sample_ref: Optional[str] = None + library_name: Optional[str] = None + library_strategy: Optional[str] = None + library_source: Optional[str] = None + library_selection: Optional[str] = None + library_layout: Optional[str] = None + nominal_length: Optional[str] = None + platform_type: Optional[str] = None + instrument_model: Optional[str] = None + +# Keep original Project, Output models as they represent NCBI structure +class Project(BaseModel): + id: str + accession: str + archive: Optional[str] = None + organism_name: Optional[str] = None + title: str + description: Optional[str] = None + release_date: Optional[str] = None + target_capture: Optional[str] = None + target_material: Optional[str] = None + target_sample_scope: Optional[str] = None + organism_species: Optional[str] = None + organism_taxID: Optional[str] = None + organism_supergroup: Optional[str] = None + method: Optional[str] = None + data_types: List[str] = [] + project_data_type: Optional[str] = None + submitted_date: Optional[str] = None + organization_role: Optional[str] = None + organization_type: Optional[str] = None + organization_name: Optional[str] = None + access: Optional[str] = None + +class OutputFile(BaseModel): + filename: str + size: Optional[int] = None + date: Optional[str] = None + url: str + md5: Optional[str] = None + +class Output(BaseModel): + accession: str + title: Optional[str] = None + experiment_ref: Optional[str] = None + total_spots: Optional[int] = None + total_bases: Optional[int] = None + size: Optional[int] = None + published: Optional[str] = None + files: List[OutputFile] = [] + nreads: Optional[int] = None + nspots: Optional[int] = None + a_count: Optional[int] = None + c_count: Optional[int] = None + g_count: Optional[int] = None + t_count: Optional[int] = None + n_count: Optional[int] = None + +# Wrapper classes using Internal types for parsing +class Samples(BaseModel): + items: List[InternalSample] + +class Experiments(BaseModel): + items: List[InternalExperiment] + +class Outputs(BaseModel): + items: List[Output] + +from fairscape_cli.data_fetcher.cell_line_api import get_cell_line_entity +from fairscape_cli.data_fetcher.bioproject_fetcher import fetch_bioproject_data + +from fairscape_cli.models.rocrate import GenerateROCrate, AppendCrate +from fairscape_cli.models.dataset import GenerateDataset +from fairscape_cli.models.experiment import GenerateExperiment +from fairscape_cli.models.instrument import GenerateInstrument +from fairscape_cli.models.sample import GenerateSample + + +class GenomicData(BaseModel): + project: Project + samples: Samples + experiments: Experiments + outputs: Outputs + + def to_rocrate( + self, + output_dir: str, + author: str = "Unknown", + crate_name: Optional[str] = None, + crate_description: Optional[str] = None, + crate_keywords: Optional[List[str]] = None, + crate_license: Optional[str] = None, + crate_version: Optional[str] = None, + organization_name: Optional[str] = None, + project_name: Optional[str] = None, + **kwargs + ) -> str: + output_path = pathlib.Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + + bioproject = self.project + + default_crate_name = crate_name if crate_name else bioproject.title + default_crate_description = crate_description if crate_description else bioproject.description + + default_crate_keywords = ["bioproject", "bioinformatics"] + if bioproject.organism_name: + default_crate_keywords.append(bioproject.organism_name) + default_crate_keywords.extend([dt.replace('e', '') for dt in bioproject.data_types]) + if bioproject.project_data_type: + default_crate_keywords.append(bioproject.project_data_type) + final_crate_keywords = crate_keywords if crate_keywords is not None else default_crate_keywords + + rocrate_params = { + "path": output_path, + "guid": "", + "name": default_crate_name, + "description": default_crate_description, + "keywords": final_crate_keywords, + "license": crate_license if crate_license else "https://creativecommons.org/publicdomain/zero/1.0/", + "author": author, + "version": crate_version if crate_version else "1.0", + "organizationName": organization_name if organization_name else bioproject.organization_name, + "projectName": project_name if project_name else None, + "datePublished": datetime.now().isoformat(), + "associatedPublication": "", + "isPartOf": [], + "hasPart": [], + "sameAs": f"https://www.ncbi.nlm.nih.gov/bioproject/{bioproject.accession}" + } + rocrate_params.update(kwargs) + rocrate_params = {k: v for k, v in rocrate_params.items() if v is not None} + + crate_root_dict = GenerateROCrate(**rocrate_params) + crate_root_guid = crate_root_dict['@id'] + + all_elements_to_append = [] + id_mapping = {} + instrument_guids = {} + + experiment_fairscape_objects: Dict[str, Experiment] = {} + + cell_line_entities_to_add = {} + for sample_spec in self.samples.items: + accession = sample_spec.accession + cell_line = None + for attr_name in ['cell_line', 'cell line', 'cell_line_name']: + if attr_name in sample_spec.attributes: + cell_line = sample_spec.attributes[attr_name] + break + + cell_line_guid = None + if cell_line and cell_line not in cell_line_entities_to_add: + try: + cell_line_entity = get_cell_line_entity(cell_line) + if cell_line_entity: + cell_line_guid = cell_line_entity["@id"] + cell_line_entities_to_add[cell_line] = cell_line_entity + except Exception as e: + print(f"Warning: Failed to get cell line entity for {cell_line}: {e}") + elif cell_line in cell_line_entities_to_add: + cell_line_guid = cell_line_entities_to_add[cell_line]["@id"] + + sample_keywords = ["biosample", sample_spec.scientific_name] + if cell_line: + sample_keywords.append(cell_line) + + # Map InternalSample fields to GenerateSample arguments + sample_params = { + "guid": None, + "name": sample_spec.title or f"BioSample {accession}", + "author": author, + "description": sample_spec.title or f"BioSample {accession} from project {bioproject.accession}", + "keywords": sample_keywords, + "version": "1.0", + "contentUrl": f"https://www.ncbi.nlm.nih.gov/biosample/{accession}", + "cellLineReference": cell_line_guid if cell_line_guid else None, + "additionalProperty": [ + {"@type": "PropertyValue", "name": "NCBI BioSample Accession", "value": accession}, + {"@type": "PropertyValue", "name": "NCBI Taxon ID", "value": sample_spec.taxon_id}, + {"@type": "PropertyValue", "name": "Scientific Name", "value": sample_spec.scientific_name}, + ] + } + sample_params = {k: v for k, v in sample_params.items() if v is not None} + + generated_sample: Sample = GenerateSample(**sample_params) + all_elements_to_append.append(generated_sample) + id_mapping[accession] = generated_sample.guid + + all_elements_to_append.extend(cell_line_entities_to_add.values()) + + + for experiment_spec in self.experiments.items: + exp_accession = experiment_spec.accession + platform = experiment_spec.platform_type + model = experiment_spec.instrument_model + instrument_key = f"{platform}_{model}" + + if instrument_key not in instrument_guids: + # Map InternalExperiment fields to GenerateInstrument arguments + instrument_params = { + "guid": None, + "name": model, + "manufacturer": platform, + "model": model, + "description": f"{model} instrument ({platform}) used for sequencing", + "usedByExperiment": [] + } + + generated_instrument: Instrument = GenerateInstrument(**instrument_params) + all_elements_to_append.append(generated_instrument) + instrument_guids[instrument_key] = generated_instrument.guid + + instrument_guid = instrument_guids[instrument_key] + + sample_guid = id_mapping.get(experiment_spec.sample_ref) + used_samples_list = [{"@id": sample_guid}] if sample_guid else [] + + exp_keywords = ["experiment", experiment_spec.library_strategy, experiment_spec.library_source] + + # Map InternalExperiment fields to GenerateExperiment arguments + exp_params = { + "guid": None, + "name": experiment_spec.title or f"Experiment {exp_accession}", + "experimentType": experiment_spec.library_strategy, + "runBy": author, + "description": experiment_spec.title or f"Sequencing experiment {exp_accession}", + "datePerformed": datetime.now().isoformat(), + "keywords": exp_keywords, + "usedInstrument": [{"@id": instrument_guid}] if instrument_guid else [], + "usedSample": used_samples_list, + "generated": [], + "additionalProperty": [ + {"@type": "PropertyValue", "name": "NCBI SRA Experiment Accession", "value": exp_accession}, + {"@type": "PropertyValue", "name": "Library Name", "value": experiment_spec.library_name}, + {"@type": "PropertyValue", "name": "Library Strategy", "value": experiment_spec.library_strategy}, + {"@type": "PropertyValue", "name": "Library Source", "value": experiment_spec.library_source}, + {"@type": "PropertyValue", "name": "Library Selection", "value": experiment_spec.library_selection}, + {"@type": "PropertyValue", "name": "Library Layout", "value": experiment_spec.library_layout}, + {"@type": "PropertyValue", "name": "Nominal Length", "value": experiment_spec.nominal_length}, + ] + } + exp_params = {k: v for k, v in exp_params.items() if v is not None and v != ""} + + generated_experiment: Experiment = GenerateExperiment(**exp_params) + all_elements_to_append.append(generated_experiment) + id_mapping[exp_accession] = generated_experiment.guid + experiment_fairscape_objects[exp_accession] = generated_experiment # Store official model + + generated_output_datasets = [] + for output_spec in self.outputs.items: + run_accession = output_spec.accession + experiment_guid = id_mapping.get(output_spec.experiment_ref) + + dataset_params = { + "guid": None, + "name": output_spec.title or f"SRA Run {run_accession}", + "author": author, + "description": f"Sequencing run {run_accession} from experiment {output_spec.experiment_ref}", + "keywords": ["SRA Run", "sequencing data"], + "datePublished": output_spec.published if output_spec.published else datetime.now().isoformat(), + "version": "1.0", + "format": "sra", + "generatedBy": experiment_guid if experiment_guid else [], + "contentUrl": f"https://www.ncbi.nlm.nih.gov/sra/{run_accession}", + "additionalProperty": [ + {"@type": "PropertyValue", "name": "NCBI SRA Run Accession", "value": run_accession}, + {"@type": "PropertyValue", "name": "Total Spots", "value": str(output_spec.total_spots)}, + {"@type": "PropertyValue", "name": "Total Bases", "value": str(output_spec.total_bases)}, + {"@type": "PropertyValue", "name": "Size (bytes)", "value": str(output_spec.size)}, + ] + } + dataset_params = {k: v for k, v in dataset_params.items() if v is not None and v != ""} + + generated_dataset: Dataset = GenerateDataset(**dataset_params) + generated_output_datasets.append(generated_dataset) + id_mapping[run_accession] = generated_dataset.guid + + if output_spec.experiment_ref in experiment_fairscape_objects: + exp_obj = experiment_fairscape_objects[output_spec.experiment_ref] + if not exp_obj.generated: + exp_obj.generated = [] + exp_obj.generated.append({"@id": generated_dataset.guid}) + + + all_elements_to_append.extend(generated_output_datasets) + + + models_to_append = [elem for elem in all_elements_to_append if hasattr(elem, 'model_dump')] + dicts_to_append = [elem for elem in all_elements_to_append if not hasattr(elem, 'model_dump')] + + if models_to_append: + AppendCrate(cratePath=output_path, elements=models_to_append) + + if dicts_to_append: + metadata_file_path = output_path / "ro-crate-metadata.json" + with metadata_file_path.open("r+") as f: + crate_json = json.load(f) + existing_ids = {item.get("@id") for item in crate_json["@graph"]} + root_dataset_node = crate_json["@graph"][1] + + for entity_dict in dicts_to_append: + entity_id = entity_dict.get("@id") + if entity_id and entity_id not in existing_ids: + crate_json["@graph"].append(entity_dict) + existing_ids.add(entity_id) + if not any(part.get("@id") == entity_id for part in root_dataset_node.get("hasPart",[])): + if "hasPart" not in root_dataset_node: root_dataset_node["hasPart"] = [] + root_dataset_node["hasPart"].append({"@id": entity_id}) + + f.seek(0) + json.dump(crate_json, f, indent=2) + f.truncate() + + + metadata_file_path = output_path / "ro-crate-metadata.json" + with metadata_file_path.open("r+") as f: + crate_json_final = json.load(f) + graph = crate_json_final["@graph"] + updated = False + for i, item in enumerate(graph): + item_id = item.get("@id") + matching_exp_obj = next((exp for acc, exp in experiment_fairscape_objects.items() if exp.guid == item_id and exp.generated), None) + if matching_exp_obj: + graph[i]["generated"] = [gen for gen in matching_exp_obj.generated] + updated = True + + if updated: + f.seek(0) + json.dump(crate_json_final, f, indent=2) + f.truncate() + + + return crate_root_guid + + @classmethod + def from_api(cls, accession: str, api_key: str = "", details_dir: str = "details") -> 'GenomicData': + data = fetch_bioproject_data(accession, api_key=api_key, details_dir=details_dir) + if not data: + raise ValueError(f"Failed to fetch data for BioProject: {accession}") + return cls.from_json(data) + + @classmethod + def from_json(cls, data: dict) -> 'GenomicData': + sample_to_study_map = {} + for experiment in data.get("experiments", []): + sample_ref = experiment.get("sample_ref") or experiment.get("title", "") + study_ref = experiment.get("study_ref") or experiment.get("title", "") + sample_to_study_map[sample_ref] = study_ref + + studies_map = {} + if data.get("studies"): + studies_map = {study.get("accession") or study.get("title", ""): study for study in data.get("studies", [])} + + project_data = data.get("bioproject", {}) + project_type = project_data.get("project_type", {}) + target = project_type.get("target", {}) + organism = target.get("organism", {}) + submission = project_data.get("submission", {}) + organization = submission.get("organization", {}) + + # Create Project instance (using its own definition) + project = Project( + id=project_data.get("id", ""), + accession=project_data.get("accession") or project_data.get("title", ""), + archive=project_data.get("archive", ""), + organism_name=project_data.get("organism_name", ""), + title=project_data.get("title", ""), + description=project_data.get("description", ""), + release_date=project_data.get("release_date", ""), + target_capture=target.get("capture", ""), + target_material=target.get("material", ""), + target_sample_scope=target.get("sample_scope", ""), + organism_species=organism.get("species", ""), + organism_taxID=organism.get("taxID", ""), + organism_supergroup=organism.get("supergroup", ""), + method=project_type.get("method", ""), + data_types=project_type.get("data_types", []), + project_data_type=project_type.get("project_data_type", ""), + submitted_date=submission.get("submitted", ""), + organization_role=organization.get("role", ""), + organization_type=organization.get("type", ""), + organization_name=organization.get("name", ""), + access=submission.get("access", "") + ) + + # Parse into InternalSample instances + internal_samples = [] + for biosample in data.get("biosamples", []): + sample_ref = biosample.get("accession") or biosample.get("title") or biosample.get("scientific_name", "") + study_ref = sample_to_study_map.get(sample_ref) + study_data = studies_map.get(study_ref, {}) + + internal_sample = InternalSample( + accession=biosample.get("accession") or biosample.get("title") or biosample.get("scientific_name", ""), + title=biosample.get("title", ""), + scientific_name=biosample.get("scientific_name", ""), + taxon_id=biosample.get("taxon_id", ""), + attributes=biosample.get("attributes", {}), + study_accession=study_ref, + study_center_name=study_data.get("center_name"), + study_title=study_data.get("title"), + study_abstract=study_data.get("abstract"), + study_description=study_data.get("description") + ) + internal_samples.append(internal_sample) + + # Parse into InternalExperiment instances + internal_experiments = [] + for exp in data.get("experiments", []): + design = exp.get("design", {}) + platform = exp.get("platform", {}) + + internal_experiment = InternalExperiment( + accession=exp.get("accession") or exp.get("title", ""), + title=exp.get("title", ""), + study_ref=exp.get("study_ref") or exp.get("title", ""), + sample_ref=exp.get("sample_ref") or exp.get("title", ""), + library_name=design.get("library_name", ""), + library_strategy=design.get("library_strategy", ""), + library_source=design.get("library_source", ""), + library_selection=design.get("library_selection", ""), + library_layout=design.get("library_layout", ""), + nominal_length=design.get("nominal_length", ""), + platform_type=platform.get("type", ""), + instrument_model=platform.get("instrument_model", "") + ) + internal_experiments.append(internal_experiment) + + # Parse into Output instances (using its own definition) + outputs_list = [] + for run in data.get("runs", []): + base_composition = run.get("base_composition", {}) + + output_files = [ + OutputFile( + filename=file.get("filename") or file.get("url", "").split("/")[-1] or "", + size=file.get("size", 0), + date=file.get("date", ""), + url=file.get("url", ""), + md5=file.get("md5", "") + ) + for file in run.get("files", []) + ] + + output = Output( + accession=run.get("accession") or run.get("title", ""), + title=run.get("title", ""), + experiment_ref=run.get("experiment_ref") or run.get("title", ""), + total_spots=run.get("total_spots", 0), + total_bases=run.get("total_bases", 0), + size=run.get("size", 0), + published=run.get("published", ""), + files=output_files, + nreads=run.get("nreads", 0), + nspots=run.get("nspots", 0), + a_count=base_composition.get("A", 0), + c_count=base_composition.get("C", 0), + g_count=base_composition.get("G", 0), + t_count=base_composition.get("T", 0), + n_count=base_composition.get("N", 0) + ) + outputs_list.append(output) + + # Create the final GenomicData instance using wrappers containing internal types + return cls( + project=project, + samples=Samples(items=internal_samples), + experiments=Experiments(items=internal_experiments), + outputs=Outputs(items=outputs_list) + ) + diff --git a/src/fairscape_cli/rocrate/__init__.py b/src/fairscape_cli/data_fetcher/__init__.py similarity index 100% rename from src/fairscape_cli/rocrate/__init__.py rename to src/fairscape_cli/data_fetcher/__init__.py diff --git a/src/fairscape_cli/data_fetcher/bioproject_fetcher.py b/src/fairscape_cli/data_fetcher/bioproject_fetcher.py new file mode 100644 index 0000000..5969e1b --- /dev/null +++ b/src/fairscape_cli/data_fetcher/bioproject_fetcher.py @@ -0,0 +1,548 @@ +import requests +import json +import xml.etree.ElementTree as ET +import os +import argparse + +def fetch_bioproject_data(bioproject_accession, api_key, details_dir="details"): + + + # First fetch the BioProject data + search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi" + search_params = { + "db": "bioproject", + "term": bioproject_accession, + "retmode": "json", + "api_key": api_key + } + + response = requests.get(search_url, params=search_params) + + try: + search_results = response.json() + + if "esearchresult" not in search_results or "idlist" not in search_results["esearchresult"] or len(search_results["esearchresult"]["idlist"]) == 0: + return None + + bioproject_id = search_results["esearchresult"]["idlist"][0] + except json.JSONDecodeError: + return None + + bioproject_fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" + bioproject_fetch_params = { + "db": "bioproject", + "id": bioproject_id, + "retmode": "xml", + "api_key": api_key + } + + response = requests.get(bioproject_fetch_url, params=bioproject_fetch_params) + response_text = response.text + + bioproject_metadata = parse_bioproject_xml(response_text) + if not bioproject_metadata: + bioproject_metadata = { + "id": bioproject_id, + "accession": bioproject_accession + } + + # Initialize result structure + result = { + "bioproject": bioproject_metadata, + "biosamples": [], + "studies": [], + "experiments": [], + "runs": [] + } + + # First try via SRA linkage + sra_data = fetch_sra_data(bioproject_id, api_key) + if sra_data and (sra_data.get("biosamples") or sra_data.get("studies") or sra_data.get("experiments") or sra_data.get("runs")): + result["biosamples"] = sra_data.get("biosamples", []) + result["studies"] = sra_data.get("studies", []) + result["experiments"] = sra_data.get("experiments", []) + result["runs"] = sra_data.get("runs", []) + else: + # If no SRA data, try to get BioSamples directly + biosamples = fetch_biosamples_for_bioproject(bioproject_id, api_key) + if biosamples: + result["biosamples"] = biosamples + + return result + +def fetch_sra_data(bioproject_id, api_key): + """Fetch SRA data linked to a BioProject""" + link_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi" + sra_link_params = { + "dbfrom": "bioproject", + "db": "sra", + "id": bioproject_id, + "retmode": "json", + "api_key": api_key + } + + response = requests.get(link_url, params=sra_link_params) + + try: + sra_link_results = response.json() + except json.JSONDecodeError: + return None + + sra_ids = [] + if "linksets" in sra_link_results and len(sra_link_results["linksets"]) > 0: + linkset = sra_link_results["linksets"][0] + if "linksetdbs" in linkset: + for linksetdb in linkset["linksetdbs"]: + if linksetdb["linkname"] == "bioproject_sra": + sra_ids = linksetdb["links"] + break + + if sra_ids: + batch_size = 50 + all_roots = [] + + for i in range(0, len(sra_ids), batch_size): + batch = sra_ids[i:i+batch_size] + + sra_fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" + sra_fetch_params = { + "db": "sra", + "id": ",".join(batch), + "rettype": "xml", + "api_key": api_key + } + + response = requests.get(sra_fetch_url, params=sra_fetch_params) + response_text = response.text + + if response_text.strip(): + try: + batch_root = ET.fromstring(response_text) + all_roots.append(batch_root) + except ET.ParseError: + pass + + if all_roots: + if len(all_roots) > 1: + combined_root = ET.Element("EXPERIMENT_PACKAGE_SET") + + for root in all_roots: + for exp_package in root.findall(".//EXPERIMENT_PACKAGE"): + combined_root.append(exp_package) + + return parse_experiment_packages(combined_root) + else: + return parse_experiment_packages(all_roots[0]) + + return None + +def fetch_biosamples_for_bioproject(bioproject_id, api_key): + """Fetch BioSamples directly linked to a BioProject""" + # First, get the BioSample IDs linked to this BioProject + link_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi" + biosample_link_params = { + "dbfrom": "bioproject", + "db": "biosample", + "id": bioproject_id, + "retmode": "json", + "api_key": api_key + } + + response = requests.get(link_url, params=biosample_link_params) + + try: + link_results = response.json() + except json.JSONDecodeError: + return [] + + biosample_ids = [] + if "linksets" in link_results and len(link_results["linksets"]) > 0: + linkset = link_results["linksets"][0] + if "linksetdbs" in linkset: + for linksetdb in linkset["linksetdbs"]: + if linksetdb["linkname"] == "bioproject_biosample": + biosample_ids = linksetdb["links"] + break + + if not biosample_ids: + return [] + + # Now fetch the BioSample details + biosamples = [] + batch_size = 50 + + for i in range(0, len(biosample_ids), batch_size): + batch = biosample_ids[i:i+batch_size] + + fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" + fetch_params = { + "db": "biosample", + "id": ",".join(batch), + "rettype": "xml", + "api_key": api_key + } + + response = requests.get(fetch_url, params=fetch_params) + if response.status_code != 200: + continue + + try: + root = ET.fromstring(response.text) + + for biosample in root.findall(".//BioSample"): + sample_data = { + "accession": biosample.get("accession", ""), + "title": "", + "scientific_name": "", + "taxon_id": "", + "attributes": {} + } + + # Get title from Description/Title + title_elem = biosample.find(".//Description/Title") + if title_elem is not None and title_elem.text: + sample_data["title"] = title_elem.text + + # Get organism information + organism = biosample.find(".//Description/Organism") + if organism is not None: + sample_data["scientific_name"] = organism.get("taxonomy_name", "") + sample_data["taxon_id"] = organism.get("taxonomy_id", "") + + # Get attributes + for attribute in biosample.findall(".//Attributes/Attribute"): + attr_name = attribute.get("attribute_name", "") + if attr_name and attribute.text: + sample_data["attributes"][attr_name] = attribute.text + + biosamples.append(sample_data) + + except ET.ParseError: + pass + + return biosamples + +def parse_bioproject_xml(xml_text): + try: + root = ET.fromstring(xml_text) + + bioproject_data = {} + + doc_summary = root.find(".//DocumentSummary") + if not doc_summary: + return None + + uid = doc_summary.get("uid", "") + if uid: + bioproject_data["id"] = uid + + project = doc_summary.find(".//Project") + if project: + archive_id = project.find(".//ProjectID/ArchiveID") + if archive_id is not None: + bioproject_data["accession"] = archive_id.get("accession", "") + bioproject_data["archive"] = archive_id.get("archive", "") + if not uid and archive_id.get("id"): + bioproject_data["id"] = archive_id.get("id") + + project_descr = project.find(".//ProjectDescr") + if project_descr is not None: + name_elem = project_descr.find("Name") + if name_elem is not None and name_elem.text: + bioproject_data["organism_name"] = name_elem.text + + title_elem = project_descr.find("Title") + if title_elem is not None and title_elem.text: + bioproject_data["title"] = title_elem.text + + desc_elem = project_descr.find("Description") + if desc_elem is not None and desc_elem.text: + bioproject_data["description"] = desc_elem.text + + release_date_elem = project_descr.find("ProjectReleaseDate") + if release_date_elem is not None and release_date_elem.text: + bioproject_data["release_date"] = release_date_elem.text + + relevance_elem = project_descr.find("Relevance") + if relevance_elem is not None: + relevance = {} + for rel_elem in relevance_elem: + if rel_elem.text: + relevance[rel_elem.tag] = rel_elem.text + if relevance: + bioproject_data["relevance"] = relevance + + project_type = project.find(".//ProjectType") + if project_type is not None: + project_type_submission = project_type.find("ProjectTypeSubmission") + if project_type_submission is not None: + type_data = {} + + target_elem = project_type_submission.find("Target") + if target_elem is not None: + target_data = { + "capture": target_elem.get("capture", ""), + "material": target_elem.get("material", ""), + "sample_scope": target_elem.get("sample_scope", "") + } + + organism_elem = target_elem.find("Organism") + if organism_elem is not None: + organism_data = { + "species": organism_elem.get("species", ""), + "taxID": organism_elem.get("taxID", ""), + "name": "", + "supergroup": "" + } + + org_name_elem = organism_elem.find("OrganismName") + if org_name_elem is not None and org_name_elem.text: + organism_data["name"] = org_name_elem.text + + supergroup_elem = organism_elem.find("Supergroup") + if supergroup_elem is not None and supergroup_elem.text: + organism_data["supergroup"] = supergroup_elem.text + + target_data["organism"] = organism_data + + type_data["target"] = target_data + + method_elem = project_type_submission.find("Method") + if method_elem is not None: + type_data["method"] = method_elem.get("method_type", "") + + objectives_elem = project_type_submission.find("Objectives") + if objectives_elem is not None: + data_types = [] + for data_elem in objectives_elem.findall("Data"): + data_type = data_elem.get("data_type", "") + if data_type: + data_types.append(data_type) + + if data_types: + type_data["data_types"] = data_types + + data_type_set_elem = project_type_submission.find("ProjectDataTypeSet") + if data_type_set_elem is not None: + data_type_elem = data_type_set_elem.find("DataType") + if data_type_elem is not None and data_type_elem.text: + type_data["project_data_type"] = data_type_elem.text + + if type_data: + bioproject_data["project_type"] = type_data + + submission = doc_summary.find(".//Submission") + if submission is not None: + submission_data = { + "submitted": submission.get("submitted", "") + } + + desc = submission.find("Description") + if desc is not None: + org_elem = desc.find("Organization") + if org_elem is not None: + org_data = { + "role": org_elem.get("role", ""), + "type": org_elem.get("type", ""), + "name": "" + } + + name_elem = org_elem.find("Name") + if name_elem is not None and name_elem.text: + org_data["name"] = name_elem.text + + submission_data["organization"] = org_data + + access_elem = desc.find("Access") + if access_elem is not None and access_elem.text: + submission_data["access"] = access_elem.text + + if submission_data: + bioproject_data["submission"] = submission_data + + return bioproject_data + + except ET.ParseError: + return None + +def parse_experiment_packages(root): + data = { + "biosamples": [], + "studies": [], + "experiments": [], + "runs": [] + } + + for exp_package in root.findall(".//EXPERIMENT_PACKAGE"): + study = exp_package.find(".//STUDY") + if study is not None and study.get("accession") not in [s.get("accession") for s in data["studies"]]: + study_data = { + "accession": study.get("accession", ""), + "center_name": study.get("center_name", ""), + "title": "", + "abstract": "", + "description": "" + } + + descriptor = study.find(".//DESCRIPTOR") + if descriptor is not None: + study_data["title"] = descriptor.findtext(".//STUDY_TITLE", "") + study_data["abstract"] = descriptor.findtext(".//STUDY_ABSTRACT", "") + study_data["description"] = descriptor.findtext(".//STUDY_DESCRIPTION", "") + + data["studies"].append(study_data) + + sample = exp_package.find(".//SAMPLE") + if sample is not None and sample.get("accession") not in [s.get("accession") for s in data["biosamples"]]: + sample_data = { + "accession": sample.get("accession", ""), + "title": sample.findtext(".//TITLE", ""), + "scientific_name": sample.findtext(".//SCIENTIFIC_NAME", ""), + "taxon_id": sample.findtext(".//TAXON_ID", ""), + "attributes": {} + } + + for attr in sample.findall(".//SAMPLE_ATTRIBUTE"): + tag = attr.findtext(".//TAG", "") + value = attr.findtext(".//VALUE", "") + if tag and tag != "N. A.": + sample_data["attributes"][tag] = value + + data["biosamples"].append(sample_data) + + experiment = exp_package.find(".//EXPERIMENT") + if experiment is not None: + exp_data = { + "accession": experiment.get("accession", ""), + "title": experiment.findtext(".//TITLE", ""), + "study_ref": "", + "sample_ref": "", + "design": {}, + "platform": {} + } + + study_ref_elem = experiment.find(".//STUDY_REF") + if study_ref_elem is not None: + exp_data["study_ref"] = study_ref_elem.get("accession", "") + + sample_desc_elem = experiment.find(".//SAMPLE_DESCRIPTOR") + if sample_desc_elem is not None: + exp_data["sample_ref"] = sample_desc_elem.get("accession", "") + + design_elem = experiment.find(".//DESIGN") + if design_elem is not None: + library_elem = design_elem.find(".//LIBRARY_DESCRIPTOR") + if library_elem is not None: + exp_data["design"]["library_name"] = library_elem.findtext(".//LIBRARY_NAME", "") + exp_data["design"]["library_strategy"] = library_elem.findtext(".//LIBRARY_STRATEGY", "") + exp_data["design"]["library_source"] = library_elem.findtext(".//LIBRARY_SOURCE", "") + exp_data["design"]["library_selection"] = library_elem.findtext(".//LIBRARY_SELECTION", "") + + layout_elem = library_elem.find(".//LIBRARY_LAYOUT") + if layout_elem is not None and len(layout_elem) > 0: + layout_type = layout_elem[0].tag + exp_data["design"]["library_layout"] = layout_type + if layout_type == "PAIRED": + exp_data["design"]["nominal_length"] = layout_elem[0].get("NOMINAL_LENGTH", "") + + platform_elem = experiment.find(".//PLATFORM") + if platform_elem is not None and len(platform_elem) > 0: + platform_type = platform_elem[0].tag + exp_data["platform"]["type"] = platform_type + + instrument_model = platform_elem.find(f".//{platform_type}/INSTRUMENT_MODEL") + if instrument_model is not None: + exp_data["platform"]["instrument_model"] = instrument_model.text + + data["experiments"].append(exp_data) + + for run in exp_package.findall(".//RUN"): + run_data = { + "accession": run.get("accession", ""), + "title": run.findtext(".//TITLE", ""), + "experiment_ref": "", + "total_spots": run.get("total_spots", ""), + "total_bases": run.get("total_bases", ""), + "size": run.get("size", ""), + "published": run.get("published", ""), + "files": [] + } + + exp_ref = run.find(".//EXPERIMENT_REF") + if exp_ref is not None: + run_data["experiment_ref"] = exp_ref.get("accession", "") + + for file_elem in run.findall(".//SRAFile"): + file_data = { + "filename": file_elem.get("filename", ""), + "size": file_elem.get("size", ""), + "date": file_elem.get("date", ""), + "url": file_elem.get("url", ""), + "md5": file_elem.get("md5", "") + } + run_data["files"].append(file_data) + + stats_elem = run.find(".//Statistics") + if stats_elem is not None: + run_data["nreads"] = stats_elem.get("nreads", "") + run_data["nspots"] = stats_elem.get("nspots", "") + + reads = [] + for read_elem in stats_elem.findall(".//Read"): + read_data = { + "index": read_elem.get("index", ""), + "count": read_elem.get("count", ""), + "average": read_elem.get("average", ""), + "stdev": read_elem.get("stdev", "") + } + reads.append(read_data) + + if reads: + run_data["reads"] = reads + + bases_elem = run.find(".//Bases") + if bases_elem is not None: + run_data["base_count"] = bases_elem.get("count", "") + + bases = {} + for base_elem in bases_elem.findall(".//Base"): + value = base_elem.get("value", "") + count = base_elem.get("count", "") + if value and count: + bases[value] = count + + if bases: + run_data["base_composition"] = bases + + data["runs"].append(run_data) + + return data + +def main(): + parser = argparse.ArgumentParser(description="Fetch and parse BioProject, BioSample, and SRA data") + parser.add_argument("bioproject", help="BioProject accession number") + parser.add_argument("--api-key", default="b5842d8d17966b13241247e793b879532d07", help="NCBI API key") + parser.add_argument("--output-dir", default=".", help="Output directory") + parser.add_argument("--details-dir", default="details", help="Details directory") + + args = parser.parse_args() + + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + + details_dir = os.path.join(args.output_dir, args.details_dir, args.bioproject) + if not os.path.exists(details_dir): + os.makedirs(details_dir) + + os.chdir(args.output_dir) + + data = fetch_bioproject_data(args.bioproject, args.api_key, details_dir) + + if data: + output_file = f"{args.bioproject}_metadata.json" + with open(output_file, "w") as f: + json.dump(data, f, indent=2) + print(f"Data written to {output_file}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/fairscape_cli/data_fetcher/cell_line_api.py b/src/fairscape_cli/data_fetcher/cell_line_api.py new file mode 100644 index 0000000..593c204 --- /dev/null +++ b/src/fairscape_cli/data_fetcher/cell_line_api.py @@ -0,0 +1,201 @@ +#!/usr/bin/env python3 +import sys +import json +import requests +from bs4 import BeautifulSoup + +def search_cellosaurus(cell_line): + """ + Search Cellosaurus for a cell line and return the first result's URL and accession. + + Args: + cell_line (str): The cell line to search for + + Returns: + tuple: (URL, accession) for the first result, or (None, None) if no results + """ + url = f"https://www.cellosaurus.org/search?query={cell_line}" + + try: + response = requests.get(url) + response.raise_for_status() + + soup = BeautifulSoup(response.text, 'html.parser') + + # Find the first row in the results table + results_table = soup.find('table', class_='type-1') + + if results_table: + first_row = results_table.find('tr') + + if first_row: + accession_cell = first_row.find('td') + if accession_cell: + accession_link = accession_cell.find('a') + if accession_link: + accession = accession_link.text + url = f"https://www.cellosaurus.org/{accession}" + return url, accession + + return None, None + + except requests.exceptions.RequestException as e: + print(f"Error: {e}", file=sys.stderr) + return None, None + +def get_cell_line_metadata(url): + """ + Retrieve and extract metadata for a cell line from its detail page. + + Args: + url (str): The URL of the cell line detail page + + Returns: + dict: A dictionary containing the cell line metadata + """ + try: + response = requests.get(url) + response.raise_for_status() + + soup = BeautifulSoup(response.text, 'html.parser') + + # Initialize metadata dictionary + metadata = {} + + # Find the main table containing cell line data + table = soup.find('table', {'class': 'type-2'}) + + if not table: + return metadata + + # Extract key-value pairs from the table + rows = table.find_all('tr') + for row in rows: + header = row.find('th') + data = row.find('td') + + if header and data: + key = header.text.strip() + value = data.text.strip() + + # Skip large sections to keep the output manageable + if key in ["Publications", "Gene expression databases"]: + continue + + # For certain fields, split multiple lines + if key in ["Synonyms", "Comments"]: + value = [v.strip() for v in value.split('\n') if v.strip()] + + metadata[key] = value + + return metadata + + except requests.exceptions.RequestException as e: + print(f"Error: {e}", file=sys.stderr) + return {} + +def format_structured_json(metadata, url, accession): + """ + Format the metadata into a structured JSON format following the schema template. + + Args: + metadata (dict): The metadata dictionary + url (str): The URL of the cell line + accession (str): The accession number + + Returns: + dict: A dictionary in the specified format + """ + # Extract cell line name + cell_name = metadata.get("Cell line name", "") + + # Create ark ID from accession + ark_id = f"ark:59852/cell-line-{cell_name.replace(' ', '-')}" + + # Extract synonyms/alternate names + alternate_names = [] + if "Synonyms" in metadata and metadata["Synonyms"]: + if isinstance(metadata["Synonyms"], list): + raw_synonyms = metadata["Synonyms"][0] + else: + raw_synonyms = metadata["Synonyms"] + + # Split by semicolons if present + alternate_names = [name.strip() for name in raw_synonyms.split(';')] + + # Extract RRID if available + rrid = "" + if "Resource Identification Initiative" in metadata: + rrid_text = metadata["Resource Identification Initiative"] + if "RRID:" in rrid_text: + start_idx = rrid_text.find("RRID:") + end_idx = rrid_text.find(")", start_idx) if ")" in rrid_text[start_idx:] else len(rrid_text) + rrid = rrid_text[start_idx:end_idx] + + # Extract species/organism information + species = "Unknown" + ncbi_taxid = "" + if "Species of origin" in metadata: + species_text = metadata["Species of origin"] + species = species_text.split("(")[0].strip() + + # Try to extract NCBI taxonomy ID + if "NCBI Taxonomy:" in species_text: + taxid_start = species_text.find("NCBI Taxonomy:") + len("NCBI Taxonomy:") + taxid_end = species_text.find(")", taxid_start) if ")" in species_text[taxid_start:] else len(species_text) + ncbi_taxid = species_text[taxid_start:taxid_end].strip() + + # Create the structured JSON + structured_json = { + "@id": ark_id, + "@type": "BioChemEntity", + "name": f"{cell_name} Cell Line", + "identifier": [ + { + "@type": "PropertyValue", + "@value": f"RRID:{accession}" if not rrid else rrid, + "name": "RRID" + } + ], + "url": url, + "alternateName": alternate_names, + "organism": { + "@id": f"ark:59852/organism-{species.lower().replace(' ', '-')}", + "name": species, + "identifier": [] + }, + "EVI:usedBy": [] + } + + # Add NCBI taxonomy ID if available + if ncbi_taxid: + structured_json["organism"]["identifier"].append({ + "@type": "PropertyValue", + "name": "NCBI Taxonomy Browser", + "value": f"NCBI:txid{ncbi_taxid}", + "url": f"https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id={ncbi_taxid}" + }) + + return structured_json + +def get_cell_line_entity(cell_line_name): + """ + Lookup a cell line in Cellosaurus and return its entity representation. + + Args: + cell_line_name (str): The name of the cell line to look up + + Returns: + dict or None: The cell line entity as a dictionary, or None if not found + """ + url, accession = search_cellosaurus(cell_line_name) + + if url and accession: + # Get metadata from the cell line detail page + metadata = get_cell_line_metadata(url) + + # Format in structured JSON + structured_data = format_structured_json(metadata, url, accession) + return structured_data + + return None \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/__init__.py b/src/fairscape_cli/datasheet_builder/__init__.py new file mode 100644 index 0000000..97e0706 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/__init__.py @@ -0,0 +1,7 @@ +""" +Datasheet Builder module for RO-Crate metadata visualization and documentation. +""" + +from fairscape_cli.datasheet_builder.rocrate.datasheet_generator import DatasheetGenerator + +__all__ = ['DatasheetGenerator'] \ No newline at end of file diff --git a/src/fairscape_cli/schema/__init__.py b/src/fairscape_cli/datasheet_builder/evidence_graph/__init__.py similarity index 100% rename from src/fairscape_cli/schema/__init__.py rename to src/fairscape_cli/datasheet_builder/evidence_graph/__init__.py diff --git a/src/fairscape_cli/datasheet_builder/evidence_graph/graph_builder.py b/src/fairscape_cli/datasheet_builder/evidence_graph/graph_builder.py new file mode 100644 index 0000000..62842c8 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/evidence_graph/graph_builder.py @@ -0,0 +1,242 @@ +from typing import Dict, List, Optional, Any, Union, Set +import json +import pathlib +from pathlib import Path +import click + +class EvidenceNode: + def __init__(self, id: str, type: str): + self.id = id + self.type = type + # For Computation nodes + self.usedSoftware: Optional[List[str]] = None + self.usedDataset: Optional[List[str]] = None + self.usedSample: Optional[List[str]] = None + self.usedInstrument: Optional[List[str]] = None + # For Dataset/Sample/Instrument nodes + self.generatedBy: Optional[str] = None + +class EvidenceGraphJSON: + def __init__(self, guid: str, owner: str, description: str, name: str = "Evidence Graph"): + self.metadataType = "evi:EvidenceGraph" + self.guid = guid + self.owner = owner + self.description = description + self.name = name + self.graph = None + + def build_graph(self, node_id: str, json_data: Dict[str, Any]): + processed = set() + self.graph = self._build_graph_recursive(node_id, json_data, processed) + + def _build_graph_recursive(self, node_id: str, json_data: Dict[str, Any], processed: Set[str]) -> Dict: + if node_id in processed: + return {"@id": node_id} + + # Find node in json data graph + node = None + for entity in json_data.get("@graph", []): + if entity.get("@id") == node_id: + node = entity + break + + if not node: + return {"@id": node_id} + + processed.add(node_id) + result = self._build_base_node(node) + + # Determine node type based on @type + node_type = None + if isinstance(node.get("@type"), list): + for type_entry in node["@type"]: + if "Dataset" in type_entry: + node_type = "Dataset" + break + elif "Computation" in type_entry: + node_type = "Computation" + break + elif "Sample" in type_entry: + node_type = "Sample" + break + elif "Instrument" in type_entry: + node_type = "Instrument" + break + elif "Experiment" in type_entry: + node_type = "Experiment" + break + elif isinstance(node.get("@type"), str): + type_str = node.get("@type") + if "Dataset" in type_str: + node_type = "Dataset" + elif "Computation" in type_str: + node_type = "Computation" + elif "Sample" in type_str: + node_type = "Sample" + elif "Instrument" in type_str: + node_type = "Instrument" + elif "Experiment" in type_str: + node_type = "Experiment" + + if node_type in ["Dataset", "Sample", "Instrument"]: + if "generatedBy" in node: + result["generatedBy"] = self._build_computation_node(node, json_data, processed) + elif node_type in ["Computation", "Experiment"]: + if "usedDataset" in node: + result["usedDataset"] = self._build_used_resources(node["usedDataset"], json_data, processed) + if "usedSoftware" in node: + result["usedSoftware"] = self._build_software_reference(node["usedSoftware"], json_data) + if "usedSample" in node: + result["usedSample"] = self._build_used_resources(node["usedSample"], json_data, processed) + if "usedInstrument" in node: + result["usedInstrument"] = self._build_used_resources(node["usedInstrument"], json_data, processed) + + return result + + def _build_base_node(self, node: Dict) -> Dict: + return { + "@id": node["@id"], + "@type": node.get("@type"), + "name": node.get("name"), + "description": node.get("description") + } + + def _build_computation_node(self, parent_node: Dict, json_data: Dict[str, Any], processed: Set[str]) -> Dict: + # If generatedBy is an empty list, don't add anything + if isinstance(parent_node["generatedBy"], list) and len(parent_node["generatedBy"]) == 0: + return {} + + comp_id = None + if isinstance(parent_node["generatedBy"], list): + if parent_node["generatedBy"] and isinstance(parent_node["generatedBy"][0], dict): + comp_id = parent_node["generatedBy"][0].get("@id") + elif isinstance(parent_node["generatedBy"], dict): + comp_id = parent_node["generatedBy"].get("@id") + + if not comp_id: + return {"@id": "unknown-computation"} + + comp = None + for entity in json_data.get("@graph", []): + if entity.get("@id") == comp_id: + comp = entity + break + + if not comp: + return {"@id": comp_id} + + return self._build_graph_recursive(comp_id, json_data, processed) + + def _build_used_resources(self, used_resources: Union[Dict, List], json_data: Dict[str, Any], processed: Set[str]) -> List: + if isinstance(used_resources, dict): + resource_id = used_resources.get("@id") + if resource_id: + return [self._build_graph_recursive(resource_id, json_data, processed)] + return [] + + if isinstance(used_resources, list): + resources = [] + for resource in used_resources: + if isinstance(resource, dict) and resource.get("@id"): + resources.append(self._build_graph_recursive(resource.get("@id"), json_data, processed)) + return resources + + return [] + + def _build_software_reference(self, used_software: Union[Dict, List], json_data: Dict[str, Any]) -> Union[Dict, List[Dict]]: + if isinstance(used_software, list): + if not used_software: + return [] + + software_refs = [] + for sw in used_software: + if isinstance(sw, dict) and sw.get("@id"): + software_id = sw.get("@id") + software = None + for entity in json_data.get("@graph", []): + if entity.get("@id") == software_id: + software = entity + break + + if software: + software_refs.append(self._build_base_node(software)) + else: + software_refs.append({"@id": software_id}) + + return software_refs + + elif isinstance(used_software, dict) and used_software.get("@id"): + software_id = used_software.get("@id") + software = None + for entity in json_data.get("@graph", []): + if entity.get("@id") == software_id: + software = entity + break + + if software: + return self._build_base_node(software) + return {"@id": software_id} + + return {"@id": "unknown-software"} + + def to_dict(self): + return { + "@type": self.metadataType, + "@id": self.guid, + "owner": self.owner, + "description": self.description, + "name": self.name, + "@graph": self.graph + } + + def save_to_file(self, file_path: Union[str, pathlib.Path]): + with open(file_path, 'w') as f: + json.dump(self.to_dict(), f, indent=2) + +def generate_evidence_graph_from_rocrate( + rocrate_path: Union[str, pathlib.Path], + output_path: Optional[Union[str, pathlib.Path]] = None, + node_id: str = "" +) -> Dict[str, Any]: + """ + Generate an evidence graph from an RO-Crate JSON file for a specific node ID + + Args: + rocrate_path: Path to the RO-Crate metadata JSON file + output_path: Optional path to save the evidence graph JSON + node_id: ID of the node to start building the graph from + + Returns: + The generated evidence graph as a dictionary + """ + # Load RO-Crate data + with open(rocrate_path, 'r') as f: + rocrate_data = json.load(f) + + # Find the root entity + root_entity = None + for entity in rocrate_data.get("@graph", []): + if entity.get("@id") == node_id: + root_entity = entity + break + + if not root_entity: + raise ValueError(f"Could not find entity with ID {node_id} in RO-Crate") + + # Create evidence graph + graph_id = f"{node_id}-evidence-graph" + graph = EvidenceGraphJSON( + guid=graph_id, + owner=node_id, + description=f"Evidence graph for {root_entity.get('name', 'Unknown')}", + name=f"Evidence Graph - {root_entity.get('name', 'Unknown')}" + ) + + # Build the graph + graph.build_graph(node_id, rocrate_data) + + # Save to file if output path provided + if output_path: + graph.save_to_file(output_path) + + return graph.to_dict() diff --git a/src/fairscape_cli/datasheet_builder/evidence_graph/html_builder.py b/src/fairscape_cli/datasheet_builder/evidence_graph/html_builder.py new file mode 100644 index 0000000..9c32626 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/evidence_graph/html_builder.py @@ -0,0 +1,1031 @@ +import json +import os +import argparse +from pathlib import Path + +def generate_evidence_graph_html(rocrate_path, output_path=None): + """ + Generate a standalone HTML file containing an interactive React + visualization of the evidence graph extracted from an RO-Crate. + Includes panning functionality and adjusted node spacing. + Automatically expands the first two levels of the graph. + + Args: + rocrate_path: Path to the RO-Crate metadata.json file + output_path: Path where the HTML output should be saved (default: same directory as input with .html extension) + + Returns: + Path to the generated HTML file + """ + try: + with open(rocrate_path, 'r', encoding='utf-8') as f: + rocrate_data = json.load(f) + except FileNotFoundError: + print(f"Error: RO-Crate file not found at {rocrate_path}") + return None + except json.JSONDecodeError: + print(f"Error: Could not parse JSON from {rocrate_path}") + return None + except Exception as e: + print(f"An unexpected error occurred while reading the RO-Crate: {e}") + return None + + if output_path is None: + output_path = Path(rocrate_path).with_suffix('.html') + else: + output_path = Path(output_path) + + output_path.parent.mkdir(parents=True, exist_ok=True) + + # Note: Double curly braces {{ }} are used for literal braces in the f-string. + html_content = f""" + + + + + Evidence Graph Visualization + + + + + + +
+ + + + + """ + + try: + with open(output_path, 'w', encoding='utf-8') as f: + f.write(html_content) + print(f"Evidence graph visualization saved to: {output_path}") + return str(output_path) + except IOError as e: + print(f"Error writing HTML file to {output_path}: {e}") + return None + except Exception as e: + print(f"An unexpected error occurred while writing the HTML file: {e}") + return None + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='Generate an evidence graph visualization from an RO-Crate using React (without React Flow)') + parser.add_argument('rocrate_path', help='Path to the RO-Crate metadata.json file (or equivalent .jsonld)') + parser.add_argument('--output', '-o', help='Output HTML file path (default: -evidence-graph.html)') + + args = parser.parse_args() + + crate_path = Path(args.rocrate_path) + if not crate_path.is_file(): + print(f"Error: Input file not found at '{args.rocrate_path}'") + else: + output_file = args.output + if not output_file: + output_file = crate_path.parent / f"{crate_path.stem}-evidence-graph.html" + + + generate_evidence_graph_html(str(crate_path), str(output_file)) \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/generate_datasheet.py b/src/fairscape_cli/datasheet_builder/generate_datasheet.py new file mode 100644 index 0000000..8282bf6 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/generate_datasheet.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +import argparse +import sys +import os +import traceback +from rocrate.datasheet_generator import DatasheetGenerator + +def main(): + parser = argparse.ArgumentParser( + description='Generate RO-Crate datasheets and previews with support for sub-crates' + ) + parser.add_argument( + '--input', + required=True, + help='Path to top-level ro-crate-metadata.json file' + ) + args = parser.parse_args() + + try: + input_dir = os.path.dirname(args.input) + template_dir = "./templates" + output_path = os.path.join(input_dir, "ro-crate-datasheet.html") + + generator = DatasheetGenerator( + json_path=args.input, + template_dir=template_dir + ) + + generator.process_subcrates() + + final_output_path = generator.save_datasheet(output_path) + print(f"Datasheet generated successfully: {final_output_path}") + except Exception as e: + print(f"Error: {str(e)}") + traceback.print_exc() + return 1 + + return 0 + +if __name__ == "__main__": + sys.exit(main()) \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/__init__.py b/src/fairscape_cli/datasheet_builder/rocrate/__init__.py new file mode 100644 index 0000000..8633f91 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/__init__.py @@ -0,0 +1,19 @@ +from .base import ROCrateProcessor +from .template_engine import TemplateEngine +from .section_generators import ( + OverviewSectionGenerator, + UseCasesSectionGenerator, + DistributionSectionGenerator, + SubcratesSectionGenerator +) +from .datasheet_generator import DatasheetGenerator + +__all__ = [ + 'ROCrateProcessor', + 'TemplateEngine', + 'OverviewSectionGenerator', + 'UseCasesSectionGenerator', + 'DistributionSectionGenerator', + 'SubcratesSectionGenerator', + 'DatasheetGenerator' +] \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/base.py b/src/fairscape_cli/datasheet_builder/rocrate/base.py new file mode 100644 index 0000000..809d193 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/base.py @@ -0,0 +1,253 @@ +from collections import Counter +import json +import os +from fairscape_cli.datasheet_builder.rocrate import prov + + +class ROCrateProcessor: + """Base class for processing RO-Crate data""" + + def __init__(self, json_data=None, json_path=None, published=False): + """Initialize with either JSON data or a path to a JSON file""" + if json_data: + self.json_data = json_data + elif json_path: + with open(json_path, 'r', encoding='utf-8') as f: + self.json_data = json.load(f) + else: + raise ValueError("Either json_data or json_path must be provided") + + self.graph = self.json_data.get("@graph", []) + self.root = self.find_root_node() + self.published = published + + def find_root_node(self): + """Find the root node in the RO-Crate graph""" + for item in self.graph: + if "@type" in item: + if isinstance(item["@type"], list) and "Dataset" in item["@type"] and "https://w3id.org/EVI#ROCrate" in item["@type"]: + return item + elif item["@type"] == "Dataset" or "ROCrate" in item["@type"]: + return item + + for item in self.graph: + if "@id" in item and not item["@id"].endswith("metadata.json"): + return item + + return self.graph[1] if self.graph else {} + + def find_subcrates(self): + """Find sub-crates referenced in the RO-Crate""" + subcrates = [] + + for item in self.graph: + if "@type" in item and item != self.root: + item_types = item["@type"] if isinstance(item["@type"], list) else [item["@type"]] + + if ("Dataset" in item_types and "https://w3id.org/EVI#ROCrate" in item_types) or "ROCrate" in item_types: + if "ro-crate-metadata" in item: + subcrates.append({ + "id": item.get("@id", ""), + "name": item.get("name", "Unnamed Sub-Crate"), + "description": item.get("description", ""), + "metadata_path": item.get("ro-crate-metadata", "") + }) + + return subcrates + + def categorize_items(self): + """Categorize items in a graph into files, software, instruments, samples, experiments, and other""" + files = [] + software = [] + instruments = [] + samples = [] + experiments = [] + computations = [] + schemas = [] + other = [] + + for item in self.graph: + if "@type" not in item: + continue + + item_types = item["@type"] if isinstance(item["@type"], list) else [item["@type"]] + + if item == self.root or (item.get("@id", "").endswith("metadata.json")): + continue + + # Skip items that are identified as subcrates + if "ro-crate-metadata" in item: + continue + + # Categorize by type + if "Dataset" in item_types or "EVI:Dataset" in item_types or "https://w3id.org/EVI#Dataset" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Dataset" or item.get("additionalType") == "Dataset": + files.append(item) + elif "SoftwareSourceCode" in item_types or "EVI:Software" in item_types or "Software" in item_types or "https://w3id.org/EVI#Software" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Software" or item.get("additionalType") == "Software": + software.append(item) + elif "Instrument" in item_types or "https://w3id.org/EVI#Instrument" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Instrument": + instruments.append(item) + elif "Sample" in item_types or "https://w3id.org/EVI#Sample" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Sample": + samples.append(item) + elif "Experiment" in item_types or "https://w3id.org/EVI#Experiment" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Experiment": + experiments.append(item) + elif "Computation" in item_types or "https://w3id.org/EVI#Computation" in item_types or item.get("metadataType") == "https://w3id.org/EVI#Computation" or item.get("additionalType") == "Computation": + computations.append(item) + elif "Schema" in item_types or "EVI:Schema" in item_types or "https://w3id.org/EVI#Schema" in item_types: + schemas.append(item) + else: + other.append(item) + + return files, software, instruments, samples, experiments, computations, schemas, other + + def get_formats_summary(self, items): + """Get a summary of formats in a list of items""" + formats = [item.get("format", "unknown") for item in items] + format_counter = Counter(formats) + return format_counter + + def get_access_summary(self, items): + """Get a summary of content URL types (available, embargoed, etc.)""" + access_types = [] + for item in items: + content_url = item.get("contentUrl", "") + if not content_url: + access_types.append("No link") + elif content_url == "Embargoed": + access_types.append("Embargoed") + else: + access_types.append("Available") + + access_counter = Counter(access_types) + return access_counter + + def get_date_range(self, items): + """Get the date range for a list of items""" + dates = [] + for item in items: + date = item.get("datePublished", "") + if not date: + date = item.get("dateModified", "") + if not date: + date = item.get("dateCreated", "") + + if date: + dates.append(date) + + if not dates: + return "Unknown" + + return f"{min(dates)} to {max(dates)}" + + def get_property_value(self, property_name, additional_properties=None): + """Get a property value from root or from additionalProperty if present""" + + if property_name in self.root: + return self.root[property_name] + + if additional_properties is None: + additional_properties = self.root.get("additionalProperty", []) + + for prop in additional_properties: + if prop.get("name") == property_name or prop.get("propertyID") == property_name: + return prop.get("value", "") + + return "" + + def extract_cell_line_info(self, samples): + """Extract cell line information from samples""" + cell_lines = {} + + for sample in samples: + # Check if sample has a direct cell line reference or derivedFrom + ref_id = None + if "cellLineReference" in sample and isinstance(sample["cellLineReference"], dict) and "@id" in sample["cellLineReference"]: + ref_id = sample["cellLineReference"]["@id"] + elif "derivedFrom" in sample and isinstance(sample["derivedFrom"], dict) and "@id" in sample["derivedFrom"]: + ref_id = sample["derivedFrom"]["@id"] + + if ref_id: + # Find the entity in the graph + for item in self.graph: + if item.get("@id") == ref_id: + cell_info = { + "name": item.get("name", "Unknown"), + "identifier": "", + "organism_name": "Unknown" + } + + # Get CVCL identifier + identifiers = item.get("identifier", []) + if isinstance(identifiers, list): + for id_obj in identifiers: + if isinstance(id_obj, dict) and "@value" in id_obj and "CVCL" in id_obj["@value"]: + cell_info["identifier"] = id_obj["@value"].split(":")[-1] + break + + # Get organism name (if directly nested) + organism = item.get("organism", {}) + if isinstance(organism, dict) and "name" in organism: + cell_info["organism_name"] = organism["name"] + + cell_lines[ref_id] = cell_info + break + + return cell_lines + + def extract_sample_species(self, samples): + """Extract species information from samples by checking cell line references""" + species = {} + + for sample in samples: + scientific_name = "Unknown" + + # Check if sample has a cell line reference + cell_line_ref = sample.get("cellLineReference", {}) + if cell_line_ref and isinstance(cell_line_ref, dict) and "@id" in cell_line_ref: + cell_line_id = cell_line_ref.get("@id", "") + + # Find the cell line in the graph + for item in self.graph: + if item.get("@id") == cell_line_id: + # Get organism information from the cell line + organism = item.get("organism", {}) + if organism and isinstance(organism, dict): + org_name = organism.get("name") + if org_name: + scientific_name = org_name + break + + if scientific_name == "Unknown": + additional_properties = sample.get("additionalProperty", []) + for prop in additional_properties: + if prop.get("propertyID") == "scientific_name" and prop.get("value") != "N. A.": + scientific_name = prop.get("value", "") + break + + if scientific_name not in species: + species[scientific_name] = 1 + else: + species[scientific_name] += 1 + + return species + + def extract_experiment_types(self, experiments): + """Extract experiment types""" + experiment_types = {} + + for experiment in experiments: + exp_type = experiment.get("experimentType", "Unknown") + + if exp_type not in experiment_types: + experiment_types[exp_type] = 1 + else: + experiment_types[exp_type] += 1 + + return experiment_types + + def get_dataset_format(self, dataset_id): + """Get the format of a dataset by its ID""" + for item in self.graph: + if item.get("@id") == dataset_id: + return item.get("format", "unknown") + + return "unknown" \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/datasheet_generator.py b/src/fairscape_cli/datasheet_builder/rocrate/datasheet_generator.py new file mode 100644 index 0000000..018f1a1 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/datasheet_generator.py @@ -0,0 +1,126 @@ +import os +from .base import ROCrateProcessor +from .template_engine import TemplateEngine +from .section_generators import ( + OverviewSectionGenerator, + UseCasesSectionGenerator, + DistributionSectionGenerator, + SubcratesSectionGenerator +) +from .preview_generator import PreviewGenerator + + +class DatasheetGenerator: + """Main class for generating RO-Crate datasheets""" + + def __init__(self, json_data=None, json_path=None, template_dir=None, published = False): + """Initialize with JSON data or a path to a JSON file""" + + self.processor = ROCrateProcessor(json_data=json_data, json_path=json_path, published=published) + + self.template_engine = TemplateEngine(template_dir=template_dir) + + self.overview_generator = OverviewSectionGenerator(self.template_engine, self.processor) + self.use_cases_generator = UseCasesSectionGenerator(self.template_engine, self.processor) + self.distribution_generator = DistributionSectionGenerator(self.template_engine, self.processor) + self.subcrates_generator = SubcratesSectionGenerator(self.template_engine, self.processor) + + if json_path: + self.base_dir = os.path.dirname(os.path.abspath(json_path)) + else: + self.base_dir = os.getcwd() + + def generate_datasheet(self): + """Generate the complete datasheet""" + overview_section = self.overview_generator.generate() + use_cases_section = self.use_cases_generator.generate() + subcrates_section = self.subcrates_generator.generate(base_dir=self.base_dir) + distribution_section = self.distribution_generator.generate() + + files, software, instruments, samples, experiments, computations, schemas, other = self.processor.categorize_items() + files_count = len(files) + software_count = len(software) + instruments_count = len(instruments) + samples_count = len(samples) + experiments_count = len(experiments) + computations_count = len(computations) + schemas_count = len(schemas) + other_count = len(other) + + cell_lines = self.processor.extract_cell_line_info(samples) + species = self.processor.extract_sample_species(samples) + experiment_types = self.processor.extract_experiment_types(experiments) + + subcrates = self.processor.find_subcrates() + subcrate_count = len(subcrates) + + context = { + 'title': self.processor.root.get("name", "Untitled RO-Crate"), + 'version': self.processor.root.get("version", ""), + 'overview_section': overview_section, + 'use_cases_section': use_cases_section, + 'subcrates_section': subcrates_section, + 'distribution_section': distribution_section, + 'files_count': files_count, + 'software_count': software_count, + 'instruments_count': instruments_count, + 'samples_count': samples_count, + 'experiments_count': experiments_count, + 'computations_count': computations_count, + 'schemas_count': schemas_count, + 'other_count': other_count, + 'cell_lines': cell_lines, + 'species': species, + 'experiment_types': experiment_types, + 'subcrate_count': subcrate_count + } + + return self.template_engine.render('base.html', **context) + + def save_datasheet(self, output_path=None): + """Generate and save the datasheet to a file""" + if output_path is None: + output_path = os.path.join(self.base_dir, "ro-crate-datasheet.html") + + datasheet_html = self.generate_datasheet() + with open(output_path, 'w', encoding='utf-8') as f: + f.write(datasheet_html) + + return output_path + + def process_subcrates(self): + """Process all subcrates and generate HTML preview files for each.""" + subcrates = self.processor.find_subcrates() + + processed_count = 0 + for subcrate_info in subcrates: + metadata_path = subcrate_info.get("metadata_path", "") + if not metadata_path: + print(f"Skipping subcrate '{subcrate_info.get('name', subcrate_info.get('id'))}' due to missing 'ro-crate-metadata' path.") + continue + + full_path = os.path.normpath(os.path.join(self.base_dir, metadata_path)) + + if not os.path.exists(full_path): + print(f"Warning: Subcrate metadata file not found at {full_path}. Skipping.") + continue + + try: + subcrate_dir = os.path.dirname(full_path) + output_path = os.path.join(subcrate_dir, "ro-crate-preview.html") + + subcrate_processor = ROCrateProcessor(json_path=full_path) + preview_gen = PreviewGenerator( + processor=subcrate_processor, + template_engine=self.template_engine, + base_dir=subcrate_dir + ) + saved_path = preview_gen.save_preview_html(output_path) + + processed_count += 1 + except Exception as e: + import traceback + print(f"Error processing subcrate {subcrate_info.get('name', '')} at {full_path}: {str(e)}") + #traceback.print_exc() + + print(f"Finished processing subcrates. Generated {processed_count} preview files.") \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/preview_generator.py b/src/fairscape_cli/datasheet_builder/rocrate/preview_generator.py new file mode 100644 index 0000000..7114960 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/preview_generator.py @@ -0,0 +1,224 @@ +import os +import json +from datetime import datetime +from .base import ROCrateProcessor +from .template_engine import TemplateEngine + +class PreviewGenerator: + DEFAULT_TEMPLATE = 'preview.html' + DESCRIPTION_TRUNCATE_LENGTH = 100 + + def __init__(self, processor: ROCrateProcessor, template_engine: TemplateEngine, base_dir: str): + self.processor = processor + self.template_engine = template_engine + self.base_dir = base_dir + + def _prepare_item_data(self, items): + prepared_items = [] + if not items: + return [] + for item in items: + if not isinstance(item, dict): + continue + + name = item.get("name", item.get("@id", "Unnamed Item")) + description = item.get("description", "") + if description is None: + description = "" + + description_display = description + if len(description) > self.DESCRIPTION_TRUNCATE_LENGTH: + description_display = description[:self.DESCRIPTION_TRUNCATE_LENGTH] + "..." + + date = item.get("datePublished", "") or \ + item.get("dateModified", "") or \ + item.get("dateCreated", "") or \ + item.get("date", "") + + content_url = item.get("contentUrl", "") + if isinstance(content_url, list): + content_url = content_url[0] if content_url else "" + + content_status = "Not specified" + if content_url == "Embargoed": + content_status = "Embargoed" + elif content_url: + link_target = content_url + if not link_target.startswith(('http:', 'https:', 'ftp:', '/')): + if self.base_dir and link_target: + try: + abs_link_path = os.path.normpath(os.path.join(self.base_dir, link_target)) + link_target = os.path.relpath(abs_link_path, self.base_dir) + except ValueError: + link_target = content_url + else: + link_target = content_url + + content_status = f"Access / Download" + + item_type = item.get("@type", "Unknown") + if isinstance(item_type, list): + specific_types = [t for t in item_type if t and not t.startswith(('http', 'https'))] + item_type = specific_types[0] if specific_types else ", ".join(filter(None, item_type)) + elif item_type is None: + item_type = "Unknown" + + manufacturer_raw = item.get("manufacturer") + manufacturer_name = "N/A" + if isinstance(manufacturer_raw, dict): + manufacturer_name = manufacturer_raw.get("name", "N/A") + elif isinstance(manufacturer_raw, str): + manufacturer_name = manufacturer_raw + + schema_properties = None + if item_type == "EVI:Schema" or "Schema" in str(item_type): + props = item.get("properties", {}) + if props and isinstance(props, dict): + schema_properties = {} + for prop_name, prop_details in props.items(): + if isinstance(prop_details, dict): + schema_properties[prop_name] = { + "type": prop_details.get("type", "Unknown"), + "description": prop_details.get("description", ""), + "index": prop_details.get("index", "N/A") + } + + prepared_items.append({ + "name": name, + "description": description, + "description_display": description_display, + "date": date or "N/A", + "content_status": content_status, + "id": item.get("@id", ""), + "type": item_type, + "identifier": item.get("identifier", item.get("@id", "")), + "experimentType": item.get("experimentType", "N/A"), + "manufacturer": manufacturer_name, + "schema_properties": schema_properties + }) + return prepared_items + + def generate_preview_html(self): + root = self.processor.root + if not root: + return "Error: Could not find root dataset node in RO-Crate." + + title = root.get("name", "Untitled RO-Crate") + id_value = root.get("@id", "") + version = root.get("version", "") + description = root.get("description", "") + doi = root.get("identifier", "") + license_value = root.get("license", "") + + release_date = root.get("datePublished", "") + created_date = root.get("dateCreated", "") + updated_date = root.get("dateModified", "") + + authors_raw = root.get("author", []) + authors = "" + author_names = [] + if isinstance(authors_raw, list): + for a in authors_raw: + if isinstance(a, dict): + author_names.append(a.get("name", "Unknown Author")) + elif isinstance(a, str): + author_names.append(a) + elif isinstance(authors_raw, dict): + authors = authors_raw.get("name", "Unknown Author") + elif isinstance(authors_raw, str): + authors = authors_raw + if author_names: + authors = ", ".join(filter(None, author_names)) + + publisher_raw = root.get("publisher", "") + publisher = "" + if isinstance(publisher_raw, dict): + publisher = publisher_raw.get("name", "") + elif isinstance(publisher_raw, str): + publisher = publisher_raw + + keywords_raw = root.get("keywords", []) + keywords = [] + if isinstance(keywords_raw, list): + keywords = [str(kw) for kw in keywords_raw if kw] + elif isinstance(keywords_raw, str): + keywords = [kw.strip() for kw in keywords_raw.split(',') if kw.strip()] + + related_pubs_list = root.get("relatedPublications", []) + associated_pubs_list = root.get("associatedPublication", []) + + if not isinstance(related_pubs_list, list): + related_pubs_list = [related_pubs_list] if related_pubs_list else [] + if not isinstance(associated_pubs_list, list): + associated_pubs_list = [associated_pubs_list] if associated_pubs_list else [] + + combined_pubs_raw = related_pubs_list + associated_pubs_list + related_publications = [] + seen_pubs = set() + + for pub in combined_pubs_raw: + pub_text = "" + if isinstance(pub, dict): + pub_text = pub.get("name", pub.get("@id", "")) + elif isinstance(pub, str): + pub_text = pub + + if not pub_text: continue + + if pub_text not in seen_pubs: + related_publications.append(pub_text) + seen_pubs.add(pub_text) + + files, software, instruments, samples, experiments, computations, schemas, other = self.processor.categorize_items() + + datasets = files if isinstance(files, list) else [] + software_list = software if isinstance(software, list) else [] + instruments_list = instruments if isinstance(instruments, list) else [] + samples_list = samples if isinstance(samples, list) else [] + experiments_list = experiments if isinstance(experiments, list) else [] + computations_list = computations if isinstance(computations, list) else [] + schemas_list = schemas if isinstance(schemas, list) else [] + other_list = other if isinstance(other, list) else [] + + context = { + 'title': title or "Untitled RO-Crate", + 'id_value': id_value or "N/A", + 'version': version or "N/A", + 'description': description or "No description provided.", + 'doi': doi or "", + 'license_value': license_value or "", + 'release_date': release_date or "", + 'created_date': created_date or "", + 'updated_date': updated_date or "", + 'authors': authors or "N/A", + 'publisher': publisher or "N/A", + 'principal_investigator': root.get("principalInvestigator", ""), + 'contact_email': root.get("contactEmail", ""), + 'confidentiality_level': root.get("confidentialityLevel", ""), + 'keywords': keywords, + 'citation': root.get("citation", ""), + 'related_publications': related_publications, + 'datasets': self._prepare_item_data(datasets), + 'software': self._prepare_item_data(software_list), + 'computations': self._prepare_item_data(computations_list), + 'samples': self._prepare_item_data(samples_list), + 'experiments': self._prepare_item_data(experiments_list), + 'instruments': self._prepare_item_data(instruments_list), + 'schemas': self._prepare_item_data(schemas_list), + 'other_items': self._prepare_item_data(other_list) + } + + return self.template_engine.render(self.DEFAULT_TEMPLATE, **context) + + def save_preview_html(self, output_path=None): + if output_path is None: + output_path = os.path.join(self.base_dir, "ro-crate-preview.html") + + os.makedirs(os.path.dirname(output_path), exist_ok=True) + + html_content = self.generate_preview_html() + + with open(output_path, 'w', encoding='utf-8') as f: + f.write(html_content) + + return output_path \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/prov.py b/src/fairscape_cli/datasheet_builder/rocrate/prov.py new file mode 100644 index 0000000..275978f --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/prov.py @@ -0,0 +1,118 @@ +def get_dataset_format(processor, dataset_id): + """ + Get the format of a dataset by its ID + + Args: + processor: ROCrateProcessor instance + dataset_id: The ID of the dataset + + Returns: + str: The format of the dataset, or "unknown" if not found + """ + for item in processor.graph: + if item.get("@id") == dataset_id: + return item.get("format", "unknown") + + # If not found in current crate, look for it in subcrates + subcrates = processor.find_subcrates() + for subcrate_info in subcrates: + metadata_path = subcrate_info.get("metadata_path", "") + if metadata_path: + try: + subcrate_processor = ROCrateProcessor(json_path=metadata_path) + for item in subcrate_processor.graph: + if item.get("@id") == dataset_id: + return item.get("format", "unknown") + except Exception: + pass + + return "unknown" + +def summarize_computation_io_formats(processor): + """ + Summarize the input and output formats for computations + + Args: + processor: ROCrateProcessor instance + + Returns: + dict: Dictionary with computation patterns as keys and counts as values + """ + _, _, _, _, _, computations, _, _ = processor.categorize_items() + + patterns = {} + + for computation in computations: + input_formats = [] + output_formats = [] + + # Get input formats + input_datasets = computation.get("usedDataset", []) + if input_datasets: + if isinstance(input_datasets, list): + for dataset in input_datasets: + if isinstance(dataset, dict) and "@id" in dataset: + input_format = get_dataset_format(processor, dataset["@id"]) + if input_format not in input_formats: + input_formats.append(input_format) + elif isinstance(dataset, str): + input_format = get_dataset_format(processor, dataset) + if input_format not in input_formats: + input_formats.append(input_format) + elif isinstance(input_datasets, dict) and "@id" in input_datasets: + input_format = get_dataset_format(processor, input_datasets["@id"]) + input_formats.append(input_format) + elif isinstance(input_datasets, str): + input_format = get_dataset_format(processor, input_datasets) + input_formats.append(input_format) + + # Get output formats + output_datasets = computation.get("generated", []) + if output_datasets: + if isinstance(output_datasets, list): + for dataset in output_datasets: + if isinstance(dataset, dict) and "@id" in dataset: + output_format = get_dataset_format(processor, dataset["@id"]) + if output_format not in output_formats: + output_formats.append(output_format) + elif isinstance(dataset, str): + output_format = get_dataset_format(processor, dataset) + if output_format not in output_formats: + output_formats.append(output_format) + elif isinstance(output_datasets, dict) and "@id" in output_datasets: + output_format = get_dataset_format(processor, output_datasets["@id"]) + output_formats.append(output_format) + elif isinstance(output_datasets, str): + output_format = get_dataset_format(processor, output_datasets) + output_formats.append(output_format) + + # Create a pattern string + if input_formats and output_formats: + input_str = ", ".join(sorted(input_formats)) + output_str = ", ".join(sorted(output_formats)) + pattern = f"{input_str} → {output_str}" + + if pattern in patterns: + patterns[pattern] += 1 + else: + patterns[pattern] = 1 + + return patterns + +def get_computation_summary(processor): + """ + Get a summary of computation transformations + + Args: + processor: ROCrateProcessor instance + + Returns: + list: List of transformation pattern strings + """ + patterns = summarize_computation_io_formats(processor) + summary = [] + + for pattern, count in patterns.items(): + summary.append(pattern) + + return summary \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/section_generators.py b/src/fairscape_cli/datasheet_builder/rocrate/section_generators.py new file mode 100644 index 0000000..03c3277 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/section_generators.py @@ -0,0 +1,401 @@ +from .base import ROCrateProcessor +import os + +class SectionGenerator: + def __init__(self, template_engine, processor=None): + self.template_engine = template_engine + self.processor = processor + + def generate(self, template_name, **context): + return self.template_engine.render(template_name, **context) + + +class OverviewSectionGenerator(SectionGenerator): + def generate(self, processor=None): + if processor: + self.processor = processor + + if not self.processor: + raise ValueError("Processor is required to generate the overview section") + + root = self.processor.root + additional_properties = root.get("additionalProperty", []) + + context = { + 'title': root.get("name", "Untitled RO-Crate"), + 'description': root.get("description", ""), + 'id_value': root.get("@id", ""), + 'doi': root.get("identifier", ""), + 'license_value': root.get("license", ""), + 'release_date': root.get("datePublished", ""), + 'created_date': root.get("dateCreated", ""), + 'updated_date': root.get("dateModified", ""), + 'authors': root.get("author", ""), + 'publisher': root.get("publisher", ""), + 'principal_investigator': root.get("principalInvestigator", ""), + 'contact_email': root.get("contactEmail", ""), + 'copyright': root.get("copyrightNotice", ""), + 'terms_of_use': root.get("conditionsOfAccess", ""), + 'confidentiality_level': root.get("confidentialityLevel", ""), + 'citation': root.get("citation", ""), + 'version': root.get("version", ""), + 'content_size': root.get("contentSize", ""), + 'human_subject': self.processor.get_property_value("Human Subject", additional_properties), + 'human_subject_research': self.processor.get_property_value("Human Subject Research", additional_properties) or "No", + 'human_subject_exemptions': self.processor.get_property_value("Human Subjects Exemptions", additional_properties) or "N/A", + 'deidentified_samples': self.processor.get_property_value("De-identified Samples", additional_properties) or "Yes", + 'fda_regulated': self.processor.get_property_value("FDA Regulated", additional_properties) or "No", + 'irb': self.processor.get_property_value("IRB", additional_properties) or "N/A", + 'irb_protocol_id': self.processor.get_property_value("IRB Protocol ID", additional_properties) or "N/A", + 'data_governance': self.processor.get_property_value("Data Governance Committee", additional_properties) or "", + 'completeness': self.processor.get_property_value("Completeness", additional_properties), + 'funding': root.get("funder", ""), + 'keywords': root.get("keywords", []), + "published": self.processor.published + } + + related_publications = root.get("associatedPublication", []) + if related_publications and isinstance(related_publications, list): + context['related_publications'] = related_publications + else: + context['related_publications'] = [] + + return self.template_engine.render('sections/overview.html', **context) + +class UseCasesSectionGenerator(SectionGenerator): + def generate(self, processor=None): + if processor: + self.processor = processor + + if not self.processor: + raise ValueError("Processor is required to generate the use cases section") + + root = self.processor.root + additional_properties = root.get("additionalProperty", []) + + context = { + 'intended_uses': self.processor.get_property_value("Intended Use", additional_properties), + 'limitations': self.processor.get_property_value("Limitations", additional_properties), + 'prohibited_uses': self.processor.get_property_value("Prohibited Uses", additional_properties), + 'maintenance_plan': self.processor.get_property_value("Maintenance Plan", additional_properties), + 'potential_bias': self.processor.get_property_value("Potential Sources of Bias", additional_properties) + } + + return self.template_engine.render('sections/use_cases.html', **context) + + +class DistributionSectionGenerator(SectionGenerator): + def generate(self, processor=None): + if processor: + self.processor = processor + + if not self.processor: + raise ValueError("Processor is required to generate the distribution section") + + root = self.processor.root + + context = { + 'license_value': root.get("license", ""), + 'publisher': root.get("publisher", ""), + 'host': root.get("distributionHost", ""), + 'doi': root.get("doi", ""), + 'release_date': root.get("datePublished", ""), + 'version': root.get("version", "") + } + + return self.template_engine.render('sections/distribution.html', **context) + + +def get_directory_size(directory): + total_size = 0 + for dirpath, dirnames, filenames in os.walk(directory): + for filename in filenames: + file_path = os.path.join(dirpath, filename) + if not os.path.islink(file_path): + total_size += os.path.getsize(file_path) + return total_size + + +def format_size(size_in_bytes): + for unit in ['B', 'KB', 'MB', 'GB', 'TB']: + if size_in_bytes < 1024.0: + return f"{size_in_bytes:.2f} {unit}" + size_in_bytes /= 1024.0 + return f"{size_in_bytes:.2f} PB" + + +class SubcratesSectionGenerator(SectionGenerator): + def generate(self, processor=None, base_dir=None): + if processor: + self.processor = processor + + if not self.processor: + raise ValueError("Processor is required to generate the subcrates section") + + subcrates = self.processor.find_subcrates() + + processed_subcrates = [] + subcrate_processors = {} + hasPart_mapping = {} + + for subcrate_ref in self.processor.root.get("hasPart", []): + if isinstance(subcrate_ref, dict) and "@id" in subcrate_ref: + subcrate_id = subcrate_ref["@id"] + hasPart_mapping[subcrate_id] = {} + + for subcrate_info in subcrates: + metadata_path = subcrate_info.get("metadata_path", "") + if not metadata_path or not base_dir: + continue + + full_path = os.path.join(base_dir, metadata_path) + if not os.path.exists(full_path): + continue + + try: + subcrate_processor = ROCrateProcessor(json_path=full_path) + subcrate_id = subcrate_processor.root.get("@id", subcrate_info.get("id", "")) + + subcrate_processors[subcrate_id] = subcrate_processor + subcrate_dir = os.path.dirname(full_path) + + files, software, instruments, samples, experiments, computations, schemas, other = subcrate_processor.categorize_items() + if isinstance(subcrate_processor.root.get("author", ""), list): + authors = ", ".join(subcrate_processor.root.get("author", "")) + else: + authors = subcrate_processor.root.get("author", "") + + additional_properties = subcrate_processor.root.get("additionalProperty", []) + + subcrate = { + 'name': subcrate_processor.root.get("name", subcrate_info.get("name", "Unnamed Sub-Crate")), + 'id': subcrate_id, + 'description': subcrate_processor.root.get("description", subcrate_info.get("description", "")), + 'authors': authors, + 'keywords': subcrate_processor.root.get("keywords", []), + 'metadata_path': metadata_path, + } + + + size = subcrate_processor.root.get("contentSize", "") + if not size and os.path.exists(subcrate_dir): + try: + dir_size = get_directory_size(subcrate_dir) + size = format_size(dir_size) + except Exception: + size = "Unknown" + + subcrate["size"] = size + + subcrate['doi'] = subcrate_processor.root.get("identifier", self.processor.root.get("identifier", "")) + subcrate['date'] = subcrate_processor.root.get("datePublished", self.processor.root.get("datePublished", "")) + subcrate['contact'] = subcrate_processor.root.get("contactEmail", self.processor.root.get("contactEmail", "")) + subcrate['published'] = self.processor.published + + # Get copyright, license, and terms of use + subcrate['copyright'] = subcrate_processor.root.get("copyrightNotice", "Copyright (c) 2025 The Regents of the University of California") + subcrate['license'] = subcrate_processor.root.get("license", "https://creativecommons.org/licenses/by-nc-sa/4.0/") + subcrate['terms_of_use'] = subcrate_processor.root.get("conditionsOfAccess", "Attribution is required to the copyright holders and the authors. Any publications referencing this data or derived products should cite the related article as well as directly citing this data collection.") + + subcrate['confidentiality'] = subcrate_processor.root.get("confidentialityLevel", self.processor.root.get("confidentialityLevel", "")) + subcrate['funder'] = subcrate_processor.root.get("funder", self.processor.root.get("funder", "")) + subcrate['md5'] = subcrate_processor.root.get("MD5", "") + subcrate['evidence'] = subcrate_processor.root.get("hasEvidenceGraph", "")["@id"] + + subcrate['files'] = files + subcrate['files_count'] = len(files) + subcrate['software'] = software + subcrate['software_count'] = len(software) + subcrate['instruments'] = instruments + subcrate['instruments_count'] = len(instruments) + subcrate['samples'] = samples + subcrate['samples_count'] = len(samples) + subcrate['experiments'] = experiments + subcrate['experiments_count'] = len(experiments) + subcrate['computations'] = computations + subcrate['computations_count'] = len(computations) + subcrate['schemas'] = schemas + subcrate['schemas_count'] = len(schemas) + subcrate['other'] = other + subcrate['other_count'] = len(other) + + subcrate['file_formats'] = subcrate_processor.get_formats_summary(files) + subcrate['software_formats'] = subcrate_processor.get_formats_summary(software) + subcrate['file_access'] = subcrate_processor.get_access_summary(files) + subcrate['software_access'] = subcrate_processor.get_access_summary(software) + + patterns, external_datasets = self.extract_computation_patterns(subcrate_processor, computations, subcrate_processors) + subcrate['computation_patterns'] = patterns + + external_datasets_by_format = {} + for dataset in external_datasets: + fmt = dataset["format"] + subcrate_name = dataset["subcrate"] + + if subcrate_name: + key = f"{subcrate_name}, {fmt}" + if key in external_datasets_by_format: + external_datasets_by_format[key] += 1 + else: + external_datasets_by_format[key] = 1 + + subcrate['input_datasets'] = external_datasets_by_format + subcrate['input_datasets_count'] = len(external_datasets) + subcrate['inputs_count'] = subcrate['samples_count'] + subcrate['input_datasets_count'] + + subcrate['experiment_patterns'] = self.extract_experiment_patterns(subcrate_processor, experiments) + + # Extract cell line information including CVCL identifier + subcrate['cell_lines'] = subcrate_processor.extract_cell_line_info(samples) + subcrate['species'] = subcrate_processor.extract_sample_species(samples) + subcrate['experiment_types'] = subcrate_processor.extract_experiment_types(experiments) + + related_pubs = subcrate_processor.root.get("relatedPublications", []) + if not related_pubs: + associated_pub = subcrate_processor.root.get("associatedPublication", "") + if associated_pub: + if isinstance(associated_pub, str): + related_pubs = [associated_pub] + elif isinstance(associated_pub, list): + related_pubs = associated_pub + elif self.processor.root.get("relatedPublications", []): + related_pubs = self.processor.root.get("relatedPublications", []) + elif self.processor.root.get("associatedPublication", ""): + associated_pub = self.processor.root.get("associatedPublication", "") + if associated_pub and isinstance(associated_pub, str): + related_pubs = [associated_pub] + elif isinstance(associated_pub, list): + related_pubs = associated_pub + + subcrate['related_publications'] = related_pubs + processed_subcrates.append(subcrate) + + except Exception as e: + print(f"Error processing subcrate {subcrate_info.get('name', 'Unnamed Sub-Crate')}: {e}") + continue + + context = { + 'subcrates': processed_subcrates, + 'subcrate_count': len(processed_subcrates) + } + + return self.template_engine.render('sections/subcrates.html', **context) + + + def extract_experiment_patterns(self, processor, experiments): + patterns = {} + + for experiment in experiments: + input_type = "Sample" + output_formats = [] + + output_datasets = experiment.get("generated", []) + if output_datasets: + if isinstance(output_datasets, list): + for dataset in output_datasets: + if isinstance(dataset, dict) and "@id" in dataset: + format_value = processor.get_dataset_format(dataset["@id"]) + if format_value != "unknown" and format_value not in output_formats: + output_formats.append(format_value) + elif isinstance(dataset, str): + format_value = processor.get_dataset_format(dataset) + if format_value != "unknown" and format_value not in output_formats: + output_formats.append(format_value) + + if output_formats: + output_str = ", ".join(sorted(output_formats)) + pattern = f"{input_type} → {output_str}" + + if pattern in patterns: + patterns[pattern] += 1 + else: + patterns[pattern] = 1 + + return list(patterns.keys()) + + def extract_computation_patterns(self, processor, computations, subcrate_processors=None): + patterns = {} + external_datasets = [] + + for computation in computations: + input_formats = [] + output_formats = [] + + # Process inputs + input_datasets_raw = computation.get("usedDataset", []) + if input_datasets_raw: + if isinstance(input_datasets_raw, list): + for dataset in input_datasets_raw: + if isinstance(dataset, dict) and "@id" in dataset: + dataset_id = dataset["@id"] + elif isinstance(dataset, str): + dataset_id = dataset + else: + continue + + format_value = processor.get_dataset_format(dataset_id) + if format_value != "unknown": + if format_value not in input_formats: + input_formats.append(format_value) + elif subcrate_processors: + for subcrate_id, subcrate_proc in subcrate_processors.items(): + if subcrate_proc: + format_value = subcrate_proc.get_dataset_format(dataset_id) + if format_value != "unknown": + subcrate_name = subcrate_proc.root.get("name", "Unknown") + + display_fmt = f"{subcrate_name} ({format_value})" + if display_fmt not in input_formats: + input_formats.append(display_fmt) + + external_datasets.append({ + "id": dataset_id, + "format": format_value, + "subcrate": subcrate_name + }) + break + elif isinstance(input_datasets_raw, dict) and "@id" in input_datasets_raw: + dataset_id = input_datasets_raw["@id"] + format_value = processor.get_dataset_format(dataset_id) + if format_value != "unknown": + input_formats.append(format_value) + elif subcrate_processors: + for subcrate_id, subcrate_proc in subcrate_processors.items(): + if subcrate_proc: + format_value = subcrate_proc.get_dataset_format(dataset_id) + if format_value != "unknown": + subcrate_name = subcrate_proc.root.get("name", "Unknown") + display_fmt = f"{subcrate_name} ({format_value})" + input_formats.append(display_fmt) + external_datasets.append({ + "id": dataset_id, + "format": format_value, + "subcrate": subcrate_name + }) + break + + output_datasets = computation.get("generated", []) + if output_datasets: + if isinstance(output_datasets, list): + for dataset in output_datasets: + if isinstance(dataset, dict) and "@id" in dataset: + format_value = processor.get_dataset_format(dataset["@id"]) + if format_value != "unknown" and format_value not in output_formats: + output_formats.append(format_value) + elif isinstance(output_datasets, dict) and "@id" in output_datasets: + format_value = processor.get_dataset_format(output_datasets["@id"]) + if format_value != "unknown": + output_formats.append(format_value) + + # Create a pattern string + if input_formats and output_formats: + input_str = ", ".join(sorted(input_formats)) + output_str = ", ".join(sorted(output_formats)) + pattern = f"{input_str} → {output_str}" + + if pattern in patterns: + patterns[pattern] += 1 + else: + patterns[pattern] = 1 + + return list(patterns.keys()), external_datasets \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/template_engine.py b/src/fairscape_cli/datasheet_builder/rocrate/template_engine.py new file mode 100644 index 0000000..ad6be1f --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/rocrate/template_engine.py @@ -0,0 +1,29 @@ +import os +from jinja2 import Environment, FileSystemLoader + + +class TemplateEngine: + """Template engine for rendering HTML templates""" + + def __init__(self, template_dir=None): + """Initialize the template engine with a template directory""" + if template_dir is None: + template_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'templates') + + os.makedirs(template_dir, exist_ok=True) + + self.env = Environment( + loader=FileSystemLoader(template_dir), + trim_blocks=True, + lstrip_blocks=True + ) + + def render(self, template_name, **context): + """Render a template with the given context""" + template = self.env.get_template(template_name) + return template.render(**context) + + def render_string(self, template_string, **context): + """Render a template string with the given context""" + template = self.env.from_string(template_string) + return template.render(**context) \ No newline at end of file diff --git a/src/fairscape_cli/datasheet_builder/rocrate/utilities/__init__.py b/src/fairscape_cli/datasheet_builder/rocrate/utilities/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/fairscape_cli/datasheet_builder/templates/base.html b/src/fairscape_cli/datasheet_builder/templates/base.html new file mode 100644 index 0000000..0ca80b4 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/base.html @@ -0,0 +1,446 @@ + + + + + + {{ title }} - RO-Crate Datasheet + + + +
+
+

{{ title }}

+
Version: {{ version }}
+
+ + {{ overview_section | safe }} {{ use_cases_section | safe }} + +
+

Composition (Datasets {{ subcrate_count }})

+
+ + {{ subcrates_section | safe }} {{ distribution_section | safe }} +
+
+ Datasheet Provenance: This datasheet and associated metadata were + generated by the FAIRSCAPE AI-readiness platform (Al Manir, et al. 2024, + BioRXiv 2024.12.23.629818; + https://doi.org/10.1101/2024.12.23.629818). +
+ + diff --git a/src/fairscape_cli/datasheet_builder/templates/preview.html b/src/fairscape_cli/datasheet_builder/templates/preview.html new file mode 100644 index 0000000..f8e186f --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/preview.html @@ -0,0 +1,637 @@ + + + + + + {{ title }} - RO-Crate Preview + + + +
+
+

{{ title }}

+ {% if version %} +
Version: {{ version }}
+ {% endif %} +
+ +
+

RO-Crate Summary

+
+
ROCrate ID
+
{{ id_value }}
+
+ {% if doi %} +
+
DOI
+
+ {{ doi }} +
+
+ {% endif %} {% if release_date %} +
+
Release Date
+
{{ release_date }}
+
+ {% endif %} {% if created_date %} +
+
Date Created
+
{{ created_date }}
+
+ {% endif %} {% if updated_date %} +
+
Date Modified
+
{{ updated_date }}
+
+ {% endif %} {% if description %} +
+
Description
+
{{ description }}
+
+ {% endif %} {% if authors %} +
+
Authors
+
{{ authors }}
+
+ {% endif %} {% if publisher %} +
+
Publisher
+
{{ publisher }}
+
+ {% endif %} {% if principal_investigator %} +
+
Principal Investigator
+
+ {{ principal_investigator }} +
+
+ {% endif %} {% if contact_email %} +
+
Contact Email
+
+ {{ contact_email }} +
+
+ {% endif %} {% if license_value %} +
+
License
+ +
+ {% endif %} {% if confidentiality_level %} +
+
Confidentiality Level
+
+ {{ confidentiality_level }} +
+
+ {% endif %} {% if keywords %} +
+
Keywords
+
+ {% if keywords is string %}{{ keywords }}{% else %}{{ + keywords|join(', ') }}{% endif %} +
+
+ {% endif %} {% if citation %} +
+
Citation
+
{{ citation }}
+
+ {% endif %} {% if related_publications %} +
+
Related Publications
+ +
+ {% endif %} +
+ +
+ {% if datasets %} +
+ Datasets {{ datasets|length }} +
+ {% endif %} {% if software %} +
+ Software {{ software|length }} +
+ {% endif %} {% if computations %} +
+ Computations {{ computations|length }} +
+ {% endif %} {% if samples %} +
+ Samples {{ samples|length }} +
+ {% endif %} {% if experiments %} +
+ Experiments {{ experiments|length }} +
+ {% endif %} {% if instruments %} +
+ Instruments {{ instruments|length }} +
+ {% endif %} {% if schemas %} +
+ Schemas {{ schemas|length }} +
+ {% endif %} {% if other_items %} +
+ Other {{ other_items|length }} +
+ {% endif %} +
+ + {% macro render_table(items, tab_id, is_active, headers, + date_field='date') %} +
+ {% if items %} + + + + {% for header in headers %} + + {% endfor %} + + + + {% for item in items %} + + + + + + + {% endfor %} + +
{{ header }}
{{ item.name }} + {{ item.description_display }} + {{ item.content_status | safe }}{{ item[date_field] }}
+ {% else %} +

No {{ tab_id }} found in this RO-Crate.

+ {% endif %} +
+ {% endmacro %} {% macro render_schema_table(items, tab_id, is_active) %} +
+ {% if items %} + + + + + + + + + + + {% for item in items %} + + + + + + + {% if item.schema_properties %} + + + + {% endif %} {% endfor %} + +
NameDescriptionAccessProperties
{{ item.name }} + {{ item.description_display }} + {{ item.content_status | safe }} + {% if item.schema_properties %} + Show Properties + {% else %} No properties found {% endif %} +
+ {% else %} +

No {{ tab_id }} found in this RO-Crate.

+ {% endif %} +
+ {% endmacro %} {% macro render_other_table(items, tab_id, is_active) %} +
+ {% if items %} + + + + + + + + + + + {% for item in items %} + + + + + + + {% endfor %} + +
NameDescriptionAccess@id
{{ item.name }} + {{ item.description_display }} + {{ item.content_status | safe }}{{ item.id }}
+ {% else %} +

No {{ tab_id }} found in this RO-Crate.

+ {% endif %} +
+ {% endmacro %} {{ render_table(datasets, 'datasets', datasets, ['Name', + 'Description', 'Access', 'Release Date'], 'date') }} {{ + render_table(software, 'software', not datasets and software, ['Name', + 'Description', 'Access', 'Release Date'], 'date') }} {{ + render_table(computations, 'computations', not datasets and not software + and computations, ['Name', 'Description', 'Access', 'Date Created'], + 'date') }} {{ render_table(samples, 'samples', not datasets and not + software and not computations and samples, ['Name', 'Description', + 'Identifier', 'Date Created'], 'date') }} {{ render_table(experiments, + 'experiments', not datasets and not software and not computations and not + samples and experiments, ['Name', 'Description', 'Type', 'Date Created'], + 'date') }} {{ render_table(instruments, 'instruments', not datasets and + not software and not computations and not samples and not experiments and + instruments, ['Name', 'Description', 'Manufacturer', 'Date Created'], + 'date') }} {{ render_schema_table(schemas, 'schemas', not datasets and not + software and not computations and not samples and not experiments and not + instruments and schemas) }} {{ render_other_table(other_items, 'other', + not datasets and not software and not computations and not samples and not + experiments and not instruments and not schemas and other_items) }} +
+ + + + diff --git a/src/fairscape_cli/datasheet_builder/templates/sections/distribution.html b/src/fairscape_cli/datasheet_builder/templates/sections/distribution.html new file mode 100644 index 0000000..6e5cafc --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/sections/distribution.html @@ -0,0 +1,40 @@ +
+

Distribution Information

+
+
+ {% if publisher %} +
+
Publisher:
+
{{ publisher }}
+
+ {% endif %} {% if host %} +
+
Distribution Host:
+
{{ host }}
+
+ {% endif %} {% if license_value %} +
+
License:
+ +
+ {% endif %} {% if doi %} +
+
DOI:
+
+ {{ doi }} +
+
+ {% endif %} {% if release_date %} +
+
Release Date:
+
{{ release_date }}
+
+ {% endif %} {% if version %} +
+
Version:
+
{{ version }}
+
+ {% endif %} +
diff --git a/src/fairscape_cli/datasheet_builder/templates/sections/overview.html b/src/fairscape_cli/datasheet_builder/templates/sections/overview.html new file mode 100644 index 0000000..3c160d0 --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/sections/overview.html @@ -0,0 +1,129 @@ +
+

Release Overview

+
+
ROCrate ID
+
+ {% if published %} + {{ id_value }} + {% else %} {{ id_value }} {% endif %} +
+
+
+
DOI
+
+ {% if doi %} + {{ doi }} + {% else %} None {% endif %} +
+
+
+
Release Date
+
{{ release_date }}
+
+
+
Size
+
{{ content_size }}
+
+
+
Description
+
{{ description }}
+
+
+
Authors
+
{{ authors }}
+
+
+
Publisher
+
{{ publisher }}
+
+
+
Principal Investigator
+
+ {{ principal_investigator }}; {{ contact_email }} +
+
+ {% if data_governance %} +
+
Data Governance Committee
+
{{ data_governance }}
+
+ {% endif %} +
+
Copyright
+ +
+
+
License
+
+ {% if license_value %} + {{ license_value }} + {% else %} Not specified {% endif %} +
+
+
+
Terms of Use
+
{{ terms_of_use }}
+
+
+
HL7 Confidentiality Level
+
+ {{ confidentiality_level }} +
+
+
+
Keywords
+
+ {% if keywords is string %} {{ keywords }} {% else %} {{ keywords|join(', + ') }} {% endif %} +
+
+
+
Cite As
+
{{ citation }}
+
+ +
+
Funding
+
{{ funding }}
+
+
+
Completeness
+
{{ completeness }}
+
+
+
Related Publications
+ +
+ +
+

Human Subjects & Regulatory

+
+
+ Human Subjects Research: No +
+
+ Human Subjects Exemptions: N/A +
+
+ De-identified Samples: Yes +
+
+ FDA Regulated: No +
+
+ IRB: N/A +
+
+ IRB Protocol ID: N/A +
+
+
+
diff --git a/src/fairscape_cli/datasheet_builder/templates/sections/subcrates.html b/src/fairscape_cli/datasheet_builder/templates/sections/subcrates.html new file mode 100644 index 0000000..fc979cf --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/sections/subcrates.html @@ -0,0 +1,260 @@ +
+ {% if subcrates %} {% for subcrate in subcrates %} +
+

{{ subcrate.name }}

+ + +
+

Content Summary

+
+
+
+ 📊 + Files ({{ subcrate.files_count }}) +
+ {% if subcrate.file_formats or subcrate.file_access %} +
+ {% if subcrate.file_formats %} +
+ Formats: + {% for fmt, count in subcrate.file_formats.items() %}{{ fmt }} + ({{ count }}){% if not loop.last %}, {% endif %}{% endfor + %} +
+ {% endif %} {% if subcrate.file_access %} +
+ Access: + {% for acc, count in subcrate.file_access.items() %}{{ acc }} + ({{ count }}){% if not loop.last %}, {% endif %}{% endfor + %} +
+ {% endif %} +
+ {% endif %} +
+ +
+
+ 💻 + Software & Instruments ({{ subcrate.software_count + + subcrate.instruments_count }}) +
+
+ {% if subcrate.software_count > 0 or subcrate.instruments_count > 0 + %} +
+ Software: + {{ subcrate.software_count }} +
+
+ Instruments: + {{ subcrate.instruments_count }} +
+ {% endif %} +
+
+ +
+
+ 🧪 + Inputs ({{ subcrate.inputs_count }}) +
+
+ {% if subcrate.samples_count > 0 %} +
+ Derived From: + {% if subcrate.cell_lines %} {% for line_id, cell_info in + subcrate.cell_lines.items() %} + {{ cell_info.name }}. {{ cell_info.organism_name }}. ({{ + cell_info.identifier }}) + {% endfor %} {% else %} + Not specified + {% endif %} +
+ {% endif %} {% if subcrate.input_datasets %} +
+ Datasets: + {{ subcrate.input_datasets_count }} +
+
+ + {% for fmt, count in subcrate.input_datasets.items() %}{{ fmt }} + ({{ count }}){% if not loop.last %}, + {% endif %}{% endfor %} + +
+ {% endif %} +
+
+
+
+ ⚙️ + Other Components +
+
+
+ Experiments: + {{ subcrate.experiments_count }} +
+ {% if subcrate.experiment_types %} +
+ {% for type, count in subcrate.experiment_types.items() %}{{ + type }} ({{ count }}){% if not loop.last %}, {% endif %}{% + endfor %} +
+ {% endif %} {% if subcrate.experiment_patterns %} +
+ {% for pattern in subcrate.experiment_patterns %}{{ pattern + }}{% if not loop.last %}, {% endif %}{% endfor %} +
+ {% endif %} +
+ Computations: + {{ subcrate.computations_count }} +
+ {% if subcrate.computation_patterns %} +
+ {% for pattern in subcrate.computation_patterns %}{{ pattern + }}{% if not loop.last %}, {% endif %}{% endfor %} +
+ {% endif %} +
+ Schemas: + {{ subcrate.schemas_count }} +
+
+ Other: + {{ subcrate.other_count }} +
+
+
+
+
+ +
+ {% endfor %} {% else %} +

No subcrates found.

+ {% endif %} +
diff --git a/src/fairscape_cli/datasheet_builder/templates/sections/use_cases.html b/src/fairscape_cli/datasheet_builder/templates/sections/use_cases.html new file mode 100644 index 0000000..1e24cae --- /dev/null +++ b/src/fairscape_cli/datasheet_builder/templates/sections/use_cases.html @@ -0,0 +1,31 @@ +
+

Use Cases and Limitations

+
+
+ {% if intended_uses %} +
+
Intended Uses:
+
{{ intended_uses }}
+
+ {% endif %} {% if limitations %} +
+
Limitations:
+
{{ limitations }}
+
+ {% endif %} {% if prohibited_uses %} +
+
Prohibited Uses:
+
{{ prohibited_uses }}
+
+ {% endif %} {% if potential_bias %} +
+
Potential Sources of Bias:
+
{{ potential_bias }}
+
+ {% endif %} {% if maintenance_plan %} +
+
Maintenance Plan:
+
{{ maintenance_plan }}
+
+ {% endif %} +
diff --git a/src/fairscape_cli/models/__init__.py b/src/fairscape_cli/models/__init__.py index e65d747..3c87a28 100644 --- a/src/fairscape_cli/models/__init__.py +++ b/src/fairscape_cli/models/__init__.py @@ -5,17 +5,20 @@ registerOutputs ) from fairscape_cli.models.software import Software, GenerateSoftware + from fairscape_cli.models.computation import Computation, GenerateComputation from fairscape_cli.models.rocrate import ( ROCrate, - ROCrateMetadata, GenerateROCrate, ReadROCrateMetadata, AppendCrate, CopyToROCrate, - UpdateCrate + UpdateCrate, + LinkSubcrates, + collect_subcrate_metadata ) from fairscape_cli.models.bagit import BagIt +from fairscape_cli.models.pep import PEPtoROCrateMapper __all__ = [ 'Dataset', @@ -33,5 +36,8 @@ 'AppendCrate', 'CopyToROCrate', 'UpdateCrate', - 'BagIt' + 'BagIt', + 'PEPtoROCrateMapper', + 'LinkSubcrates', + 'collect_subcrate_metadata' ] diff --git a/src/fairscape_cli/models/computation.py b/src/fairscape_cli/models/computation.py index 5ead8c5..6bbcb1f 100644 --- a/src/fairscape_cli/models/computation.py +++ b/src/fairscape_cli/models/computation.py @@ -1,81 +1,49 @@ -import re -from datetime import datetime -from typing import Optional, List, Union, Dict -from pydantic import Field, AnyUrl, BaseModel +from fairscape_models.computation import Computation from fairscape_cli.config import NAAN -from fairscape_cli.models.base import FairscapeBaseModel from fairscape_cli.models.guid_utils import GenerateDatetimeSquid - -class ArkPointer(BaseModel): - ark: str = Field( - alias="@id", - validation_alias="@id" - ) - -class Computation(FairscapeBaseModel): - guid: Optional[str] = Field( - default=None, - alias="@id", - validation_alias="@id" - ) - metadataType: str = Field( - default="https://w3id.org/EVI#Computation", - alias="@type", - validation_alias="@type" - ) - runBy: str - dateCreated: str - description: str = Field(min_length=10) - associatedPublication: Optional[str] = Field(default=None) - additionalDocumentation: Optional[str] = Field(default=None) - command: Optional[Union[List[str], str]] = Field(default="") - usedSoftware: Optional[List[ArkPointer]] = Field(default_factory=list) - usedDataset: Optional[List[ArkPointer]] = Field(default_factory=list) - generated: Optional[List[ArkPointer]] = Field(default_factory=list) +from typing import Dict, Any, Optional, List, Union def GenerateComputation( - guid: str, - name: str, - runBy: str, - command: Optional[Union[str, List[str]]], - dateCreated: str, - description: str, - keywords: List[str], - usedSoftware: List[str], - usedDataset: List[str], - generated: Optional[List[str]] = None + guid: Optional[str] = None, + name: Optional[str] = None, + **kwargs ) -> Computation: - """ Generate a Computation model class from command line arguments """ - sq = GenerateDatetimeSquid() - guid = f"ark:{NAAN}/computation-{name.lower().replace(' ', '-')}-{sq}" + Generate a Computation instance with flexible parameters. - if generated is None: - processedGenerated = [] - else: - processedGenerated = [ - {"@id": output.strip("\n")} for output in generated - ] - - computation_model = Computation.model_validate( - { - "@id": guid, - "@type": "https://w3id.org/EVI#Computation", - "name": name, - "description": description, - "keywords": keywords, - "runBy": runBy, - "command": command, - "dateCreated": dateCreated, - "description": description, - # Convert arks to ArkPointer objects - "usedSoftware": [ - {"@id": software.strip("\n")} for software in usedSoftware - ], - "usedDataset": [ - {"@id": dataset.strip("\n")} for dataset in usedDataset - ], - "generated": processedGenerated - }) + This function creates a Computation instance with minimal required parameters and + allows for any additional parameters to be passed through to the Computation model. + Validation is handled by the Computation model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the computation. Used for GUID generation if provided. + **kwargs: Additional parameters to pass to the Computation model. + + Returns: + A validated Computation instance + """ + # Generate GUID if not provided + if not guid and name: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/computation-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/computation-{sq}" + + computationMetadata = { + "@id": guid, + "name": name, + "@type": "https://w3id.org/EVI#Computation" + } + + for key, value in kwargs.items(): + if key in ["usedSoftware", "usedDataset", "generated"] and value: + if isinstance(value, str): + computationMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + computationMetadata[key] = [{"@id": item.strip("\n")} for item in value] + elif value is not None: + computationMetadata[key] = value - return computation_model \ No newline at end of file + return Computation.model_validate(computationMetadata) \ No newline at end of file diff --git a/src/fairscape_cli/models/dataset.py b/src/fairscape_cli/models/dataset.py index 895530d..5ce6ad2 100644 --- a/src/fairscape_cli/models/dataset.py +++ b/src/fairscape_cli/models/dataset.py @@ -1,126 +1,68 @@ -# Standard library imports -import pathlib -from datetime import datetime -from typing import Optional, List, Union, Dict, Tuple, Set - -from pydantic import ( - BaseModel, - constr, - Field, - AnyUrl, - field_serializer -) - -from fairscape_cli.models.base import FairscapeBaseModel -from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_models.dataset import Dataset from fairscape_cli.config import NAAN - - -class Dataset(FairscapeBaseModel): - guid: Optional[str] = Field(alias="@id", default=None) - metadataType: Optional[str] = Field(alias="@type", default="https://w3id.org/EVI#Dataset") - author: str = Field(max_length=64) - datePublished: Optional[str] = Field() - version: str - description: str = Field(min_length=10) - keywords: List[str] = Field(...) - associatedPublication: Optional[str] = Field(default=None) - additionalDocumentation: Optional[str] = Field(default=None) - fileFormat: str = Field(alias="format") - dataSchema: Optional[str] = Field(alias="schema", default=None) - generatedBy: Optional[List[str]] = Field(default=[]) - derivedFrom: Optional[List[str]] = Field(default=[]) - usedBy: Optional[List[str]] = Field(default=[]) - contentUrl: Optional[str] = Field(default=None) - hasSummaryStatistics: Optional[Union[str, List[str]]] = Field(default=None) - - #@field_serializer('datePublished') - #def serialize_date_published(self, datePublished: datetime): - # return datePublished.timestamp() - - +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_cli.models.utils import setRelativeFilepath +import pathlib +import datetime +from typing import Dict, Any, Optional, List, Tuple, Set def GenerateDataset( - guid: Optional[str], - url: Optional[str], - author: str, - description: str, - name: str, - keywords: List[str], - datePublished: str, - version: str, - associatedPublication: Optional[str], - additionalDocumentation: Optional[str], - dataFormat: str, - schema: Optional[str], - derivedFrom: Optional[List[str]], - usedBy: Optional[List[str]], - generatedBy: Optional[List[str]], - filepath: Optional[str], - cratePath, - summary_stats_guid: Optional[str] = None - ): - - if not guid: + guid: Optional[str] = None, + name: Optional[str] = None, + filepath: Optional[str] = None, + cratePath: Optional[str] = None, + **kwargs +) -> Dataset: + """ + Generate a Dataset instance with flexible parameters. + + This function creates a Dataset instance with minimal required parameters and + allows for any additional parameters to be passed through to the Dataset model. + Validation is handled by the Dataset model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the dataset. Used for GUID generation if provided. + filepath: Optional path to the dataset file. + cratePath: Optional path to the RO-Crate containing the dataset. + **kwargs: Additional parameters to pass to the Dataset model. + + Returns: + A validated Dataset instance + """ + # Generate GUID if not provided + if not guid and name: sq = GenerateDatetimeSquid() guid = f"ark:{NAAN}/dataset-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/dataset-{sq}" datasetMetadata = { - "@id": guid, - "@type": "https://w3id.org/EVI#Dataset", - "url": url, - "author": author, - "name": name, - "description": description, - "keywords": keywords, - "datePublished": datePublished, - "version": version, - "associatedPublication": associatedPublication, - "additionalDocumentation": additionalDocumentation, - "format": dataFormat, - "schema": {"@id":schema}, - "derivedFrom": [{"@id":derived.strip("\n")} for derived in derivedFrom], - "usedBy": [{"@id":used.strip("\n")} for used in usedBy], - "generatedBy": [{"@id":gen.strip("\n")} for gen in generatedBy], - "hasSummaryStatistics": {"@id":summary_stats_guid} - } - - datasetMetadata['contentUrl'] = setRelativeFilepath(cratePath, filepath) - datasetInstance = Dataset.model_validate(datasetMetadata) - return datasetInstance - - -def setRelativeFilepath(cratePath, filePath): - ''' Modify the filepath specified in metadata s.t. - ''' - - if filePath is None: - return None - - # if filepath is a url - if 'http' in filePath: - return filePath - - # if a relative file uri to the crate - if 'file:///' in filePath: - # TODO: search within crate to determine file is relative to crate - # filePath = filePath.replace("file:///", "") - - return filePath - - # set relative filepath - # if filepath is a path that exists - if 'ro-crate-metadata.json' in str(cratePath): - rocratePath = pathlib.Path(cratePath).parent.absolute() - else: - rocratePath = pathlib.Path(cratePath).absolute() - - - # if relative filepath - datasetPath = pathlib.Path(filePath).absolute() - relativePath = datasetPath.relative_to(rocratePath) - return f"file:///{str(relativePath)}" - + "@id": guid, + "name": name, + "@type": "https://w3id.org/EVI#Dataset" + } + + if filepath and cratePath: + datasetMetadata['contentUrl'] = setRelativeFilepath(cratePath, filepath) + elif filepath: + datasetMetadata['contentUrl'] = filepath + + for key, value in kwargs.items(): + if key in ["schema", "dataSchema"] and value: + datasetMetadata["schema"] = {"@id": value} + elif key == "hasSummaryStatistics" and value: + datasetMetadata["hasSummaryStatistics"] = {"@id": value} + elif key in ["derivedFrom", "usedBy", "generatedBy"] and value: + if isinstance(value, str): + datasetMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + datasetMetadata[key] = [{"@id": item.strip("\n")} for item in value] + elif value is not None: + datasetMetadata[key] = value + + return Dataset.model_validate(datasetMetadata) from fairscape_cli.models.computation import GenerateComputation, Computation def generateSummaryStatsElements( @@ -178,27 +120,30 @@ def generateSummaryStatsElements( generated=[summary_stats_guid] ) - # Create summary statistics dataset - summary_stats_instance = GenerateDataset( - guid=summary_stats_guid, - url=None, - author=author, - name=f"{name} - Summary Statistics", - description=f"Summary statistics for dataset: {name}", - keywords=keywords, - datePublished=date_published, - version=version, - associatedPublication=associated_publication, - additionalDocumentation=additional_documentation, - dataFormat='pdf', - schema=schema, - derivedFrom=[], - generatedBy=[computation_guid], - usedBy=[], - filepath=summary_statistics_filepath, - cratePath=crate_path, - summary_stats_guid=None - ) + # Create summary statistics dataset with only non-empty fields + stats_dataset_params = { + "guid": summary_stats_guid, + "author": author, + "name": f"{name} - Summary Statistics", + "description": f"Summary statistics for dataset: {name}", + "keywords": keywords, + "datePublished": date_published, + "version": version, + "dataFormat": "pdf", + "generatedBy": [computation_guid], + "filepath": summary_statistics_filepath, + "cratePath": crate_path + } + + # Add optional fields only if they have values + if associated_publication: + stats_dataset_params["associatedPublication"] = associated_publication + if additional_documentation: + stats_dataset_params["additionalDocumentation"] = additional_documentation + if schema: + stats_dataset_params["schema"] = schema + + summary_stats_instance = GenerateDataset(**stats_dataset_params) return summary_stats_guid, summary_stats_instance, computation_instance @@ -212,24 +157,24 @@ def registerOutputs( output_instances = [] for file_path in new_files: file_path_str = str(file_path) - output_instance = GenerateDataset( - guid=None, - name=f"Statistics Output - {file_path.name}", - author=author, # Use the original author - description=f"Statistical analysis output for {dataset_id}", - keywords=["statistics"], - datePublished=datetime.now().isoformat(), - version="1.0", - dataFormat=file_path.suffix[1:], - filepath=file_path_str, - cratePath=str(file_path.parent), - url=None, - associatedPublication=None, - additionalDocumentation=None, - schema=None, - derivedFrom=[], - usedBy=[], - generatedBy=[computation_id] - ) + + # Create dataset with only non-empty fields + output_params = { + "guid": None, + "name": f"Statistics Output - {file_path.name}", + "author": author, + "description": f"Statistical analysis output for {dataset_id}", + "keywords": ["statistics"], + "datePublished": datetime.now().isoformat(), + "version": "1.0", + "dataFormat": file_path.suffix[1:] if file_path.suffix else "unknown", + "filepath": file_path_str, + "cratePath": str(file_path.parent) + } + + if computation_id: + output_params["generatedBy"] = [computation_id] + + output_instance = GenerateDataset(**output_params) output_instances.append(output_instance) return output_instances \ No newline at end of file diff --git a/src/fairscape_cli/models/experiment.py b/src/fairscape_cli/models/experiment.py new file mode 100644 index 0000000..1a9dc1e --- /dev/null +++ b/src/fairscape_cli/models/experiment.py @@ -0,0 +1,53 @@ +from fairscape_models.experiment import Experiment +from fairscape_cli.config import NAAN +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +import pathlib +from typing import Dict, Any, Optional, List, Tuple + +def GenerateExperiment( + guid: Optional[str] = None, + name: Optional[str] = None, + **kwargs +) -> Experiment: + """ + Generate an Experiment instance with flexible parameters. + + This function creates an Experiment instance with minimal required parameters and + allows for any additional parameters to be passed through to the Experiment model. + Validation is handled by the Experiment model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the experiment. Used for GUID generation if provided. + **kwargs: Additional parameters to pass to the Experiment model. + + Returns: + A validated Experiment instance + """ + # Generate GUID if not provided + if not guid and name: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/experiment-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/experiment-{sq}" + + experimentMetadata = { + "@id": guid, + "name": name, + "@type": "https://w3id.org/EVI#Experiment" + } + + for key, value in kwargs.items(): + if key in ["usedInstrument", "usedSample", "generated","usedStain","usedTreatment"] and value: + if isinstance(value, str): + experimentMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + if isinstance(value[0], str): + experimentMetadata[key] = [{"@id": item.strip("\n")} for item in value] + else: + experimentMetadata[key] = [item for item in value] + elif value is not None: + experimentMetadata[key] = value + + return Experiment.model_validate(experimentMetadata) \ No newline at end of file diff --git a/src/fairscape_cli/models/instrument.py b/src/fairscape_cli/models/instrument.py new file mode 100644 index 0000000..259b3b7 --- /dev/null +++ b/src/fairscape_cli/models/instrument.py @@ -0,0 +1,60 @@ +from fairscape_models.instrument import Instrument +from fairscape_cli.config import NAAN +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_cli.models.utils import setRelativeFilepath +import pathlib +from typing import Dict, Any, Optional, List, Tuple + +def GenerateInstrument( + guid: Optional[str] = None, + name: Optional[str] = None, + filepath: Optional[str] = None, + cratePath: Optional[str] = None, + **kwargs +) -> Instrument: + """ + Generate an Instrument instance with flexible parameters. + + This function creates an Instrument instance with minimal required parameters and + allows for any additional parameters to be passed through to the Instrument model. + Validation is handled by the Instrument model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the instrument. Used for GUID generation if provided. + filepath: Optional path to the instrument documentation file. + cratePath: Optional path to the RO-Crate containing the instrument. + **kwargs: Additional parameters to pass to the Instrument model. + + Returns: + A validated Instrument instance + """ + # Generate GUID if not provided + if not guid and name: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/instrument-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/instrument-{sq}" + + instrumentMetadata = { + "@id": guid, + "name": name, + "@type": "https://w3id.org/EVI#Instrument" + } + + if filepath and cratePath: + instrumentMetadata['contentUrl'] = setRelativeFilepath(cratePath, filepath) + elif filepath: + instrumentMetadata['contentUrl'] = filepath + + for key, value in kwargs.items(): + if key in ["usedByExperiment"] and value: + if isinstance(value, str): + instrumentMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + instrumentMetadata[key] = [{"@id": item.strip("\n")} for item in value] + elif value is not None: + instrumentMetadata[key] = value + + return Instrument.model_validate(instrumentMetadata) diff --git a/src/fairscape_cli/models/pep.py b/src/fairscape_cli/models/pep.py new file mode 100644 index 0000000..140c2b2 --- /dev/null +++ b/src/fairscape_cli/models/pep.py @@ -0,0 +1,250 @@ +import os +import pathlib +import yaml +import pandas as pd +from typing import List, Dict, Optional, Union, Any +from datetime import datetime + +from fairscape_cli.models import ( + GenerateROCrate, + GenerateDataset, + AppendCrate, + CopyToROCrate +) +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_cli.models.schema.tabular import TabularValidationSchema +from fairscape_cli.config import NAAN + + +class PEPtoROCrateMapper: + + def __init__(self, pep_path: Union[str, pathlib.Path]): + self.pep_path = pathlib.Path(pep_path) + + if self.pep_path.is_dir(): + yaml_files = list(self.pep_path.glob("*.yaml")) + list(self.pep_path.glob("*.yml")) + + if not yaml_files: + raise FileNotFoundError(f"No YAML files found in {self.pep_path}") + + config_files = [f for f in yaml_files if "config" in f.name.lower()] + + if config_files: + self.config_path = config_files[0] + else: + self.config_path = yaml_files[0] + self.config = self._load_yaml(self.config_path) + else: + raise FileNotFoundError(f"PEP path {self.pep_path} is not a directory") + + if "pep_version" not in self.config: + raise ValueError("Invalid PEP configuration: missing pep_version") + + def _load_yaml(self, path: pathlib.Path) -> Dict: + with open(path, "r") as f: + return yaml.safe_load(f) + + def _resolve_path(self, path_str: str) -> pathlib.Path: + path = pathlib.Path(path_str) + if path.is_absolute(): + return path + return self.pep_path / path + + def _extract_metadata_from_pep(self) -> Dict[str, Any]: + metadata = {} + + if "name" in self.config: + metadata["name"] = self.config["name"] + elif "project_name" in self.config: + metadata["name"] = self.config["project_name"] + + if 'descrtion' not in metadata and "description" in self.config: + metadata["description"] = self.config["description"] + + if "experiment_metadata" in self.config: + exp_meta = self.config["experiment_metadata"] + + if "series_title" in exp_meta: + metadata["name"] = exp_meta["series_title"] + + if "series_summary" in exp_meta: + metadata["description"] = exp_meta["series_summary"] + + if "series_contributor" in exp_meta and 'author' not in metadata: + metadata["author"] = exp_meta["series_contributor"] + + if "author" not in metadata and "series_contact_name" in exp_meta: + metadata["author"] = exp_meta["series_contact_name"] + + if "series_submission_date" in exp_meta: + metadata["datePublished"] = exp_meta["series_submission_date"] + elif "series_last_update_date" in exp_meta: + metadata["datePublished"] = exp_meta["series_last_update_date"] + + return metadata + + def create_rocrate(self, + output_path: Optional[Union[str, pathlib.Path]] = None, + name: Optional[str] = None, + description: Optional[str] = None, + author: Optional[str] = None, + organization_name: Optional[str] = None, + project_name: Optional[str] = None, + keywords: Optional[List[str]] = None, + license: Optional[str] = None, + date_published: Optional[str] = None, + version: str = "1.0") -> str: + if output_path is None: + output_path = self.pep_path + else: + output_path = pathlib.Path(output_path) + output_path.mkdir(parents=True, exist_ok=True) + + pep_metadata = self._extract_metadata_from_pep() + + final_metadata = { + "name": name or pep_metadata.get("name"), + "description": description or pep_metadata.get("description"), + "author": author or pep_metadata.get("author"), + "keywords": keywords or pep_metadata.get("keywords", []), + "datePublished": date_published or pep_metadata.get("datePublished", datetime.now().isoformat()), + "license": license or "https://creativecommons.org/licenses/by/4.0/", + "version": version + } + + required_fields = ["name", "description", "author"] + missing_fields = [field for field in required_fields if not final_metadata.get(field)] + + if missing_fields: + raise ValueError( + f"Missing required metadata: {', '.join(missing_fields)}. " + "Please provide these values as arguments or ensure they are in the PEP config." + ) + + if not final_metadata["keywords"]: + final_metadata["keywords"] = ["pep", final_metadata["name"]] + + crate = GenerateROCrate( + path=output_path, + guid="", + name=final_metadata["name"], + description=final_metadata["description"], + keywords=final_metadata["keywords"], + organizationName=organization_name, + projectName=project_name or self.config.get("name"), + license=final_metadata["license"], + datePublished=final_metadata["datePublished"] + ) + + rocrate_id = crate["@id"] + + if "sample_table" in self.config: + self._add_sample_path_to_rocrate(output_path, rocrate_id, final_metadata) + + if "subsample_table" in self.config: + self._add_subsample_paths_to_rocrate(output_path, rocrate_id, final_metadata) + + return rocrate_id + + def _add_sample_path_to_rocrate(self, + output_path: pathlib.Path, + rocrate_id: str, + metadata: Dict[str, Any]) -> None: + source_path = self._resolve_path(self.config["sample_table"]) + + rel_path = os.path.basename(source_path) + + dataset_name = f"Samples Data: {rel_path}" + sq_dataset = GenerateDatetimeSquid() + dataset_guid = f"ark:{NAAN}/dataset-samples-{sq_dataset}" + + schema = None + try: + schema = TabularValidationSchema.infer_from_file( + str(source_path), + f"Schema for {dataset_name}", + f"Automatically inferred schema for {dataset_name}" + ) + AppendCrate(output_path, [schema]) + except Exception as e: + print(f"Warning: Could not infer schema for {source_path}: {str(e)}") + + dataset = GenerateDataset( + guid=dataset_guid, + name=dataset_name, + description=f"Sample table from PEP project: {metadata['name']}", + author=metadata["author"], + keywords=metadata["keywords"], + datePublished=metadata["datePublished"], + version=metadata.get("version", "1.0"), + dataFormat="csv", + filepath=str(source_path), + cratePath=output_path, + url="", + associatedPublication="", + additionalDocumentation="", + schema=schema.guid if schema else "", + derivedFrom=[], + usedBy=[], + generatedBy=[] + ) + + AppendCrate(output_path, [dataset]) + + def _add_subsample_paths_to_rocrate(self, + output_path: pathlib.Path, + rocrate_id: str, + metadata: Dict[str, Any]) -> None: + subsample_tables = self.config["subsample_table"] + + if isinstance(subsample_tables, list): + for index, table_path in enumerate(subsample_tables): + self._register_subsample_path(output_path, rocrate_id, metadata, table_path, index) + else: + self._register_subsample_path(output_path, rocrate_id, metadata, subsample_tables, 0) + + def _register_subsample_path(self, + output_path: pathlib.Path, + rocrate_id: str, + metadata: Dict[str, Any], + table_path: str, + index: int) -> None: + source_path = self._resolve_path(table_path) + + rel_path = os.path.basename(source_path) + + dataset_name = f"Subsamples Data {index+1}: {rel_path}" + sq_dataset = GenerateDatetimeSquid() + dataset_guid = f"ark:{NAAN}/dataset-subsamples-{index}-{sq_dataset}" + + schema = None + try: + schema = TabularValidationSchema.infer_from_file( + str(source_path), + f"Schema for {dataset_name}", + f"Automatically inferred schema for {dataset_name}" + ) + except Exception as e: + print(f"Warning: Could not infer schema for {source_path}: {str(e)}") + + dataset = GenerateDataset( + guid=dataset_guid, + name=dataset_name, + description=f"Subsample table from PEP project: {metadata['name']}", + author=metadata["author"], + keywords=metadata["keywords"], + datePublished=metadata["datePublished"], + version=metadata.get("version", "1.0"), + dataFormat="csv", + filepath=str(source_path), + cratePath=output_path, + url="", + associatedPublication="", + additionalDocumentation="", + schema=schema.guid if schema else "", + derivedFrom="", + usedBy="", + generatedBy="" + ) + + AppendCrate(output_path, [dataset]) \ No newline at end of file diff --git a/src/fairscape_cli/models/rocrate.py b/src/fairscape_cli/models/rocrate.py index a3622e9..be66f7d 100644 --- a/src/fairscape_cli/models/rocrate.py +++ b/src/fairscape_cli/models/rocrate.py @@ -11,134 +11,73 @@ from fairscape_cli.models.computation import Computation from fairscape_cli.models.guid_utils import GenerateDatetimeSquid -class ROCrateMetadataDescriptor(BaseModel): - model_config = ConfigDict(populate_by_name=True) - - id: str = Field(default="ro-crate-metadata.json", alias="@id") - type: Literal["CreativeWork"] = Field(alias="@type") - conformsTo: Dict = Field(default={ - "@id": "https://w3id.org/ro/crate/1.2-DRAFT" - }) - about: Dict[str, str] +from datetime import datetime +import pathlib +import json +from typing import List, Optional, Dict, Any -class ROCrateMetadata(BaseModel): - model_config = ConfigDict( - populate_by_name=True, - extra='forbid' - ) - - context: Dict[str, str] = Field( - default={ - "EVI": "https://w3id.org/EVI#", - "@vocab": "https://schema.org/" - }, - alias="@context" - ) - graph: List[Dict] = Field(alias="@graph") - - @model_validator(mode='after') - def validate_metadata(self) -> 'ROCrateMetadata': - self.validate_metadata_descriptor() - self.validate_graph_elements() - return self - - def validate_metadata_descriptor(self): - # Check for metadata descriptor - descriptors = [item for item in self.graph - if item.get("@id") == "ro-crate-metadata.json"] - if not descriptors: - raise ValueError("Missing required metadata descriptor in @graph") - - descriptor = descriptors[0] - # Validate descriptor - ROCrateMetadataDescriptor(**descriptor) - - # Validate about reference exists in graph - about_id = descriptor.get("about", {}).get("@id") - if not about_id: - raise ValueError("Metadata descriptor missing root node in about.@id") - - # Check root exists - root_items = [item for item in self.graph if item.get("@id") == about_id] - if not root_items: - raise ValueError(f"Root id {about_id} referenced in about.@id not found in @graph") - - def validate_graph_elements(self): - """Validate each element in @graph is flat and has an id""" - for item in self.graph: - if "@id" not in item or "@type" not in item: - raise ValueError("All @graph elements must have @id and @type properties") - - # Validate nested objects only contain @id - for key, value in item.items(): - if isinstance(value, dict): - allowed_keys = {"@id"} - if set(value.keys()) - allowed_keys: - raise ValueError(f"Nested object under '{key}' can only contain '@id' property") +from fairscape_cli.config import NAAN, DEFAULT_CONTEXT +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_models.rocrate import ROCrateV1_2, ROCrateMetadataElem, ROCrateMetadataFileElem def GenerateROCrate( path: pathlib.Path, guid: str, name: str, - description: str, - keywords: List[str], - organizationName: str = None, - projectName: str = None, - license: str = "https://creativecommons.org/licenses/by/4.0/", - datePublished: str = None, + **kwargs ): - # Generate GUID if not provided - sq = GenerateDatetimeSquid() - guid = f"ark:{NAAN}/rocrate-{name.lower().replace(' ', '-')}-{sq}/" - - if datePublished is None: - datePublished = datetime.now().isoformat() + if not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/rocrate-{name.lower().replace(' ', '-')}-{sq}/" + + metadata_descriptor = ROCrateMetadataFileElem.model_validate({ + "@id": "ro-crate-metadata.json", + "@type": "CreativeWork", + "conformsTo": {"@id": "https://w3id.org/ro/crate/1.2-DRAFT"}, + "about": {"@id": guid} + }) - # Create root dataset entity - root_dataset = { + root_metadata = { "@id": guid, "@type": ["Dataset", "https://w3id.org/EVI#ROCrate"], "name": name, - "keywords": keywords, - "description": description, - "license": license, - "datePublished": datePublished, - "hasPart": [], - "isPartOf": [] + "hasPart": [] } - - if organizationName: - organization_guid = f"ark:{NAAN}/organization-{organizationName.lower().replace(' ', '-')}-{GenerateDatetimeSquid()}" - root_dataset['isPartOf'] = [{ - "@id": organization_guid - }] - - if projectName: - project_guid = f"ark:{NAAN}/project-{projectName.lower().replace(' ', '-')}-{GenerateDatetimeSquid()}" - root_dataset['isPartOf'].append({ - "@id": project_guid - }) - - metadata_descriptor = { - "@id": "ro-crate-metadata.json", - "@type": "CreativeWork", - "conformsTo": {"@id": "https://w3id.org/ro/crate/1.2-DRAFT"}, - "about": {"@id": guid} - } - - # Create full RO-Crate structure - rocrate_metadata = { - "@context": DEFAULT_CONTEXT, - "@graph": [ - metadata_descriptor, + + if "organizationName" in kwargs: + organization_guid = f"ark:{NAAN}/organization-{kwargs['organizationName'].lower().replace(' ', '-')}-{GenerateDatetimeSquid()}" + root_metadata["isPartOf"] = [{"@id": organization_guid}] + del kwargs["organizationName"] + + if "projectName" in kwargs: + project_guid = f"ark:{NAAN}/project-{kwargs['projectName'].lower().replace(' ', '-')}-{GenerateDatetimeSquid()}" + if "isPartOf" not in root_metadata: + root_metadata["isPartOf"] = [] + root_metadata["isPartOf"].append({"@id": project_guid}) + del kwargs["projectName"] + + if "license" in kwargs: + root_metadata["license"] = kwargs["license"] + del kwargs["license"] + else: + root_metadata["license"] = "https://creativecommons.org/licenses/by/4.0/" + + for key, value in kwargs.items(): + if value is not None: + root_metadata[key] = value + + root_dataset = ROCrateMetadataElem(**root_metadata) + + rocrate = ROCrateV1_2(**{ + "@context":DEFAULT_CONTEXT, + "@graph":[ + metadata_descriptor, root_dataset - ] - } + ]} + ) - # Validate the structure - ROCrateMetadata(**rocrate_metadata) + rocrate_dict = rocrate.model_dump(by_alias=True, exclude_none=True) - # Write to file if 'ro-crate-metadata.json' in str(path): roCrateMetadataPath = path if not path.parent.exists(): @@ -149,16 +88,19 @@ def GenerateROCrate( path.mkdir(parents=True, exist_ok=True) with roCrateMetadataPath.open(mode="w") as metadataFile: - json.dump(rocrate_metadata, metadataFile, indent=2) - - return rocrate_metadata["@graph"][1] + json.dump(rocrate_dict, metadataFile, indent=2) -class ROCrate(BaseModel): + return root_dataset.model_dump(by_alias=True, exclude_none=True) +class ROCrate(ROCrateMetadataElem): model_config = ConfigDict(populate_by_name=True) guid: Optional[str] = Field(alias="@id", default=None) name: str = Field(max_length=200) description: str = Field(min_length=5) + author: Optional[str] = None + datePublished: Optional[datetime] = None + license: Optional[str] = None + version: Optional[str] = None keywords: List[str] projectName: Optional[str] = None organizationName: Optional[str] = None @@ -181,59 +123,52 @@ def create_subcrate( keywords: List[str], organization_name: Optional[str] = None, project_name: Optional[str] = None, - guid: Optional[str] = None + guid: Optional[str] = None, + author: Optional[str] = None, + version: Optional[str] = None, + license: Optional[str] = None ) -> str: - """Create a new subcrate within this RO-Crate. - - Args: - subcrate_path: Relative path within this crate where subcrate should be created - name: Name of the subcrate - description: Description of the subcrate - keywords: List of keywords for the subcrate - organization_name: Optional organization name - project_name: Optional project name - guid: Optional GUID for the subcrate - - Returns: - str: The GUID of the created subcrate - - Raises: - Exception: If there are errors creating or linking the subcrate - """ - # Get parent (this crate's) metadata parent_metadata_path = self.path / 'ro-crate-metadata.json' with parent_metadata_path.open('r') as f: parent_metadata = json.load(f) parent_id = parent_metadata['@graph'][1]['@id'] - # Create full path for subcrate + if author is None: + author = getattr(self, 'author', "Unknown") + if version is None: + version = getattr(self, 'version', "1.0") + if license is None: + license = getattr(self, 'license', "https://creativecommons.org/licenses/by/4.0/") + full_subcrate_path = self.path / subcrate_path - # Create subcrate subcrate = GenerateROCrate( path=full_subcrate_path, guid=guid, name=name, description=description, keywords=keywords, + author=author, + version=version, + license=license, organizationName=organization_name, - projectName=project_name + projectName=project_name, + datePublished=datetime.now().isoformat(), + isPartOf=[{"@id": parent_id}], + hasPart=[] ) - # Update subcrate to reference parent subcrate_metadata_path = full_subcrate_path / 'ro-crate-metadata.json' with subcrate_metadata_path.open('r+') as f: subcrate_metadata = json.load(f) root_dataset = subcrate_metadata['@graph'][1] - # Add isPartOf reference to parent root_dataset['isPartOf'] = [{"@id": parent_id}] f.seek(0) f.truncate() json.dump(subcrate_metadata, f, indent=2) - # Update parent crate with subcrate reference with parent_metadata_path.open('r+') as f: parent_metadata = json.load(f) root_dataset = parent_metadata['@graph'][1] @@ -241,28 +176,40 @@ def create_subcrate( if 'hasPart' not in root_dataset: root_dataset['hasPart'] = [] - # Create subcrate reference with metadata subcrate_ref = { "@id": subcrate['@id'], "@type": ["Dataset", "https://w3id.org/EVI#ROCrate"], "name": name, "description": description, "keywords": keywords, - "contentUrl": f"file:///{str(subcrate_path / 'ro-crate-metadata.json')}" + "author": author, + "version": version, + "license": license, + "isPartOf": [{"@id": parent_id}], + "hasPart": [], + "contentUrl": f"file:///{str(subcrate_path / 'ro-crate-metadata.json')}", + "datePublished": datetime.now().isoformat() } - # Add subcrate reference to parent's graph parent_metadata['@graph'].append(subcrate_ref) - # Add reference to hasPart if not any(part.get('@id') == subcrate['@id'] for part in root_dataset['hasPart']): root_dataset['hasPart'].append({"@id": subcrate['@id']}) - # Validate and save - ROCrateMetadata(**parent_metadata) + if 'version' not in root_dataset: + root_dataset['version'] = getattr(self, 'version', "1.0") + if 'author' not in root_dataset: + root_dataset['author'] = getattr(self, 'author', "Unknown") + if 'license' not in root_dataset: + root_dataset['license'] = getattr(self, 'license', "https://creativecommons.org/licenses/by/4.0/") + if 'isPartOf' not in root_dataset: + root_dataset['isPartOf'] = [] + + rocrate = ROCrateV1_2.model_validate(parent_metadata) + f.seek(0) f.truncate() - json.dump(parent_metadata, f, indent=2) + json.dump(rocrate.model_dump(by_alias=True), f, indent=2) return subcrate['@id'] @@ -314,7 +261,7 @@ def initCrate(self): } # Validate the structure - ROCrateMetadata(**rocrate_metadata) + ROCrateV1_2(**rocrate_metadata) # Write to file with ro_crate_metadata_path.open(mode="w") as metadata_file: @@ -338,7 +285,7 @@ def registerObject(self, model: Union[Dataset, Software, Computation]): root_dataset['hasPart'].append({"@id": model_data["@id"]}) # Validate updated structure - ROCrateMetadata(**rocrate_metadata) + ROCrateV1_2(**rocrate_metadata) # Write back to file rocrate_metadata_file.seek(0) @@ -363,8 +310,7 @@ def ReadROCrateMetadata(cratePath: pathlib.Path) -> Dict[str, Any]: with metadata_path.open("r") as metadata_file: crate_metadata = json.load(metadata_file) - # Validate the structure - ROCrateMetadata(**crate_metadata) + validated_graph = ROCrateV1_2.validate_metadata_graph(crate_metadata) return crate_metadata def AppendCrate( @@ -380,7 +326,7 @@ def AppendCrate( with cratePath.open("r+") as rocrate_metadata_file: rocrate_metadata = json.load(rocrate_metadata_file) - # Add elements to @graph and references to root dataset + root_dataset = rocrate_metadata['@graph'][1] # Second element after descriptor if 'hasPart' not in root_dataset: root_dataset['hasPart'] = [] @@ -391,7 +337,7 @@ def AppendCrate( root_dataset['hasPart'].append({"@id": element_data["@id"]}) # Validate updated structure - ROCrateMetadata(**rocrate_metadata) + ROCrateV1_2(**rocrate_metadata) # Write back to file rocrate_metadata_file.seek(0) @@ -439,9 +385,190 @@ def UpdateCrate( break # Validate updated structure - ROCrateMetadata(**rocrate_metadata) + ROCrateV1_2(**rocrate_metadata) # Write back to file rocrate_metadata_file.seek(0) rocrate_metadata_file.truncate() - json.dump(rocrate_metadata, rocrate_metadata_file, indent=2) \ No newline at end of file + json.dump(rocrate_metadata, rocrate_metadata_file, indent=2) + +def LinkSubcrates(parent_crate_path: pathlib.Path) -> List[str]: + parent_metadata_file = parent_crate_path / 'ro-crate-metadata.json' + if not parent_metadata_file.is_file(): + raise FileNotFoundError(f"Parent metadata file not found: {parent_metadata_file}") + + # Always load as JSON + with parent_metadata_file.open('r') as f: + parent_metadata = json.load(f) + + # Find parent root dataset + parent_root_id = None + parent_root_dataset = None + + # First find the ID from the metadata descriptor + for item in parent_metadata.get('@graph', []): + if item.get('@id') == 'ro-crate-metadata.json' and 'about' in item: + parent_root_id = item['about'].get('@id') + break + + # Then find the root dataset with that ID + if parent_root_id: + for item in parent_metadata.get('@graph', []): + if item.get('@id') == parent_root_id: + parent_root_dataset = item + break + + if not parent_root_dataset: + raise ValueError("Could not determine the root dataset of the parent RO-Crate") + + # Fields that can be propagated from parent to subcrates + transferable_fields = [ + "publisher", "principalInvestigator", "copyrightNotice", + "conditionsOfAccess", "contactEmail", "confidentialityLevel", + "citation", "funder", "usageInfo", "contentSize", "additionalProperty" + ] + + # Collect transferable data, checking for empty values + transferable_data = {} + for field in transferable_fields: + if field in parent_root_dataset: + value = parent_root_dataset[field] + # Skip empty values + if value is None or (isinstance(value, str) and value.strip() == "") or (isinstance(value, list) and len(value) == 0): + continue + transferable_data[field] = value + + sub_crate_references = [] + linked_sub_crate_ids = [] + + # Find all subcrates + for dir_item in parent_crate_path.iterdir(): + if dir_item.is_dir(): + subcrate_metadata_file = dir_item / 'ro-crate-metadata.json' + if subcrate_metadata_file.is_file(): + # Always load as JSON + with subcrate_metadata_file.open('r') as f: + subcrate_metadata = json.load(f) + + # Find subcrate root element + subcrate_root_id = None + + # First find the ID from the metadata descriptor + for item in subcrate_metadata.get('@graph', []): + if item.get('@id') == 'ro-crate-metadata.json' and 'about' in item: + subcrate_root_id = item['about'].get('@id') + break + + # Find the root dataset with that ID + subcrate_root = None + if subcrate_root_id: + for index, item in enumerate(subcrate_metadata.get('@graph', [])): + if item.get('@id') == subcrate_root_id: + subcrate_root = item + subcrate_root_index = index + break + + if not subcrate_root: + continue + + # Apply transferable fields to subcrate root if they don't exist or are empty + modified = False + for field, value in transferable_data.items(): + if field not in subcrate_root or subcrate_root[field] is None or \ + (isinstance(subcrate_root[field], str) and subcrate_root[field].strip() == "") or \ + (isinstance(subcrate_root[field], list) and len(subcrate_root[field]) == 0): + subcrate_root[field] = value + modified = True + + # Save changes to subcrate if modified + if modified: + subcrate_metadata['@graph'][subcrate_root_index] = subcrate_root + with subcrate_metadata_file.open('w') as f: + json.dump(subcrate_metadata, f, indent=2) + + # Create reference for parent crate + reference_dict = dict(subcrate_root) + relative_path = (dir_item.relative_to(parent_crate_path) / 'ro-crate-metadata.json').as_posix() + reference_dict['ro-crate-metadata'] = relative_path + + sub_crate_references.append(reference_dict) + linked_sub_crate_ids.append(subcrate_root_id) + + # Update parent crate with references to subcrates + if sub_crate_references: + parent_root_dataset.setdefault('hasPart', []) + existing_haspart_ids = {part.get('@id') for part in parent_root_dataset['hasPart']} + + # Add new references to graph + existing_graph_ids = {item.get('@id') for item in parent_metadata['@graph']} + for ref in sub_crate_references: + if ref['@id'] not in existing_graph_ids: + parent_metadata['@graph'].append(ref) + + # Add new hasPart relations + for sub_id in linked_sub_crate_ids: + if sub_id not in existing_haspart_ids: + parent_root_dataset['hasPart'].append({'@id': sub_id}) + + # Write back to parent metadata file + with parent_metadata_file.open('w') as f: + json.dump(parent_metadata, f, indent=2) + else: + print("No valid sub-crates found to link.") + + return linked_sub_crate_ids + +def collect_subcrate_metadata(parent_crate_path: pathlib.Path) -> dict: + """ + Collects author and keyword metadata from all subcrates in the parent crate. + Returns a dictionary with 'authors' (list of unique authors) and 'keywords' (list of unique keywords). + """ + parent_crate_path = pathlib.Path(parent_crate_path) + authors = set() + keywords = set() + processed_files = set() + + def process_directory(directory): + for path in directory.glob('**/ro-crate-metadata.json'): + if path.is_file() and str(path) not in processed_files: + processed_files.add(str(path)) + + try: + subcrate_metadata = ReadROCrateMetadata(path) + root_dataset = None + if '@graph' in subcrate_metadata and len(subcrate_metadata['@graph']) > 1: + root_dataset = subcrate_metadata['@graph'][1] + + + if root_dataset: + if root_dataset.author: + author_value = root_dataset.author + if isinstance(author_value, str): + for author in [a.strip() for a in author_value.split(',')]: + if author: + authors.add(author) + elif isinstance(author_value, tuple) or isinstance(author_value, list): + for author in author_value: + if isinstance(author, str): + authors.add(author) + + if root_dataset.keywords: + keyword_values = root_dataset.keywords + if isinstance(keyword_values, list) or isinstance(keyword_values, tuple): + for keyword in keyword_values: + if keyword: + keywords.add(keyword) + elif isinstance(keyword_values, str): + for keyword in [k.strip() for k in keyword_values.split(',')]: + if keyword: + keywords.add(keyword) + except Exception as e: + print(f"Error reading subcrate metadata {path}: {e}") + continue + for dir_item in parent_crate_path.iterdir(): + if dir_item.is_dir(): + process_directory(dir_item) + return { + 'authors': sorted(list(authors)), + 'keywords': sorted(list(keywords)) + } \ No newline at end of file diff --git a/src/fairscape_cli/models/sample.py b/src/fairscape_cli/models/sample.py new file mode 100644 index 0000000..488b221 --- /dev/null +++ b/src/fairscape_cli/models/sample.py @@ -0,0 +1,65 @@ +from fairscape_models.sample import Sample +from fairscape_cli.config import NAAN +from fairscape_cli.models.guid_utils import GenerateDatetimeSquid +from fairscape_cli.models.utils import setRelativeFilepath +import pathlib +from typing import Dict, Any, Optional, List, Tuple + +def GenerateSample( + guid: Optional[str] = None, + name: Optional[str] = None, + filepath: Optional[str] = None, + cratePath: Optional[str] = None, + **kwargs +) -> Sample: + """ + Generate a Sample instance with flexible parameters. + + This function creates a Sample instance with minimal required parameters and + allows for any additional parameters to be passed through to the Sample model. + Validation is handled by the Sample model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the sample. Used for GUID generation if provided. + filepath: Optional path to the sample file. + cratePath: Optional path to the RO-Crate containing the sample. + **kwargs: Additional parameters to pass to the Sample model. + + Returns: + A validated Sample instance + """ + # Generate GUID if not provided + if not guid and name: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/sample-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/sample-{sq}" + + sampleMetadata = { + "@id": guid, + "name": name, + "@type": "https://w3id.org/EVI#Sample" + } + + if filepath and cratePath: + sampleMetadata['contentUrl'] = setRelativeFilepath(cratePath, filepath) + elif filepath: + sampleMetadata['contentUrl'] = filepath + + for key, value in kwargs.items(): + if key == "cellLineReference" and value: + if isinstance(value, str): + sampleMetadata["cellLineReference"] = {"@id": value} + else: + sampleMetadata["cellLineReference"] = {"@id": value} + elif key in ["generatedBy"] and value: + if isinstance(value, str): + sampleMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + sampleMetadata[key] = [{"@id": item.strip("\n")} for item in value] + elif value is not None: + sampleMetadata[key] = value + + return Sample.model_validate(sampleMetadata) \ No newline at end of file diff --git a/src/fairscape_cli/models/software.py b/src/fairscape_cli/models/software.py index 453c350..53ff609 100644 --- a/src/fairscape_cli/models/software.py +++ b/src/fairscape_cli/models/software.py @@ -1,98 +1,62 @@ -import pathlib -from datetime import datetime -from typing import Optional, Union, Dict, List - -from pydantic import Field, AnyUrl, ConfigDict - +from fairscape_models.software import Software from fairscape_cli.config import NAAN -from fairscape_cli.models.base import FairscapeBaseModel from fairscape_cli.models.guid_utils import GenerateDatetimeSquid - - -class Software(FairscapeBaseModel): - guid: Optional[str] = Field( alias='@id', default=None) - metadataType: Optional[str] = Field(alias="@type", default="https://w3id.org/EVI#Software") - author: str = Field(min_length=4, max_length=64) - dateModified: str - version: str - description: str = Field(min_length=10) - associatedPublication: Optional[str] = Field(default=None) - additionalDocumentation: Optional[str] = Field(default=None) - fileFormat: str = Field(title="fileFormat", alias="format") - usedByComputation: Optional[List[str]] - contentUrl: Optional[str] = Field(default=None) - - -def GenerateSoftware( - guid, - name, - author, - version, - description, - keywords, - fileFormat, - url, - dateModified, - filepath, - usedByComputation, - associatedPublication, - additionalDocumentation, - cratePath +from typing import Dict, Any, Optional, List +import pathlib +from fairscape_cli.models.utils import setRelativeFilepath +from fairscape_cli.models.utils import FileNotInCrateException + +def GenerateSoftware( + guid: Optional[str] = None, + name: Optional[str] = None, + filepath: Optional[str] = None, + cratePath: Optional[str] = None, + **kwargs ) -> Software: - """ Generate a Software Model Class """ - - sq = GenerateDatetimeSquid() - guid = f"ark:{NAAN}/software-{name.lower().replace(' ', '-')}-{sq}" - + Generate a Software instance with flexible parameters. + + This function creates a Software instance with minimal required parameters and + allows for any additional parameters to be passed through to the Software model. + Validation is handled by the Software model itself. + + Args: + guid: Optional identifier. If not provided, one will be generated. + name: Optional name for the software. Used for GUID generation if provided. + filepath: Optional path to the software file. + cratePath: Optional path to the RO-Crate containing the software. + **kwargs: Additional parameters to pass to the Software model. + + Returns: + A validated Software instance + """ + if not guid and name: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/software-{name.lower().replace(' ', '-')}-{sq}" + elif not guid: + sq = GenerateDatetimeSquid() + guid = f"ark:{NAAN}/software-{sq}" + softwareMetadata = { - "@id": guid, - "@type": "https://w3id.org/EVI#Software", - "url": url, - "name": name, - "author": author, - "dateModified": dateModified, - "description": description, - "keywords": keywords, - "version": version, - "associatedPublication": associatedPublication, - "additionalDocumentation": additionalDocumentation, - "format": fileFormat, - # sanitize new line characters for multiple inputs - "usedByComputation": [ - {"@id":computation.strip("\n")} for computation in usedByComputation - ], - } - - if filepath is not None: - - # if filepath is a url - if 'http' in filepath: - softwareMetadata['contentUrl'] = filepath - - # if filepath is a path that exists - else: - if 'ro-crate-metadata.json' in str(cratePath): - rocratePath = pathlib.Path(cratePath).parent.absolute() - else: - rocratePath = pathlib.Path(cratePath).absolute() - - softwarePath = pathlib.Path(filepath).absolute() - - if softwarePath.exists(): - try: - relativePath = softwarePath.relative_to(rocratePath) - softwareMetadata['contentUrl'] = f"file:///{str(relativePath)}" - except: - raise FileNotInCrateException(cratePath=cratePath, filePath=softwarePath) - else: - raise Exception(f"Software File Does Not Exist: {str(softwarePath)}") - - - # validate metadata - softwareModel = Software.model_validate(softwareMetadata) - - - return softwareModel - - + "@id": guid, + "name" : name, + "@type": "https://w3id.org/EVI#Software" + } + + if filepath and cratePath: + softwareMetadata['contentUrl'] = setRelativeFilepath(cratePath, filepath) + elif filepath: + softwareMetadata['contentUrl'] = filepath + + for key, value in kwargs.items(): + if key == "usedByComputation" and value: + if isinstance(value, str): + softwareMetadata[key] = [{"@id": value.strip("\n")}] + elif (isinstance(value, list) or isinstance(value, tuple)) and len(value) > 0: + softwareMetadata[key] = [{"@id": item.strip("\n")} for item in value] + elif key == "fileFormat": + softwareMetadata["format"] = value + elif value is not None: + softwareMetadata[key] = value + + return Software.model_validate(softwareMetadata) \ No newline at end of file diff --git a/src/fairscape_cli/models/utils.py b/src/fairscape_cli/models/utils.py index dfe270a..6ae5172 100644 --- a/src/fairscape_cli/models/utils.py +++ b/src/fairscape_cli/models/utils.py @@ -1,11 +1,32 @@ from pathlib import Path from typing import Set, Dict, List, Optional, Tuple import subprocess +import pathlib from pydantic import ValidationError from fairscape_cli.models.base import FairscapeBaseModel +def setRelativeFilepath(cratePath, filePath): + '''Modify the filepath specified in metadata to be relative to the crate''' + if filePath is None: + return None + + if 'http' in filePath: + return filePath + + if 'file:///' in filePath: + return filePath + + if 'ro-crate-metadata.json' in str(cratePath): + rocratePath = pathlib.Path(cratePath).parent.absolute() + else: + rocratePath = pathlib.Path(cratePath).absolute() + + datasetPath = pathlib.Path(filePath).absolute() + relativePath = datasetPath.relative_to(rocratePath) + return f"file:///{str(relativePath)}" + def InstantiateModel(ctx, metadata: dict, modelInstance): try: modelInstance.model_validate(metadata) diff --git a/src/fairscape_cli/publish/__init__.py b/src/fairscape_cli/publish/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/fairscape_cli/publish/publish_tools.py b/src/fairscape_cli/publish/publish_tools.py new file mode 100644 index 0000000..13d759f --- /dev/null +++ b/src/fairscape_cli/publish/publish_tools.py @@ -0,0 +1,464 @@ +import click +import json +import sys +import requests +from datetime import datetime +from pathlib import Path +import csv +from abc import ABC, abstractmethod +from typing import Dict, Any, Optional + +def _load_authors_info(authors_csv_path: Optional[str]) -> Dict[str, Dict[str, str]]: + authors_info = {} + if authors_csv_path: + try: + with open(authors_csv_path, 'r', newline='', encoding='utf-8') as f: + reader = csv.DictReader(f) + for row in reader: + name = row.get('name', '').strip().lower() + if name: + authors_info[name] = { + 'affiliation': row.get('affiliation', ''), + 'orcid': row.get('orcid', '') + } + except FileNotFoundError: + click.echo(f"Warning: Authors CSV file not found at {authors_csv_path}", err=True) + except Exception as e: + click.echo(f"Warning: Error loading authors CSV '{authors_csv_path}': {e}", err=True) + return authors_info + +def _read_rocrate_root(rocrate_path: Path) -> Optional[Dict]: + try: + with open(rocrate_path, 'r', encoding='utf-8') as f: + rocrate_data = json.load(f) + except FileNotFoundError: + click.echo(f"Error: RO-Crate metadata file not found at {rocrate_path}", err=True) + return None + except json.JSONDecodeError: + click.echo(f"Error: Invalid JSON in RO-Crate metadata file {rocrate_path}", err=True) + return None + except Exception as e: + click.echo(f"Error reading RO-Crate metadata '{rocrate_path}': {e}", err=True) + return None + + root_node = None + graph = rocrate_data.get('@graph', []) + metadata_node = next((item for item in graph if item.get('@id') == 'ro-crate-metadata.json'), None) + if metadata_node and 'about' in metadata_node: + about_id = metadata_node['about'].get('@id') + root_node = next((item for item in graph if item.get('@id') == about_id), None) + + if not root_node: + for item in graph: + item_type = item.get('@type', []) + if not isinstance(item_type, list): + item_type = [item_type] + if 'Dataset' in item_type and item.get('@id') != 'ro-crate-metadata.json': + if 'https://w3id.org/EVI#ROCrate' in item_type: + root_node = item + break + if root_node is None: + root_node = item + + if not root_node: + click.echo("Error: Could not find root dataset node in RO-Crate graph.", err=True) + return None + + return root_node + + +class Publisher(ABC): + @abstractmethod + def publish(self, rocrate_path: Path, **kwargs): + pass + +class FairscapePublisher(Publisher): + def __init__(self, base_url: str = "https://fairscape.net/api"): + self.base_url = base_url.rstrip('/') + + def _zip_directory(self, directory_path: Path) -> Path: + import tempfile + import zipfile + import os + + click.echo(f"Zipping directory '{directory_path}'...") + + temp_zip_file = tempfile.NamedTemporaryFile(delete=False, suffix='.zip') + temp_zip_path = Path(temp_zip_file.name) + temp_zip_file.close() + + with zipfile.ZipFile(temp_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf: + for root, _, files in os.walk(directory_path): + for file in files: + file_path = os.path.join(root, file) + arcname = os.path.relpath(file_path, directory_path) + zipf.write(file_path, arcname) + + click.echo(f"Directory zipped successfully: {temp_zip_path}") + return temp_zip_path + + def _get_auth_token(self, username: str, password: str) -> str: + url = f"{self.base_url}/login" + + data = { + "username": username, + "password": password + } + + try: + response = requests.post(url, data=data) + response.raise_for_status() + + token_data = response.json() + if "access_token" in token_data: + return token_data["access_token"] + else: + click.echo(f"Error: Authentication response didn't contain access token", err=True) + click.echo(f"Response: {response.text}", err=True) + sys.exit(1) + + except requests.exceptions.RequestException as e: + click.echo(f"Error authenticating with Fairscape API: {e}", err=True) + sys.exit(1) + + def publish(self, rocrate_path: Path, username: str, password: str): + click.echo(f"Publishing to Fairscape ({self.base_url})...") + + # Get authentication token + token = self._get_auth_token(username, password) + + # Check if path is a directory, if so zip it + if rocrate_path.is_dir(): + click.echo(f"Input path is a directory, zipping it first...") + zip_path = self._zip_directory(rocrate_path) + try: + self._upload_zip(zip_path, token) + finally: + # Clean up temporary zip file + if zip_path.exists(): + zip_path.unlink() + elif rocrate_path.suffix.lower() == '.zip': + self._upload_zip(rocrate_path, token) + else: + click.echo(f"Error: Input path must be a directory or a zip file", err=True) + sys.exit(1) + + def _upload_zip(self, zip_path: Path, token: str): + url = f"{self.base_url}/rocrate/upload-async" + headers = {"Authorization": f"Bearer {token}"} + + try: + with open(zip_path, 'rb') as f: + files = {'crate': (zip_path.name, f, 'application/zip')} + click.echo(f"Uploading zip file to Fairscape...") + + response = requests.post(url, headers=headers, files=files) + + if response.status_code == 200 or response.status_code == 201 or response.status_code == 202: + result = response.json() + transaction_id = result.get('transactionFolder', 'N/A') + click.echo(f"Successfully initiated upload to Fairscape!") + click.echo(f"Transaction ID: {transaction_id}") + click.echo(f"Check Fairscape dashboard for upload status.") + return transaction_id + else: + click.echo(f"Error uploading dataset. Status: {response.status_code}", err=True) + click.echo(f"Response: {response.text}", err=True) + sys.exit(1) + + except requests.exceptions.RequestException as e: + click.echo(f"Error connecting to Fairscape API at {url}: {e}", err=True) + sys.exit(1) + except Exception as e: + click.echo(f"An unexpected error occurred during Fairscape upload: {e}", err=True) + sys.exit(1) + +class DataversePublisher(Publisher): + def __init__(self, base_url: str, collection_alias: str): + self.base_url = base_url.rstrip('/') + self.collection_alias = collection_alias + + def _transform_metadata(self, root_node: Dict, authors_info: Dict) -> Dict: + license_map = { + "https://creativecommons.org/licenses/by/4.0/": { "name": "CC BY 4.0", "uri": "https://creativecommons.org/licenses/by/4.0/"}, + "https://creativecommons.org/licenses/by-nc-sa/4.0/": { "name": "CC BY-NC-SA 4.0", "uri": "https://creativecommons.org/licenses/by-nc-sa/4.0/"}, + "https://creativecommons.org/publicdomain/zero/1.0/": { "name": "CC0 1.0", "uri": "http://creativecommons.org/publicdomain/zero/1.0"} + } + default_license_info = license_map["https://creativecommons.org/licenses/by/4.0/"] + license_url = root_node.get('license', list(license_map.keys())[0]) + license_info = license_map.get(license_url, default_license_info) + + authors_raw = root_node.get('author', []) + author_list = [] + if isinstance(authors_raw, str): + delimiters = [',', ';'] + for d in delimiters: + if d in authors_raw: + author_list = [a.strip() for a in authors_raw.split(d) if a.strip()] + break + if not author_list: + author_list = [authors_raw.strip()] if authors_raw.strip() else [] + elif isinstance(authors_raw, list): + for item in authors_raw: + if isinstance(item, str): + author_list.append(item.strip()) + elif isinstance(item, dict) and 'name' in item: + author_list.append(item['name'].strip()) + + author_entries = [] + for name in author_list: + author_name_lower = name.lower() + author_details = authors_info.get(author_name_lower, {}) + entry = { + "authorName": {"typeName": "authorName", "multiple": False, "typeClass": "primitive", "value": name}, + "authorAffiliation": {"typeName": "authorAffiliation", "multiple": False, "typeClass": "primitive", "value": author_details.get('affiliation', '')} + } + orcid = author_details.get('orcid', '') + if orcid: + if len(orcid) == 19 and orcid[4] == '-' and orcid[9] == '-' and orcid[14] == '-': + entry["authorIdentifierScheme"] = {"typeName": "authorIdentifierScheme", "multiple": False, "typeClass": "controlledVocabulary", "value": "ORCID"} + entry["authorIdentifier"] = {"typeName": "authorIdentifier", "multiple": False, "typeClass": "primitive", "value": orcid} + else: + click.echo(f"Warning: Invalid ORCID format '{orcid}' for author '{name}'. Skipping ORCID.", err=True) + + author_entries.append(entry) + + if not author_entries: + author_entries.append({ + "authorName": {"typeName": "authorName", "multiple": False, "typeClass": "primitive", "value": "Unknown"}, + "authorAffiliation": {"typeName": "authorAffiliation", "multiple": False, "typeClass": "primitive", "value": ""} + }) + + keywords_raw = root_node.get('keywords', []) + keyword_list = [] + if isinstance(keywords_raw, str): + keyword_list = [k.strip() for k in keywords_raw.split(',') if k.strip()] + elif isinstance(keywords_raw, list): + keyword_list = [str(k).strip() for k in keywords_raw if str(k).strip()] + + keyword_entries = [{"keywordValue": {"typeName": "keywordValue", "multiple": False, "typeClass": "primitive", "value": kw}} for kw in keyword_list] + + contact_name = root_node.get("principalInvestigator", author_list[0] if author_list else "Unknown") + contact_email = root_node.get("contactEmail", "placeholder@example.com") + + pub_date_raw = root_node.get("datePublished", datetime.today().strftime('%Y-%m-%d')) + try: + dt_obj = datetime.fromisoformat(pub_date_raw.split('T')[0]) + pub_date = dt_obj.strftime('%Y-%m-%d') + except ValueError: + try: + dt_obj = datetime.strptime(pub_date_raw, '%m/%d/%Y') + pub_date = dt_obj.strftime('%Y-%m-%d') + except ValueError: + pub_date = datetime.today().strftime('%Y-%m-%d') + + dv_metadata = { + "datasetVersion": { + "license": license_info, + "metadataBlocks": { + "citation": { + "displayName": "Citation Metadata", + "fields": [ + {"typeName": "title", "multiple": False, "typeClass": "primitive", "value": root_node.get("name", "Untitled Dataset")}, + {"typeName": "author", "multiple": True, "typeClass": "compound", "value": author_entries}, + {"typeName": "datasetContact", "multiple": True, "typeClass": "compound", "value": [ + {"datasetContactName": {"typeName": "datasetContactName", "multiple": False, "typeClass": "primitive", "value": contact_name}, + "datasetContactEmail": {"typeName": "datasetContactEmail", "multiple": False, "typeClass": "primitive", "value": contact_email}} + ]}, + {"typeName": "dsDescription", "multiple": True, "typeClass": "compound", "value": [ + {"dsDescriptionValue": {"typeName": "dsDescriptionValue", "multiple": False, "typeClass": "primitive", "value": root_node.get("description", "")}} + ]}, + {"typeName": "subject", "multiple": True, "typeClass": "controlledVocabulary", "value": root_node.get("subject", ["Other"])}, + {"typeName": "keyword", "multiple": True, "typeClass": "compound", "value": keyword_entries}, + {"typeName": "notesText", "multiple": False, "typeClass": "primitive", "value": f"RO-Crate Source: {root_node.get('@id', 'N/A')}"}, + {"typeName": "distributionDate", "multiple": False, "typeClass": "primitive", "value": pub_date}, + {"typeName": "dateOfDeposit", "multiple": False, "typeClass": "primitive", "value": datetime.today().strftime('%Y-%m-%d')} + ] + } + + } + } + } + return dv_metadata + + def publish(self, rocrate_path: Path, api_token: str, authors_csv_path: Optional[str]): + click.echo(f"Publishing RO-Crate '{rocrate_path.name}' to Dataverse ({self.base_url})...") + root_node = _read_rocrate_root(rocrate_path) + if not root_node: + sys.exit(1) + + authors_info = _load_authors_info(authors_csv_path) + dataverse_metadata = self._transform_metadata(root_node, authors_info) + + url = f"{self.base_url}/api/dataverses/{self.collection_alias}/datasets" + headers = {"X-Dataverse-key": api_token, "Content-Type": "application/json"} + + try: + response = requests.post(url, headers=headers, json=dataverse_metadata) + response.raise_for_status() + + if response.status_code == 201: + result = response.json() + persistent_id = result.get('data', {}).get('persistentId', 'N/A') + click.echo(f"Successfully created dataset on Dataverse!") + click.echo(f"Persistent Identifier: {persistent_id}") + dataset_id = result.get('data', {}).get('id', '') + if dataset_id: + click.echo(f"Dataverse Dataset URL: {self.base_url}/dataset.xhtml?persistentId={persistent_id}&version=DRAFT") + return persistent_id + else: + click.echo(f"Error creating dataset. Status: {response.status_code}", err=True) + click.echo(f"Response: {response.text}", err=True) + sys.exit(1) + + except requests.exceptions.RequestException as e: + click.echo(f"Error connecting to Dataverse API at {url}: {e}", err=True) + sys.exit(1) + except Exception as e: + click.echo(f"An unexpected error occurred during Dataverse publishing: {e}", err=True) + sys.exit(1) + + +class DataCitePublisher(Publisher): + def __init__(self, prefix: str, repository_id: str, api_url: str): + self.prefix = prefix + self.repository_id = repository_id + self.api_url = api_url.rstrip('/') + + def _transform_metadata(self, root_node: Dict) -> Dict: + authors_raw = root_node.get('author', []) + creator_list = [] + if isinstance(authors_raw, str): + delimiters = [',', ';'] + names = [] + for d in delimiters: + if d in authors_raw: + names = [a.strip() for a in authors_raw.split(d) if a.strip()] + break + if not names: names = [authors_raw.strip()] if authors_raw.strip() else [] + creator_list = [{"name": name} for name in names] + + elif isinstance(authors_raw, list): + for item in authors_raw: + if isinstance(item, str): + creator_list.append({"name": item.strip()}) + elif isinstance(item, dict): + entry = {} + if 'name' in item: entry['name'] = item['name'] + if 'affiliation' in item: entry['affiliation'] = [item['affiliation']] + if 'orcid' in item: + entry['nameIdentifiers'] = [{ + "nameIdentifier": item['orcid'], + "nameIdentifierScheme": "ORCID", + "schemeUri": "https://orcid.org" + }] + if entry.get("name"): + creator_list.append(entry) + + if not creator_list: + creator_list = [{"name": "Unknown"}] + + keywords_raw = root_node.get('keywords', []) + keyword_list = [] + if isinstance(keywords_raw, str): + keyword_list = [k.strip() for k in keywords_raw.split(',') if k.strip()] + elif isinstance(keywords_raw, list): + keyword_list = [str(k).strip() for k in keywords_raw if str(k).strip()] + subject_list = [{"subject": kw} for kw in keyword_list] + + pub_date_raw = root_node.get("datePublished", datetime.today().strftime('%Y-%m-%d')) + pub_year = datetime.today().year + dates_list = [] + try: + dt_obj = datetime.fromisoformat(pub_date_raw.split('T')[0]) + pub_year = dt_obj.year + dates_list.append({"date": dt_obj.strftime('%Y-%m-%d'), "dateType": "Issued"}) + except ValueError: + try: + dt_obj = datetime.strptime(pub_date_raw, '%m/%d/%Y') + pub_year = dt_obj.year + dates_list.append({"date": dt_obj.strftime('%Y-%m-%d'), "dateType": "Issued"}) + except ValueError: + dates_list.append({"date": datetime.today().strftime('%Y-%m-%d'), "dateType": "Issued"}) + + rights_list = [] + license_url = root_node.get('license') + if license_url: + license_name = license_url.split('/')[-2] if license_url.count('/') > 3 else license_url + rights_list.append({ + "rights": license_name.upper().replace('-', ' '), + "rightsUri": license_url + }) + + datacite_payload = { + "data": { + "type": "dois", + "attributes": { + "prefix": self.prefix, + "publisher": root_node.get("publisher", "Unknown"), + "publicationYear": pub_year, + "titles": [{"title": root_node.get("name", "Untitled Dataset")}], + "creators": creator_list, + "types": {"resourceTypeGeneral": "Dataset"}, + "subjects": subject_list if subject_list else None, + "contributors": [], + "dates": dates_list if dates_list else None, + "language": "en", + "alternateIdentifiers": [], + "relatedIdentifiers": [], + "sizes": [], + "formats": [], + "version": root_node.get("version"), + "rightsList": rights_list if rights_list else None, + "descriptions": [{"description": root_node.get("description", ""), "descriptionType": "Abstract"}] if root_node.get("description") else None, + "geoLocations": [], + "fundingReferences": [], + "url": root_node.get("url") or root_node.get('@id'), + "schemaVersion": "http://datacite.org/schema/kernel-4" + } + } + } + + for key, value in list(datacite_payload["data"]["attributes"].items()): + if value is None or (isinstance(value, list) and not value): + del datacite_payload["data"]["attributes"][key] + + return datacite_payload + + def publish(self, rocrate_path: Path, username: str, password: str, event: str = 'publish'): + click.echo(f"Publishing RO-Crate '{rocrate_path.name}' to DataCite ({self.api_url})...") + root_node = _read_rocrate_root(rocrate_path) + if not root_node: + sys.exit(1) + + datacite_metadata = self._transform_metadata(root_node) + datacite_metadata["data"]["attributes"]["event"] = event + + url = f"{self.api_url}/dois" + headers = {"Content-Type": "application/vnd.api+json"} + auth = (username, password) + + try: + response = requests.post(url, headers=headers, json=datacite_metadata, auth=auth) + response.raise_for_status() + + if response.status_code == 201: + result = response.json() + doi = result.get('data', {}).get('id', 'N/A') + doi_url = result.get('data', {}).get('attributes', {}).get('url', '') + click.echo(f"Successfully created DOI on DataCite!") + click.echo(f"DOI: {doi}") + click.echo(f"URL: https://doi.org/{doi}") + if doi_url: click.echo(f"Landing Page: {doi_url}") + return doi + else: + click.echo(f"Error creating DOI. Status: {response.status_code}", err=True) + click.echo(f"Response: {response.text}", err=True) + sys.exit(1) + + except requests.exceptions.RequestException as e: + click.echo(f"Error connecting to DataCite API at {url}: {e}", err=True) + sys.exit(1) + except Exception as e: + click.echo(f"An unexpected error occurred during DataCite publishing: {e}", err=True) + sys.exit(1) \ No newline at end of file diff --git a/src/fairscape_cli/rocrate/rocrate.py b/src/fairscape_cli/rocrate/rocrate.py deleted file mode 100644 index 7a16af6..0000000 --- a/src/fairscape_cli/rocrate/rocrate.py +++ /dev/null @@ -1,720 +0,0 @@ -import click -import pathlib -import shutil -import json -from datetime import datetime -from typing import List, Optional, Union - -from pydantic import ValidationError - -from fairscape_cli.config import NAAN -from fairscape_cli.models.guid_utils import GenerateDatetimeSquid -from fairscape_cli.models.utils import ( - FileNotInCrateException, - getDirectoryContents, - getEntityFromCrate, - run_command -) -from fairscape_cli.models import ( - # Core models - Dataset, - Software, - Computation, - ROCrate, - ROCrateMetadata, - BagIt, - - # Generator functions - GenerateDataset, - GenerateSoftware, - GenerateComputation, - GenerateROCrate, - - # RO Crate operations - ReadROCrateMetadata, - AppendCrate, - CopyToROCrate, - UpdateCrate, - - # Additional utilities - generateSummaryStatsElements, - registerOutputs -) - - - -# Click Commands -# RO Crate -@click.group('rocrate') -def rocrate(): - """Invoke operations on Research Object Crate (RO-CRate). - """ - pass - - -@rocrate.command('init') -@click.option('--guid', required=False, type=str, default="", show_default=False) -@click.option('--name', required=True, type=str) -@click.option('--organization-name', required=True, type=str) -@click.option('--project-name', required=True, type=str) -@click.option('--description', required=True, type=str) -@click.option('--keywords', required=True, multiple=True, type=str) -@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/") -@click.option('--date-published', required=False, type=str) -def init( - guid, - name, - organization_name, - project_name, - description, - keywords, - license, - date_published -): - """ Initialize a rocrate in the current working directory by instantiating a ro-crate-metadata.json file. - """ - passed_crate = GenerateROCrate( - guid=guid, - name=name, - organizationName=organization_name, - projectName=project_name, - description=description, - keywords=keywords, - license=license, - datePublished=date_published, - path=pathlib.Path.cwd(), - ) - click.echo(passed_crate.get("@id")) - -@rocrate.command('create') -@click.option('--guid', required=False, type=str, default="", show_default=False) -@click.option('--name', required=True, type=str) -@click.option('--organization-name', required=True, type=str) -@click.option('--project-name', required=True, type=str) -@click.option('--description', required=True, type=str) -@click.option('--keywords', required=True, multiple=True, type=str) -@click.option('--license', required=False, type=str, default="https://creativecommons.org/licenses/by/4.0/") -@click.option('--date-published', required=False, type=str) -@click.argument('rocrate-path', type=click.Path(exists=False, path_type=pathlib.Path)) -def create( - rocrate_path, - guid, - name, - organization_name, - project_name, - description, - keywords, - license, - date_published -): - '''Create an ROCrate in a new path specified by the rocrate-path argument - ''' - passed_crate = GenerateROCrate( - guid=guid, - name=name, - organizationName=organization_name, - projectName=project_name, - description=description, - keywords=keywords, - license=license, - datePublished=date_published, - path=rocrate_path - ) - click.echo(passed_crate.get("@id")) - - - - -########################## -# RO Crate register subcommands -########################## -@rocrate.group('register') -def register(): - """ Add a metadata record to the RO-Crate for a Dataset, Software, or Computation - """ - pass - -@register.command('software') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--guid', type=str, required=False, default=None) -@click.option('--name', required=True) -@click.option('--author', required=True) -@click.option('--version', required=True) -@click.option('--description', required = True) -@click.option('--keywords', required=True, multiple=True) -@click.option('--file-format', required = True) -@click.option('--url', required = False) -@click.option('--date-modified', required=False) -@click.option('--filepath', required=False) -@click.option('--used-by-computation', required=False, multiple=True) -@click.option('--associated-publication', required=False) -@click.option('--additional-documentation', required=False) -@click.pass_context -def registerSoftware( - ctx, - rocrate_path: pathlib.Path, - guid, - name, - author, - version, - description, - keywords, - file_format, - url, - date_modified, - filepath, - used_by_computation, - associated_publication, - additional_documentation -): - """Register a Software metadata record to the specified ROCrate - """ - try: - crateInstance = ReadROCrateMetadata(rocrate_path) - except Exception as exc: - click.echo(f"ERROR Reading ROCrate: {str(exc)}") - ctx.exit(code=1) - - try: - software_instance = GenerateSoftware( - guid= guid, - url= url, - name=name, - version=version, - keywords=keywords, - fileFormat=file_format, - description=description, - author= author, - associatedPublication=associated_publication, - additionalDocumentation=additional_documentation, - dateModified=date_modified, - usedByComputation=used_by_computation, - filepath=filepath, - cratePath =rocrate_path - ) - - AppendCrate(cratePath = rocrate_path, elements=[software_instance]) - click.echo(software_instance.guid) - - except FileNotInCrateException as e: - click.echo(f"ERROR: {str(e)}") - ctx.exit(code=1) - - except ValidationError as e: - click.echo("ERROR: Software Validation Failure") - click.echo(e) - ctx.exit(code=1) - - except Exception as exc: - click.echo(f"ERROR: {str(exc)}") - ctx.exit(code=1) - - -@register.command('dataset') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--guid', type=str, required=False, default=None) -@click.option('--name', required=True) -@click.option('--url', required=False) -@click.option('--author', required=True) -@click.option('--version', required=True) -@click.option('--date-published', required=True) -@click.option('--description', required=True) -@click.option('--keywords', required=True, multiple=True) -@click.option('--data-format', required=True) -@click.option('--filepath', required=True) -@click.option('--summary-statistics-filepath', required=False, type=click.Path(exists=True)) -@click.option('--used-by', required=False, multiple=True) -@click.option('--derived-from', required=False, multiple=True) -@click.option('--generated-by', required=False, multiple=True) -@click.option('--schema', required=False, type=str) -@click.option('--associated-publication', required=False) -@click.option('--additional-documentation', required=False) -@click.pass_context -def registerDataset( - ctx, - rocrate_path: pathlib.Path, - guid: str, - name: str, - url: str, - author: str, - version: str, - date_published: str, - description: str, - keywords: List[str], - data_format: str, - filepath: str, - summary_statistics_filepath: Optional[str], - used_by: Optional[List[str]], - derived_from: Optional[List[str]], - generated_by: Optional[List[str]], - schema: str, - associated_publication: Optional[str], - additional_documentation: Optional[List[str]], -): - """Register Dataset object metadata with the specified RO-Crate""" - try: - crate_instance = ReadROCrateMetadata(rocrate_path) - except Exception as exc: - click.echo(f"ERROR Reading ROCrate: {str(exc)}") - ctx.exit(code=1) - - try: - # Generate main dataset GUID - sq_dataset = GenerateDatetimeSquid() - dataset_guid = guid if guid else f"ark:{NAAN}/dataset-{name.lower().replace(' ', '-')}-{sq_dataset}" - - summary_stats_guid = None - elements = [] - - # Handle summary statistics if provided - if summary_statistics_filepath: - summary_stats_guid, summary_stats_instance, computation_instance = generateSummaryStatsElements( - name=name, - author=author, - keywords=keywords, - date_published=date_published, - version=version, - associated_publication=associated_publication, - additional_documentation=additional_documentation, - schema=schema, - dataset_guid=dataset_guid, - summary_statistics_filepath=summary_statistics_filepath, - crate_path=rocrate_path - ) - elements.extend([computation_instance, summary_stats_instance]) - - # Generate main dataset - dataset_instance = GenerateDataset( - guid=dataset_guid, - url=url, - author=author, - name=name, - description=description, - keywords=keywords, - datePublished=date_published, - version=version, - associatedPublication=associated_publication, - additionalDocumentation=additional_documentation, - dataFormat=data_format, - schema=schema, - derivedFrom=derived_from, - generatedBy=generated_by, - usedBy=used_by, - filepath=filepath, - cratePath=rocrate_path, - summary_stats_guid=summary_stats_guid - ) - - elements.insert(0, dataset_instance) - AppendCrate(cratePath=rocrate_path, elements=elements) - click.echo(dataset_instance.guid) - - except FileNotInCrateException as e: - click.echo(f"ERROR: {str(e)}") - ctx.exit(code=1) - - except ValidationError as e: - click.echo("Dataset Validation Error") - click.echo(e) - ctx.exit(code=1) - - except Exception as exc: - click.echo(f"ERROR: {str(exc)}") - ctx.exit(code=1) - - -@register.command('computation') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--guid', type=str, required=False, default=None) -@click.option('--name', required=True) -@click.option('--run-by', required=True) -@click.option('--command', required=False) -@click.option('--date-created', required=True) -@click.option('--description', required=True) -@click.option('--keywords', required=True, multiple=True) -@click.option('--used-software', required=False, multiple=True) -@click.option('--used-dataset', required=False, multiple=True) -@click.option('--generated', required=False, multiple=True) -@click.pass_context -def computation( - ctx, - rocrate_path: pathlib.Path, - guid: str, - name: str, - run_by: str, - command: Optional[Union[str, List[str]]], - date_created: str, - description: str, - keywords: List[str], - used_software, - used_dataset, - generated -): - """Register a Computation with the specified RO-Crate - """ - try: - crateInstance = ReadROCrateMetadata(rocrate_path) - except Exception as exc: - click.echo(f"ERROR Reading ROCrate: {str(exc)}") - ctx.exit(code=1) - - - try: - computationInstance = GenerateComputation( - guid=guid, - name=name, - runBy=run_by, - command= command, - dateCreated= date_created, - description= description, - keywords= keywords, - usedSoftware= used_software, - usedDataset= used_dataset, - generated= generated - ) - - AppendCrate(cratePath=rocrate_path, elements=[computationInstance]) - click.echo(computationInstance.guid) - - except ValidationError as e: - click.echo("Computation Validation Error") - click.echo(e) - ctx.exit(code=1) - -@register.command('subrocrate') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.argument('subrocrate-path', type=click.Path(path_type=pathlib.Path)) -@click.option('--guid', required=False, type=str, default="", show_default=False) -@click.option('--name', required=True, type=str) -@click.option('--organization-name', required=True, type=str) -@click.option('--project-name', required=True, type=str) -@click.option('--description', required=True, type=str) -@click.option('--keywords', required=True, multiple=True, type=str) -@click.pass_context -def subrocrate( - ctx, - rocrate_path: pathlib.Path, - subrocrate_path: pathlib.Path, - guid: str, - name: str, - organization_name: str, - project_name: str, - description: str, - keywords: List[str] -): - """Register a new RO-Crate within an existing RO-Crate directory. - - ROCRATE_PATH: Path to the parent RO-Crate - SUBCRATE_PATH: Relative path within the parent RO-Crate where the subcrate should be created - """ - try: - # Load existing crate - metadata = ReadROCrateMetadata(rocrate_path) - parent_crate = ROCrate( - guid=metadata['@graph'][1]['@id'], - name=metadata['@graph'][1]['name'], - description=metadata['@graph'][1]['description'], - keywords=metadata['@graph'][1]['keywords'], - path=rocrate_path - ) - - # Create subcrate using the new method - subcrate_id = parent_crate.create_subcrate( - subcrate_path=subrocrate_path, - guid=guid, - name=name, - description=description, - keywords=keywords, - organization_name=organization_name, - project_name=project_name - ) - - click.echo(subcrate_id) - - except Exception as exc: - click.echo(f"ERROR: {str(exc)}") - ctx.exit(code=1) - -# RO Crate add subcommands -@rocrate.group('add') -def add(): - """Add (transfer) object to RO-Crate and register object metadata.""" - pass - - -@add.command('software') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--guid', type=str, required=False, default=None) -@click.option('--name', required=True) -@click.option('--author', required=True) -@click.option('--version', required=True) -@click.option('--description', required = True) -@click.option('--keywords', required=True, multiple=True) -@click.option('--file-format', required = True) -@click.option('--url', required = False) -@click.option('--source-filepath', required=True) -@click.option('--destination-filepath', required=True) -@click.option('--date-modified', required=True) -@click.option('--used-by-computation', required=False, multiple=True) -@click.option('--associated-publication', required=False) -@click.option('--additional-documentation', required=False) -@click.pass_context -def software( - ctx, - rocrate_path: pathlib.Path, - guid, - name, - author, - version, - description, - keywords, - file_format, - url, - source_filepath, - destination_filepath, - date_modified, - used_by_computation, - associated_publication, - additional_documentation -): - """Add a Software and its corresponding metadata. - """ - try: - crateInstance = ReadROCrateMetadata(rocrate_path) - except Exception as exc: - click.echo(f"ERROR Reading ROCrate: {str(exc)}") - ctx.exit(code=1) - - - try: - CopyToROCrate(source_filepath, destination_filepath) - - software_instance = GenerateSoftware( - guid=guid, - url= url, - name=name, - version=version, - keywords=keywords, - fileFormat=file_format, - description=description, - author= author, - associatedPublication=associated_publication, - additionalDocumentation=additional_documentation, - dateModified=date_modified, - usedByComputation=used_by_computation, - filepath=destination_filepath, - cratePath =rocrate_path - ) - - AppendCrate(cratePath = rocrate_path, elements=[software_instance]) - # copy file to rocrate - click.echo(software_instance.guid) - - except ValidationError as e: - click.echo("Software Validation Error") - click.echo(e) - ctx.exit(code=1) - - # TODO add to cache - - -@add.command('dataset') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--guid', type=str, required=False, default=None) -@click.option('--name', required=True) -@click.option('--url', required=False) -@click.option('--author', required=True) -@click.option('--version', required=True) -@click.option('--date-published', required=True) -@click.option('--description', required=True) -@click.option('--keywords', required=True, multiple=True) -@click.option('--data-format', required=True) -@click.option('--source-filepath', required=True) -@click.option('--destination-filepath', required=True) -@click.option('--summary-statistics-source', required=False, type=click.Path(exists=True)) -@click.option('--summary-statistics-destination', required=False, type=click.Path()) -@click.option('--used-by', required=False, multiple=True) -@click.option('--derived-from', required=False, multiple=True) -@click.option('--generated-by', required=False, multiple=True) -@click.option('--schema', required=False, type=str) -@click.option('--associated-publication', required=False) -@click.option('--additional-documentation', required=False) -@click.pass_context -def dataset( - ctx, - rocrate_path: pathlib.Path, - guid, - name, - url, - author, - version, - date_published, - description, - keywords, - data_format, - source_filepath, - destination_filepath, - summary_statistics_source, - summary_statistics_destination, - used_by, - derived_from, - generated_by, - schema, - associated_publication, - additional_documentation, -): - """Add a Dataset file and its metadata to the RO-Crate.""" - try: - crateInstance = ReadROCrateMetadata(rocrate_path) - except Exception as exc: - click.echo(f"ERROR Reading ROCrate: {str(exc)}") - ctx.exit(code=1) - - try: - # Copy main dataset file - CopyToROCrate(source_filepath, destination_filepath) - - # Generate main dataset GUID - sq_dataset = GenerateDatetimeSquid() - dataset_guid = guid if guid else f"ark:{NAAN}/dataset-{name.lower().replace(' ', '-')}-{sq_dataset}" - - summary_stats_guid = None - elements = [] - - # Handle summary statistics if provided - if summary_statistics_source and summary_statistics_destination: - # Copy summary statistics file - CopyToROCrate(summary_statistics_source, summary_statistics_destination) - - # Generate summary statistics elements - summary_stats_guid, summary_stats_instance, computation_instance = generateSummaryStatsElements( - name=name, - author=author, - keywords=keywords, - date_published=date_published, - version=version, - associated_publication=associated_publication, - additional_documentation=additional_documentation, - schema=schema, - dataset_guid=dataset_guid, - summary_statistics_filepath=summary_statistics_destination, - crate_path=rocrate_path - ) - elements.extend([computation_instance, summary_stats_instance]) - - # Generate main dataset - dataset_instance = GenerateDataset( - guid=dataset_guid, - url=url, - author=author, - name=name, - description=description, - keywords=keywords, - datePublished=date_published, - version=version, - associatedPublication=associated_publication, - additionalDocumentation=additional_documentation, - dataFormat=data_format, - schema=schema, - derivedFrom=derived_from, - generatedBy=generated_by, - usedBy=used_by, - filepath=destination_filepath, - cratePath=rocrate_path, - summary_stats_guid=summary_stats_guid - ) - - elements.insert(0, dataset_instance) - AppendCrate(cratePath=rocrate_path, elements=elements) - click.echo(dataset_instance.guid) - - except ValidationError as e: - click.echo("Dataset Validation Error") - click.echo(e) - ctx.exit(code=1) - - except Exception as exc: - click.echo(f"ERROR: {str(exc)}") - ctx.exit(code=1) - -################# -# Summary Statistics -################# -@rocrate.command('compute-statistics') -@click.argument('rocrate-path', type=click.Path(exists=True, path_type=pathlib.Path)) -@click.option('--dataset-id', required=True, help='ID of dataset to compute statistics for') -@click.option('--software-id', required=True, help='ID of software to run') -@click.option('--command', required=True, help='Python command to execute (e.g. python)') -@click.pass_context -def compute_statistics( - ctx, - rocrate_path: pathlib.Path, - dataset_id: str, - software_id: str, - command: str -): - """Compute statistics for a dataset using specified software""" - crate_instance = ReadROCrateMetadata(rocrate_path) - initial_files = getDirectoryContents(rocrate_path) - - # Get original dataset info - dataset_info = getEntityFromCrate(crate_instance, dataset_id) - software_info = getEntityFromCrate(crate_instance, software_id) - if not dataset_info or not software_info: - raise ValueError(f"Dataset or software not found in crate") - - # Get original dataset author - original_author = dataset_info.get("author", "Unknown") - dataset_path = dataset_info.get("contentUrl", "").replace("file:///", "") - software_path = software_info.get("contentUrl", "").replace("file:///", "") - - if not dataset_path or not software_path: - raise ValueError("Dataset or software path not found") - - full_command = f"{command} {software_path} {dataset_path} {rocrate_path}" - success, stdout, stderr = run_command(full_command) - if not success: - raise RuntimeError(f"Command failed: {stderr}") - - final_files = getDirectoryContents(rocrate_path) - new_files = final_files - initial_files - if not new_files: - raise RuntimeError("No output files generated") - - computation_instance = GenerateComputation( - guid=None, - name=f"Statistics Computation for {dataset_id}", - runBy="Fairscape-CLI", - command=full_command, - dateCreated=datetime.now().isoformat(), - description=f"Generated statistics\nstdout:\n{stdout}\nstderr:\n{stderr}", - keywords=["statistics"], - usedSoftware=[software_id], - usedDataset=[dataset_id], - generated=[] - ) - - output_instances = registerOutputs( - new_files=new_files, - computation_id=computation_instance.guid, - dataset_id=dataset_id, - author=original_author - ) - - stats_output = [out.guid for out in output_instances] - computation_instance.generated = stats_output - - if stats_output: - # Update the original dataset metadata - dataset_info["hasSummaryStatistics"] = stats_output - # Generate a new Dataset instance with updated metadata - updated_dataset = Dataset.model_validate(dataset_info) - - # Update the dataset in the crate and append new elements - UpdateCrate(cratePath=rocrate_path, element=updated_dataset) - AppendCrate( - cratePath=rocrate_path, - elements=[computation_instance] + output_instances - ) - - click.echo(computation_instance.guid) \ No newline at end of file diff --git a/src/fairscape_cli/rocrate/utils.py b/src/fairscape_cli/rocrate/utils.py deleted file mode 100644 index 53a0b41..0000000 --- a/src/fairscape_cli/rocrate/utils.py +++ /dev/null @@ -1,10 +0,0 @@ -class DestinationCrateError(Exception): - def __init__(self, crate_path, destination_path): - self.message = "\n".join([ - "Destination Filepath isnt inside the Crate", - f"ROCrate Path: {str(crate_path)}", - f"DestinationPath: {str(destination_path)}" - ]) - - - diff --git a/tests/test_rocrate_api.py b/tests/test_rocrate_api.py index cfda2b9..d462c1d 100644 --- a/tests/test_rocrate_api.py +++ b/tests/test_rocrate_api.py @@ -48,6 +48,10 @@ def test_api(self): rocrate_metadata = { "guid": "ark:59853/UVA/B2AI/rocrate_test", "name": 'test rocrate', + "author": "Fake Person", + "version": "0.1", + "datePublished": "2024-01-01", + "license": "CC-BY-4.0", "organizationName": "UVA", "projectName": "B2AI", "description": "Testing ROCrate Model", @@ -114,7 +118,7 @@ def test_api(self): "author": "Lundberg Lab", "datePublished": "2024-10-22", "version": "0.7alpha", - "dataFormat": "jpg", + "format": "jpg", "generatedBy": [], "derivedFrom": [], "usedBy": [], diff --git a/tests/test_rocrate_commands.py b/tests/test_rocrate_commands.py index 051ffb8..f0df0e0 100644 --- a/tests/test_rocrate_commands.py +++ b/tests/test_rocrate_commands.py @@ -30,6 +30,7 @@ def test_cli_workflow(self): '--name', 'Top Level Crate', '--organization-name', 'Test Org', '--project-name', 'Test Project', + '--date-published', '2025-01-22', '--description', 'Top level test crate', '--keywords', 'test,top-level', '.'