Enforce data structure and data type consistency for JSON metadata#1421
Enforce data structure and data type consistency for JSON metadata#1421
Conversation
Implements JSON Schema for titles
KellyStathis
left a comment
There was a problem hiding this comment.
Hi @svogt0511, I noted a couple things from reading through this but haven't tested yet. Happy to take a closer look when I'm back next week!
| "if": { | ||
| "properties": { | ||
| "relationType": { | ||
| "enum": ["HasMetadata", "IsMetadataFor"] | ||
| } | ||
| } | ||
| }, | ||
| "then": { | ||
| "properties": { | ||
| "relatedMetadataScheme": { | ||
| "type": "string" | ||
| }, | ||
| "schemeUri": { | ||
| "type": "string" | ||
| }, | ||
| "schemeType": { | ||
| "type": "string" | ||
| } | ||
| } | ||
| }, | ||
| "else": { | ||
| "not": { | ||
| "anyOf": [ | ||
| { "required": ["relatedMetadataScheme"] }, | ||
| { "required": ["schemeUri"] }, | ||
| { "required": ["schemeType"] } | ||
| ] | ||
| } | ||
| }, |
There was a problem hiding this comment.
@svogt0511 This restriction is in the schema documentation—that 12.c, 12.d, and 12.e should only be used when 12.b = "HasMetadata" or "IsMetadataFor": https://datacite-metadata-schema.readthedocs.io/en/4.6/properties/relatedidentifier/
However, I don't think it is in the XSD as XSD doesn't support conditional requirements like this.
In the interest of maintaining parity between the XSD and the JSON Schema, could we remove this?
There was a problem hiding this comment.
Agreeing with Kelly—let's remove this conditional since it's not imposed by the XSD.
There was a problem hiding this comment.
I think removing this logic will fix this, but I am having an issue including relatedMetadataScheme even when the relationType is HasMetadata. This was my JSON test:
"relatedIdentifiers": [
{
"schemeUri": "https://github.com/citation-style-language/schema/raw/master/csl-data.json",
"relationType": "HasMetadata",
"relatedIdentifier": "https://data.datacite.org/application/citeproc+json/10.5072/example-full",
"relatedIdentifierType": "URL",
"relatedMetadataScheme": "citeproc+json"
},Which failed with this error:
{
"errors": [
{
"source": "related_identifiers",
"title": "Object property at `/0/relatedMetadataScheme` is a disallowed additional property",
"uid": "10.1111/742r-wc63"
}
]
}| "if": { | ||
| "properties": { | ||
| "relationType": { | ||
| "enum": [ "HasMetadata", "IsMetadataFor" ] | ||
| } | ||
| } | ||
| }, | ||
| "then": { | ||
| "properties": { | ||
| "relatedItemIdentifier": { | ||
| "type": "object", | ||
| "properties": { | ||
| "relatedItemIdentifier": { | ||
| "type": "string" | ||
| }, | ||
| "relatedItemIdentifierType": { | ||
| "$ref": "controlled_vocabularies/related_identifier_type.json" | ||
| }, | ||
| "relatedMetadataScheme": { | ||
| "type": "string" | ||
| }, | ||
| "schemeURI": { | ||
| "type": "string" | ||
| }, | ||
| "schemeType": { | ||
| "type": "string" | ||
| } | ||
| }, | ||
| "additionalProperties": false | ||
| } | ||
| } | ||
| }, | ||
| "else": { | ||
| "properties": { | ||
| "relatedItemIdentifier": { | ||
| "type": "object", | ||
| "properties": { | ||
| "relatedItemIdentifier": { | ||
| "type": "string" | ||
| }, | ||
| "relatedItemIdentifierType": { | ||
| "$ref": "controlled_vocabularies/related_identifier_type.json" | ||
| } | ||
| }, | ||
| "additionalProperties": false | ||
| } | ||
| } | ||
| }, |
There was a problem hiding this comment.
@svogt0511 If we update RelatedIdentifier to remove this restriction (see earlier comment), we should also remove it from RelatedItem here.
| } | ||
| }, | ||
| "dependentRequired": { | ||
| "affiliationIdentifier": ["nameIdentifierScheme"] |
There was a problem hiding this comment.
@svogt0511 Should this line be as follows?
"nameIdentifier": ["nameIdentifierScheme"]
As opposed to "affiliationIdentifier": ["nameIdentifierScheme"]?
There was a problem hiding this comment.
Also, I think nameIdentifierSchema is simply required rather than having a dependency on nameIdentifier:
<xs:complexType name="nameIdentifier">
<xs:annotation>
<xs:documentation>Uniquely identifies a creator or contributor, according to various identifier schemes.</xs:documentation>
</xs:annotation>
<xs:simpleContent>
<xs:extension base="nonemptycontentStringType">
<xs:attribute name="nameIdentifierScheme" type="xs:string" use="required"/>
<xs:attribute name="schemeURI" type="xs:anyURI" use="optional"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
There was a problem hiding this comment.
nameIdentifier also cannot be null since it has nonemptycontentStringType.
codycooperross
left a comment
There was a problem hiding this comment.
Took a pass on the JSON Schemas aside from relatedItem—these look great! I made some comments where there might be inconsistencies with the XSD or the current JSON representation.
Will get to relatedItem soon!
| "type": ["string", "null"] | ||
| } | ||
| }, | ||
| "dependentRequired": { |
There was a problem hiding this comment.
While logical, this is inconsistent with the XSD definition, which does not suggest dependent values between affiliationIdentifier and affiliationIdentifierScheme:
<xs:complexType name="affiliation">
<xs:annotation>
<xs:documentation>Uniquely identifies an affiliation, according to various identifier schemes.</xs:documentation>
</xs:annotation>
<xs:simpleContent>
<xs:extension base="nonemptycontentStringType">
<xs:attribute name="affiliationIdentifier" type="xs:string" use="optional"/>
<xs:attribute name="affiliationIdentifierScheme" type="xs:string" use="optional"/>
<xs:attribute name="schemeURI" type="xs:anyURI" use="optional"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
There was a problem hiding this comment.
Yes, I agree with this. Although logic is described in the Schema documentation at https://datacite-metadata-schema.readthedocs.io/en/4.6/properties/creator/#b-affiliationidentifierscheme - "If affiliationIdentifier is used, affiliationIdentifierScheme is mandatory." - it isn't part of the XSD, so we should omit it from the JSON Schema validation.
| } | ||
| }, | ||
| "dependentRequired": { | ||
| "affiliationIdentifier": ["nameIdentifierScheme"] |
There was a problem hiding this comment.
Also, I think nameIdentifierSchema is simply required rather than having a dependency on nameIdentifier:
<xs:complexType name="nameIdentifier">
<xs:annotation>
<xs:documentation>Uniquely identifies a creator or contributor, according to various identifier schemes.</xs:documentation>
</xs:annotation>
<xs:simpleContent>
<xs:extension base="nonemptycontentStringType">
<xs:attribute name="nameIdentifierScheme" type="xs:string" use="required"/>
<xs:attribute name="schemeURI" type="xs:anyURI" use="optional"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
| } | ||
| }, | ||
| "dependentRequired": { | ||
| "affiliationIdentifier": ["nameIdentifierScheme"] |
There was a problem hiding this comment.
nameIdentifier also cannot be null since it has nonemptycontentStringType.
| { | ||
| "title": "publicationYear", | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": [ "integer", "string" ], |
There was a problem hiding this comment.
Is this necessary with oneOf below?
| "inclusiveMinimum": true, | ||
| "inclusiveMaximum": true | ||
| } | ||
| ], |
There was a problem hiding this comment.
Should publicationYear be required by this schema?
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "array", | ||
| "items": { | ||
| "$ref": "size.json" |
There was a problem hiding this comment.
I don't think it makes a difference since sizes and formats are both just arrays of strings, but this points to size.json rather than format.json
| { | ||
| "title": "Version", | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": [ "number", "string" ] |
There was a problem hiding this comment.
Per the XSD, this seems like it should just be a string rather than a number or a string:
<xs:element name="version" type="xs:string" minOccurs="0">
<xs:annotation>
<xs:documentation>Version number of the resource. If the primary resource has changed the version number increases.</xs:documentation>
<xs:documentation>Register a new identifier for a major version change. Individual stewards need to determine which are major vs. minor versions. May be used in conjunction with properties 11 and 12 (AlternateIdentifier and RelatedIdentifier) to indicate various information updates. May be used in conjunction with property 17 (Description) to indicate the nature and file/record range of version.</xs:documentation>
</xs:annotation>
</xs:element>
| "$ref": "geo_location_polygon.json" | ||
| } | ||
| }, | ||
| "required": [ |
There was a problem hiding this comment.
Are these meant to be in the top-level geoLocation schema? They don't seem to point to existing attributes.
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "object", | ||
| "properties": { | ||
| "funderName": { |
There was a problem hiding this comment.
It looks like funderName cannot be null or empty:
<xs:element name="funderName" minOccurs="1" maxOccurs="1">
<xs:annotation>
<xs:documentation>Name of the funding provider.</xs:documentation>
</xs:annotation>
<xs:simpleType>
<xs:restriction base="nonemptycontentStringType"/>
</xs:simpleType>
</xs:element>
Should this be "type": ["string"] ?
| "title": "RelatedItems", | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "array", | ||
| "minItems": 0, |
There was a problem hiding this comment.
Also a clear pattern with "minItems": 0, but are these necessary?
| "dateType": { | ||
| "$ref": "controlled_vocabularies/date_type.json" | ||
| }, | ||
| "dateInformation": { |
There was a problem hiding this comment.
I believe dateInformation can be null. It is in this example: https://support.datacite.org/docs/api-create-dois#request-payload-1
| "schemeUri": { | ||
| "type": "string" | ||
| }, | ||
| "schemeType": { |
There was a problem hiding this comment.
I believe schemeType can be null. It is in this example: https://support.datacite.org/docs/api-create-dois#request-payload-1
| "schemeUri": { | ||
| "type": "string" | ||
| }, | ||
| "resourceTypeGeneral": { |
There was a problem hiding this comment.
This can also be null, per this example: https://support.datacite.org/docs/api-create-dois#request-payload-1
Since this is a pattern, I'm wondering if we should systematically allow null value for optional properties. What do you think, @codycooperross?
There was a problem hiding this comment.
Since the production API currently accepts null values for many fields in my testing, yes, the JSON Schema should permit null values for all fields that are not xs:string or nonemptycontentStringType in the XSD.
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "type": "object", | ||
| "properties": { | ||
| "name": { |
There was a problem hiding this comment.
This is an interesting one. I was testing with metadata pulled from some recently created/updated DOIs, to see if this validation would have impacted them.
I found this DOI which in the JSON has a contributor givenName and familyName, but not name: https://api.datacite.org/dois/10.34804/supra.2021092825
"contributors": [
{
"nameType": "Personal",
"givenName": "Jacopo",
"familyName": "Torrisi",
"affiliation": [],
"contributorType": "DataManager",
"nameIdentifiers": [
{
"nameIdentifier": "",
"nameIdentifierScheme": "ORCID"
}
]
}
],Using this metadata to create a DOI on staging failed with this error:
{
"errors": [
{
"source": "contributors",
"title": "Object at `/0` is missing required properties: name",
"uid": "10.1111/742r-wc63"
}
]
}From my understanding of the XSD, contributorName is required even if givenName and familyName are provided. And the corresponding XML for this DOI does have a contributorName:
<contributors>
<contributor contributorType="DataManager">
<contributorName nameType="Personal">Torrisi, Jacopo</contributorName>
<givenName>Jacopo</givenName>
<familyName>Torrisi</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI=""/>
<affiliation affiliationIdentifierScheme="ROR"/>
</contributor>
</contributors>How is that contributorName being generated for the XML? I am just thinking through the potential impact on user who are currently providing not providing contributor.name, but are providing contributor.givenName and contributor.familyName, if we introduce this.
There was a problem hiding this comment.
Tagging @codycooperross for input into this as well :)
There was a problem hiding this comment.
With JSON -> XML, we currently generate a contributorName and creatorName based on available familyName and givenName metadata if available. See the description here: https://docs.google.com/spreadsheets/d/1Hy0KXWPxqNx-Pfh-nNFxbsUFXXVYsO8O2sDIytXQv7U/edit?gid=1806954511#gid=1806954511&range=2:2
For the sake of backwards compatibility with existing request patterns and scoping this PR, let's remove the name requirement on creator and contributor for now. Currently invalid JSON metadata, i.e. metadata that contains no name or familyName metadata, will continue to fail when validated against the XSD.
Purpose
closes: https://github.com/datacite/product-backlog/issues/325
Approach
See #1341 for the approach
Open Questions and Pre-Merge TODOs
Learning
Types of changes
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Reviewer, please remember our guidelines: