Skip to content

Add spec section on requirements for importing/exporting CDT values#2

Merged
hartig merged 20 commits intomainfrom
spec-import-export
Jun 4, 2024
Merged

Add spec section on requirements for importing/exporting CDT values#2
hartig merged 20 commits intomainfrom
spec-import-export

Conversation

@kasei
Copy link
Collaborator

@kasei kasei commented Apr 2, 2024

I'm a bit unclear on the xref metadata in respec (maybe that needs populating with all the new references?).

@kasei kasei requested a review from hartig April 2, 2024 23:36
Copy link
Contributor

@hartig hartig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good start!! Yet, I have many comments about inconsistent or confusing wording of the text, as well as some more substantive issues around the two lists of bullet points at the beginning of the two normative sections (i.e., Section 12.2 and 12.3).

kasei and others added 2 commits April 11, 2024 14:50
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
hartig added a commit to hartig/jena that referenced this pull request May 26, 2024
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
hartig added a commit to hartig/jena that referenced this pull request May 27, 2024
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
Comment on lines 3322 to 3324
However, using a Turtle parser not aware of the <a href="#list-datatype">cdt:List datatype</a> will result in an object literal containing the substring "_:eve"; that is, the literal will keep the exact lexical form as given in the Turtle snippet above (<a href="#load-renaming-blank-example"></a>).
Allowing multiple such composite values, from multiple sources, to be inserted into the same RDF Dataset (e.g., via two SPARQL `LOAD` updates) presents a challenge:
when a system interprets the literals obtained from its Turtle parser, the <a>term lists</a> that it creates by applying the <a href="#list-datatype-lex-to-value-mapping">cdt:List lexical-to-value mapping</a> will contain the same blank nodes (because, during the mapping process for each cdt:List literal, the <i><a>bnl2bn</a></i> function is applied to the same blank node identifier, `_:eve`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this text focuses on the case of a system that is aware of our CDT datatypes but uses a Turtle parser that is not (i.e., the system interprets the CDT literals after receiving them from the Turtle parser), there is also another issue that we don't mention here; namely for systems that do not support the cdt:List datatype at all. In the case of the example outlined in the text here, such a system would throw all the cdt:List literals from different sources together, without renaming _:eve in any of these literals. Hence, piping data through such a system and then into a cdt:List-aware system may cause differences in query results, compared to loading the data directly into the cdt:List-aware system. In other words, a system that is not aware of the cdt:List datatype may accidentally change the meaning of cdt:List literals collected from multiple sources (by making these literals all contain the same blank nodes). Do you agree that this is a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd necessarily think of it as a "problem," but it is something users of CDTs need to be aware of.

Seems like you could probably think of somewhat similar examples of OWL-capable and non-OWL plain RDF stores where you add inconsistent assertions in the latter, and try to import into the former, and suddenly have bad data. That's not really a problem with OWL so much as a use case that can't work as intended when system of varying capabilities are involved. (Maybe not the best comparison, but I think instructive…)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you are suggesting we add a Note to call out this issue?

Even if users of CDTs are aware of it, I am afraid that there is nothing that they will be able to do to avoid the problem if they combine using systems that are and that are not CDT aware. That's somewhat unsatisfying, but I also don't see another way to address this issue (except for disallowing blank nodes in CDT literals).

I should also mention that this issue was one of the concerns raised in discussions of the approach at ESWC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, understood. Possible paths forward are to have bnode support be optional, with clear documentation that supporting them would introduce incompatibility with non-CDT-supporting systems; or to just document the issue with strong warnings about using bnodes if you need interop between systems…

Copy link
Contributor

@hartig hartig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over all outdated comments and added those that are still unresolved into the current version of the changes in this PR.

Comment on lines 3463 to 3477
<p>
Conforming implementations MUST process <a>cdt:List literals</a> and <a>cdt:Map literals</a> during export, replacing, in their lexical form, any substring _:<var>id</var>
matching the <a data-cite="TURTLE#grammar-production-BLANK_NODE_LABEL">`BLANK_NODE_LABEL`</a> production of the grammar
with a string "_:"+<var>b</var> where <var>b</var> is a blank node identifier, such that
</p>

<ul>
<li>within the scope of the document being serialized, all occurrences of <var>id</var> used as a blank node identifier (both inside and outside of composite values) are replaced with the same <var>b</var> value, and</li>
<li>within the scope of the document being serialized, two distinct <var>id</var> blank node identifiers are replaced with distinct <var>b</var> values, and</li>
<li>
if the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of cdt:List and cdt:Map literals (e.g., `_:b1` syntax in N-Triples and in Turtle, `rdf:nodeID` in RDF/XML),
and the blank node <var>B</var> is contained in the data to be serialized both inside and outside of composite value literals,
then the serializer for this document format MUST serialize <var>B</var> using the blank node identifier <var>b</var> (both inside and outside of composite values)
</li>
</ul>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am copying the following three points from a now-outdated comment as they are still unresolved.

  1. While I generally agree with the new third bullet point that you have added here, this point feels somewhat detached from the context in which you are listing these bullet points. I mean, the context is that there is some substring "_:id" in the lexical form of some cdt:List or cdt:Map literal to be exported, but this bullet point does not talk at all about this substring / about id.
  2. Related to the previous point, within the new third bullet point you seem to assume systems that represent CDT literals internally based their value form (how would such a system otherwise know that it is the same blank node B that is both inside and outside of CDT literals?). So, the issue that I see here is that it is not clear how the requirement of this bullet point can apply to systems that represent CDT literals internally based on their lexical form.
  3. Another potential problem regarding this new third bullet point: While the point focuses on cases in which the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of CDT literals, I wonder what a system should do for formats that do not have such a mechanism. Assume the dataset in a system contains the triple (b, :p, l) where b is a blank node and l is a cdt:List literal whose value (i.e., term list) contains b as well. If the dataset with this triple needs to be serialized into a format that does not provide such an aforementioned mechanism, then there is no way to serialize the triple without loosing the connection between the blank node in the subject and the blank node inside the object literal, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding point 3, yes, you'd lose the connection between the blank nodes. I don't know of any serialization formats that lack the ability to identify the bnodes, but I was trying to write this in a way that would be accepting of domain-specific serializations that might. For example, if you enforce restrictions on the data such that bnodes never appear outside of CDTs; or if your data model only uses bnodes in constrained ways such that they would never need to be named in the serialization (e.g. using them just as nodes to model data that could entirely be captured by Turtle-like [ :p1 [ :p2 :v ]] syntax). Maybe that's going too far in trying to support something there's no real evidence of?

Copy link
Contributor

@hartig hartig May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you say it, I think yes that is going too far. So, to address my third point here, I suggest to change the beginning of the third bullet point from "if the format..." to "assuming that the format...", and then perhaps add a Note that briefly mentions that we are not aware of formats in which blank nodes cannot be named and if there was one, it may lead to the problem described in my third point here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the proposal of my previous comment in commit fdbf337

This addresses point 3 above. Points 1 and 2 remain.

@kasei kasei force-pushed the spec-import-export branch from c869df2 to fdbf337 Compare June 1, 2024 18:07
@hartig
Copy link
Contributor

hartig commented Jun 3, 2024

While the section about export requirements still needs more work, the one about import requirements is ready now. I would like to have at least the latter in the main branch before creating the SEP (at sparql-dev) and announcing the approach further, which I really would like to do now to leverage the current momentum.

Therefore, would you be okay if I merge Section 12.1 (motivation) and 12.2 (import requirements) now, with a TODO in the export requirements section, and then create another PR for the continued work on the latter?

@kasei
Copy link
Collaborator Author

kasei commented Jun 3, 2024

Therefore, would you be okay if I merge Section 12.1 (motivation) and 12.2 (import requirements) now, with a TODO in the export requirements section, and then create another PR for the continued work on the latter?

Yes. Go ahead and merge and we can talk with any remaining issues with the import section with follow-up PRs. Thanks.

@hartig hartig merged commit ebb74e4 into main Jun 4, 2024
@hartig hartig deleted the spec-import-export branch June 4, 2024 07:13
@hartig hartig mentioned this pull request Jun 4, 2024
@hartig
Copy link
Contributor

hartig commented Jun 4, 2024

I have created #9 as the follow-up PR to continue working on the export requirements.

@hartig hartig mentioned this pull request Jun 4, 2024
hartig added a commit to hartig/jena that referenced this pull request Jun 4, 2024
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants