Add spec section on requirements for importing/exporting CDT values by kasei · Pull Request #2 · awslabs/SPARQL-CDTs

kasei · 2024-04-02T23:36:48Z

I'm a bit unclear on the xref metadata in respec (maybe that needs populating with all the new references?).

…rom SPARQL systems/datasets.

hartig

I think it is a good start!! Yet, I have many comments about inconsistent or confusing wording of the text, as well as some more substantive issues around the two lists of bullet points at the beginning of the two normative sections (i.e., Section 12.2 and 12.3).

spec/editors_draft.html

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

spec/editors_draft.html

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

spec/editors_draft.html

…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4

hartig · 2024-05-29T08:55:42Z

spec/editors_draft.html

+        However, using a Turtle parser not aware of the <a href="#list-datatype">cdt:List datatype</a> will result in an object literal containing the substring "_:eve"; that is, the literal will keep the exact lexical form as given in the Turtle snippet above (<a href="#load-renaming-blank-example"></a>).
+        Allowing multiple such composite values, from multiple sources, to be inserted into the same RDF Dataset (e.g., via two SPARQL `LOAD` updates) presents a challenge:
+        when a system interprets the literals obtained from its Turtle parser, the <a>term lists</a> that it creates by applying the <a href="#list-datatype-lex-to-value-mapping">cdt:List lexical-to-value mapping</a> will contain the same blank nodes (because, during the mapping process for each cdt:List literal, the <i><a>bnl2bn</a></i> function is applied to the same blank node identifier, `_:eve`).


While this text focuses on the case of a system that is aware of our CDT datatypes but uses a Turtle parser that is not (i.e., the system interprets the CDT literals after receiving them from the Turtle parser), there is also another issue that we don't mention here; namely for systems that do not support the cdt:List datatype at all. In the case of the example outlined in the text here, such a system would throw all the cdt:List literals from different sources together, without renaming _:eve in any of these literals. Hence, piping data through such a system and then into a cdt:List-aware system may cause differences in query results, compared to loading the data directly into the cdt:List-aware system. In other words, a system that is not aware of the cdt:List datatype may accidentally change the meaning of cdt:List literals collected from multiple sources (by making these literals all contain the same blank nodes). Do you agree that this is a problem?

I'm not sure I'd necessarily think of it as a "problem," but it is something users of CDTs need to be aware of.

Seems like you could probably think of somewhat similar examples of OWL-capable and non-OWL plain RDF stores where you add inconsistent assertions in the latter, and try to import into the former, and suddenly have bad data. That's not really a problem with OWL so much as a use case that can't work as intended when system of varying capabilities are involved. (Maybe not the best comparison, but I think instructive…)

So, you are suggesting we add a Note to call out this issue?

Even if users of CDTs are aware of it, I am afraid that there is nothing that they will be able to do to avoid the problem if they combine using systems that are and that are not CDT aware. That's somewhat unsatisfying, but I also don't see another way to address this issue (except for disallowing blank nodes in CDT literals).

I should also mention that this issue was one of the concerns raised in discussions of the approach at ESWC.

Yeah, understood. Possible paths forward are to have bnode support be optional, with clear documentation that supporting them would introduce incompatibility with non-CDT-supporting systems; or to just document the issue with strong warnings about using bnodes if you need interop between systems…

hartig

I went over all outdated comments and added those that are still unresolved into the current version of the changes in this PR.

spec/editors_draft.html

hartig · 2024-05-31T15:42:24Z

spec/editors_draft.html

+      <p>
+        Conforming implementations MUST process <a>cdt:List literals</a> and <a>cdt:Map literals</a> during export, replacing, in their lexical form, any substring _:<var>id</var>
+        matching the <a data-cite="TURTLE#grammar-production-BLANK_NODE_LABEL">`BLANK_NODE_LABEL`</a> production of the grammar
+        with a string "_:"+<var>b</var> where <var>b</var> is a blank node identifier, such that
+      </p>
+
+      <ul>
+        <li>within the scope of the document being serialized, all occurrences of <var>id</var> used as a blank node identifier (both inside and outside of composite values) are replaced with the same <var>b</var> value, and</li>
+        <li>within the scope of the document being serialized, two distinct <var>id</var> blank node identifiers are replaced with distinct <var>b</var> values, and</li>
+        <li>
+          if the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of cdt:List and cdt:Map literals (e.g., `_:b1` syntax in N-Triples and in Turtle, `rdf:nodeID` in RDF/XML),
+          and the blank node <var>B</var> is contained in the data to be serialized both inside and outside of composite value literals,
+          then the serializer for this document format MUST serialize <var>B</var> using the blank node identifier <var>b</var> (both inside and outside of composite values)
+        </li>
+      </ul>


I am copying the following three points from a now-outdated comment as they are still unresolved.

While I generally agree with the new third bullet point that you have added here, this point feels somewhat detached from the context in which you are listing these bullet points. I mean, the context is that there is some substring "_:id" in the lexical form of some cdt:List or cdt:Map literal to be exported, but this bullet point does not talk at all about this substring / about id.

Related to the previous point, within the new third bullet point you seem to assume systems that represent CDT literals internally based their value form (how would such a system otherwise know that it is the same blank node B that is both inside and outside of CDT literals?). So, the issue that I see here is that it is not clear how the requirement of this bullet point can apply to systems that represent CDT literals internally based on their lexical form.

Another potential problem regarding this new third bullet point: While the point focuses on cases in which the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of CDT literals, I wonder what a system should do for formats that do not have such a mechanism. Assume the dataset in a system contains the triple (b, :p, l) where b is a blank node and l is a cdt:List literal whose value (i.e., term list) contains b as well. If the dataset with this triple needs to be serialized into a format that does not provide such an aforementioned mechanism, then there is no way to serialize the triple without loosing the connection between the blank node in the subject and the blank node inside the object literal, right?

Regarding point 3, yes, you'd lose the connection between the blank nodes. I don't know of any serialization formats that lack the ability to identify the bnodes, but I was trying to write this in a way that would be accepting of domain-specific serializations that might. For example, if you enforce restrictions on the data such that bnodes never appear outside of CDTs; or if your data model only uses bnodes in constrained ways such that they would never need to be named in the serialization (e.g. using them just as nodes to model data that could entirely be captured by Turtle-like [ :p1 [ :p2 :v ]] syntax). Maybe that's going too far in trying to support something there's no real evidence of?

Now that you say it, I think yes that is going too far. So, to address my third point here, I suggest to change the beginning of the third bullet point from "if the format..." to "assuming that the format...", and then perhaps add a Note that briefly mentions that we are not aware of formats in which blank nodes cannot be named and if there was one, it may lead to the problem described in my third point here.

I have implemented the proposal of my previous comment in commit fdbf337

This addresses point 3 above. Points 1 and 2 remain.

…e earlier version of the example

…s in Sec.12.3 - as per #2 (comment)

hartig · 2024-06-03T07:53:57Z

While the section about export requirements still needs more work, the one about import requirements is ready now. I would like to have at least the latter in the main branch before creating the SEP (at sparql-dev) and announcing the approach further, which I really would like to do now to leverage the current momentum.

Therefore, would you be okay if I merge Section 12.1 (motivation) and 12.2 (import requirements) now, with a TODO in the export requirements section, and then create another PR for the continued work on the latter?

kasei · 2024-06-03T10:02:57Z

Therefore, would you be okay if I merge Section 12.1 (motivation) and 12.2 (import requirements) now, with a TODO in the export requirements section, and then create another PR for the continued work on the latter?

Yes. Go ahead and merge and we can talk with any remaining issues with the import section with follow-up PRs. Thanks.

hartig · 2024-06-04T07:42:32Z

I have created #9 as the follow-up PR to continue working on the export requirements.

…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4

Add spec section on requirements for importing/exporting CDT values f…

07bd0fb

…rom SPARQL systems/datasets.

kasei requested a review from hartig April 2, 2024 23:36

hartig requested changes Apr 5, 2024

View reviewed changes

kasei and others added 2 commits April 11, 2024 14:50

Apply suggestions from code review

b68d922

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

Update spec/editors_draft.html

1d1cb74

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

hartig reviewed Apr 16, 2024

View reviewed changes

spec/editors_draft.html Outdated Show resolved Hide resolved

kasei and others added 8 commits April 16, 2024 17:26

Address PR feedback.

81826ee

Update spec/editors_draft.html

638f27d

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

Update spec/editors_draft.html

287ceaa

Co-authored-by: Olaf Hartig <ohartig@amazon.com>

Address PR feedback.

22ec81d

Address PR feedback.

8d44715

Address PR feedback.

8110887

Address PR feedback.

45c18eb

some edits in the motivation part of Sec.12

fca2e80

hartig reviewed Apr 30, 2024

View reviewed changes

spec/editors_draft.html Outdated Show resolved Hide resolved

hartig reviewed Apr 30, 2024

View reviewed changes

spec/editors_draft.html Outdated Show resolved Hide resolved

hartig reviewed Apr 30, 2024

View reviewed changes

spec/editors_draft.html Outdated Show resolved Hide resolved

hartig mentioned this pull request May 19, 2024

tests related to importing of blank nodes #4

Merged

hartig mentioned this pull request May 28, 2024

Changes list-equal and map-equal for cases with the same blank node #5

Merged

Clarification in Sec.12.1

450c2d6

hartig reviewed May 29, 2024

View reviewed changes

hartig added 3 commits May 30, 2024 13:28

Improves the description of Example 46.

29ce26a

Edit of the importing requirements

647e8d7

minor edits for clarification in Sections 12.2 and 12.3

f82d153

hartig requested changes May 31, 2024

View reviewed changes

hartig added 3 commits June 1, 2024 17:46

Minor edits to provide clarification in Section 12.2

a89e63e

fixes INSERT DATA example in Section 12.2 and adds a Note based on th…

d9f037d

…e earlier version of the example

comment about assumption of bnode identifiers in serialization format…

fdbf337

…s in Sec.12.3 - as per #2 (comment)

kasei force-pushed the spec-import-export branch from c869df2 to fdbf337 Compare June 1, 2024 18:07

hartig added 2 commits June 4, 2024 09:08

comments out the export requirements section

c83b1f3

moves the section about importing and exporting up to become Sec.5 now

c2db162

hartig merged commit ebb74e4 into main Jun 4, 2024

hartig deleted the spec-import-export branch June 4, 2024 07:13

hartig mentioned this pull request Jun 4, 2024

Export Requirements #9

Open

hartig mentioned this pull request Jun 4, 2024

Blank nodes optional #10

Open

Conversation

kasei commented Apr 2, 2024

Uh oh!

hartig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hartig May 29, 2024

Choose a reason for hiding this comment

Uh oh!

kasei May 29, 2024

Choose a reason for hiding this comment

Uh oh!

hartig May 31, 2024

Choose a reason for hiding this comment

Uh oh!

kasei May 31, 2024

Choose a reason for hiding this comment

Uh oh!

hartig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hartig May 31, 2024

Choose a reason for hiding this comment

Uh oh!

kasei May 31, 2024

Choose a reason for hiding this comment

Uh oh!

hartig May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hartig Jun 1, 2024

Choose a reason for hiding this comment

Uh oh!

hartig commented Jun 3, 2024

Uh oh!

kasei commented Jun 3, 2024

Uh oh!

hartig commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hartig May 31, 2024 •

edited

Loading