Add spec section on requirements for importing/exporting CDT values#2
Add spec section on requirements for importing/exporting CDT values#2
Conversation
…rom SPARQL systems/datasets.
hartig
left a comment
There was a problem hiding this comment.
I think it is a good start!! Yet, I have many comments about inconsistent or confusing wording of the text, as well as some more substantive issues around the two lists of bullet points at the beginning of the two normative sections (i.e., Section 12.2 and 12.3).
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
Co-authored-by: Olaf Hartig <ohartig@amazon.com>
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
spec/editors_draft.html
Outdated
| However, using a Turtle parser not aware of the <a href="#list-datatype">cdt:List datatype</a> will result in an object literal containing the substring "_:eve"; that is, the literal will keep the exact lexical form as given in the Turtle snippet above (<a href="#load-renaming-blank-example"></a>). | ||
| Allowing multiple such composite values, from multiple sources, to be inserted into the same RDF Dataset (e.g., via two SPARQL `LOAD` updates) presents a challenge: | ||
| when a system interprets the literals obtained from its Turtle parser, the <a>term lists</a> that it creates by applying the <a href="#list-datatype-lex-to-value-mapping">cdt:List lexical-to-value mapping</a> will contain the same blank nodes (because, during the mapping process for each cdt:List literal, the <i><a>bnl2bn</a></i> function is applied to the same blank node identifier, `_:eve`). |
There was a problem hiding this comment.
While this text focuses on the case of a system that is aware of our CDT datatypes but uses a Turtle parser that is not (i.e., the system interprets the CDT literals after receiving them from the Turtle parser), there is also another issue that we don't mention here; namely for systems that do not support the cdt:List datatype at all. In the case of the example outlined in the text here, such a system would throw all the cdt:List literals from different sources together, without renaming _:eve in any of these literals. Hence, piping data through such a system and then into a cdt:List-aware system may cause differences in query results, compared to loading the data directly into the cdt:List-aware system. In other words, a system that is not aware of the cdt:List datatype may accidentally change the meaning of cdt:List literals collected from multiple sources (by making these literals all contain the same blank nodes). Do you agree that this is a problem?
There was a problem hiding this comment.
I'm not sure I'd necessarily think of it as a "problem," but it is something users of CDTs need to be aware of.
Seems like you could probably think of somewhat similar examples of OWL-capable and non-OWL plain RDF stores where you add inconsistent assertions in the latter, and try to import into the former, and suddenly have bad data. That's not really a problem with OWL so much as a use case that can't work as intended when system of varying capabilities are involved. (Maybe not the best comparison, but I think instructive…)
There was a problem hiding this comment.
So, you are suggesting we add a Note to call out this issue?
Even if users of CDTs are aware of it, I am afraid that there is nothing that they will be able to do to avoid the problem if they combine using systems that are and that are not CDT aware. That's somewhat unsatisfying, but I also don't see another way to address this issue (except for disallowing blank nodes in CDT literals).
I should also mention that this issue was one of the concerns raised in discussions of the approach at ESWC.
There was a problem hiding this comment.
Yeah, understood. Possible paths forward are to have bnode support be optional, with clear documentation that supporting them would introduce incompatibility with non-CDT-supporting systems; or to just document the issue with strong warnings about using bnodes if you need interop between systems…
hartig
left a comment
There was a problem hiding this comment.
I went over all outdated comments and added those that are still unresolved into the current version of the changes in this PR.
spec/editors_draft.html
Outdated
| <p> | ||
| Conforming implementations MUST process <a>cdt:List literals</a> and <a>cdt:Map literals</a> during export, replacing, in their lexical form, any substring _:<var>id</var> | ||
| matching the <a data-cite="TURTLE#grammar-production-BLANK_NODE_LABEL">`BLANK_NODE_LABEL`</a> production of the grammar | ||
| with a string "_:"+<var>b</var> where <var>b</var> is a blank node identifier, such that | ||
| </p> | ||
|
|
||
| <ul> | ||
| <li>within the scope of the document being serialized, all occurrences of <var>id</var> used as a blank node identifier (both inside and outside of composite values) are replaced with the same <var>b</var> value, and</li> | ||
| <li>within the scope of the document being serialized, two distinct <var>id</var> blank node identifiers are replaced with distinct <var>b</var> values, and</li> | ||
| <li> | ||
| if the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of cdt:List and cdt:Map literals (e.g., `_:b1` syntax in N-Triples and in Turtle, `rdf:nodeID` in RDF/XML), | ||
| and the blank node <var>B</var> is contained in the data to be serialized both inside and outside of composite value literals, | ||
| then the serializer for this document format MUST serialize <var>B</var> using the blank node identifier <var>b</var> (both inside and outside of composite values) | ||
| </li> | ||
| </ul> |
There was a problem hiding this comment.
I am copying the following three points from a now-outdated comment as they are still unresolved.
- While I generally agree with the new third bullet point that you have added here, this point feels somewhat detached from the context in which you are listing these bullet points. I mean, the context is that there is some substring "
_:id" in the lexical form of some cdt:List or cdt:Map literal to be exported, but this bullet point does not talk at all about this substring / about id. - Related to the previous point, within the new third bullet point you seem to assume systems that represent CDT literals internally based their value form (how would such a system otherwise know that it is the same blank node B that is both inside and outside of CDT literals?). So, the issue that I see here is that it is not clear how the requirement of this bullet point can apply to systems that represent CDT literals internally based on their lexical form.
- Another potential problem regarding this new third bullet point: While the point focuses on cases in which the format of the document being serialized has a mechanism for explicitly identifying blank nodes outside of CDT literals, I wonder what a system should do for formats that do not have such a mechanism. Assume the dataset in a system contains the triple (b, :p, l) where b is a blank node and l is a cdt:List literal whose value (i.e., term list) contains b as well. If the dataset with this triple needs to be serialized into a format that does not provide such an aforementioned mechanism, then there is no way to serialize the triple without loosing the connection between the blank node in the subject and the blank node inside the object literal, right?
There was a problem hiding this comment.
Regarding point 3, yes, you'd lose the connection between the blank nodes. I don't know of any serialization formats that lack the ability to identify the bnodes, but I was trying to write this in a way that would be accepting of domain-specific serializations that might. For example, if you enforce restrictions on the data such that bnodes never appear outside of CDTs; or if your data model only uses bnodes in constrained ways such that they would never need to be named in the serialization (e.g. using them just as nodes to model data that could entirely be captured by Turtle-like [ :p1 [ :p2 :v ]] syntax). Maybe that's going too far in trying to support something there's no real evidence of?
There was a problem hiding this comment.
Now that you say it, I think yes that is going too far. So, to address my third point here, I suggest to change the beginning of the third bullet point from "if the format..." to "assuming that the format...", and then perhaps add a Note that briefly mentions that we are not aware of formats in which blank nodes cannot be named and if there was one, it may lead to the problem described in my third point here.
There was a problem hiding this comment.
I have implemented the proposal of my previous comment in commit fdbf337
This addresses point 3 above. Points 1 and 2 remain.
…e earlier version of the example
…s in Sec.12.3 - as per #2 (comment)
|
While the section about export requirements still needs more work, the one about import requirements is ready now. I would like to have at least the latter in the main branch before creating the SEP (at sparql-dev) and announcing the approach further, which I really would like to do now to leverage the current momentum. Therefore, would you be okay if I merge Section 12.1 (motivation) and 12.2 (import requirements) now, with a TODO in the export requirements section, and then create another PR for the continued work on the latter? |
Yes. Go ahead and merge and we can talk with any remaining issues with the import section with follow-up PRs. Thanks. |
|
I have created #9 as the follow-up PR to continue working on the export requirements. |
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
I'm a bit unclear on the
xrefmetadata in respec (maybe that needs populating with all the new references?).