Skip to content

Mistakes with IRI regular expressions #32

@dertseha

Description

@dertseha

The code regarding the regular expressions used for rdf/iri contain mistakes:

  • iqueryRE uses ipath instead of iquery, which is thus unused; When replacing it, further mistakes within iquery come to light:
    • iprivate contains an invalid regexp sequence: \x{F0000]-\x{FFFFD} should be \x{F0000}-\x{FFFFD}.
    • iquery is wrongly using "/?" as a sequence; This should be a choice, as in [\/\?].
  • iuserinfo is missing the colon character as per RFC. As such, IRI "https://user:pwd@example.com" cannot be parsed.
  • h16 regular expression should allow for 1-4 hex digits as per RFC, not require exactly 4 hex digits

As a side-note, the example "http://résumé.example.org", used for testing normalization, is not a properIRI string. The &#xE9 sequence is according to RFC chapter 1.4 the way how non US-ASCII characters are represented within a US-ASCII-only RFC text.
The first # makes the remainder be considered a fragment, which would be invalid because of the second #.

I found these things as I was extracting the package as a separate library, handling all the TODOs (ending up in a large rework), and feeding in many samples from the RFC - especially those about resolving relative IRIs. See https://github.com/contomap/iri .
My rework makes it incompatible with your use in here (different type & behaviour), which is why I collect the mistakes I found only as an issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions