Skip to content

Peek ahead parser feature - more powerful lookahead capability #75

@mbeckerle

Description

@mbeckerle

This feature is intended to be a more general solution to the problem of look-ahead when parsing. It is a more powerful feature than the one proposed in #66 . That proposal for a simple fixed-distance look-ahead capability, is too limited for many use cases.

We have use cases where a choice decision in the schema is driven by a field that appears later in the data but not at a fixed offset. Rather the field is in a later optional part of the message, with the distance to that optional part also variable due to intervening optional parts. Getting access to this later field essentially requires the full generality of DFDL to describe how to access it.

This situation can be called the deep discriminator problem.

The proposal follows. This was pulled from the initial emails to the DFDL Workgroup discussing this item.

Property dfdlx:peek="yes/no". Property on a sequence model group. Compatible with a hidden group ref. Unlike other properties, which are disallowed on a sequence with dfdl:hiddenGroupRef, this could be allowed on hidden sequences. That's actually a primary use case for this.

If "no", no behavior change (what things do now).

If "yes", the parse happens, including set variable assignments, infoset creation, etc. Then at the end of the sequence if the parse is successful the position in the data stream is reset to where it was at the start of the peek sequence. If the parse fails you backtrack to the enclosing PoU as normal and everything about the peek (when inside the PoU) is discarded.

This allows you to learn by parsing something in the data more than once. Once to discover something which goes into the parser infoset (hidden or not), and into single-assignments to DFDL variables. The second time can parse making use of this learning.

This is sort of like backtracking at a PoU, but you don't undo anything except the position in the data stream.

On unparsing, all data written while unparsing the infoset for a sequence with dfdl:peek="yes" is discarded. Or maybe we can just say the infoset corresponding to a sequence with dfdl:peek="yes" is not unparsed at all.

This example is motivated by the fact that many data format specifications describe a number of records or messages and they describe each message with a common set of initial fields. DFDL schema authors seem to want to follow this pattern of description so as to maintain the correspondence between the DFDL schema and specification showing the layout of each record. However, the choice of record type in the schema is based on initial fields of each record in a uniform way. This peek feature can be used to parse these initial fields once, use the values of them to guide a choice using dfdl:choiceDispatchKey expressions, and then parse the entire record. This style seems redundant but makes the DFDL expression paths to fields be the natural ones which match the record layouts as in the specification document.

<group name="hdr">
   <!-- used to peek ahead 
          does not exactly match the header fields of the message as
          it skips over things it doesn't need.
    --> 
   <sequence>
      <element name="category" type="xs:unsignedByte" dfdl:trailingSkip="1"/>
       <!-- 
          notice this is a variable-length id string between category and subcategory fields 
          which we don't need, except to find the position of the subcategory field. 
        --> 
      <element name="idLen" type="xs:unsignedByte"/>
      <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
      <element name="subcategory" type="xs:unsignedByte" dfdl:leadingSkip="4"/>
   </sequence>
</group>

<!--
The main message element.
Uses hdr to look ahead for fields needed to do the choice by dispatch
--> 
<element name="message">
   <complexType>
       <sequence>
          <sequence dfdl:hiddenGroupRef="hdr" dfdlx:peek="yes"/>
          <choice dfdl:choiceDispatchKey="{ fn:concat('X', string(hdr/category), '.' string(hdr/subcategory)) }">
              <element name="pos" dfdl:choiceBranchKey="X1.1" type="tns:positionReport"/>
               ... other message types ...
              <element name="txt" dfdl:choiceBranchKey="X11.7" type="chatMessage"/>
           </choice>
        </sequence>
    </complexType>
</element>

<!--
These are the message layouts and they redundantly express the header fields that are common
to all messages, so as to exactly match the layouts in the original specification document which have
this same redundancy, and express in prose that the first 7 fields are common to all messages.

This style misses an opportunity to create a reusable group for the sequence of the first 7 fields
as a way of capturing that the specification says the first 7 fields are always the same. 

But this is, nevertheless, what some schema authors seem to prefer. 
--> 
<complexType name="positionReport">
    <sequence>
        <element name="category" type="xs:unsignedByte"/>
        <element name="priority" type="tns:bit"/> 
        <element name="specialProcessingIndicator" type="tns:bits7"/> 
        <element name="idLen" type="xs:unsignedByte"/>
        <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
        <element name="msgLength" type="xs:unsignedInt"/>
        <element name="subcategory" type="xs:unsignedByte"/>
        <element name="latitude" type="xs:double"/>
        <element name="longitude" type="xs:double"/>
        <element name="altitude" type="xs:double"/>
        <element name="course" type="xs:double"/>
        <element name="speed" type="xs:double"/>
     </sequence>
</complexType>

<complexType name="chatMessage">
    <sequence>
        <element name="category" type="xs:unsignedByte"/>
        <element name="priority" type="tns:bit"/> 
        <element name="specialProcessingIndicator" type="tns:bits7"/> 
        <element name="idLen" type="xs:unsignedByte"/>
        <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
        <element name="msgLength" type="xs:unsignedInt"/>
        <element name="subcategory" type="xs:unsignedByte"/>
        <element name="text" type="xs:string" minOccurs="1" maxOccurs="7"/>
     </sequence>
</complexType>

Implementations could put a limit on how far ahead you can peek. But a minimum of say, 512 bytes or maybe a bit bigger makes sense I think. That would be enough for every use case I have.

I believe current restrictions in DFDL to ensure forward progress when parsing are sufficient to make it impossible to delay parsing forever with this. I.e., parsing can take a long time, but it still has to terminate (at least in theory, if there is enough memory for a big infoset).

I think this dfdlx:peek has some nice properties.

  • Pro: This is the most important thing: No specialized constructs for looking ahead. Just use DFDL to learn about the data, save it in variables or a piece of infoset that you can navigate with expressions to utilize the knowledge.

  • Pro: Composition properties are good. Nothing new to learn. I can think of no impact on backtracking or any other aspects.

  • Pro: Pretty cheap to implement, so long as the amount you can peek ahead is reasonably bounded.

  • Pro: Synergistic with existing things like newVariableInstance and hidden groups to capture learning from a peek ahead.

  • Con: Huge generality. Peeking ahead with a sequence with really rich sub-structure, PoUs and backtracking inside it, etc. That's all enabled by this feature, but none of the use cases I have need anything like that level of generality. This is one of those things where the stuff people will invent with it are unanticipated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DFDL 2.0For issues associated with DFDL v2.0 (next major revision)experimental

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions