Peek ahead parser feature - more powerful lookahead capability

This feature is intended to be a more general solution to the problem of look-ahead when parsing. It is a more powerful feature than the one proposed in https://github.com/OpenGridForum/DFDL/issues/66 . That proposal for a simple fixed-distance look-ahead capability, is too limited for many use cases. 

We have use cases where a choice decision in the schema is driven by a field that appears later in the data but not at a fixed offset. Rather the field is in a later optional part of the message, with the distance to that optional part also variable due to intervening optional parts. Getting access to this later field essentially requires the full generality of DFDL to describe how to access it. 

This situation can be called the _deep discriminator problem_.

The proposal follows. This was pulled from the initial emails to the DFDL Workgroup discussing this item. 

Property `dfdlx:peek="yes/no"`. Property on a sequence model group. Compatible with a hidden group ref. Unlike other properties, which are disallowed on a sequence with dfdl:hiddenGroupRef, this could be allowed on hidden sequences. That's actually a primary use case for this. 

If "no", no behavior change (what things do now). 

If "yes", the parse happens, including set variable assignments, infoset creation, etc. Then at the end of the sequence if the parse is successful the position in the data stream is reset to where it was at the start of the peek sequence. If the parse fails you backtrack to the enclosing PoU as normal and everything about the peek (when inside the PoU) is discarded. 

This allows you to learn by parsing something in the data more than once. Once to discover something which goes into the parser infoset (hidden or not), and into single-assignments to DFDL variables. The second time can parse making use of this learning.

This is sort of like backtracking at a PoU, but you don't undo anything except the position in the data stream. 

On unparsing, all data written while unparsing the infoset for a sequence with dfdl:peek="yes" is discarded. Or maybe we can just say the infoset corresponding to a sequence with dfdl:peek="yes" is not unparsed at all. 

This example is motivated by the fact that many data format specifications describe a number of records or messages and they describe each message with a common set of initial fields. DFDL schema authors seem to want to follow this pattern of description so as to maintain the correspondence between the DFDL schema and specification showing the layout of each record. However, the choice of record type in the schema is based on initial fields of each record in a uniform way. This peek feature can be used to parse these initial fields once, use the values of them to guide a choice using `dfdl:choiceDispatchKey` expressions, and then parse the entire record. This style seems redundant but makes the DFDL expression paths to fields be the natural ones which match the record layouts as in the specification document. 

```
<group name="hdr">
    
   <sequence>
      <element name="category" type="xs:unsignedByte" dfdl:trailingSkip="1"/>
        
      <element name="idLen" type="xs:unsignedByte"/>
      <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
      <element name="subcategory" type="xs:unsignedByte" dfdl:leadingSkip="4"/>
   </sequence>
</group>

 
<element name="message">
   <complexType>
       <sequence>
          <sequence dfdl:hiddenGroupRef="hdr" dfdlx:peek="yes"/>
          <choice dfdl:choiceDispatchKey="{ fn:concat('X', string(hdr/category), '.' string(hdr/subcategory)) }">
              <element name="pos" dfdl:choiceBranchKey="X1.1" type="tns:positionReport"/>
               ... other message types ...
              <element name="txt" dfdl:choiceBranchKey="X11.7" type="chatMessage"/>
           </choice>
        </sequence>
    </complexType>
</element>

 
<complexType name="positionReport">
    <sequence>
        <element name="category" type="xs:unsignedByte"/>
        <element name="priority" type="tns:bit"/> 
        <element name="specialProcessingIndicator" type="tns:bits7"/> 
        <element name="idLen" type="xs:unsignedByte"/>
        <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
        <element name="msgLength" type="xs:unsignedInt"/>
        <element name="subcategory" type="xs:unsignedByte"/>
        <element name="latitude" type="xs:double"/>
        <element name="longitude" type="xs:double"/>
        <element name="altitude" type="xs:double"/>
        <element name="course" type="xs:double"/>
        <element name="speed" type="xs:double"/>
     </sequence>
</complexType>

<complexType name="chatMessage">
    <sequence>
        <element name="category" type="xs:unsignedByte"/>
        <element name="priority" type="tns:bit"/> 
        <element name="specialProcessingIndicator" type="tns:bits7"/> 
        <element name="idLen" type="xs:unsignedByte"/>
        <element name="id" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{ ../idLen }"/>
        <element name="msgLength" type="xs:unsignedInt"/>
        <element name="subcategory" type="xs:unsignedByte"/>
        <element name="text" type="xs:string" minOccurs="1" maxOccurs="7"/>
     </sequence>
</complexType>
```

Implementations could put a limit on how far ahead you can peek. But a minimum of say, 512 bytes or maybe a bit bigger makes sense I think. That would be enough for every use case I have. 

I believe current restrictions in DFDL to ensure forward progress when parsing are sufficient to make it impossible to delay parsing forever with this. I.e., parsing can take a long time, but it still has to terminate (at least in theory, if there is enough memory for a big infoset). 

I think this dfdlx:peek has some nice properties.

- Pro: This is the most important thing: No specialized constructs for looking ahead. Just use DFDL to learn about the data, save it in variables or a piece of infoset that you can navigate with expressions to utilize the knowledge. 

- Pro: Composition properties are good. Nothing new to learn. I can think of no impact on backtracking or any other aspects. 

- Pro: Pretty cheap to implement, so long as the amount you can peek ahead is reasonably bounded.

- Pro: Synergistic with existing things like newVariableInstance and hidden groups to capture learning from a peek ahead. 

- Con: Huge generality. Peeking ahead with a sequence with really rich sub-structure, PoUs and backtracking inside it, etc. That's all enabled by this feature, but none of the use cases I have need anything like that level of generality. This is one of those things where the stuff people will invent with it are unanticipated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Peek ahead parser feature - more powerful lookahead capability #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Peek ahead parser feature - more powerful lookahead capability #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions