Skip to content

XMLLexer

do- edited this page Nov 16, 2021 · 55 revisions

Description

XMLLexer is a transform

  • from a Readable stream representing the content of an XML (HTML and so on) document;
  • to an object mode stream of strings representing its lexemes,

where by a lexeme we mean either

  • a text node
  • or some pointy brackets delimited tag
    • including comments, processing instructions and all weird <! stuff inherited from SGML
      • and, yes <![CDATA[ belongs here.

Usage

const lexer = XMLLexer ({...options})

someInputStream.pipe (lexer).pipe (someObjectOutputStream)

// or

lexer.on  ('data', s => console.log (s))
lexed.end ('<root>...</root>')

Options

Name Default Description
maxLength 1 << 20 Maximum lexeme length (in bytes or characters)

Lexeme types

XMLLexer splits the incoming stream into strings matching following patterns:

Template Description Note
<? ... ?> Processing instruction or prolog <?xml ... ?>
<!-- ... --> Comment
<! ... [ ... [ ... ]]> Conditional, complex doctype or <![CDATA[ ... ]]>
<! ... [ ... ]> Medium doctype
<! ... > Doctype, atomic
< ... > Element tag (opening, closing or both)
... Text node

Implementation

Members

Name Type Description
bodyString String What's left unparsed
state int Current state
start int Start of the current lexeme in buffer
awaited Buffer Enclosing sequence

Constants

States (state values)

XMLLexer is basically a finite state machine with the following states:

Constant Body from start Description
ST_TEXT unknown or anyting but < find the nearest <, slice the text
ST_LT < Look ahead for the next byte to switch to ST_LT_X (if !) or ST_TAG (otherwise)
ST_LT_X <! Look ahead for the next byte to switch to ST_TAG with awaited --> (if -) or ST_TAG_X (otherwise)
ST_TAG <... Scan for '>', find the one preceded with awaited sequence, slice the tag
ST_TAG_X <!... Scan through all bytes, adjust awaited when [ occur, watch for '>', finally slice the tag

Following transitions are possible:

From To Awaited Explanation
ST_TEXT ST_LT < occured, buffered text is given out (if any)
ST_TEXT ST_TEXT some other char occured
ST_LT ST_LT_X ! occured
ST_LT ST_TAG ?> ! occured
ST_LT ST_TAG > not !, normal tag
ST_LT_X ST_TAG --> - occured, must be a comment (unless broken)
ST_LT_X ST_TAG_X > / ]> / ]]> counting [s
ST_TAG ST_TEXT > occured, previous characters match -- giving out the tag
ST_TAG_X ST_TEXT > occured, previous characters match -- giving out the tag

Enclosing sequences (awaited values)

Possible enclosing sequences (> chopped) are stored as Buffer instances:

Name Value Description
CL_DEAFULT '' default
CL_PI ? for <?
CL_COMMENT -- for <!-
CL_SQ_1 ] if [ occured in ST_TAG state and CL_DEAFULT awaited
CL_SQ_2 ]] if [ occured in ST_TAG state and CL_SQ_1 awaited

Methods

Name, Params Type Description
append (chunk) append this chunk to the body
slice () String find a lexem at start, move start and return the lexem found or null
trim () replace the buffer with its slice from start
indexOfLt () int indexOf ('<', this.start)
indexOfGt (fromPos) int indexOf ('>', fromPos)
charCodeAt (pos) int As in String
isClosing (pos) boolean do bytes preceding pos match awaited

Clone this wiki locally