XMLLexer

Description

XMLLexer is a lowest level asynchronous XML parser. Lower than SAX. It only splits incoming text into fragments (lexemes):

text nodes
pointy braces delimited tags of all kinds
- including comments, processing instructions and all weird <! stuff inherited from SGML
  - and, yes, <![CDATA[ belongs here.

In application code, XMLLexer can be used:

directly (what makes sense for huge XML files with primitive structure);
as data source for SAXEventEmitter.

Technically it is a transform

from a Readable stream representing the content of an XML (HTML and so on) document;
to an object mode stream of strings (primitive ones, not wrapped).

Usage

const lexer = new XMLLexer ({...options})

someInputStream.pipe (lexer)
for await (const tag of lexer) if (/* the tag is OK */) {
  // do something with the tag
}

// or

someInputStream.pipe (lexer).pipe (someObjectOutputStream)

// or

lexer.on  ('data', s => console.log (s))
lexed.end ('<root>...</root>')

Options

Name	Default	Description
maxLength	1 << 20	Maximum lexeme length, characters
encoding	utf8	Encoding for output lexemes, used with binary input

Lexeme types

XMLLexer splits the incoming stream into strings matching following patterns:

Template	Description	Note
`<?` ... `?>`	Processing instruction	or prolog `<?xml` ... `?>`
`<!--` ... `-->`	Comment
`<!` ... `[` ... `[` ... `]]>`	Conditional, complex doctype	or `<![CDATA[` ... `]]>`
`<!` ... `[` ... `]>`	Medium doctype
`<!` ... `>`	Doctype, atomic
`<` ... `>`	Element tag (opening, closing or both)
...	Text node

Implementation

Members

Name	Type	Description
body	string	What's left unparsed
beforeBody	BigInt	Amount of chars or bytes left past current `body` contents
state	int	Current state
position	int	Last position analyzed for `[` in state `ST_TAG_X`
awaited	Buffer	Enclosing sequence

Constants

States (`state` values)

XMLLexer is basically a finite state machine with the following states:

Constant	Body from `start`	Description
ST_TEXT	unknown or anyting but `<`	find the nearest `<`, slice the text
ST_LT	`<`	Look ahead for the next byte to switch to `ST_LT_X` (if `!`) or `ST_TAG` (otherwise)
ST_LT_X	`<!`	Look ahead for the next byte to switch to `ST_TAG` with awaited `-->` (if `-`) or `ST_TAG_X` (otherwise)
ST_TAG	`<...`	Scan for '>', find the one preceded with `awaited` sequence, slice the tag
ST_TAG_X	`<!...`	Scan through all bytes, adjust `awaited` when `[` occur, watch for '>', finally slice the tag

Following transitions are possible:

From	To	Awaited	Explanation
ST_TEXT	ST_LT		`<` occured, buffered text is given out (if any)
ST_TEXT	ST_TEXT		some other char occured
ST_LT	ST_LT_X		`!` occured
ST_LT	ST_TAG	`?>`	`!` occured
ST_LT	ST_TAG	`>`	not `!`, normal tag
ST_LT_X	ST_TAG	`-->`	`-` occured, must be a comment (unless broken)
ST_LT_X	ST_TAG_X	`>` / `]>` / `]]>`	counting `[`s
ST_TAG	ST_TEXT		`>` occured, previous characters match -- giving out the tag
ST_TAG_X	ST_TEXT		`>` occured, previous characters match -- giving out the tag

Enclosing sequences (`awaited` values)

Possible enclosing sequences (> chopped) are stored as Buffer instances:

Name	Value	Description
CL_DEAFULT	''	default
CL_PI	`?`	for `<?`
CL_COMMENT	`--`	for `<!-`
CL_SQ_1	`]`	if `[` occured in ST_TAG state and CL_DEAFULT awaited
CL_SQ_2	`]]`	if `[` occured in ST_TAG state and CL_SQ_1 awaited

Methods

Name, Params	Type	Description
setState (state, awaited)		set the `state` and `awaited` members
isClosing (pos)	boolean	do bytes preceding `pos` match `awaited`
publishTo (pos)		`push` the the slice from `start` to `pos` in the output stream, then `start` after `pos`
parse ()		scan the body from `start`, publish lexemes found, move `start` after the last one
checkMaxLength ()		Throws an error if `body.size () > maxLength`
_transform (chunk, encoding, callback)		`append` the incoming chunk, `parse` the body, then `trim`, reset `start` to `0` it and invoke `callback`
_flush (callback)		publish the rest of the `body`, invoke `callback`
getPosition ()	BigInt	Position of the current lexeme in the incoming stream (for error reporting)

Comparison to `XMLIterator`

XMLLexer and XMLIterator and are both low level XML parsers splitting pointy bracket delimited text to syntactically atomic tokens. But:

Name	Proto	XML Source	Pro	Contra
`XMLLexer`	`Transform`	`Readable`	limited memory footprint with any XML size	asynchronous by nature
`XMLIterator`	`Iterable`	`String`	can be used in synchronous `for ... of`, e. g. in object constructors	limited size XML only

So, XMLLexer vs. XMLIterator is basically like fs.createReadStream vs. fs.readFileSync.

XMLLexer

Description

Usage

Options

Lexeme types

Implementation

Members

Constants

States (state values)

Enclosing sequences (awaited values)

Methods

Comparison to XMLIterator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

States (`state` values)

Enclosing sequences (`awaited` values)

Comparison to `XMLIterator`