XMLLexer

Description

XMLLexer is a transform

from a Readable stream representing the content of an XML (HTML and so on) document;
to an object mode stream of strings representing its lexemes,

where by a lexeme we mean either

a text node
or some pointy brackets delimited tag
- including comments, processing instructions and all weird <! stuff inherited from SGML
  - and, yes <![CDATA[ belongs here.

Usage

const lexer = XMLLexer ({...options})

someInputStream.pipe (lexer).pipe (someObjectOutputStream)

// or

lexer.on  ('data', s => console.log (s))
lexed.end ('<root>...</root>')

Options

Name	Default	Description
maxLength	1 << 20	Maximum lexeme length (in bytes or characters)

Lexeme types

XMLLexer splits the incoming stream into strings matching following patterns:

Template	Description	Note
`<?` ... `?>`	Processing instruction	or prolog `<?xml` ... `?>`
`<!--` ... `-->`	Comment
`<!` ... `[` ... `[` ... `]]>`	Conditional, complex doctype	or `<![CDATA[` ... `]]>`
`<!` ... `[` ... `]>`	Medium doctype
`<!` ... `>`	Doctype, atomic
`<` ... `>`	Element tag (opening, closing or both)
...	Text node

Implementation

Members

Name	Type	Description
bodyString	String	What's left unparsed
state	int	Current state
start	int	Start of the current lexeme in buffer
awaited	Buffer	Enclosing sequence

Constants

States (`state` values)

XMLLexer is basically a finite state machine with the following states:

Constant	Body from `start`	Description
ST_TEXT	unknown or anyting but `<`	find the nearest `<`, slice the text
ST_LT	`<`	Look ahead for the next byte to switch to `ST_LT_X` (if `!`) or `ST_TAG` (otherwise)
ST_LT_X	`<!`	Look ahead for the next byte to switch to `ST_TAG` with awaited `-->` (if `-`) or `ST_TAG_X` (otherwise)
ST_TAG	`<...`	Scan for '>', find the one preceded with `awaited` sequence, slice the tag
ST_TAG_X	`<!...`	Scan through all bytes, adjust `awaited` when `[` occur, watch for '>', finally slice the tag

Following transitions are possible:

From	To	Awaited	Explanation
ST_TEXT	ST_LT		`<` occured, buffered text is given out (if any)
ST_TEXT	ST_TEXT		some other char occured
ST_LT	ST_LT_X		`!` occured
ST_LT	ST_TAG	`?>`	`!` occured
ST_LT	ST_TAG	`>`	not `!`, normal tag
ST_LT_X	ST_TAG	`-->`	`-` occured, must be a comment (unless broken)
ST_LT_X	ST_TAG_X	`>` / `]>` / `]]>`	counting `[`s
ST_TAG	ST_TEXT		`>` occured, previous characters match -- giving out the tag
ST_TAG_X	ST_TEXT		`>` occured, previous characters match -- giving out the tag

Enclosing sequences (`awaited` values)

Possible enclosing sequences (> chopped) are stored as Buffer instances:

Name	Value	Description
CL_DEAFULT	''	default
CL_PI	`?`	for `<?`
CL_COMMENT	`--`	for `<!-`
CL_SQ_1	`]`	if `[` occured in ST_TAG state and CL_DEAFULT awaited
CL_SQ_2	`]]`	if `[` occured in ST_TAG state and CL_SQ_1 awaited

Methods

Name, Params	Type	Description
append (chunk)		append this chunk to the body
slice ()	String	find a lexem at `start`, move `start` and return the lexem found or null
trim ()		replace the buffer with its slice from `start`
indexOfLt ()	int	indexOf ('<', this.start)
indexOfGt (fromPos)	int	indexOf ('>', fromPos)
charCodeAt (pos)	int	As in String
isClosing (pos)	boolean	do bytes preceding `pos` match `awaited`

XMLLexer

Description

Usage

Options

Lexeme types

Implementation

Members

Constants

States (state values)

Enclosing sequences (awaited values)

Methods

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

States (`state` values)

Enclosing sequences (`awaited` values)