-
Notifications
You must be signed in to change notification settings - Fork 4
XMLLexer
do- edited this page Oct 16, 2022
·
55 revisions
XMLLexer is a lowest level asynchronous XML parser. Lower than SAX. It only splits incoming text into fragments (lexemes):
- text nodes
- pointy braces delimited tags of all kinds
- including comments, processing instructions and all weird
<!stuff inherited from SGML- and, yes,
<![CDATA[belongs here.
- and, yes,
- including comments, processing instructions and all weird
In application code, XMLLexer can be used:
- directly (what makes sense for huge XML files with primitive structure);
- as data source for SAXEventEmitter.
Technically it is a transform
- from a Readable stream representing the content of an XML (HTML and so on) document;
- to an object mode stream of strings (primitive ones, not wrapped).
const lexer = new XMLLexer ({...options})
someInputStream.pipe (lexer)
for await (const tag of lexer) if (/* the tag is OK */) {
// do something with the tag
}
// or
someInputStream.pipe (lexer).pipe (someObjectOutputStream)
// or
lexer.on ('data', s => console.log (s))
lexed.end ('<root>...</root>')| Name | Default | Description |
|---|---|---|
| maxLength | 1 << 20 | Maximum lexeme length, characters |
| encoding | utf8 | Encoding for output lexemes, used with binary input |
XMLLexer splits the incoming stream into strings matching following patterns:
| Template | Description | Note |
|---|---|---|
<? ... ?>
|
Processing instruction | or prolog <?xml ... ?>
|
<!-- ... -->
|
Comment | |
<! ... [ ... [ ... ]]>
|
Conditional, complex doctype | or <![CDATA[ ... ]]>
|
<! ... [ ... ]>
|
Medium doctype | |
<! ... >
|
Doctype, atomic | |
< ... >
|
Element tag (opening, closing or both) | |
| ... | Text node |
| Name | Type | Description |
|---|---|---|
| body | string | What's left unparsed |
| beforeBody | BigInt | Amount of chars or bytes left past current body contents |
| state | int | Current state |
| position | int | Last position analyzed for [ in state ST_TAG_X
|
| awaited | Buffer | Enclosing sequence |
XMLLexer is basically a finite state machine with the following states:
| Constant | Body from start
|
Description |
|---|---|---|
| ST_TEXT | unknown or anyting but <
|
find the nearest <, slice the text |
| ST_LT | < |
Look ahead for the next byte to switch to ST_LT_X (if !) or ST_TAG (otherwise) |
| ST_LT_X | <! |
Look ahead for the next byte to switch to ST_TAG with awaited --> (if -) or ST_TAG_X (otherwise) |
| ST_TAG | <... |
Scan for '>', find the one preceded with awaited sequence, slice the tag |
| ST_TAG_X | <!... |
Scan through all bytes, adjust awaited when [ occur, watch for '>', finally slice the tag |
Following transitions are possible:
| From | To | Awaited | Explanation |
|---|---|---|---|
| ST_TEXT | ST_LT |
< occured, buffered text is given out (if any) |
|
| ST_TEXT | ST_TEXT | some other char occured | |
| ST_LT | ST_LT_X |
! occured |
|
| ST_LT | ST_TAG | ?> |
! occured |
| ST_LT | ST_TAG | > |
not !, normal tag |
| ST_LT_X | ST_TAG | --> |
- occured, must be a comment (unless broken) |
| ST_LT_X | ST_TAG_X |
> / ]> / ]]>
|
counting [s |
| ST_TAG | ST_TEXT |
> occured, previous characters match -- giving out the tag |
|
| ST_TAG_X | ST_TEXT |
> occured, previous characters match -- giving out the tag |
Possible enclosing sequences (> chopped) are stored as Buffer instances:
| Name | Value | Description |
|---|---|---|
| CL_DEAFULT | '' | default |
| CL_PI | ? |
for <?
|
| CL_COMMENT | -- |
for <!-
|
| CL_SQ_1 | ] |
if [ occured in ST_TAG state and CL_DEAFULT awaited |
| CL_SQ_2 | ]] |
if [ occured in ST_TAG state and CL_SQ_1 awaited |
| Name, Params | Type | Description |
|---|---|---|
| setState (state, awaited) | set the state and awaited members |
|
| isClosing (pos) | boolean | do bytes preceding pos match awaited
|
| publishTo (pos) |
push the the slice from start to pos in the output stream, then start after pos
|
|
| parse () | scan the body from start, publish lexemes found, move start after the last one |
|
| checkMaxLength () | Throws an error if body.size () > maxLength
|
|
| _transform (chunk, encoding, callback) |
append the incoming chunk, parse the body, then trim, reset start to 0 it and invoke callback
|
|
| _flush (callback) | publish the rest of the body, invoke callback
|
|
| getPosition () | BigInt | Position of the current lexeme in the incoming stream (for error reporting) |
XMLLexer and XMLIterator and are both low level XML parsers splitting pointy bracket delimited text to syntactically atomic tokens. But:
| Name | Proto | XML Source | Pro | Contra |
|---|---|---|---|---|
XMLLexer |
Transform |
Readable |
limited memory footprint with any XML size | asynchronous by nature |
XMLIterator |
Iterable |
String |
can be used in synchronous for ... of, e. g. in object constructors |
limited size XML only |
So, XMLLexer vs. XMLIterator is basically like fs.createReadStream vs. fs.readFileSync.