-
Notifications
You must be signed in to change notification settings - Fork 4
XMLLexer
do- edited this page Nov 16, 2021
·
55 revisions
XMLLexer is a transform
- from a Readable stream representing the content of an XML (HTML and so on) document;
- to an object mode stream of strings representing its lexemes,
where by a lexeme we mean either
- a text node
- or some pointy brackets delimited tag
- including comments, processing instructions and all weird
<!stuff inherited from SGML- and, yes
<![CDATA[belongs here.
- and, yes
- including comments, processing instructions and all weird
const lexer = XMLLexer ({...options})
someInputStream.pipe (lexer).pipe (someObjectOutputStream)
// or
lexer.on ('data', s => console.log (s))
lexed.end ('<root>...</root>')| Name | Default | Description |
|---|---|---|
| maxLength | 1 << 20 | Maximum lexeme length (in bytes or characters) |
XMLLexer splits the incoming stream into strings matching following patterns:
| Template | Description | Note |
|---|---|---|
<? ... ?>
|
Processing instruction | or prolog <?xml ... ?>
|
<!-- ... -->
|
Comment | |
<! ... [ ... [ ... ]]>
|
Conditional, complex doctype | or <![CDATA[ ... ]]>
|
<! ... [ ... ]>
|
Medium doctype | |
<! ... >
|
Doctype, atomic | |
< ... >
|
Element tag (opening, closing or both) | |
| ... | Text node |
| Name | Type | Description |
|---|---|---|
| bodyString | String | What's left unparsed |
| state | int | Current state |
| start | int | Start of the current lexeme in buffer |
| awaited | Buffer | Enclosing sequence |
XMLLexer is basically a finite state machine with the following states:
| Constant | Body from start
|
Description |
|---|---|---|
| ST_TEXT | unknown or anyting but <
|
find the nearest <, slice the text |
| ST_LT | < |
Look ahead for the next byte to switch to ST_LT_X (if !) or ST_TAG (otherwise) |
| ST_LT_X | <! |
Look ahead for the next byte to switch to ST_TAG with awaited --> (if -) or ST_TAG_X (otherwise) |
| ST_TAG | <... |
Scan for '>', find the one preceded with awaited sequence, slice the tag |
| ST_TAG_X | <!... |
Scan through all bytes, adjust awaited when [ occur, watch for '>', finally slice the tag |
Following transitions are possible:
| From | To | Awaited | Explanation |
|---|---|---|---|
| ST_TEXT | ST_LT |
< occured, buffered text is given out (if any) |
|
| ST_TEXT | ST_TEXT | some other char occured | |
| ST_LT | ST_LT_X |
! occured |
|
| ST_LT | ST_TAG | ?> |
! occured |
| ST_LT | ST_TAG | > |
not !, normal tag |
| ST_LT_X | ST_TAG | --> |
- occured, must be a comment (unless broken) |
| ST_LT_X | ST_TAG_X |
> / ]> / ]]>
|
counting [s |
| ST_TAG | ST_TEXT |
> occured, previous characters match -- giving out the tag |
|
| ST_TAG_X | ST_TEXT |
> occured, previous characters match -- giving out the tag |
Possible enclosing sequences (> chopped) are stored as Buffer instances:
| Name | Value | Description |
|---|---|---|
| CL_DEAFULT | '' | default |
| CL_PI | ? |
for <?
|
| CL_COMMENT | -- |
for <!-
|
| CL_SQ_1 | ] |
if [ occured in ST_TAG state and CL_DEAFULT awaited |
| CL_SQ_2 | ]] |
if [ occured in ST_TAG state and CL_SQ_1 awaited |
| Name, Params | Type | Description |
|---|---|---|
| append (chunk) | append this chunk to the body | |
| slice () | String | find a lexem at start, move start and return the lexem found or null |
| trim () | replace the buffer with its slice from start
|
|
| indexOfLt () | int | indexOf ('<', this.start) |
| indexOfGt (fromPos) | int | indexOf ('>', fromPos) |
| charCodeAt (pos) | int | As in String |
| isClosing (pos) | boolean | do bytes preceding pos match awaited
|