Skip to content

Adding specified repetitions to Invisible XML#326

Open
ndw wants to merge 1 commit intoinvisibleXML:masterfrom
ndw:repetitions
Open

Adding specified repetitions to Invisible XML#326
ndw wants to merge 1 commit intoinvisibleXML:masterfrom
ndw:repetitions

Conversation

@ndw
Copy link
Contributor

@ndw ndw commented Nov 13, 2025

Invisible XML currently has two styles of repetition, * meaning “zero or more” and + meaning “one or more”. These are extended to ** and ++ where a separator is introduced.

"a"*, zero or more “a” characters.
"a"+, one or more “a” characters.
"a" ** ",", zero or more “a” characters separated by a “,” character, and
"a" ++ ",", one or more “a” characters separated by a “,” character.

This proposal adds the ability to have specified repetitions: at least “m” (m≥0) occurrences, and at most “n” (n≥m, n>0) occurrences. Stipulate that it is an error if “n” is less than “m” or if “n” is zero.

A specified repetition is introduced with angle brackets: <m,n>. Repetition with a separator uses doubled angle brackets: <<m,n>>.

"a"<3> (equivalently, "a"<3,3>), exactly three “a” characters.
"a"<3,5>, at least three “a” characters and at most five.
"a"<3,*>, three or more “a” characters.
"a" <<3>> "," (equivalently, "a" <<3,3>> ","), exactly three “a” characters separated by a “,” character.
"a" <<3,5>> ",", at least three “a” characters and at most five, separated by the “,” character, and
"a" <<3,*>> ",", three or more “a” characters separated by the “,” character.

It is trivially the case that <0,*> is the same as *, <1,*> is the same as +, <<0,*>> is the same as **, and <<1,*>> is the same as ++, but there doesn’t seem to be a compelling reason to attempt to forbid these expressions.

This proposal can be implemented with surprisingly few changes to the spec. A few small changes to the grammar:

Grammar changes

The following new grammar rules are added:

repeatn: factor, (-"<", s, min, s, (-",", s, max, s)?, -">", s;
                  -"<<", s, min, s, (-",", s, max, s)?, -">>", s, sep).
-number: ["0"-"9"]+ .
@min: number .
@max: number | "*" .

(It would be possible to constrain max to be greater than zero in the grammar, ("0"*, ["1"-"9"], ["0"-"9"]*) | "*", but it’s impractical to express the n≥m constraint, so I don’t think it’s worth the added complexity.)

The rule for factor is extended to include repeatn:

-term: factor;
       option;
       repeat0;
       repeat1;
       repeatn.

And a few new hints for implementors:

Hints for implementors

A specific repeat:

f<m> ⇒ f₁, f₂, f₃ , ..., fₘ

A specific range:

f<m,n> ⇒ f₁, f₂, f₃ , ..., fₘ, (fₘ₊₁, (fₘ₊₂, ..., (fₙ)?, ... )? )?

That is, f<3,7> is equivalent to f, f, f, (f, (f, (f, (f)?)?)?)?.

An unbounded range:

f<m,*> ⇒ f₁, f₂, f₃ , ..., fₘ, f*

A specific repeat:

f <<m>> sep ⇒ f, (sep, f)<m-1>

A specific range:

f <<m,n>> sep ⇒ f, (sep, f)<m-1,n-1>

An unbounded repeat:

f <<m,*>> sep ⇒ f, (sep, f)<m-1,*>

That's it.

Fix #308

@ndw ndw force-pushed the repetitions branch 2 times, most recently from 2c466c3 to 7274c44 Compare November 13, 2025 12:44
@ndw
Copy link
Contributor Author

ndw commented Nov 13, 2025

The diff is now available, https://invisiblexml.org/pr/326/autodiff.html

@ndw
Copy link
Contributor Author

ndw commented Nov 13, 2025

I'm pleased to report that it was not difficult to implement.

@nverwer
Copy link
Contributor

nverwer commented Nov 20, 2025

Is there a reason for using < and > for repetitions? This seems to come from regular expressions, where repetition is specified like {3,5}. Of course, curly braces were no available in invisible XML.

In regular expressions, a closing bracket } is needed, because the syntax allows (almost) any character at any position.
It seems that this is not the case in ixml, and we could do without a closing bracket.

For instance, a repetition could be specified as "a"#3, "a"#3,5, "a"#3,*, "a" ##3,5 ",".
This seems more in line with *, **, +, and ++.

Of course, I have my own idea what <...> could be used for, which is in my proposal for the ixml symposium. But what I suggest in that proposal does not conflict with using <...> for repetition, and is not the reason for asking this question.

Maybe there is a reason why the closing bracket is needed, but I have not found it.

@ndw
Copy link
Contributor Author

ndw commented Nov 20, 2025

I don't think the closing > is necessary for parsing, but I think it's an aid to readability. If you've got a sequence of possibly one, possibly two things, then putting delimiters around the whole sequence makes it more visually clear, I think.

If we wanted to go with a single character an no closing delimiter, I don't think # would be the ideal choice because we already use it to introduce hex encoded terminals. (Like the closing >, I don't think that's an absolute barrier to parsability, but I think it could be confusing for users. It would mean that reletively small changers in the grammar could lead to different results.)

@nverwer
Copy link
Contributor

nverwer commented Nov 20, 2025

Ouch, I forgot about the hex encoded terminals.
Thanks for explaining the reasoning behind having a closing >.

@ndw
Copy link
Contributor Author

ndw commented Nov 26, 2025

On the CG call earlier this week, I said that I'd seen a reply by Liam but couldn't immediately work out where. I figured it out this morning; it was in a comment on a weblog post:

It's unfortunate {m,n} isn't available since `<<m,n>>`` does not sit well in XML documents. Of course, curly braces don’t sit well in XSLT these days either. But at least they have familiarity given that’s what most regular expression syntaxes seem to use.

"x"@(1,3) might work? or even "x" x (1, 3) haking back to Perl :-) where the s1 x n operator repeats a string s1 nn times, concatenating them.

Agree if it’s reasonably implementable it seems like an improvement either way.

I still sort of like the symmetry of the angle brackets, but I think @(m,n) could be made to work.

@LinguaCelta
Copy link
Contributor

I'd vote against @ because it already has strong semantics in iXML (and elsewhere in the XML stack), as an indicator for an attribute. Using it for something totally unrelated to attributes feels like overload.

Could we consider

item&(3,5)

item&&(3,5)separator

?

I tentatively like this - the standard use for & is concatenation, which is semantically adjacent to repetition. So it feels like a fairly natural counterpart to +.

@ndw
Copy link
Contributor Author

ndw commented Dec 1, 2025

That's a good point about @, thank you!

FWIW, it was quick work to make this parse:

S: a | b .
a: "a" &(3) .
b: "b" &&(3,*) "c" .

To me it feels a bit noisier than

S: a | b .
a: "a"<3> .
b: "b"<<3,*>>"c" .

But then I've been staring at the angle bracket forms for longer and I don't feel strongly about what color we paint the bike shed.

@LinguaCelta
Copy link
Contributor

Just a quick follow up - the mailing list has come up with a proposal to add repetition numbering to the * operator rather than introducing a new symbol. This could work like the & operator outlined above, except with * instead of &. I wouldn't personally vote for this option, but thought it worth documenting the proposal here.

There have also been suggestions for a different separator between the numbers in the repetition, including (3:5) and (3..5).

(The fact that iXML allows unnecessary alternatives for the rule separator and the alternation symbol (= and :, | and ;, respectively) is making life harder for us as we think about extending the language. In an ideal world, I might propose that we consider standardising these in iXML 2.0 to free up symbols for other uses.)

@ndw
Copy link
Contributor Author

ndw commented Dec 3, 2025

If we're going to entertain a backwards-incompatible 2.0, I'd be in favor of clawing "{" and "}" back from comments, replacing them with some two character sequence, /*...*/, (*...*), (:...:), or even some flavor of {*...*}.

@LinguaCelta
Copy link
Contributor

A summary of questions, and proposed answers, arising from discussion on the mailing list.

  • Should the syntax for repetitions consist only of paired delimiters (e.g. angle brackets) or of a repetition operator symbol plus some way of specifying number of repetitions?

"Delimiters only" might mean something like these possibilities:

"cat"<3,5>
"cat"<<3,5>>","
"cat"(3,5)
"cat"((3,5))","

"cat"/3,5/
"cat"//3,5//","

"Repetition operator" might mean something like one of these:

"cat"&(3,5)
"cat"&&(3,5)","

"cat"*[3,5]
"cat"**[3,5]","

  • If we choose to have an operator, should we reuse the * operator and just add some way to express the number of repetitions, or should we choose a new symbol as the operator?

  • If we choose a new symbol, proposals include &, x, %.

  • If we choose delimiters only, proposals include < >, «».

  • If we choose a repetition operator, we will still need to decide how to delimit the min/max numbers: obvious choices include (), [], <>.

  • What character should be used to separate the numbers expressing min and max repetitions? Proposals include ,, .., :.

@ndw
Copy link
Contributor Author

ndw commented Dec 4, 2025

A summary of questions, and proposed answers, arising from discussion on the mailing list.

Thank you!

  • If we choose to have an operator, should we reuse the * operator and just add some way to express the number of repetitions, or should we choose a new symbol as the operator?

I think we should choose a new symbol. Even if it’s technically possible to reuse “*” or “+”, I think it would be confusing. Saying *(1,2) means * doesn't match zero occurrences. Saying +(0,2) means + does.

  • If we choose a new symbol, proposals include &, x, %.

I am strongly opposed to choosing any name character. After that, I vote "concur". ¹

  • If we choose delimiters only, proposals include < >, «».

Every working group I’ve been on, even groups that have initially accepted proposals for using symbols not on US ASCII keyboards, have at the 11th hour or before, lost their nerve and picked ASCII characters.

  • If we choose a repetition operator, we will still need to decide how to delimit the min/max numbers: obvious choices include (), [], <>.

I’m sure I have a three sided coin around here somewhere…

It seems to me that one motivation for choosing an operator is discomfort with using < and > as delimiters in which case choosing to use them after the operator seems … odd. Of () and [], I think () have more familiarity and are less likely to be confusing than []. But that's just intuition.

  • What character should be used to separate the numbers expressing min and max repetitions? Proposals include ,, .., :.

Get out that coin again. I think I have a marginal preference for “,” or “:”, but I could live with any of them.


¹ A vote of "concur" sides with the majority. Two in favor, one opposed, one concur is recorded as three in favor, one opposed, as contrasted with "abstain" which leaves the vote two-to-one. (I don't know why I felt I had to explain that, but ...) In any event, when a working group comes down to attempting to decide by counting votes, things are on shaky ground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Not possible to specify an exact number of repetitions

3 participants