Skip to content

Commit 976e22d

Browse files
committed
Add comments from Unicode to script_run()
1 parent d15dd33 commit 976e22d

File tree

1 file changed

+20
-2
lines changed

1 file changed

+20
-2
lines changed

regexec.c

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11773,9 +11773,27 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
1177311773
* parallel, table that gives the number of entries in each aux table.
1177411774
* These are all defined in charclass_invlists.inc */
1177511775

11776-
/* XXX Here are the additional things UTS 39 says could be done:
11776+
/* XXX Here are the additional things UTS 39 (17.0
11777+
* https://unicode.org/reports/tr39/#Optional_Detection ) says could be
11778+
* done:
1177711779
*
11778-
* Forbid sequences of the same nonspacing mark
11780+
* Check for unlikely sequences of combining marks:
11781+
* Forbid sequences of the same nonspacing mark.
11782+
* Forbid sequences of more than 4 nonspacing marks (gc=Mn or gc=Me).
11783+
* Forbid sequences of base character + nonspacing mark that look the
11784+
* same as or confusingly similar to the base character alone
11785+
* (because the nonspacing mark overlays a portion of the base
11786+
* character). An example is U+0069 LOWERCASE LETTER I + U+0307
11787+
* COMBINING DOT ABOVE.
11788+
* Add support for detecting two distinct sequences that have identical
11789+
* representations. The current data files only handle cases where a
11790+
* single code point is confusable with another code point or
11791+
* sequence. It does not handle cases like shri:
11792+
* The characters U+0BB6 TAMIL LETTER SHA and U+0BB8 TAMIL LETTER SA
11793+
* are normally quite distinct. However, they can both be used in the
11794+
* representation of the Tamil word shri. On some very common
11795+
* platforms, some sequences result in exactly the same visual
11796+
* appearance:
1177911797
*
1178011798
* Check to see that all the characters are in the sets of exemplar
1178111799
* characters for at least one language in the Unicode Common Locale Data

0 commit comments

Comments
 (0)