Improve ICU4C-based Intl.DateTimeFormat implementation#1636
Improve ICU4C-based Intl.DateTimeFormat implementation#1636robchu05 wants to merge 2 commits intofacebook:mainfrom
Conversation
- Add a flag in top-level CMakeLists.txt for enabling Intl formatRange() and
formatRangeToParts() APIs for non-Android and non-Apple platforms, i.e.
Linux / ICU4C-backed Intl implementation.
- Add the platform plumbing code for formatRange() and formatRangeToParts().
- Remove PlatformIntlShared.h/.cpp as the remaining logic is only used in
ICU4C-backed Intl.DateTimeFormat implementation and is now all inside
impl_icu/DateTimeFormat.cpp.
- Fix a bug in BCP47Parser isUnicodeVariantSubtag() so that the code accounts
for the "digit" porition in the "digit alphanum{3}" syntax definition.
- impl_icu/DateTimeFormat.h|.cpp is the implementation of Intl.DateTimeFormat
using ICU4C C APIs. No ICU4C C++ APIs are used for ABI stability.
- In PlatformIntlICU.cpp, remove DateTimeFormat implementation code and
redirect DateTimeFormat platform implementation calls to
impl_icu::DateTimeFormat.
- Edit the order of Intl.DateTimeFormat options list in Intl.cpp so that they
match the spec's option read ordering. A few test262 intl DateTimeFormat tests
check for option read order and would fail if the options in this list are not
ordered in the spec's option read ordering.
- Add reading and validating formatMatcher option in PlatformIntlApple.mm.
- Add new test files: date-time-format-additional.js,
date-time-format-hour-cycle.js, date-time-format-format-range.js. They tests
Intl.DateTimeFormat features that are not supported on Apple, and thus labeled
with "UNSUPPORTED: apple". The test runner will skip these when running on
Apple platform.
- Rename date-time-format-apple.js to date-time-format.js, which have common
test cases for both Apple and Linux / ICU4C-backed Intl.DateTimeFormat
implementation. Edit a few test cases and expectations to account for
differences in locale data and time zone id support (only use IANA time zone ids
as they are the only ones that the spec allows). Add more negative test cases
for invalid option values.
- Enable running more test262 Intl test cases by editing testsuite.py to no
longer skip tests that include testintl.js and tests that use 'const'
declaration.
- Update corresponding skip lists so to skip tests that are failing due to
unsupported features for Hermes in general, for Linux only, or for Apple only.
[Testing]
Builds succeed on Ubuntu 20.04, 24.04, and Mac. Run the following JS tests on
Ubuntu 20.04, 24.04, Mac, and all pass:
```
$ ./build/bin/hermes ./hermes/test/hermes/intl/intl.js
$ ./build/bin/hermes ./hermes/test/hermes/intl/get-canonical-locales.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/get-canonical-locales.js
$ ./build/bin/hermes ./hermes/test/hermes/intl/collator.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/collator.js
$ LC_ALL=fr_FR _HERMES_TEST_LOCALE=fr_FR ./build/bin/hermes ./hermes/test/hermes/intl/collator-resolved-options.js
$ TZ=GMT ./build/bin/hermes -O -target=HBC ./hermes/test/hermes/intl/date-time-format.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/date-time-format.js
$ TZ=US/Pacific ./build/bin/hermes -O -target=HBC hermes/test/hermes/intl/default-tz.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/default-tz.js
```
Run the following JS tests on Ubuntu 20.04, 24.04, and all pass:
```
$ ./build/bin/hermes hermes/test/hermes/intl/collator-resolved-options-validate-co-extension.js
$ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-additional.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/date-time-format-additional.js
$ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-hour-cycle.js
$ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-format-range.js
```
Run valgrind with above test runs and no memory leak detected.
Run Test262 test suite (pinned at commit hash
62626e083bd506124aac6c799464d76c2c42851b that the github workflow uses for test
stablity). All non-skipped test cases pass.
```
$ ./hermes/utils/testsuite/run_testsuite.py -b ./build/bin --test-intl -a ~/Downloads/test262-62626e0/test262/test/
```
|
Thanks for submitting this PR! Before we start reviewing it, could you make the changes to the static_h branch instead? All new features/improvements should happen on that branch now. Thanks! |
|
Hi @lavenzg, Thank you for letting me know about static_h branch. I see that Hermes team is working on getting React Native main branch and releases to switch over to the static_h branch: facebook/react-native#48531. Our team tracks RN main branch and releases as upstreams and Hermes is part of that. So we get upstream Hermes changes from Hermes main branch, until RN main branch and releases switch over to Hermes static_h branch. We would thus like to see this PR to be included in both Hermes main branch and static_h branch as we don't yet know when RN will switch over. What do you think about keeping this PR targeting main branch, and have it reviewed, comments addressed, and merged first, then subseqeuntly I'll work on a new PR for the static_h branch? thanks, Robert P.S. I took a look at the relevant code files on static_h branch, and bringing the changes in this PR to static_h should be straightforward as the relevant code is the pretty much the same between static_h and main branch, except for some details to be worked out in the automated testing configuration because that setup has some differences between the two branches. |
|
@robchu05 Sounds good. We'll start reviewing this. Thanks! |
|
Hi @lavenzg, Just want to see if you had a chance to start reviewing this yet. Regards, Robert |
|
Hi @lavenzg, Do you have any initial feedback to share? Best, Robert |
|
Hi @lavenzg, Hope you are doing well. Any update on the review for this PR? Regards, Robert |
|
@robchu05 Sorry for the delay! I'll get it done this week. |
lavenzg
left a comment
There was a problem hiding this comment.
Looks good. Left a few comments.
| "test262/test/intl402/supportedLocalesOf-taint-Array-2.js", | ||
| "test262/test/intl402/supportedLocalesOf-test-option-localeMatcher.js", | ||
| "test262/test/intl402/supportedLocalesOf-unicode-extensions-ignored.js", | ||
| # date-same-returns-single-date.js and fractionalSecondDigits.js fail on older |
There was a problem hiding this comment.
It depends on ICU APIs introduced in newer version? Or there're bugs fixed in new version? Do you know what's the minimum version that can pass this test?
There was a problem hiding this comment.
There should be bugs fixed in newer versions. There was some rounding issue in older ICU version that caused these tests to fail when comparing two date format results that are expected to be the same. I don't know what's the minimum version to pass. I know that with ICU 74.2, the tests pass, and with ICU 66.1, the tests fail.
There was a problem hiding this comment.
Hmm, when you port this to static_h branch, we may need to increase the ICU version. For now we search for a minimum of 65 when INTL is ON. We can worry about this later.
| @@ -46,7 +46,8 @@ bool isUnicodeRegionSubtag(const std::u16string &subtag) { | |||
| bool isUnicodeVariantSubtag(const std::u16string &subtag) { | |||
| // = (alphanum{5,8} | digit alphanum{3}); | |||
There was a problem hiding this comment.
Update the code comment accordingly.
There was a problem hiding this comment.
The existing code comment is correct. The previous code incorrectly missed the "digit" part in the "digit alphanum{3}" check.
| std::vector<Part> formatToParts(double jsTimeValue) noexcept; | ||
| #ifdef HERMES_INTL_FORMAT_RANGE | ||
| vm::CallResult<std::u16string> | ||
| formatRange(vm::Runtime &runtime, double startDate, double endDate) noexcept; |
There was a problem hiding this comment.
nit: I assume this is utc time? We could name it like "startUtcMs`.
There was a problem hiding this comment.
Done in rev 2. Renamed all startDate and endDate param names to startUtcMS and endUtcMs in these new formatRange() and formatRangeToParts() methods.
| * Checks if given string starts with given prefix. | ||
| * @param str string to check | ||
| * @param prefix prefix to check | ||
| * @return true if str starts with prefix, false otherwise. |
There was a problem hiding this comment.
| * @return true if str starts with prefix, false otherwise. | |
| * @return true if \p str starts with \p prefix, false otherwise. |
| std::u16string toUpperASCII(std::u16string str) { | ||
| constexpr std::uint16_t offset = 'a' - 'A'; | ||
| for (char16_t &c : str) { | ||
| if (c >= u'a' && c <= u'z') { |
There was a problem hiding this comment.
We should assert that \p c is indeed ASCII char to match the comment.
There was a problem hiding this comment.
I updated the method documentation to reflect that the method only converts the ASCII letters, but doesn't require the input string to be ASCII only.
This function is also called with time zone string option passed in by Intl.DateTimeFormat API caller that could contain non-ASCII, which is fine, as the result is then matched against valid time zone names, which are ASCII only, so it does not match as expected.
| relevantExtensionKeys, | ||
| isExtensionTypeSupported); | ||
|
|
||
| LocaleBCP47Object resolvedBCP47Locale = result.localeBcp47Object; |
There was a problem hiding this comment.
These objects aren't small. Coundn't we use reference here? Or at least we should move.
There was a problem hiding this comment.
The resolvedBCP47Locale instance can potentially get modified later in the code, so I prefer that it's a new copy, rather than modifying the one in result. And moving it out of result would leave result in an incomplete state, which isn't intuitive.
Regarding the size, LocaleBCP47Object contains multiple containers (maps, vectors) of short strings and have small number of entries in most cases, as they are just locale strings separated out into their categories. So its size is not expensive.
| if (extensionMap.find(constants::extension_key::ca) == extensionMap.end()) { | ||
| extensionMap.emplace(constants::extension_key::ca, resolvedCalendar_); | ||
| resolvedBCP47Locale.updateExtensionMap(extensionMap); | ||
| internalLocale = resolvedBCP47Locale.getCanonicalizedLocaleId(); | ||
| } |
There was a problem hiding this comment.
We don't need to search two times.
| if (extensionMap.find(constants::extension_key::ca) == extensionMap.end()) { | |
| extensionMap.emplace(constants::extension_key::ca, resolvedCalendar_); | |
| resolvedBCP47Locale.updateExtensionMap(extensionMap); | |
| internalLocale = resolvedBCP47Locale.getCanonicalizedLocaleId(); | |
| } | |
| auto success = extensionMap.emplace(constants::extension_key::ca, resolvedCalendar_).second; | |
| if (success) { | |
| resolvedBCP47Locale.updateExtensionMap(extensionMap); | |
| internalLocale = resolvedBCP47Locale.getCanonicalizedLocaleId(); | |
| } |
There was a problem hiding this comment.
Ah, yes, good suggestion. Done in rev 2 for this code block and below for the numbering system code block. I used try_emplace, which avoids creating a pair for the arguments if map already contains the key and no insertion takes place.
|
|
||
| LocaleBCP47Object resolvedBCP47Locale = result.localeBcp47Object; | ||
| Options resolvedOpt = result.resolvedOpts; | ||
| auto extensionMap = resolvedBCP47Locale.getExtensionMap(); |
There was a problem hiding this comment.
Below you call getExtensionMap() to update resolvedBCP47Locale, why can't we just make extensionMap takes reference so that it updates directly (note that getExtensionMap() already returns reference)?
There was a problem hiding this comment.
For the encapsulation that properties of LocaleBCP47Object should only be modified by its own class methods. getExtensionMap() returns a const reference to avoid a copy for callers that only need to read.
There was a problem hiding this comment.
For the encapsulation that properties of LocaleBCP47Object should only be modified by its own class methods
I don't quite remember, is this required by the spec? Otherwise we could just change the implementation. It feels very unnecessary to copy out the map and merge it back. Maybe one case is that multiple map references are taken and one update should not affect others, but I'm not sure if this is an actual case. Anyway, I'm fine keeping the current form, assuming that this won't appear on hot path.
| } | ||
|
|
||
| const std::unordered_set<std::u16string> &getAvailableNumberingSystems() { | ||
| // Intentionally leaked to avoid destruction order problems. |
There was a problem hiding this comment.
Could you explain a bit more? Couldn't we just move the lambda to a separate function and declare a static variable here? Why does it have to be allocated on heap?
There was a problem hiding this comment.
This idea is directly borrowed from existing code in https://github.com/facebook/hermes/blob/main/lib/Platform/Intl/PlatformIntlApple.mm#L37.
Here is an article that also explains the C++ static destruction order problem, which this solution deals with: https://ppwwyyxx.com/blog/2023/Safe-Static-Initialization-No-Destruction/ > Safe Destructions section. Basically, don't destroy to avoid the whole issue.
| const std::unordered_map<std::u16string, std::u16string> & | ||
| DateTimeFormat::getValidTimeZones() { | ||
| // Intentionally leaked to avoid destruction order problems. | ||
| static const auto *validTimeZones = [] { |
There was a problem hiding this comment.
Same as NumberingSystem, I think it'd be better to move the lambda body to a separate function (e.g., computeValidTimeZones(), and just save the result to a static local variable here. I don't see why it has to be on heap.
There was a problem hiding this comment.
See my reply in the NumberingSystem comment.
|
@lavenzg, thanks for the review and feedback. I plan to address them next week. |
Address code review feedback: - Add documentation - More precise variable names - Change map::find then ::emplace to ::try_emplace. [Testing] Builds succeed on Ubuntu 24.04. Run the following JS tests on Ubuntu 24.04 and all pass: ``` $ ./build/bin/hermes ./hermes/test/hermes/intl/intl.js $ ./build/bin/hermes ./hermes/test/hermes/intl/get-canonical-locales.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/get-canonical-locales.js $ ./build/bin/hermes ./hermes/test/hermes/intl/collator.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/collator.js $ LC_ALL=fr_FR _HERMES_TEST_LOCALE=fr_FR ./build/bin/hermes ./hermes/test/hermes/intl/collator-resolved-options.js $ TZ=GMT ./build/bin/hermes -O -target=HBC ./hermes/test/hermes/intl/date-time-format.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/date-time-format.js $ TZ=US/Pacific ./build/bin/hermes -O -target=HBC hermes/test/hermes/intl/default-tz.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/default-tz.js $ ./build/bin/hermes hermes/test/hermes/intl/collator-resolved-options-validate-co-extension.js $ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-additional.js | ./build/bin/FileCheck --match-full-lines ./hermes/test/hermes/intl/date-time-format-additional.js $ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-hour-cycle.js $ TZ=GMT ./build/bin/hermes hermes/test/hermes/intl/date-time-format-format-range.js ``` Run valgrind with above test runs and no memory leak detected. Run Test262 test suite (pinned at commit hash 62626e083bd506124aac6c799464d76c2c42851b that the github workflow uses for test stablity). All non-skipped test cases pass. ``` $ ./hermes/utils/testsuite/run_testsuite.py -b ./build/bin --test-intl -a ~/Downloads/test262-62626e0/test262/test/ ```
robchu05
left a comment
There was a problem hiding this comment.
Addressing review feedback. Rev 2 posted.
| std::vector<Part> formatToParts(double jsTimeValue) noexcept; | ||
| #ifdef HERMES_INTL_FORMAT_RANGE | ||
| vm::CallResult<std::u16string> | ||
| formatRange(vm::Runtime &runtime, double startDate, double endDate) noexcept; |
There was a problem hiding this comment.
Done in rev 2. Renamed all startDate and endDate param names to startUtcMS and endUtcMs in these new formatRange() and formatRangeToParts() methods.
| @@ -46,7 +46,8 @@ bool isUnicodeRegionSubtag(const std::u16string &subtag) { | |||
| bool isUnicodeVariantSubtag(const std::u16string &subtag) { | |||
| // = (alphanum{5,8} | digit alphanum{3}); | |||
There was a problem hiding this comment.
The existing code comment is correct. The previous code incorrectly missed the "digit" part in the "digit alphanum{3}" check.
| // 30. Let dataLocaleData be localeData.[[<dataLocale>]]. | ||
| // 31. Let matcher be ? GetOption(options, "formatMatcher", "string", « | ||
| // "basic", "best fit" », "best fit"). | ||
| // NOTE: Only best fit format matcher is implemented through use of NSDateFormatter. |
There was a problem hiding this comment.
The "basic" format matcher algorithm defined in the spec is not implemented. The Apple implementation uses the Apple Platform API to implement the "best fit" format matcher. And the "best fit" format matcher is used even when "basic" is specified.
"best fit" is defined in the spec to be a platform implementation that produces results at least as good as "basic", so it is fine to just always use the "best fit" format matcher.
| relevantExtensionKeys, | ||
| isExtensionTypeSupported); | ||
|
|
||
| LocaleBCP47Object resolvedBCP47Locale = result.localeBcp47Object; |
There was a problem hiding this comment.
The resolvedBCP47Locale instance can potentially get modified later in the code, so I prefer that it's a new copy, rather than modifying the one in result. And moving it out of result would leave result in an incomplete state, which isn't intuitive.
Regarding the size, LocaleBCP47Object contains multiple containers (maps, vectors) of short strings and have small number of entries in most cases, as they are just locale strings separated out into their categories. So its size is not expensive.
|
|
||
| LocaleBCP47Object resolvedBCP47Locale = result.localeBcp47Object; | ||
| Options resolvedOpt = result.resolvedOpts; | ||
| auto extensionMap = resolvedBCP47Locale.getExtensionMap(); |
There was a problem hiding this comment.
For the encapsulation that properties of LocaleBCP47Object should only be modified by its own class methods. getExtensionMap() returns a const reference to avoid a copy for callers that only need to read.
| static const auto *availableNumberingSystems = [] { | ||
| auto *numberingSystems = new std::unordered_set<std::u16string>(); | ||
| UErrorCode status = U_ZERO_ERROR; | ||
| std::unique_ptr<UEnumeration, decltype(&uenum_close)> names( |
There was a problem hiding this comment.
See my reply to the other comment about this.
| namespace platform_intl { | ||
| namespace impl_icu { | ||
|
|
||
| std::u16string getDefaultNumberingSystem(const std::string &localeICU); |
| {u"hourCycle", platform_intl::Option::Kind::String, 0}, | ||
| {u"timeZone", platform_intl::Option::Kind::String, 0}, | ||
| {u"formatMatcher", platform_intl::Option::Kind::String, 0}, | ||
| {u"weekday", platform_intl::Option::Kind::String, kDateRequired}, |
There was a problem hiding this comment.
Added comment in rev 2.
Note that this ordering isn't explicitly called out in the spec, and it's for all Intl APIs, not just for DateTimeFormat.
The reasons I can think of why there are test262 tests for the ordering is (1) in case there are side effects in the way the arguments are specified, say the argument is actually a function that returns a value and has side effects, and (2) if there are multiple invalid arguments resulting in exception thrown, then the ordering of resolving and checking the arguments is observable and ECMAScript wants conforming JS engines to behave the same in these regards.
| } | ||
|
|
||
| #ifdef HERMES_INTL_FORMAT_RANGE | ||
| CallResult<double> dateValue(Runtime &runtime, NativeArgs args, int argIndex) { |
| "test262/test/intl402/supportedLocalesOf-taint-Array-2.js", | ||
| "test262/test/intl402/supportedLocalesOf-test-option-localeMatcher.js", | ||
| "test262/test/intl402/supportedLocalesOf-unicode-extensions-ignored.js", | ||
| # date-same-returns-single-date.js and fractionalSecondDigits.js fail on older |
There was a problem hiding this comment.
There should be bugs fixed in newer versions. There was some rounding issue in older ICU version that caused these tests to fail when comparing two date format results that are expected to be the same. I don't know what's the minimum version to pass. I know that with ICU 74.2, the tests pass, and with ICU 66.1, the tests fail.
|
@lavenzg, please have a look at my replies to your feedback and the new revision I posted. I'm going to be on vacation between June 7 - 29. If there are additional feedback while I'm away, I'll address them after I'm back. I will also work on porting this to the static_h branch after my break. |
|
@lavenzg, I'm back in office. Have you had a chance to review rev 2? Thanks! |
LGTM. I will import it to our internal repository and run some extra tests. If no failure we will merge them. Thanks for your work on this (and your patience as my response sometimes are slow)! |
|
@lavenzg, update on the import status of this change? Thanks. |
|
@lavenzg, any update on merging this PR? Thank you. |
|
@lavenzg, are you able to merge this PR? Let me know if you need anything form me. |
|
Hi, unfortunately it seems like we won't be able to merge this PR. I apologize, Meta, for better or worse, has decided that it cannot dedicate the resources to maintain a Intl implementation, and we can't merge something that we can't maintain. This is not an ideal situation, but it is out the hands of the Hermes team. Instead, we are adding 1st class (and maintained) support for NodeAPI, which is well defined and ABI stable. This make it possible for Intl implementations to exist outside of the Hermes repository, while being guaranteed compatibility with Hermes by NodeAPI. |
Summary
Improve and refactor ICU4C-based Intl.DateTimeFormat implementation for non-Android / non-Apple platforms.
Test Plan
Builds succeed on Ubuntu 20.04, 24.04, and Mac. Run the following JS tests on Ubuntu 20.04, 24.04, Mac, and all pass:
Run the following JS tests on Ubuntu 20.04, 24.04, and all pass:
Run valgrind with above test runs and no memory leak detected.
Run Test262 test suite (pinned at commit hash
62626e083bd506124aac6c799464d76c2c42851b that the github workflow uses for test stablity). All non-skipped test cases pass.