chore: Add microbenchmark for casting string to numeric #2979

andygrove · 2025-12-23T20:03:42Z

Which issue does this PR close?

Part of #2955

Rationale for this change

Add benchmarks for casting strings to numeric types using both cast and try_cast. ANSI mode is not explicitly enabled because we would either have to ensure that all input values are valid (which would not be benchmarking the validation logic fully) or we would just be timing how long it takes to hit the first invalid value and throw an exception, which is not helpful. Testing try_cast instead allows us to test the overhead of the validation logic.

What changes are included in this PR?

How are these changes tested?

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to BOOLEAN:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               127            147          20          8.3         120.7       1.0X
Comet (Scan)                                        122            137          17          8.6         116.6       1.0X
Comet (Scan + Exec)                                  85             96          11         12.3          81.3       1.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to BYTE:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             73          16         17.8          56.3       1.0X
Comet (Scan)                                         57             71          17         18.3          54.7       1.0X
Comet (Scan + Exec)                                  74             91          18         14.2          70.5       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to SHORT:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             78          20         17.9          55.8       1.0X
Comet (Scan)                                         56             70          19         18.7          53.4       1.0X
Comet (Scan + Exec)                                  73             98          22         14.3          70.0       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to INT:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                56             64          11         18.7          53.4       1.0X
Comet (Scan)                                         56             68          16         18.9          53.0       1.0X
Comet (Scan + Exec)                                  70             88          21         15.0          66.6       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to LONG:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             75          30         17.8          56.2       1.0X
Comet (Scan)                                         57             82          35         18.3          54.5       1.0X
Comet (Scan + Exec)                                  73             99          28         14.4          69.7       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to FLOAT:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                96            116          32         10.9          91.9       1.0X
Comet (Scan)                                         95            110          23         11.0          90.6       1.0X
Comet (Scan + Exec)                                  62             86          30         16.9          59.0       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to DOUBLE:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                98            113          26         10.7          93.8       1.0X
Comet (Scan)                                         99            110          24         10.6          93.9       1.0X
Comet (Scan + Exec)                                  62             93          31         16.9          59.2       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to DECIMAL(10,2):             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               180            194          25          5.8         171.2       1.0X
Comet (Scan)                                        179            185          13          5.8         171.1       1.0X
Comet (Scan + Exec)                                  95            148          30         11.1          90.4       1.9X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to BOOLEAN:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              4632           4658          36          0.2        4417.8       1.0X
Comet (Scan)                                       4578           4596          25          0.2        4366.1       1.0X
Comet (Scan + Exec)                                  81            101          25         13.0          76.9      57.4X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to BYTE:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              2555           2569          19          0.4        2436.5       1.0X
Comet (Scan)                                       2472           2482          13          0.4        2357.8       1.0X
Comet (Scan + Exec)                                  71             96          28         14.7          67.8      35.9X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to SHORT:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              2517           2522           7          0.4        2400.4       1.0X
Comet (Scan)                                       2482           2484           3          0.4        2367.4       1.0X
Comet (Scan + Exec)                                  71             88          28         14.8          67.7      35.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to INT:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              1383           1384           2          0.8        1318.7       1.0X
Comet (Scan)                                       1352           1354           3          0.8        1289.0       1.0X
Comet (Scan + Exec)                                  69             86          29         15.2          65.8      20.0X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to LONG:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              1369           1372           4          0.8        1305.3       1.0X
Comet (Scan)                                       1355           1399          63          0.8        1292.0       1.0X
Comet (Scan + Exec)                                  72            109          32         14.6          68.3      19.1X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to FLOAT:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                95            120          34         11.0          90.8       1.0X
Comet (Scan)                                         94            146          49         11.1          89.8       1.0X
Comet (Scan + Exec)                                  62            107          47         17.0          58.9       1.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to DOUBLE:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                96            108          12         10.9          91.5       1.0X
Comet (Scan)                                         98            128          37         10.7          93.1       1.0X
Comet (Scan + Exec)                                  61             94          40         17.1          58.5       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to DECIMAL(10,2):         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               180            218          35          5.8         171.3       1.0X
Comet (Scan)                                        178            222          44          5.9         170.0       1.0X
Comet (Scan + Exec)                                  93            126          41         11.3          88.8       1.9X

andygrove · 2025-12-23T21:01:12Z

@coderfender Could you review when you have time

codecov-commenter · 2025-12-23T21:16:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.64%. Comparing base (f09f8af) to head (01b0772).
⚠️ Report is 798 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2979      +/-   ##
============================================
+ Coverage     56.12%   59.64%   +3.51%     
- Complexity      976     1375     +399     
============================================
  Files           119      167      +48     
  Lines         11743    15497    +3754     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9243    +2652     
- Misses         4012     4955     +943     
- Partials       1140     1299     +159

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderfender

minor comments but LGTM @andygrove

coderfender · 2025-12-23T23:12:17Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometCastStringToNumericBenchmark.scala

+ * `SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometCastStringToNumericBenchmark`
+ * Results will be written to "spark/benchmarks/CometCastStringToNumericBenchmark-**results.txt".
+ */
+// spotless:on


@andygrove nit : perhaps we might dont need to turn off spotless given that all other benchmarks dont ?
Example comment form CometCastBenchmark

/** * Benchmark to measure Comet execution performance. To run this benchmark: * {{{ * SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometCastBenchmark * }}} * * Results will be written to "spark/benchmarks/CometCastBenchmark-**results.txt". */

coderfender · 2025-12-23T23:20:25Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometCastStringToNumericBenchmark.scala

+
+  private val castFunctions = Seq("CAST", "TRY_CAST")
+  private val targetTypes =
+    Seq("BOOLEAN", "BYTE", "SHORT", "INT", "LONG", "FLOAT", "DOUBLE", "DECIMAL(10,2)")


nit : Perhaps one could argue that Boolean isn't necessarily a numeric input ? Also , could we add some more higher precision + scale decimals too ?

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

coderfender · 2025-12-23T23:23:24Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometCastStringToNumericBenchmark.scala

+    s"SELECT $castFunc(c1 AS $targetType) FROM parquetV1Table",
+    Map(
+      SQLConf.ANSI_ENABLED.key -> "false",
+      CometConf.getExprAllowIncompatConfigKey(classOf[Cast]) -> "true"))


I was wondering if we could handle the cast compatibility on a case by case basis which would help us better evaluate custom implementations vs datafusion supported operation along with unsupported cast ops . This can be done in a follow up PR (I will file an issue once this is merged)

andygrove added 6 commits December 23, 2025 13:02

Add microbenchmark for casting string to numeric

5c785ec

improve

41ad69a

simplify

399a047

Save

480abcf

generate data once

ef5baaa

improve

01b0772

andygrove marked this pull request as ready for review December 23, 2025 21:00

coderfender approved these changes Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Add microbenchmark for casting string to numeric #2979

chore: Add microbenchmark for casting string to numeric #2979

andygrove commented Dec 23, 2025 •

edited

Loading

Uh oh!

andygrove commented Dec 23, 2025

Uh oh!

codecov-commenter commented Dec 23, 2025 •

edited

Loading

Uh oh!

coderfender left a comment

Uh oh!

coderfender Dec 23, 2025

Uh oh!

coderfender Dec 23, 2025

Uh oh!

coderfender Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: Add microbenchmark for casting string to numeric #2979

Are you sure you want to change the base?

chore: Add microbenchmark for casting string to numeric #2979

Conversation

andygrove commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Dec 23, 2025

Uh oh!

codecov-commenter commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

coderfender Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderfender Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderfender Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Dec 23, 2025 •

edited

Loading

codecov-commenter commented Dec 23, 2025 •

edited

Loading