Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 23, 2025

Which issue does this PR close?

Part of #2955

Rationale for this change

Add benchmarks for casting strings to numeric types using both cast and try_cast. ANSI mode is not explicitly enabled because we would either have to ensure that all input values are valid (which would not be benchmarking the validation logic fully) or we would just be timing how long it takes to hit the first invalid value and throw an exception, which is not helpful. Testing try_cast instead allows us to test the overhead of the validation logic.

What changes are included in this PR?

How are these changes tested?

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to BOOLEAN:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               127            147          20          8.3         120.7       1.0X
Comet (Scan)                                        122            137          17          8.6         116.6       1.0X
Comet (Scan + Exec)                                  85             96          11         12.3          81.3       1.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to BYTE:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             73          16         17.8          56.3       1.0X
Comet (Scan)                                         57             71          17         18.3          54.7       1.0X
Comet (Scan + Exec)                                  74             91          18         14.2          70.5       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to SHORT:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             78          20         17.9          55.8       1.0X
Comet (Scan)                                         56             70          19         18.7          53.4       1.0X
Comet (Scan + Exec)                                  73             98          22         14.3          70.0       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to INT:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                56             64          11         18.7          53.4       1.0X
Comet (Scan)                                         56             68          16         18.9          53.0       1.0X
Comet (Scan + Exec)                                  70             88          21         15.0          66.6       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to LONG:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                59             75          30         17.8          56.2       1.0X
Comet (Scan)                                         57             82          35         18.3          54.5       1.0X
Comet (Scan + Exec)                                  73             99          28         14.4          69.7       0.8X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to FLOAT:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                96            116          32         10.9          91.9       1.0X
Comet (Scan)                                         95            110          23         11.0          90.6       1.0X
Comet (Scan + Exec)                                  62             86          30         16.9          59.0       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to DOUBLE:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                98            113          26         10.7          93.8       1.0X
Comet (Scan)                                         99            110          24         10.6          93.9       1.0X
Comet (Scan + Exec)                                  62             93          31         16.9          59.2       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
CAST String to DECIMAL(10,2):             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               180            194          25          5.8         171.2       1.0X
Comet (Scan)                                        179            185          13          5.8         171.1       1.0X
Comet (Scan + Exec)                                  95            148          30         11.1          90.4       1.9X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to BOOLEAN:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              4632           4658          36          0.2        4417.8       1.0X
Comet (Scan)                                       4578           4596          25          0.2        4366.1       1.0X
Comet (Scan + Exec)                                  81            101          25         13.0          76.9      57.4X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to BYTE:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              2555           2569          19          0.4        2436.5       1.0X
Comet (Scan)                                       2472           2482          13          0.4        2357.8       1.0X
Comet (Scan + Exec)                                  71             96          28         14.7          67.8      35.9X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to SHORT:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              2517           2522           7          0.4        2400.4       1.0X
Comet (Scan)                                       2482           2484           3          0.4        2367.4       1.0X
Comet (Scan + Exec)                                  71             88          28         14.8          67.7      35.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to INT:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              1383           1384           2          0.8        1318.7       1.0X
Comet (Scan)                                       1352           1354           3          0.8        1289.0       1.0X
Comet (Scan + Exec)                                  69             86          29         15.2          65.8      20.0X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to LONG:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                              1369           1372           4          0.8        1305.3       1.0X
Comet (Scan)                                       1355           1399          63          0.8        1292.0       1.0X
Comet (Scan + Exec)                                  72            109          32         14.6          68.3      19.1X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to FLOAT:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                95            120          34         11.0          90.8       1.0X
Comet (Scan)                                         94            146          49         11.1          89.8       1.0X
Comet (Scan + Exec)                                  62            107          47         17.0          58.9       1.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to DOUBLE:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                96            108          12         10.9          91.5       1.0X
Comet (Scan)                                         98            128          37         10.7          93.1       1.0X
Comet (Scan + Exec)                                  61             94          40         17.1          58.5       1.6X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
TRY_CAST String to DECIMAL(10,2):         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               180            218          35          5.8         171.3       1.0X
Comet (Scan)                                        178            222          44          5.9         170.0       1.0X
Comet (Scan + Exec)                                  93            126          41         11.3          88.8       1.9X

@andygrove andygrove marked this pull request as ready for review December 23, 2025 21:00
@andygrove
Copy link
Member Author

@coderfender Could you review when you have time

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.64%. Comparing base (f09f8af) to head (01b0772).
⚠️ Report is 798 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2979      +/-   ##
============================================
+ Coverage     56.12%   59.64%   +3.51%     
- Complexity      976     1375     +399     
============================================
  Files           119      167      +48     
  Lines         11743    15497    +3754     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9243    +2652     
- Misses         4012     4955     +943     
- Partials       1140     1299     +159     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments but LGTM @andygrove

* `SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometCastStringToNumericBenchmark`
* Results will be written to "spark/benchmarks/CometCastStringToNumericBenchmark-**results.txt".
*/
// spotless:on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove nit : perhaps we might dont need to turn off spotless given that all other benchmarks dont ?
Example comment form CometCastBenchmark

/**
 * Benchmark to measure Comet execution performance. To run this benchmark:
 * {{{
 *   SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometCastBenchmark
 * }}}
 *
 * Results will be written to "spark/benchmarks/CometCastBenchmark-**results.txt".
 */


private val castFunctions = Seq("CAST", "TRY_CAST")
private val targetTypes =
Seq("BOOLEAN", "BYTE", "SHORT", "INT", "LONG", "FLOAT", "DOUBLE", "DECIMAL(10,2)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : Perhaps one could argue that Boolean isn't necessarily a numeric input ? Also , could we add some more higher precision + scale decimals too ?

https://spark.apache.org/docs/latest/sql-ref-datatypes.html

s"SELECT $castFunc(c1 AS $targetType) FROM parquetV1Table",
Map(
SQLConf.ANSI_ENABLED.key -> "false",
CometConf.getExprAllowIncompatConfigKey(classOf[Cast]) -> "true"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could handle the cast compatibility on a case by case basis which would help us better evaluate custom implementations vs datafusion supported operation along with unsupported cast ops . This can be done in a follow up PR (I will file an issue once this is merged)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants