Skip to content

Conversation

@MCBoarder289
Copy link
Contributor

@MCBoarder289 MCBoarder289 commented Dec 8, 2025

This PR addresses issue #1429, as well as some of the histogram issues in #1602. When trying to resolve the missing category in the charts when Spark is used as a backend, I discovered a few bugs in the logic computing summary stats, as well as with the aggregations for value counts of not-null records.

I approached this by creating a toy dataset of both strings and numbers to profile in both Spark and Pandas and compare the reports afterwards. The toy dataset includes nulls on both fields, as well as duplicate records.

Initial Pandas Profile Output

  • Summary Stats:
    Pandas_example_stats
  • Common Values:
    Pandas_example_common_values

Initial Spark Profile Output

  • Summary Stats:
    Spark_wrong_stats

  • Common Values:
    Spark_wrong_common_values

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:

  • Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.

    • Resolution: This was resolved by ensuring the numeric_stats_spark() method explicitly filters out nulls and Nans to match pandas' default behavior
  • Issue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be

    • Resolution: Adding nan filters to the n_missing computation in the describe_spark_counts()method
  • Issue 3: Histogram counts and Common Values counts using the summary["value_counts_without_nan"] Series were not correctly summing counts.

    • Resolution: Adding a sum to the counts, and removing the limit(200) makes everything line up to parity with the Pandas output
    • NOTE: Since we're pre-aggregating data for the value_counts, I don't think the limit(200) is necessary even with Spark. Since we're pulling this down into a Pandas Series anyway, if the data was too big, then that would explicitly fail the process instead of producing misleading reports. If you're running this in Spark on big data anyway, it's a reasonable assumption that you're using a high-memory compute instance anyway.

Fixed Spark Profile Output

  • Summary Stats:
    Spark_fixed_stats

  • Common Values:
    Spark_fixed_common_values

Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes.

@fabclmnt - I definitely welcome any and all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend. I think this would knock out a few bug tickets overall.

In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values.
This commit updates the spark implementation to more closely match that.
Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not
The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value.

This is updated by summing the count value instead of performing a row count operation.
@MCBoarder289 MCBoarder289 force-pushed the fix/1429-spark-freqtable-missing-vals branch from 1c1ae2c to d124bbb Compare December 8, 2025 21:02
@MCBoarder289 MCBoarder289 force-pushed the fix/1429-spark-freqtable-missing-vals branch from d124bbb to 4daf389 Compare December 10, 2025 22:07
Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.
@MCBoarder289
Copy link
Contributor Author

Closing in favor of #1800 which fixes multiple issues at once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant