fix: Spark freqtable and missing values #1798
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses issue #1429, as well as some of the histogram issues in #1602. When trying to resolve the missing category in the charts when Spark is used as a backend, I discovered a few bugs in the logic computing summary stats, as well as with the aggregations for value counts of not-null records.
I approached this by creating a toy dataset of both strings and numbers to profile in both Spark and Pandas and compare the reports afterwards. The toy dataset includes nulls on both fields, as well as duplicate records.
Initial Pandas Profile Output
Initial Spark Profile Output
Summary Stats:

Common Values:

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:
Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.
numeric_stats_spark()method explicitly filters out nulls and Nans to match pandas' default behaviorIssue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be
describe_spark_counts()methodIssue 3: Histogram counts and Common Values counts using the
summary["value_counts_without_nan"]Series were not correctly summing counts.limit(200)makes everything line up to parity with the Pandas outputFixed Spark Profile Output
Summary Stats:

Common Values:

Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes.
@fabclmnt - I definitely welcome any and all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend. I think this would knock out a few bug tickets overall.