Skip to content

Conversation

@LArnoldt
Copy link

The provided script for data preparation on the UKB has some major issues:

(1) The performance issue is caused specifically by the for i in range(len(fo_fields)): loop that repeatedly executes df.select(...).where(...).withColumn(...).union(...). This performs ~1000 separate Spark jobs with repeated scans, UDF execution, and growing union lineage, which prevents effective query optimization and makes the runtime explode with column count.
(2) The vocabulary number deviates from the original implementation.
(3) The events are not correctly tokenized.

I am providing a script, which solves these issues, while basically replicating the original data preparation script example_ukb_to_bin.ipynb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant