Functioning UKB RAP data prep script #35

LArnoldt · 2025-12-18T14:47:07Z

The provided script for data preparation on the UKB has some major issues:

(1) The performance issue is caused specifically by the for i in range(len(fo_fields)): loop that repeatedly executes df.select(...).where(...).withColumn(...).union(...). This performs ~1000 separate Spark jobs with repeated scans, UDF execution, and growing union lineage, which prevents effective query optimization and makes the runtime explode with column count.
(2) The vocabulary number deviates from the original implementation.
(3) The events are not correctly tokenized.

I am providing a script, which solves these issues, while basically replicating the original data preparation script example_ukb_to_bin.ipynb.

Functioning UKB RAP data prep script

72fc136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Functioning UKB RAP data prep script #35

Functioning UKB RAP data prep script #35

Uh oh!

LArnoldt commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Functioning UKB RAP data prep script #35

Are you sure you want to change the base?

Functioning UKB RAP data prep script #35

Uh oh!

Conversation

LArnoldt commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant