Conversation
Implements normalise_table_keys and normalise_single_table_keys functions to convert primary and foreign keys to zero-based sequential indices while preserving relationships between tables. Features: - Auto-detection of foreign key relationships - Explicit foreign key specification support - Comprehensive error handling and validation - Full test coverage with 15 test cases - Documentation with examples and use cases Fixes #15
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
juaristi22
left a comment
There was a problem hiding this comment.
Very nice, very clean. Are we thinking of using this on person or household ids for example?
Small note: I saw you are missing the changelog entry, and I think the reason why that hasn't failed is because there exists a changelog entry in main already (the versioning failed on my code that was last merged). I was hoping to fix it with my open Dataset migration pr but I think its best if I quickly do it on main already, you might wanna add a a new entry for when thats fixed? Not blocking though (wouldn't be the end of the world if there is no entry when we merge this anyway)
Implements key normalisation functionality to convert primary and foreign keys in related tables to zero-based sequential indices while preserving relationships.
When constructing microdata, we often work with datasets that have inconsistent key formats - some might use sequential IDs starting from 1, others might use large sparse integers like user IDs 101, 105, 103. This creates unnecessary complexity and memory overhead. By normalising all keys to a common zero-based sequential format, we can assume consistent key patterns across all datasets, simplifying downstream processing and reducing memory usage.
The implementation adds
normalise_table_keys()for multi-table normalisation with relationship preservation andnormalise_single_table_keys()for single table scenarios. Foreign key relationships are automatically detected based on column name matching, though explicit specification is also supported. The functionality handles edge cases like duplicate keys, missing columns, and invalid references with clear error messages.Comprehensive test coverage includes 15 test cases covering normal operations, edge cases, and error handling. All existing tests continue to pass and code is formatted with black and isort.
Fixes #15