Skip to content

Add key normalisation functionality#16

Merged
juaristi22 merged 3 commits intomainfrom
add-key-normalisation
Aug 7, 2025
Merged

Add key normalisation functionality#16
juaristi22 merged 3 commits intomainfrom
add-key-normalisation

Conversation

@nikhilwoodruff
Copy link
Collaborator

@nikhilwoodruff nikhilwoodruff commented Jul 27, 2025

Implements key normalisation functionality to convert primary and foreign keys in related tables to zero-based sequential indices while preserving relationships.

When constructing microdata, we often work with datasets that have inconsistent key formats - some might use sequential IDs starting from 1, others might use large sparse integers like user IDs 101, 105, 103. This creates unnecessary complexity and memory overhead. By normalising all keys to a common zero-based sequential format, we can assume consistent key patterns across all datasets, simplifying downstream processing and reducing memory usage.

The implementation adds normalise_table_keys() for multi-table normalisation with relationship preservation and normalise_single_table_keys() for single table scenarios. Foreign key relationships are automatically detected based on column name matching, though explicit specification is also supported. The functionality handles edge cases like duplicate keys, missing columns, and invalid references with clear error messages.

Comprehensive test coverage includes 15 test cases covering normal operations, edge cases, and error handling. All existing tests continue to pass and code is formatted with black and isort.

Fixes #15

Implements normalise_table_keys and normalise_single_table_keys functions
to convert primary and foreign keys to zero-based sequential indices while
preserving relationships between tables.

Features:
- Auto-detection of foreign key relationships
- Explicit foreign key specification support
- Comprehensive error handling and validation
- Full test coverage with 15 test cases
- Documentation with examples and use cases

Fixes #15
@nikhilwoodruff nikhilwoodruff self-assigned this Jul 27, 2025
@codecov
Copy link

codecov bot commented Jul 27, 2025

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, very clean. Are we thinking of using this on person or household ids for example?

Small note: I saw you are missing the changelog entry, and I think the reason why that hasn't failed is because there exists a changelog entry in main already (the versioning failed on my code that was last merged). I was hoping to fix it with my open Dataset migration pr but I think its best if I quickly do it on main already, you might wanna add a a new entry for when thats fixed? Not blocking though (wouldn't be the end of the world if there is no entry when we merge this anyway)

@juaristi22 juaristi22 merged commit 7a533dc into main Aug 7, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add key normalisation functionality

2 participants