Convert _get_column_projection_values to use Field-IDs#2293
Convert _get_column_projection_values to use Field-IDs#2293Fokko merged 3 commits intoapache:mainfrom
_get_column_projection_values to use Field-IDs#2293Conversation
| ) | ||
|
|
||
| # Translate column names | ||
| translated_expr = translate_column_names(bound_expr, file_schema, case_sensitive=True, projected_field_values={2: None}) |
There was a problem hiding this comment.
A partition can be null as well 👍
| with transaction.update_snapshot().overwrite() as update: | ||
| update.append_data_file(unpartitioned_file) | ||
|
|
||
| schema = pa.schema([("other_field", pa.string()), ("partition_id", pa.int64())]) |
There was a problem hiding this comment.
The IdentityTransform returns the same type as the one in the table:
schema = Schema(
NestedField(1, "other_field", StringType(), required=False), NestedField(2, "partition_id", IntegerType(), required=False)
)| # In the order described by the "Column Projection" section of the Iceberg spec: | ||
| # https://iceberg.apache.org/spec/#column-projection | ||
| # Evaluate column projection first if it exists | ||
| if projected_field_value := self.projected_field_values.get(field.name): |
There was a problem hiding this comment.
Removing wallrus := here since the projected value can also be None
| if partition_value := accessors[partition_field.field_id].get(file.partition): | ||
| projected_missing_fields[field_id] = partition_value |
There was a problem hiding this comment.
I think it makes sense to fail here, rather than suppress the Error. In the case of an IndexError I think your table is corrupt.
There was a problem hiding this comment.
This was actually a bug. It always used the current spec, while we should use the spec that it was written with.
There was a problem hiding this comment.
good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672
now im worried about all the other places we use .spec()
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for cleaning this up
|
should we also address this comment in the PR? |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM!
I think we can remove test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default test in tests/expressions/test_visitors.py since you added a new test in this PR
#2029 (comment)
| if partition_value := accessors[partition_field.field_id].get(file.partition): | ||
| projected_missing_fields[field_id] = partition_value |
There was a problem hiding this comment.
good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672
now im worried about all the other places we use .spec()
Yes we can do that here as well, however, I think it is an extreme edge-case. This is only related to when the partition-spec is set, but the actual fields in the struct are not set. This potentially can cause data-incorrectness in V1 tables (because there the partition struct doesn't have field-IDs). I'm pretty sure that any Iceberg client won't produce such data, but it could be in the case of a poorly written add-files script that ignores partitioning. Edit: Let's defer that to another PR 👍 |
|
Thanks @kevinjqliu for the review 🙌 |
|
I've created the follow-up PR here: #2295 |
# Rationale for this change This is a refactor of the `_get_column_projection_values` to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime. # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
Rationale for this change
This is a refactor of the
_get_column_projection_valuesto rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime.Are these changes tested?
Are there any user-facing changes?