-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix PartialEq behavior for UnionFields
#8937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix PartialEq behavior for UnionFields
#8937
Conversation
|
cc @alamb this should be fairly quick to review! |
arrow-schema/src/fields.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl PartialEq for UnionFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a possible way of getting the desired behavior. In our project, we do have some mildly large unions (10-ish currently, but maybe more in the future) that occur very frequently and I am a bit anxious on how the quadratic equality check will perform in practice, as the equality of a data type is often used when comparing fields, schemas, and logical plans.
Another approach would be sorting the slice when constructing the instance. Maybe this would be an equally simple change with better performance as this already a private field.
Note that I have no idea how this will actually perform as I have no benchmarks. Just my 2 cents.
Otherwise, this looks like a good change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think this PR is focused on getting a correct implementation in place
If this equality check ever shows up as the bottleneck in a profile, I'd be totally for a follow-up perf PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can file a ticket
arrow-schema/src/fields.rs
Outdated
| other.iter().any(|b| { | ||
| a.0 == b.0 | ||
| && a.1.is_nullable() == b.1.is_nullable() | ||
| && a.1.data_type().equals_datatype(b.1.data_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the metadata?
(it might be more consistent to compare a.1 == b.1 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since DataType::equals_datatype
arrow-rs/arrow-schema/src/datatype.rs
Lines 676 to 688 in 08dcc0b
| DataType::Union(a_union_fields, a_union_mode), | |
| DataType::Union(b_union_fields, b_union_mode), | |
| ) => { | |
| a_union_mode == b_union_mode | |
| && a_union_fields.len() == b_union_fields.len() | |
| && a_union_fields.iter().all(|a| { | |
| b_union_fields.iter().any(|b| { | |
| a.0 == b.0 | |
| && a.1.is_nullable() == b.1.is_nullable() | |
| && a.1.data_type().equals_datatype(b.1.data_type()) | |
| }) | |
| }) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I take that back. It seems DataType impls PartialEq as well. I'll choose to do the same
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew and @tobixdev -- I am not sure about this PR.
Specifically, I am not sure about (re)defining what equality means for Unions - I think it is important that Union equality is consistently defined across the crate
If we are going to do this, I think we should clearly document somewhere what "equal" means for DataType::Union and UnionArray and make sure the crate is consistent with this definition
Can you research what the current definition UnionArray is?
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
I agree that we should document what "equal" means for But isn't The specific match arm for unions is here: arrow-rs/arrow-schema/src/datatype.rs Lines 676 to 688 in 08dcc0b
|
629388f to
b43a742
Compare
Hi @alamb, apologies for the delay. Work was a bit busy this week, but I have some bandwidth to get this over the line now |
b5793e6 to
f4441b6
Compare
curious to hear your thoughts @alamb |
arrow-schema/src/fields.rs
Outdated
|
|
||
| impl Hash for UnionFields { | ||
| fn hash<H: std::hash::Hasher>(&self, state: &mut H) { | ||
| let mut v = self.0.iter().collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating (the Vec) during a hash function is probably a pretty poor choice -- can we please try and find some different implementation that doesn't require a hash? For example a constant or hash the data types of the lowest and highest type_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to always sort the UnionFields on construction 🤔
arrow-schema/src/fields.rs
Outdated
|
|
||
| impl PartialEq for UnionFields { | ||
| fn eq(&self, other: &Self) -> bool { | ||
| self.len() == other.len() && self.iter().all(|a| other.iter().any(|b| a == b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment that says this is (and should remain) consistent with the definition in DataType::Union?
536715c to
7b1d689
Compare
01a23e3 to
b15c0f2
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the approach is to ensure the Arc<[]> inside UnionFields is always sorted, we need to include that documentation as part of its docstring for visibility. Also, I see only try_new is modified with this new behaviour, but there's a whole lot of other APIs there which construct UnionFIelds without going through try_new() so they would need to also be updated; for example, new(), from_fields(), the FromIterator implementation, etc.
Which issue does this PR close?
PartialEqbehavior forUnionFields#8936Rationale for this change
This PR introduces a custom PartialEq impl for
UnionFieldsthat uses set semantics when determining equalityCurrently,
UnionFieldsderives PartailEq, but its internal representation is a list of(id, data type)tuples. As a result, equality is order-dependent, even thoughUnionFieldsis conceptually a set of unique(id, data type)pairs