Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR introduces a custom PartialEq impl for UnionFields that uses set semantics when determining equality

Currently, UnionFields derives PartailEq, but its internal representation is a list of (id, data type) tuples. As a result, equality is order-dependent, even though UnionFields is conceptually a set of unique (id, data type) pairs

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 1, 2025
@friendlymatthew
Copy link
Contributor Author

cc @alamb this should be fairly quick to review!

}
}

impl PartialEq for UnionFields {
Copy link
Contributor

@tobixdev tobixdev Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a possible way of getting the desired behavior. In our project, we do have some mildly large unions (10-ish currently, but maybe more in the future) that occur very frequently and I am a bit anxious on how the quadratic equality check will perform in practice, as the equality of a data type is often used when comparing fields, schemas, and logical plans.

Another approach would be sorting the slice when constructing the instance. Maybe this would be an equally simple change with better performance as this already a private field.

Note that I have no idea how this will actually perform as I have no benchmarks. Just my 2 cents.

Otherwise, this looks like a good change!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think this PR is focused on getting a correct implementation in place

If this equality check ever shows up as the bottleneck in a profile, I'd be totally for a follow-up perf PR!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can file a ticket

other.iter().any(|b| {
a.0 == b.0
&& a.1.is_nullable() == b.1.is_nullable()
&& a.1.data_type().equals_datatype(b.1.data_type())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the metadata?

(it might be more consistent to compare a.1 == b.1 )

Copy link
Contributor Author

@friendlymatthew friendlymatthew Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DataType::equals_datatype

DataType::Union(a_union_fields, a_union_mode),
DataType::Union(b_union_fields, b_union_mode),
) => {
a_union_mode == b_union_mode
&& a_union_fields.len() == b_union_fields.len()
&& a_union_fields.iter().all(|a| {
b_union_fields.iter().any(|b| {
a.0 == b.0
&& a.1.is_nullable() == b.1.is_nullable()
&& a.1.data_type().equals_datatype(b.1.data_type())
})
})
}
explicitly avoids the metadata eq check, I will elect to do the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take that back. It seems DataType impls PartialEq as well. I'll choose to do the same

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew and @tobixdev -- I am not sure about this PR.

Specifically, I am not sure about (re)defining what equality means for Unions - I think it is important that Union equality is consistently defined across the crate

If we are going to do this, I think we should clearly document somewhere what "equal" means for DataType::Union and UnionArray and make sure the crate is consistent with this definition

Can you research what the current definition UnionArray is?

@alamb
Copy link
Contributor

alamb commented Dec 10, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft December 10, 2025 17:26
@friendlymatthew
Copy link
Contributor Author

If we are going to do this, I think we should clearly document somewhere what "equal" means for DataType::Union and UnionArray and make sure the crate is consistent with this definition

I agree that we should document what "equal" means for DataType::Union and UnionArray.

But isn't DataType::equals_datatype already the authoritative definition of equality for DataType::Union?

The specific match arm for unions is here:

DataType::Union(a_union_fields, a_union_mode),
DataType::Union(b_union_fields, b_union_mode),
) => {
a_union_mode == b_union_mode
&& a_union_fields.len() == b_union_fields.len()
&& a_union_fields.iter().all(|a| {
b_union_fields.iter().any(|b| {
a.0 == b.0
&& a.1.is_nullable() == b.1.is_nullable()
&& a.1.data_type().equals_datatype(b.1.data_type())
})
})
}

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/compare-union-fields-eq branch from 629388f to b43a742 Compare December 12, 2025 19:03
@friendlymatthew friendlymatthew marked this pull request as ready for review December 12, 2025 19:03
@friendlymatthew
Copy link
Contributor Author

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Hi @alamb, apologies for the delay. Work was a bit busy this week, but I have some bandwidth to get this over the line now

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/compare-union-fields-eq branch from b5793e6 to f4441b6 Compare December 16, 2025 15:38
@friendlymatthew
Copy link
Contributor Author

If we are going to do this, I think we should clearly document somewhere what "equal" means for DataType::Union and UnionArray and make sure the crate is consistent with this definition

I agree that we should document what "equal" means for DataType::Union and UnionArray.

But isn't DataType::equals_datatype already the authoritative definition of equality for DataType::Union?

The specific match arm for unions is here:

DataType::Union(a_union_fields, a_union_mode),
DataType::Union(b_union_fields, b_union_mode),
) => {
a_union_mode == b_union_mode
&& a_union_fields.len() == b_union_fields.len()
&& a_union_fields.iter().all(|a| {
b_union_fields.iter().any(|b| {
a.0 == b.0
&& a.1.is_nullable() == b.1.is_nullable()
&& a.1.data_type().equals_datatype(b.1.data_type())
})
})
}

curious to hear your thoughts @alamb


impl Hash for UnionFields {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
let mut v = self.0.iter().collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating (the Vec) during a hash function is probably a pretty poor choice -- can we please try and find some different implementation that doesn't require a hash? For example a constant or hash the data types of the lowest and highest type_id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative might be to always sort the UnionFields on construction 🤔


impl PartialEq for UnionFields {
fn eq(&self, other: &Self) -> bool {
self.len() == other.len() && self.iter().all(|a| other.iter().any(|b| a == b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment that says this is (and should remain) consistent with the definition in DataType::Union?

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/compare-union-fields-eq branch 2 times, most recently from 536715c to 7b1d689 Compare December 17, 2025 21:27
@github-actions github-actions bot added the arrow-avro arrow-avro crate label Dec 24, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/compare-union-fields-eq branch from 01a23e3 to b15c0f2 Compare December 24, 2025 19:34
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the approach is to ensure the Arc<[]> inside UnionFields is always sorted, we need to include that documentation as part of its docstring for visibility. Also, I see only try_new is modified with this new behaviour, but there's a whole lot of other APIs there which construct UnionFIelds without going through try_new() so they would need to also be updated; for example, new(), from_fields(), the FromIterator implementation, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect PartialEq behavior for UnionFields

4 participants