Skip to content

Conversation

@AndreaBozzo
Copy link
Contributor

This PR adds a second example to ArrowReaderOptions::with_schema demonstrating how to preserve dictionary encoding when reading Parquet string columns.

Closes #9095

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026
Copy link
Contributor

@mhilton mhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added docs. This reads well and makes sense to me, someone with more understanding will have to decide if the code is correct (I assume it is).

@AndreaBozzo
Copy link
Contributor Author

Thanks for the added docs. This reads well and makes sense to me, someone with more understanding will have to decide if the code is correct (I assume it is).

Thank you for your time !

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndreaBozzo -- this looks great

The only thing I want to resolve before merge is the Notes section. All the other comments are nice to haves / would be good to do as follow on PRs

Also thanks @mhilton for the review

Comment on lines 567 to 570
/// **Note**: Dictionary encoding preservation works best when the batch size
/// is a divisor of the row group size and a single read does not span multiple
/// column chunks. If these conditions are not met, the reader may compute
/// a fresh dictionary from the decoded values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is somewhat misleading beacuse

  1. A single read never spans multiple column chunks
  2. the batch size is not related to the dictionary size, as far as I know

Dictionary encoding works best when:

  1. The original column was dictionary encoded (the default)
  2. There are a small number of distinct values

/// options
/// ).unwrap();
///
/// // Verify the schema shows Dictionary type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking the schema type and then the actual record batch type is redundant -- you could leave out this check and make the example more concise

/// the dictionary encoding by specifying a `Dictionary` type in the schema hint:
///
/// ```
/// use std::sync::Arc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to prefix the setup of the example with # so it isn't rendered in the docs

So instead of

    /// use tempfile::tempfile;

Do

    /// # use tempfile::tempfile;

That still runs, but will not be shown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, given this example follows the others in this, file, I think it is fine to keep it this way and we can improve all the examples as a follow on PR

/// use parquet::arrow::ArrowWriter;
///
/// // Write a Parquet file with string data
/// let file = tempfile().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see that this follows the example above -- I think we could make these examples smaller (and thus easier to follow) if we wrote into an in memory Vec rather than a file

let mut file = Vec::new();
...
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
...
// now read from the "file"

Since what you have here is following the same pattern of the other examples, I think it is good. Maybe we can improve the examples as a follow on PR

- Remove redundant schema type check
- Update note with accurate dictionary encoding guidance
@AndreaBozzo
Copy link
Contributor Author

AndreaBozzo commented Jan 9, 2026

Thank you for the thorough review and your time! I addressed your feedback in commit 1ac4d54:

if the follow up about the fourth pattern is needed, i/we can handle that aswell

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndreaBozzo

@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

@AndreaBozzo
Copy link
Contributor Author

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

Sorry i meant i can improve the examples if needed

@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

Sorry i meant i can improve the examples if needed

That would be great if you have the time -- specifically I think applying both these comments to all the examples would improve the readability 🙏

@alamb alamb merged commit 7856970 into apache:main Jan 10, 2026
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

Thanks again @AndreaBozzo

@AndreaBozzo AndreaBozzo deleted the ab/docs/dictionary-encoding-example branch January 10, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document / Add an example of preserving dictionary encoding when reading parquet

3 participants