docs(parquet): add example for preserving dictionary encoding #9116

AndreaBozzo · 2026-01-08T11:43:40Z

This PR adds a second example to ArrowReaderOptions::with_schema demonstrating how to preserve dictionary encoding when reading Parquet string columns.

Closes #9095

mhilton

Thanks for the added docs. This reads well and makes sense to me, someone with more understanding will have to decide if the code is correct (I assume it is).

AndreaBozzo · 2026-01-09T10:12:43Z

Thanks for the added docs. This reads well and makes sense to me, someone with more understanding will have to decide if the code is correct (I assume it is).

Thank you for your time !

alamb

Thanks @AndreaBozzo -- this looks great

The only thing I want to resolve before merge is the Notes section. All the other comments are nice to haves / would be good to do as follow on PRs

Also thanks @mhilton for the review

alamb · 2026-01-09T20:37:47Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// **Note**: Dictionary encoding preservation works best when the batch size
+    /// is a divisor of the row group size and a single read does not span multiple
+    /// column chunks. If these conditions are not met, the reader may compute
+    /// a fresh dictionary from the decoded values.


I think this is somewhat misleading beacuse

A single read never spans multiple column chunks

the batch size is not related to the dictionary size, as far as I know

Dictionary encoding works best when:

The original column was dictionary encoded (the default)

There are a small number of distinct values

alamb · 2026-01-09T20:38:58Z

parquet/src/arrow/arrow_reader/mod.rs

+    ///     options
+    /// ).unwrap();
+    ///
+    /// // Verify the schema shows Dictionary type


I think checking the schema type and then the actual record batch type is redundant -- you could leave out this check and make the example more concise

alamb · 2026-01-09T20:39:50Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// the dictionary encoding by specifying a `Dictionary` type in the schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;


It would be nice to prefix the setup of the example with # so it isn't rendered in the docs

So instead of

/// use tempfile::tempfile;

Do

/// # use tempfile::tempfile;

That still runs, but will not be shown

However, given this example follows the others in this, file, I think it is fine to keep it this way and we can improve all the examples as a follow on PR

alamb · 2026-01-09T20:42:45Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// use parquet::arrow::ArrowWriter;
+    ///
+    /// // Write a Parquet file with string data
+    /// let file = tempfile().unwrap();


I also see that this follows the example above -- I think we could make these examples smaller (and thus easier to follow) if we wrote into an in memory Vec rather than a file

let mut file = Vec::new(); ... let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap(); writer.write(&batch).unwrap(); writer.close().unwrap(); ... // now read from the "file"

Since what you have here is following the same pattern of the other examples, I think it is good. Maybe we can improve the examples as a follow on PR

- Remove redundant schema type check - Update note with accurate dictionary encoding guidance

AndreaBozzo · 2026-01-09T21:04:37Z

Thank you for the thorough review and your time! I addressed your feedback in commit 1ac4d54:

if the follow up about the fourth pattern is needed, i/we can handle that aswell

alamb

Thanks @AndreaBozzo

alamb · 2026-01-09T22:36:00Z

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

AndreaBozzo · 2026-01-09T22:38:58Z

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

Sorry i meant i can improve the examples if needed

alamb · 2026-01-10T12:22:42Z

if the follow up about the fourth pattern is needed, i/we can handle that aswell

I am not quite sure what you mean by this

Sorry i meant i can improve the examples if needed

That would be great if you have the time -- specifically I think applying both these comments to all the examples would improve the readability 🙏

alamb · 2026-01-10T12:22:48Z

Thanks again @AndreaBozzo

docs(parquet): add example for preserving dictionary encoding

0d712ce

github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026

mhilton approved these changes Jan 9, 2026

View reviewed changes

alamb reviewed Jan 9, 2026

View reviewed changes

address review feedback

1ac4d54

- Remove redundant schema type check - Update note with accurate dictionary encoding guidance

alamb approved these changes Jan 9, 2026

View reviewed changes

alamb merged commit 7856970 into apache:main Jan 10, 2026
16 checks passed

AndreaBozzo deleted the ab/docs/dictionary-encoding-example branch January 10, 2026 16:31

docs(parquet): add example for preserving dictionary encoding #9116

docs(parquet): add example for preserving dictionary encoding #9116

Conversation

AndreaBozzo commented Jan 8, 2026

Uh oh!

mhilton left a comment

Choose a reason for hiding this comment

Uh oh!

AndreaBozzo commented Jan 9, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

AndreaBozzo commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 9, 2026

Uh oh!

AndreaBozzo commented Jan 9, 2026

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreaBozzo commented Jan 9, 2026 •

edited

Loading