-
Notifications
You must be signed in to change notification settings - Fork 1.1k
docs(parquet): add example for preserving dictionary encoding #9116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(parquet): add example for preserving dictionary encoding #9116
Conversation
mhilton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added docs. This reads well and makes sense to me, someone with more understanding will have to decide if the code is correct (I assume it is).
Thank you for your time ! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AndreaBozzo -- this looks great
The only thing I want to resolve before merge is the Notes section. All the other comments are nice to haves / would be good to do as follow on PRs
Also thanks @mhilton for the review
| /// **Note**: Dictionary encoding preservation works best when the batch size | ||
| /// is a divisor of the row group size and a single read does not span multiple | ||
| /// column chunks. If these conditions are not met, the reader may compute | ||
| /// a fresh dictionary from the decoded values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is somewhat misleading beacuse
- A single read never spans multiple column chunks
- the batch size is not related to the dictionary size, as far as I know
Dictionary encoding works best when:
- The original column was dictionary encoded (the default)
- There are a small number of distinct values
| /// options | ||
| /// ).unwrap(); | ||
| /// | ||
| /// // Verify the schema shows Dictionary type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking the schema type and then the actual record batch type is redundant -- you could leave out this check and make the example more concise
| /// the dictionary encoding by specifying a `Dictionary` type in the schema hint: | ||
| /// | ||
| /// ``` | ||
| /// use std::sync::Arc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to prefix the setup of the example with # so it isn't rendered in the docs
So instead of
/// use tempfile::tempfile;
Do
/// # use tempfile::tempfile;
That still runs, but will not be shown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, given this example follows the others in this, file, I think it is fine to keep it this way and we can improve all the examples as a follow on PR
| /// use parquet::arrow::ArrowWriter; | ||
| /// | ||
| /// // Write a Parquet file with string data | ||
| /// let file = tempfile().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see that this follows the example above -- I think we could make these examples smaller (and thus easier to follow) if we wrote into an in memory Vec rather than a file
let mut file = Vec::new();
...
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
...
// now read from the "file"Since what you have here is following the same pattern of the other examples, I think it is good. Maybe we can improve the examples as a follow on PR
- Remove redundant schema type check - Update note with accurate dictionary encoding guidance
|
Thank you for the thorough review and your time! I addressed your feedback in commit 1ac4d54: if the follow up about the fourth pattern is needed, i/we can handle that aswell |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AndreaBozzo
I am not quite sure what you mean by this |
Sorry i meant i can improve the examples if needed |
That would be great if you have the time -- specifically I think applying both these comments to all the examples would improve the readability 🙏 |
|
Thanks again @AndreaBozzo |
This PR adds a second example to
ArrowReaderOptions::with_schemademonstrating how to preserve dictionary encoding when reading Parquet string columns.Closes #9095