-
-
Notifications
You must be signed in to change notification settings - Fork 4
feat(ole): support image extraction #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@DevExzh 需要构造一个很简单的带有像素图片的 doc,来进行单元测试么? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for extracting embedded images from OLE (.doc) documents. The implementation follows Apache POI's approach for handling picture data stored in the Data stream, with support for various image formats through Escher records (BSE and BLIP types).
Changes:
- New
imagemodule with image detection, extraction, and error handling - Extended
DocumentAPI to provide image data access via the Data stream - Added
is_dataflag toCharacterPropertiesfor proper picture vs. data distinction - Refactored image extraction to support delay-loaded BLIPs from data streams
- Unrelated OOXML refactoring: extracted
to_streammethod fromsave
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ole/doc/image.rs | New module implementing image detection, PictureFields parsing, and extraction logic |
| src/ole/doc/document.rs | Added Data stream support and image_data() method for lazy image loading |
| src/ole/doc/paragraph.rs | Extended Run with optional image field and accessor methods |
| src/ole/doc/parts/chp.rs | Added is_data flag to distinguish picture data from NilPICFAndBinData |
| src/ole/doc/parts/chp_bin_table.rs | Refactored to use centralized SPRM parser for consistency |
| src/ole/doc/mod.rs | Exported new Image and ImageError types |
| src/ole/escher/record.rs | Custom Debug implementation to avoid printing large binary data |
| src/images/extractor.rs | Added methods to extract images from individual Escher records with stream support |
| src/images/blip.rs | Added try_from_escher_record helper for BLIP extraction |
| src/ooxml/docx/package.rs | Refactored save to delegate to new to_stream method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Ok(all_paragraphs) | ||
| } | ||
|
|
||
| // fn has_picture(&self, picture_offset: u32) -> bool {} |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed or completed. If this function is not needed, delete the comment. If it's intended for future implementation, add a TODO comment explaining the purpose.
| // fn has_picture(&self, picture_offset: u32) -> bool {} | |
| // TODO: Implement `has_picture(&self, picture_offset: u32) -> bool` if/when | |
| // picture-detection by offset is needed, to check whether the document | |
| // contains an embedded image at the specified picture offset. |
src/ole/doc/image.rs
Outdated
| let pic_offset = props.pic_offset.unwrap_or(0); | ||
| return is_block_contains_image(data_buff, pic_offset); |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect handling of missing pic_offset: When props.pic_offset is None, this code defaults to offset 0 via unwrap_or(0). This could lead to false positives or incorrect image detection. Consider returning Ok(false) early if pic_offset is None, as an image run should always have a valid offset.
| let pic_offset = props.pic_offset.unwrap_or(0); | |
| return is_block_contains_image(data_buff, pic_offset); | |
| if let Some(pic_offset) = props.pic_offset { | |
| return is_block_contains_image(data_buff, pic_offset); | |
| } else { | |
| return Ok(false); | |
| } |
| /// This method extracts the image data from the WordDocument stream. | ||
| /// The data is returned as a `Cow` to minimize copying when possible. |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inaccurate documentation: The comments state "This method extracts the image data from the WordDocument stream" and "The data is returned as a Cow to minimize copying when possible", but the implementation actually uses the Data stream when available (via get_data_stream), and returns an ExtractedImage (not a Cow). Update the documentation to accurately describe the behavior.
| /// This method extracts the image data from the WordDocument stream. | |
| /// The data is returned as a `Cow` to minimize copying when possible. | |
| /// This method extracts the image data, using the Data stream when available | |
| /// (via `get_data_stream`) and falling back to the WordDocument stream otherwise. | |
| /// The data is returned as an `ExtractedImage`, which may contain borrowed or | |
| /// owned bytes to minimize copying when possible. |
| pub fn save<P: AsRef<Path>>(&mut self, path: P) -> Result<()> { | ||
| let mut file = std::fs::OpenOptions::new() | ||
| .write(true) | ||
| .create(true) | ||
| .truncate(true) | ||
| .open(path)?; | ||
| self.to_stream(&mut file) | ||
| } |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change: The refactoring of save to use to_stream in the OOXML/DOCX package appears unrelated to OLE image extraction. While this is a good refactoring, consider whether it should be in a separate PR focused on OOXML improvements, or if there's a specific reason it's included in this image extraction PR.
不好意思刚看到,如果有单元测试的话,那再好不过了😊 像素图片如果能内嵌在 test 里的话最好,如果有点困难的话,放在根目录的 test-data/ 下吧 单元测试有些我写的没来得及改,有可能会失败,我打算之后批量修一版 |
|
我看了一下,如果可以麻烦修一下以下几个问题嘛
其他 LGTM,感谢您的贡献 🫰 |
fix: adviced comments and add static assert for PictureFields
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@DevExzh 不好意思,现在才看到,已经按要求进行了修改。 之前可能是考虑这个文件是放置在
|
|
合并了,感谢 |

Uh oh!
There was an error while loading. Please reload this page.