Skip to content

Conversation

@greenhat616
Copy link
Contributor

@greenhat616 greenhat616 commented Jan 15, 2026

  • basic image extraction implement
  • validate image data extractor implement
  • add unit tests?
Screenshot_2026-01-20_19-31-41

@greenhat616 greenhat616 marked this pull request as ready for review January 20, 2026 11:32
Copilot AI review requested due to automatic review settings January 20, 2026 11:32
@greenhat616
Copy link
Contributor Author

@DevExzh 需要构造一个很简单的带有像素图片的 doc,来进行单元测试么?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for extracting embedded images from OLE (.doc) documents. The implementation follows Apache POI's approach for handling picture data stored in the Data stream, with support for various image formats through Escher records (BSE and BLIP types).

Changes:

  • New image module with image detection, extraction, and error handling
  • Extended Document API to provide image data access via the Data stream
  • Added is_data flag to CharacterProperties for proper picture vs. data distinction
  • Refactored image extraction to support delay-loaded BLIPs from data streams
  • Unrelated OOXML refactoring: extracted to_stream method from save

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/ole/doc/image.rs New module implementing image detection, PictureFields parsing, and extraction logic
src/ole/doc/document.rs Added Data stream support and image_data() method for lazy image loading
src/ole/doc/paragraph.rs Extended Run with optional image field and accessor methods
src/ole/doc/parts/chp.rs Added is_data flag to distinguish picture data from NilPICFAndBinData
src/ole/doc/parts/chp_bin_table.rs Refactored to use centralized SPRM parser for consistency
src/ole/doc/mod.rs Exported new Image and ImageError types
src/ole/escher/record.rs Custom Debug implementation to avoid printing large binary data
src/images/extractor.rs Added methods to extract images from individual Escher records with stream support
src/images/blip.rs Added try_from_escher_record helper for BLIP extraction
src/ooxml/docx/package.rs Refactored save to delegate to new to_stream method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ok(all_paragraphs)
}

// fn has_picture(&self, picture_offset: u32) -> bool {}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed or completed. If this function is not needed, delete the comment. If it's intended for future implementation, add a TODO comment explaining the purpose.

Suggested change
// fn has_picture(&self, picture_offset: u32) -> bool {}
// TODO: Implement `has_picture(&self, picture_offset: u32) -> bool` if/when
// picture-detection by offset is needed, to check whether the document
// contains an embedded image at the specified picture offset.

Copilot uses AI. Check for mistakes.
Comment on lines 90 to 91
let pic_offset = props.pic_offset.unwrap_or(0);
return is_block_contains_image(data_buff, pic_offset);
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect handling of missing pic_offset: When props.pic_offset is None, this code defaults to offset 0 via unwrap_or(0). This could lead to false positives or incorrect image detection. Consider returning Ok(false) early if pic_offset is None, as an image run should always have a valid offset.

Suggested change
let pic_offset = props.pic_offset.unwrap_or(0);
return is_block_contains_image(data_buff, pic_offset);
if let Some(pic_offset) = props.pic_offset {
return is_block_contains_image(data_buff, pic_offset);
} else {
return Ok(false);
}

Copilot uses AI. Check for mistakes.
Comment on lines +477 to +478
/// This method extracts the image data from the WordDocument stream.
/// The data is returned as a `Cow` to minimize copying when possible.
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inaccurate documentation: The comments state "This method extracts the image data from the WordDocument stream" and "The data is returned as a Cow to minimize copying when possible", but the implementation actually uses the Data stream when available (via get_data_stream), and returns an ExtractedImage (not a Cow). Update the documentation to accurately describe the behavior.

Suggested change
/// This method extracts the image data from the WordDocument stream.
/// The data is returned as a `Cow` to minimize copying when possible.
/// This method extracts the image data, using the Data stream when available
/// (via `get_data_stream`) and falling back to the WordDocument stream otherwise.
/// The data is returned as an `ExtractedImage`, which may contain borrowed or
/// owned bytes to minimize copying when possible.

Copilot uses AI. Check for mistakes.
Comment on lines 634 to +641
pub fn save<P: AsRef<Path>>(&mut self, path: P) -> Result<()> {
let mut file = std::fs::OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.open(path)?;
self.to_stream(&mut file)
}
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change: The refactoring of save to use to_stream in the OOXML/DOCX package appears unrelated to OLE image extraction. While this is a good refactoring, consider whether it should be in a separate PR focused on OOXML improvements, or if there's a specific reason it's included in this image extraction PR.

Copilot uses AI. Check for mistakes.
@DevExzh
Copy link
Owner

DevExzh commented Jan 20, 2026

@DevExzh 需要构造一个很简单的带有像素图片的 doc,来进行单元测试么?

不好意思刚看到,如果有单元测试的话,那再好不过了😊 像素图片如果能内嵌在 test 里的话最好,如果有点困难的话,放在根目录的 test-data/ 下吧

单元测试有些我写的没来得及改,有可能会失败,我打算之后批量修一版

@DevExzh
Copy link
Owner

DevExzh commented Jan 20, 2026

我看了一下,如果可以麻烦修一下以下几个问题嘛

  1. 函数命名符合英语语法
  2. 加点 boundary checks,以防 edge cases 导致整个逻辑 panic
  3. Run cargo clippy && cargo fmt, make the code compile

其他 LGTM,感谢您的贡献 🫰

greenhat616 and others added 4 commits January 21, 2026 15:48
fix: adviced comments and add static assert for PictureFields
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@greenhat616 greenhat616 requested a review from DevExzh January 22, 2026 05:26
@greenhat616
Copy link
Contributor Author

@DevExzh 不好意思,现在才看到,已经按要求进行了修改。
Screenshot_2026-01-29_18-07-38

之前可能是考虑这个文件是放置在 ppt 下的,没有做修复。

  • src/ole/ppt/writer/escher.rs

@DevExzh DevExzh merged commit 01ac1e8 into DevExzh:main Jan 29, 2026
1 check passed
@DevExzh
Copy link
Owner

DevExzh commented Jan 29, 2026

合并了,感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants