feat(ole): support image extraction #4

greenhat616 · 2026-01-15T12:04:58Z

basic image extraction implement
validate image data extractor implement
add unit tests?

chore: up

greenhat616 · 2026-01-20T11:33:24Z

@DevExzh 需要构造一个很简单的带有像素图片的 doc，来进行单元测试么？

Copilot

Pull request overview

This pull request adds support for extracting embedded images from OLE (.doc) documents. The implementation follows Apache POI's approach for handling picture data stored in the Data stream, with support for various image formats through Escher records (BSE and BLIP types).

Changes:

New image module with image detection, extraction, and error handling
Extended Document API to provide image data access via the Data stream
Added is_data flag to CharacterProperties for proper picture vs. data distinction
Refactored image extraction to support delay-loaded BLIPs from data streams
Unrelated OOXML refactoring: extracted to_stream method from save

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/ole/doc/image.rs	New module implementing image detection, PictureFields parsing, and extraction logic
src/ole/doc/document.rs	Added Data stream support and `image_data()` method for lazy image loading
src/ole/doc/paragraph.rs	Extended Run with optional image field and accessor methods
src/ole/doc/parts/chp.rs	Added `is_data` flag to distinguish picture data from NilPICFAndBinData
src/ole/doc/parts/chp_bin_table.rs	Refactored to use centralized SPRM parser for consistency
src/ole/doc/mod.rs	Exported new Image and ImageError types
src/ole/escher/record.rs	Custom Debug implementation to avoid printing large binary data
src/images/extractor.rs	Added methods to extract images from individual Escher records with stream support
src/images/blip.rs	Added `try_from_escher_record` helper for BLIP extraction
src/ooxml/docx/package.rs	Refactored `save` to delegate to new `to_stream` method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ole/doc/image.rs

Copilot · 2026-01-20T11:37:54Z

src/ole/doc/document.rs

        Ok(all_paragraphs)
    }

+    // fn has_picture(&self, picture_offset: u32) -> bool {}


Commented-out code should be removed or completed. If this function is not needed, delete the comment. If it's intended for future implementation, add a TODO comment explaining the purpose.

Suggested change

// fn has_picture(&self, picture_offset: u32) -> bool {}

// TODO: Implement `has_picture(&self, picture_offset: u32) -> bool` if/when

// picture-detection by offset is needed, to check whether the document

// contains an embedded image at the specified picture offset.

src/images/extractor.rs

src/ole/doc/image.rs

Copilot · 2026-01-20T11:37:55Z

src/ole/doc/image.rs

+            let pic_offset = props.pic_offset.unwrap_or(0);
+            return is_block_contains_image(data_buff, pic_offset);


Incorrect handling of missing pic_offset: When props.pic_offset is None, this code defaults to offset 0 via unwrap_or(0). This could lead to false positives or incorrect image detection. Consider returning Ok(false) early if pic_offset is None, as an image run should always have a valid offset.

Suggested change

let pic_offset = props.pic_offset.unwrap_or(0);

return is_block_contains_image(data_buff, pic_offset);

if let Some(pic_offset) = props.pic_offset {

return is_block_contains_image(data_buff, pic_offset);

} else {

return Ok(false);

}

Copilot · 2026-01-20T11:37:55Z

src/ole/doc/document.rs

+    /// This method extracts the image data from the WordDocument stream.
+    /// The data is returned as a `Cow` to minimize copying when possible.


Inaccurate documentation: The comments state "This method extracts the image data from the WordDocument stream" and "The data is returned as a Cow to minimize copying when possible", but the implementation actually uses the Data stream when available (via get_data_stream), and returns an ExtractedImage (not a Cow). Update the documentation to accurately describe the behavior.

Suggested change

/// This method extracts the image data from the WordDocument stream.

/// The data is returned as a `Cow` to minimize copying when possible.

/// This method extracts the image data, using the Data stream when available

/// (via `get_data_stream`) and falling back to the WordDocument stream otherwise.

/// The data is returned as an `ExtractedImage`, which may contain borrowed or

/// owned bytes to minimize copying when possible.

src/ole/doc/image.rs

Copilot · 2026-01-20T11:37:56Z

src/ooxml/docx/package.rs

    pub fn save<P: AsRef<Path>>(&mut self, path: P) -> Result<()> {
+        let mut file = std::fs::OpenOptions::new()
+            .write(true)
+            .create(true)
+            .truncate(true)
+            .open(path)?;
+        self.to_stream(&mut file)
+    }


Unrelated change: The refactoring of save to use to_stream in the OOXML/DOCX package appears unrelated to OLE image extraction. While this is a good refactoring, consider whether it should be in a separate PR focused on OOXML improvements, or if there's a specific reason it's included in this image extraction PR.

src/ole/doc/image.rs

DevExzh · 2026-01-20T17:17:10Z

@DevExzh 需要构造一个很简单的带有像素图片的 doc，来进行单元测试么？

不好意思刚看到，如果有单元测试的话，那再好不过了😊 像素图片如果能内嵌在 test 里的话最好，如果有点困难的话，放在根目录的 test-data/ 下吧

单元测试有些我写的没来得及改，有可能会失败，我打算之后批量修一版

DevExzh · 2026-01-20T17:43:32Z

我看了一下，如果可以麻烦修一下以下几个问题嘛

函数命名符合英语语法
加点 boundary checks，以防 edge cases 导致整个逻辑 panic
Run cargo clippy && cargo fmt, make the code compile

其他 LGTM，感谢您的贡献 🫰

fix: adviced comments and add static assert for PictureFields

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

src/images/extractor.rs

greenhat616 · 2026-01-29T10:08:27Z

@DevExzh 不好意思，现在才看到，已经按要求进行了修改。

之前可能是考虑这个文件是放置在 ppt 下的，没有做修复。

src/ole/ppt/writer/escher.rs

DevExzh · 2026-01-29T12:27:50Z

合并了，感谢

greenhat616 added 5 commits January 13, 2026 17:15

feat(docx): add to-stream writer for docx package

27853e9

feat(ole): support paragraph runs image extraction

dbaa8ee

fix(ole): use apache poi based chp parser

edc62a7

fix(ole): store data stream for accessing image data

1a7fc9d

refactor(ole): use builtin blip parser

533f226

chore: up

greenhat616 force-pushed the main branch from b54c29e to 533f226 Compare January 20, 2026 11:24

greenhat616 marked this pull request as ready for review January 20, 2026 11:32

Copilot AI review requested due to automatic review settings January 20, 2026 11:32

Copilot started reviewing on behalf of greenhat616 January 20, 2026 11:32 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

greenhat616 and others added 4 commits January 21, 2026 15:48

fix: adviced comments and add static assert for PictureFields

ea20e8f

fix: adviced comments and add static assert for PictureFields

tests: add document image extraction test

cc33640

Update src/images/extractor.rs

e5fbb30

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/ole/doc/image.rs

d94d9d7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greenhat616 requested a review from DevExzh January 22, 2026 05:26

DevExzh requested changes Jan 22, 2026

View reviewed changes

src/images/extractor.rs Outdated Show resolved Hide resolved

greenhat616 added 2 commits January 23, 2026 23:14

fix: make clippy happy

cb720cf

fix: make clippy happy again

061a324

DevExzh merged commit 01ac1e8 into DevExzh:main Jan 29, 2026
1 check passed

-    // fn has_picture(&self, picture_offset: u32) -> bool {}
+    // TODO: Implement `has_picture(&self, picture_offset: u32) -> bool` if/when
+    //       picture-detection by offset is needed, to check whether the document
+    //       contains an embedded image at the specified picture offset.

		let pic_offset = props.pic_offset.unwrap_or(0);
		return is_block_contains_image(data_buff, pic_offset);

		/// This method extracts the image data from the WordDocument stream.
		/// The data is returned as a `Cow` to minimize copying when possible.

-    /// This method extracts the image data from the WordDocument stream.
-    /// The data is returned as a `Cow` to minimize copying when possible.
+    /// This method extracts the image data, using the Data stream when available
+    /// (via `get_data_stream`) and falling back to the WordDocument stream otherwise.
+    /// The data is returned as an `ExtractedImage`, which may contain borrowed or
+    /// owned bytes to minimize copying when possible.

Uh oh!

feat(ole): support image extraction #4

feat(ole): support image extraction #4

Uh oh!

Conversation

greenhat616 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greenhat616 commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DevExzh commented Jan 20, 2026

Uh oh!

DevExzh commented Jan 20, 2026

Uh oh!

Uh oh!

greenhat616 commented Jan 29, 2026

Uh oh!

Uh oh!

DevExzh commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greenhat616 commented Jan 15, 2026 •

edited

Loading