Skip to content

A Rust library designed to facilitate the conversion of various document formats into markdown text.

License

Notifications You must be signed in to change notification settings

TM9657/markitdown-rs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

markitdown-rs

markitdown-rs is a Rust library designed to facilitate the conversion of various document formats into markdown text. It is a Rust implementation of the original markitdown Python library.

Features

Document Formats

Microsoft Office (Modern)

  • Word (.docx, .dotx, .dotm)
  • Excel (.xlsx, .xltx, .xltm)
  • PowerPoint (.pptx, .potx, .potm)

Microsoft Office (Legacy)

  • Word 97-2003 (.doc)
  • Excel 97-2003 (.xls)
  • PowerPoint 97-2003 (.ppt)
  • Rich Text Format (.rtf)

OpenDocument Format

  • Text (.odt, .ott)
  • Spreadsheet (.ods, .ots)
  • Presentation (.odp, .otp)

Apple iWork

  • Pages (.pages)
  • Numbers (.numbers)
  • Keynote (.key)

Other Document Formats

  • PDF (.pdf)
    • Intelligent fallback mechanism: Automatically detects scanned PDFs, complex pages with diagrams, or pages with limited text and images
    • Uses text extraction by default for efficiency
    • Falls back to LLM-powered page rendering when:
      • Page has < 10 words (likely scanned)
      • Low alphanumeric ratio < 0.5 (OCR artifacts/garbage)
      • Unstructured content < 50 characters
      • Page contains images + < 350 words (provides full context to LLM)
    • Renders entire page as PNG for LLM processing when needed
  • EPUB (.epub)
  • Markdown (.md)

Data Formats

  • CSV (.csv)
  • Excel spreadsheets (.xlsx, .xls)
  • SQLite databases (.sqlite, .db)

Structured Data

  • XML (.xml)
  • RSS feeds (.rss, .atom)
  • HTML (.html, .htm)
  • Email (.eml, .msg)
  • vCard (.vcf)
  • iCalendar (.ics)
  • BibTeX (.bib)

Archive Formats

  • ZIP (.zip)
  • TAR (.tar, .tar.gz, .tar.bz2, .tar.xz)
  • GZIP (.gz)
  • BZIP2 (.bz2)
  • XZ (.xz)
  • ZSTD (.zst)
  • 7-Zip (.7z)

Media

  • Images (.jpg, .png, .gif, .bmp, .tiff, .webp)
    • With LLM integration for intelligent image descriptions
  • Audio (planned)

Other

  • Plain text (.txt)
  • Log files (.log)

Note: All formats support both file path and in-memory bytes conversion.

Usage

Command-Line

Installation

cargo install markitdown

Convert a File

markitdown path-to-file.pdf

Or use -o to specify the output file:

markitdown path-to-file.pdf -o document.md

Supported formats include Office documents (.docx, .xlsx, .pptx), legacy Office (.doc, .xls, .ppt), OpenDocument (.odt, .ods), Apple iWork (.pages, .numbers, .key), PDFs, EPUB, images, archives, and more. See the full list above.

Rust API

Installation

Add the following to your Cargo.toml:

[dependencies]
markitdown = "0.1.10"

Initialize MarkItDown

use markitdown::MarkItDown;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let md = MarkItDown::new();
    Ok(())
}

Convert a File

use markitdown::{ConversionOptions, MarkItDown};
use object_store::local::LocalFileSystem;
use object_store::path::Path as ObjectPath;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let md = MarkItDown::new();
    
    // Create a local file system object store
    let store = Arc::new(LocalFileSystem::new());
    
    // Convert file path string to ObjectStore Path
    let path = ObjectPath::from("path/to/file.xlsx");

    // Basic conversion - file type is auto-detected from extension
    let result = md.convert_with_store(store.clone(), &path, None).await?;
    println!("Converted Text: {}", result.to_markdown());

    // Convert legacy Office formats
    let doc_path = ObjectPath::from("document.doc");
    let result = md.convert_with_store(store.clone(), &doc_path, None).await?;

    // Convert archives (extracts and converts contents)
    let zip_path = ObjectPath::from("archive.zip");
    let result = md.convert_with_store(store.clone(), &zip_path, None).await?;

    // Or explicitly specify options
    let options = ConversionOptions::default()
        .with_extension(".xlsx")
        .with_extract_images(true);

    let result = md.convert_with_store(store, &path, Some(options)).await?;
    
    Ok(())
}

Important: The library uses object_store for file operations, not plain file paths. You must:

  1. Create an ObjectStore implementation (like LocalFileSystem for local files)
  2. Convert file path strings to object_store::path::Path using Path::from()
  3. Use convert_with_store() method with the store and path

For convenience, there's also a convert() method that accepts string paths and uses LocalFileSystem internally.

Convert with LLM for Image Descriptions

use markitdown::{ConversionOptions, MarkItDown, create_llm_client};
use rig::providers::openai;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let md = MarkItDown::new();
    
    // Create an LLM client using any rig-core compatible provider
    // OpenAI example:
    let openai_client = openai::Client::from_env();
    let model = openai_client.completion_model("gpt-4o");
    let llm = create_llm_client(model);
    
    // Google Gemini example:
    // let gemini_client = gemini::Client::from_env();
    // let model = gemini_client.completion_model("gemini-2.0-flash");
    // let llm = create_llm_client(model);
    
    // Anthropic Claude example:
    // let anthropic_client = anthropic::Client::from_env();
    // let model = anthropic_client.completion_model("claude-sonnet-4-20250514");
    // let llm = create_llm_client(model);
    
    // Cohere example with custom endpoint:
    // let api_key = std::env::var("COHERE_API_KEY")?;
    // let mut builder = rig::providers::cohere::Client::builder(&api_key);
    // if let Some(endpoint) = custom_endpoint {
    //     builder = builder.base_url(endpoint);
    // }
    // let client = builder.build();
    // let model = client.completion_model("command-r-plus");
    // let llm = create_llm_client(model);

    let options = ConversionOptions::default()
        .with_extension(".jpg")
        .with_llm(llm);

    let result = md.convert("path/to/image.jpg", Some(options)).await?;
    println!("Image description: {}", result.to_markdown());
    
    Ok(())
}

Supported LLM Providers (via rig-core):

  • OpenAI (GPT-4, GPT-4o, etc.)
  • Google Gemini (gemini-2.0-flash, gemini-pro, etc.)
  • Anthropic Claude (claude-sonnet, claude-opus, etc.)
  • Cohere (command-r-plus, etc.)
  • Any custom provider implementing CompletionModel

Convert from Bytes

use markitdown::{ConversionOptions, MarkItDown};
use bytes::Bytes;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let md = MarkItDown::new();
    
    let file_bytes = std::fs::read("path/to/file.pdf")?;

    // Auto-detect file type from bytes
    let result = md.convert_bytes(Bytes::from(file_bytes.clone()), None).await?;
    println!("Converted: {}", result.to_markdown());

    // Or specify options explicitly
    let options = ConversionOptions::default()
        .with_extension(".pdf");

    let result = md.convert_bytes(Bytes::from(file_bytes), Some(options)).await?;
    
    Ok(())
}

Working with the Output Structure

The conversion returns a Document struct that preserves the page/slide structure of the original file:

use markitdown::{MarkItDown, Document, Page, ContentBlock, ExtractedImage};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let md = MarkItDown::new();
    let result: Document = md.convert("presentation.pptx", None).await?;
    
    // Access document metadata
    if let Some(title) = &result.title {
        println!("Document: {}", title);
    }
    
    // Iterate through pages/slides
    for page in &result.pages {
        println!("Page {}", page.page_number);
        
        // Get page content as markdown
        let markdown = page.to_markdown();
        
        // Or access individual content blocks
        for block in &page.content {
            match block {
                ContentBlock::Text(text) => println!("Text: {}", text),
                ContentBlock::Heading { level, text } => println!("H{}: {}", level, text),
                ContentBlock::Image(img) => {
                    println!("Image: {} ({} bytes)", img.id, img.data.len());
                    if let Some(desc) = &img.description {
                        println!("  Description: {}", desc);
                    }
                }
                ContentBlock::Table { headers, rows } => {
                    println!("Table: {} cols, {} rows", headers.len(), rows.len());
                }
                ContentBlock::List { ordered, items } => {
                    println!("List ({} items)", items.len());
                }
                ContentBlock::Code { language, code } => {
                    println!("Code block: {:?}", language);
                }
                ContentBlock::Quote(text) => println!("Quote: {}", text),
                ContentBlock::Markdown(md) => println!("Markdown: {}", md),
            }
        }
        
        // Get all images from this page
        let images: Vec<&ExtractedImage> = page.images();
        
        // Access rendered page image (for scanned PDFs, complex pages)
        if let Some(rendered) = &page.rendered_image {
            println!("Page rendered as image: {} bytes", rendered.data.len());
        }
    }
    
    // Convert entire document to markdown (with page separators)
    let full_markdown = result.to_markdown();
    
    // Get all images from the entire document
    let all_images = result.images();
    
    Ok(())
}

Output Structure:

  • Document - Complete document with optional title, pages, and metadata
    • Page - Single page/slide with page number and content blocks
      • ContentBlock - Individual content element (Text, Heading, Image, Table, List, Code, Quote, Markdown)
      • rendered_image - Optional full-page render (for scanned PDFs, slides with complex layouts)
    • ExtractedImage - Image data with id, bytes, MIME type, dimensions, alt text, and LLM description

This structure is ideal for:

  • Pagination-aware processing - Handle each page separately
  • Image extraction - Access embedded images with their metadata
  • Structured content - Work with tables, lists, headings programmatically
  • LLM pipelines - Pass individual pages or content blocks to AI models

Recent Improvements

Format Expansion

  • 40+ new formats including legacy Office (.doc, .xls, .ppt), OpenDocument (.odt, .ods, .odp), Apple iWork (.pages, .numbers, .key)
  • Archive support for ZIP, TAR, GZIP, BZIP2, XZ, ZSTD, and 7-Zip with automatic content extraction
  • Additional formats: EPUB, vCard, iCalendar, BibTeX, log files, SQLite databases, email files

Performance & Reliability

  • Static compilation for compression libraries (bzip2, xz2) for better portability
  • Improved file detection - prioritizes file extension over magic byte detection for legacy formats
  • Template support for Office formats (.dotx, .potx, .xltx)
  • LLM flexibility - works with any rig-core compatible model (OpenAI, Gemini, Claude, Cohere, custom providers)

Testing

  • Comprehensive test suite using real-world files from Apache Tika test corpus
  • Tests for all supported formats with both file and bytes conversion
  • In-memory test generation for compression formats

Register a Custom Converter

You can extend MarkItDown by implementing the DocumentConverter trait for your custom converters and registering them:

use markitdown::{DocumentConverter, Document, ConversionOptions, MarkItDown};
use markitdown::error::MarkitdownError;
use async_trait::async_trait;
use bytes::Bytes;
use std::sync::Arc;
use object_store::ObjectStore;

struct MyCustomConverter;

#[async_trait]
impl DocumentConverter for MyCustomConverter {
    async fn convert(
        &self,
        store: Arc<dyn ObjectStore>,
        path: &object_store::path::Path,
        options: Option<ConversionOptions>,
    ) -> Result<Document, MarkitdownError> {
        // Implement file conversion logic
        todo!()
    }

    async fn convert_bytes(
        &self,
        bytes: Bytes,
        options: Option<ConversionOptions>,
    ) -> Result<Document, MarkitdownError> {
        // Implement bytes conversion logic
        todo!()
    }
    
    fn supported_extensions(&self) -> &[&str] {
        &[".custom"]
    }
}

let mut md = MarkItDown::new();
md.register_converter(Box::new(MyCustomConverter));

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgments

License

MarkItDown is licensed under the MIT License. See LICENSE for more details.

About

A Rust library designed to facilitate the conversion of various document formats into markdown text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%