Skip to content

Conversation

@Egor-OSSRevival
Copy link
Contributor

Pull Request Description

Overview

This PR adds mixed encoding support to enca, resolving issue #25 where files with multiple encodings (e.g., GB2312 + UTF-8) could not be processed.

Features

  • Mixed Encoding Detection (-M / --mixed-encodings)
    Detects multiple encodings within one file, reports segments with offsets and lengths.
  • Configurable Buffer Size (-B / --mixed-buffer-size)
    Default 1024 bytes, range 1–1048576. Smaller = finer detection, larger = faster.
  • Error Handling (-I / --mixed-ignore-errors)
    Skips corrupted/unknown segments, falls back to predominant encoding.
  • Mixed Encoding Conversion (-x with -M)
    Converts each segment individually while preserving file integrity.

Usage

# Detect mixed encodings
enca -L pl -M mixed_file.txt

# Convert to UTF-8
enca -L pl -M -x utf8 mixed_file.txt

# Fine-tuned with buffer and error handling
enca -L pl -M -B 256 -I -x utf8 mixed_file.txt

Implementation

  • Chunk-based analysis with segment merging
  • Predominant encoding fallback
  • Integrated with existing conversion system (iconv/recode/internal)
  • Verbose logging for detailed progress

Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant