spreadsheet_splitter is a Python command-line tool designed to split large Excel files (.xls or .xlsx) into smaller, manageable parts. It supports processing large files iteratively with low memory usage, making it ideal for handling files that exceed available RAM (e.g., >4 GB). The tool preserves specified title rows, logs processing details (including file profiling and memory usage), and supports resuming interrupted tasks.
- Split
.xlsand.xlsxfiles into two equal parts or smaller chunks. - Process large files iteratively with configurable rows per iteration.
- Preserve title rows in non-large mode.
- Validate column counts and profile data types for
.xlsxfiles. - Monitor memory usage during processing.
- Resume processing from the last checkpoint using a log file.
- Built with
xlrd,xlwt,openpyxl,pandas,pyexcelerate, andpsutil.
To build and run spreadsheet_splitter locally, ensure you have:
-
Python: Version 3.13 or higher.
-
uv: A Python package manager for managing dependencies and virtual environments. Install it with:
# macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_1>
# Windows powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_2>
-
Git: To clone the repository.
-
An Excel file (
.xlsor.xlsx) to process.
Follow these steps to build spreadsheet_splitter from source:
-
Clone the Repository
git clone https://github.com/parsaloi/spreadsheet_splitter.git cd spreadsheet_splitter -
Set Up a Virtual Environment Create and activate a virtual environment using
uv:uv venv source .venv/bin/activate # On Linux/macOS .venv\Scripts\activate # On Windows
-
Install Dependencies Install the required dependencies specified in
pyproject.toml:uv sync
-
Verify Setup Check that the
spreadsheet-splittercommand is available:uv run spreadsheet-splitter --help
This should display the CLI help message with available arguments.
The spreadsheet_splitter tool provides two modes:
- Default Mode: Splits an Excel file into two equal parts, preserving title rows.
- Large Mode: Splits large files into smaller chunks with a specified number of rows per iteration, ideal for low-memory environments.
-
Split a File into Two Parts
uv run spreadsheet-splitter /path/to/file.xlsx --output-dir split_files --title-rows 1
- Splits
file.xlsxintosplit_files/file_part1.xlsxandsplit_files/file_part2.xlsx. - Preserves 1 title row in each output file.
- Splits
-
Split a Large File Iteratively
uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
- Processes
very_large_file.xlsxin chunks of 500 rows, up to 5 iterations (2500 rows total). - Reads only the first 22 columns.
- Outputs files like
split_files/very_large_file_part001.xlsx,split_files/very_large_file_part002.xlsx, etc. - Logs progress to
split_log.json.
- Processes
-
Resume Processing
uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --resume --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
- Resumes from the last processed row recorded in
split_log.json.
- Resumes from the last processed row recorded in
| Argument | Description | Default |
|---|---|---|
input_file |
Path to the input .xls or .xlsx file |
Required |
-o, --output-dir |
Directory to save output files | output |
-t, --title-rows |
Number of top rows to treat as title rows (ignored in --large mode) |
1 |
--large |
Process large files iteratively with limited rows per iteration | False |
--rows-per-iteration |
Number of data rows to process per iteration | 500 |
--max-iterations |
Maximum number of iterations to process | 10 |
--resume |
Resume processing from the last row in the log file | False |
--log-file |
Path to the log file for tracking processed rows | split_log.json |
--columns |
Number of columns to process | All columns |
The tool generates a split_log.json file to track processing details, including:
- Input and Output: Input file path and output directory.
- Processed Ranges: Start and end rows for each output file.
- File Profile: Total rows, columns, and data types (for
.xlsx). - Memory Usage: Memory before and after reading each chunk (for
.xlsx). - Timestamp: When the log was last updated.
Example split_log.json:
{
"input_path": "/path/to/very_large_file.xlsx",
"output_dir": "split_files",
"rows_per_iteration": 500,
"processed_ranges": [
{
"file": "split_files/very_large_file_part001.xlsx",
"start_row": 1,
"end_row": 500,
"iteration": 1
}
],
"last_row_processed": 499,
"timestamp": "2025-05-27T11:12:46.172255",
"columns": 22,
"file_profile": {
"total_rows": 1,
"total_columns": 22,
"data_types": {
"ZONE": "float64",
"SITE": "object",
...
}
},
"memory_usage": [
{
"iteration": 1,
"memory_before_mb": 256.55078125,
"memory_after_mb": 259.6328125
}
]
}-
Column Count Error: If you see
Specified columns (N) exceeds actual columns (M), verify the column count:import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=1, engine='openpyxl') print(df.shape[1], "columns")
Use the correct number with
--columns. -
Empty File Warning: If the tool warns that the input file has no rows, check the file content:
import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=10, engine='openpyxl') print(df.shape[0], "rows")
-
Memory Issues: If memory usage is high, reduce
--rows-per-iteration(e.g., to 100). Checksplit_log.jsonfor memory usage details. -
Slow Processing: Ensure the input file is on a fast local drive (e.g., SSD). For
.xlsxfiles with many unique strings, consider converting to.csvfirst:import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", engine='openpyxl') df.to_csv("very_large_file.csv", index=False)
- Package the tool for PyPI with cross-platform installation instructions.
- Add support for
.csvinput files. - Enhance profiling with more detailed statistics (e.g., unique values per column).
MIT License