MakeAIDatasets

Project Overview

MakeAIDatasets is an automated pipeline for creating high-quality AI datasets from PDF and EPUB sources. The system automates text extraction, language filtering, and dataset preparation, and generates comprehensive metadata for quality control.

Key Features

Multi-format support: Process PDF (text & scanned) and EPUB files
Intelligent text extraction: Native text extraction with OCR fallback
Language filtering: FastText-powered English detection with confidence thresholds
Metadata enrichment: Extract technical and contextual metadata
Parallel processing: Utilize multi-core CPUs for efficient batch processing
Dataset export: Create Hugging Face-compatible datasets with one command
Cloud integration: Direct upload to Hugging Face Hub

Technical Process

Input Handling
- Accepts PDF and EPUB files in /input directory
- Processes files in parallel using thread pooling
Content Extraction
- PDF: Native text extraction with PyPDF2 + OCR fallback using Tesseract
- EPUB: Structural parsing with ebooklib and BeautifulSoup
Text Processing
- Whitespace normalization
- Short line filtering
- Language detection (English with >70% confidence)
- Paragraph reconstruction
Metadata Generation
- File characteristics (format, size)
- Processing details (OCR usage, page count)
- Content metrics (paragraph count, English ratio)
- PDF metadata (title, author, creation date)
Output Creation
- Cleaned text files in /output/cleaned_texts
- JSON metadata in /output/metadata
- Hugging Face dataset in /hf_dataset

Repository Structure

MakeAIDatasets/
├── input/                  # Source files directory
├── output/                 # Processing results
│   ├── cleaned_texts/      # Processed text files
│   └── metadata/           # JSON metadata files
├── hf_dataset/             # Hugging Face dataset
├── src/                    # Application source code
│   ├── main.py             # Main processing script
│   ├── processors/         # Processing modules
│   │   ├── pdf_processor.py
│   │   ├── epub_processor.py
│   │   └── text_cleaner.py
│   └── utils/              # Utility functions
├── tests/                  # Test cases
├── Dockerfile              # Container configuration
├── requirements.txt        # Python dependencies
├── .env.example            # Environment configuration
└── README.md               # Project documentation

Requirements

Python Dependencies (requirements.txt)

PyPDF2==3.0.0
ebooklib==1.0.0
beautifulsoup4==4.12.0
datasets==2.14.0
pdf2image==1.16.0
pytesseract==0.3.10
fasttext==0.9.2
requests==2.31.0
huggingface-hub==0.16.4
python-dotenv==1.0.0
tqdm==4.66.1

System Dependencies

# Ubuntu/Debian
sudo apt install -y \
    tesseract-ocr \
    poppler-utils \
    libgl1 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    tesseract-ocr-eng

Docker Setup

Dockerfile

FROM python:3.10-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    poppler-utils \
    libgl1 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    tesseract-ocr-eng \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create directories
RUN mkdir -p input output

# Run processing
CMD ["python", "src/main.py"]

Usage

Basic Processing

Place source files in /input directory
Run main processing:
```
python src/main.py
```
Follow interactive prompts for dataset creation

Advanced Options

# Process files without interaction
python src/main.py --auto

# Set custom directories
python src/main.py --input custom_input --output custom_output

# Enable debug logging
python src/main.py --log-level DEBUG

Docker Execution

# Build Docker image
docker build -t book-processor .

# Run container with volume mounts
docker run -it --rm \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  book-processor

# With environment variables
docker run -it --rm \
  -e POPPLER_PATH=/custom/path \
  -e MIN_ENGLISH_CONFIDENCE=0.8 \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  book-processor

Command Line Usage

Process all input files and generate summary report:
```
python -m src.cli --process
```
Build Hugging Face dataset:
```
python -m src.cli --build-dataset
```

Upload dataset to Hugging Face Hub:

python -m src.cli --build-dataset --upload-hf

Process with different output format:

python -m src.cli --process --output-format json

Web Interface

Start the web interface:
```
python src/webapp.py
```
Go to http://localhost:5000 in your browser and upload a file.

Testing

Run all tests:
```
pytest
```

Advanced Features

Multi-language support via --lang parameter.
After batch processing, check output/summary_report.json for summary statistics.
Output can be exported in different formats: txt, json, csv.

Errors and Logging

Logs can be written to both console and file (can be improved).
Failed files are listed in summary_report.json.

Contribution and Development

Add tests for every function.
CI/CD with GitHub Actions is recommended.
Update README and examples regularly.

Configuration

Customize processing through environment variables (use .env file or export):

Variable	Default	Description
`POPPLER_PATH`	System PATH	Custom Poppler binaries location
`TESSERACT_THREADS`	4	OCR processing threads
`MIN_ENGLISH_CONFIDENCE`	0.7	Language detection threshold
`HF_TOKEN`	-	Hugging Face API token
`LOGLEVEL`	INFO	Log verbosity
`MAX_WORKERS`	CPU cores	Thread pool size

Output Structure

output/
├── cleaned_texts/
│   ├── book1_cleaned.txt
│   └── book2_cleaned.txt
├── metadata/
│   ├── book1_metadata.json
│   └── book2_metadata.json
└── hf_dataset/
    ├── dataset_info.json
    ├── state.json
    └── data/
        └── train.arrow

Sample metadata:

{
  "source_file": "deep_learning.pdf",
  "source_format": ".pdf",
  "paragraph_count": 1242,
  "character_count": 687412,
  "english_ratio": "1242/1248",
  "ocr_used": true,
  "title": "Deep Learning Textbook",
  "author": "Ian Goodfellow et al.",
  "creation_date": "D:20230115120000Z"
}

Planned Features

Near-term Roadmap

Future Development

Contribution Guidelines

We welcome contributions! Here's how to help:

Report issues and suggest features
Submit pull requests:
- Fork repository
- Create feature branch (feat/new-feature)
- Submit PR with detailed description
Improve documentation
Add test cases

Testing Requirements

Unit tests for all processing modules
Integration tests with sample books
Performance benchmarks
Error handling simulations

Support

For assistance, please:

Check Troubleshooting Guide
Open a GitHub issue

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
input		input
makeaidatasets.egg-info		makeaidatasets.egg-info
output		output
src		src
tests		tests
web_uploads		web_uploads
.env.example		.env.example
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_tests.ps1		run_tests.ps1
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MakeAIDatasets

Project Overview

Key Features

Technical Process

Repository Structure

Requirements

Python Dependencies (requirements.txt)

System Dependencies

Docker Setup

Dockerfile

Usage

Basic Processing

Advanced Options

Docker Execution

Command Line Usage

Web Interface

Testing

Advanced Features

Errors and Logging

Contribution and Development

Configuration

Output Structure

Planned Features

Near-term Roadmap

Future Development

Contribution Guidelines

Testing Requirements

Support

About

Uh oh!

Releases

Packages

Languages

License

coddard/MakeAIDatasets

Folders and files

Latest commit

History

Repository files navigation

MakeAIDatasets

Project Overview

Key Features

Technical Process

Repository Structure

Requirements

Python Dependencies (requirements.txt)

System Dependencies

Docker Setup

Dockerfile

Usage

Basic Processing

Advanced Options

Docker Execution

Command Line Usage

Web Interface

Testing

Advanced Features

Errors and Logging

Contribution and Development

Configuration

Output Structure

Planned Features

Near-term Roadmap

Future Development

Contribution Guidelines

Testing Requirements

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages