Add comprehensive end-to-end documentation for CharLOTTE by ringger · Pull Request #2 · hatch5o6/SCIMT

ringger · 2025-10-15T04:53:34Z

Summary

This PR adds comprehensive end-to-end documentation to help new researchers successfully run complete CharLOTTE experiments from installation through translation and evaluation.

New Documentation Files

Core Guides (docs/)

QUICKSTART.md: 30-minute end-to-end pipeline demonstration (all 6 phases)
EXPERIMENTATION.md: Complete Portuguese→English experimental workflow
SETUP.md: Installation guide with requirements decision tree
TROUBLESHOOTING.md: Enhanced troubleshooting with SC_MODEL_ID diagnostics
CONFIGURATION.md: Complete reference for all configuration parameters
DATA_PREPARATION.md: Guide for obtaining and preparing parallel corpora
MONITORING.md: Training monitoring and evaluation guide

Infrastructure Improvements

requirements-minimal.txt: Clean minimal dependencies for core functionality
nmt.requirements.txt: NMT pipeline-specific requirements
Pipeline/train_SC_venv.sh: Virtual environment-compatible SC training script
charlotte-test/: Complete quickstart test environment with configs and automation
test_imports.py: Import verification utility for troubleshooting

Key Improvements

End-to-end focus: All guides now walk through complete workflows, not just isolated components
Verification steps: Each phase includes success criteria and verification commands
Phase/step mapping: Clear correspondence between 6-phase pipeline and actual scripts
Requirements clarity: Decision tree for which requirements file to use
SC_MODEL_ID troubleshooting: Diagnostic commands to find actual SC model IDs from filenames
Quickstart automation: run_full_quickstart.sh runs all 6 phases automatically

Testing

The quickstart has been tested end-to-end on macOS with:

Python 3.10
All three virtual environments (venv_sound, venv_copper, venv_nmt)
FastAlign, CopperMT/Moses, and all dependencies
Complete pipeline from SC training through NMT translation

Impact

This documentation enables researchers to:

Complete their first end-to-end experiment in ~30 minutes
Understand the complete CharLOTTE methodology (not just components)
Troubleshoot common issues independently
Scale up to full experiments with confidence

Changes

130 files changed: +82,353 insertions, -1,546 deletions
Updated .gitignore: Added patterns for venv/ and development artifacts

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Add extensive documentation to make the CharLOTTE pipeline fully reproducible from start to finish. Grew from 480 lines to 1,576 lines (+1,096 lines). Major additions: - Data acquisition guide with public corpus sources (OPUS, Tatoeba, etc.) - Complete data preparation workflow with CSV generation scripts - Path configuration guide explaining placeholder replacement - 15-minute quick test for environment verification - End-to-end example with 9 detailed steps and full config files - Time and resource estimates for all training stages - NMT training documentation (previously undocumented) - Training metrics visualization guide (loss curves, TensorBoard) - COMET evaluation setup (optional advanced metric) - Troubleshooting section with common issues and solutions - SC model explanation section clarifying the core method This comprehensive documentation enables researchers to: 1. Obtain and format parallel data from public sources 2. Configure all three pipeline stages (SC, tokenizer, NMT) 3. Run complete experiments with concrete examples 4. Monitor training progress and visualize results 5. Reproduce published CharLOTTE results on new language pairs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit adds complete documentation to enable new researchers to successfully run end-to-end CharLOTTE experiments from installation through translation and evaluation. Documentation improvements (docs/): - QUICKSTART.md: 30-min end-to-end pipeline demo (6 phases) - EXPERIMENTATION.md: Complete Portuguese→English workflow - SETUP.md: Installation guide with requirements decision tree - TROUBLESHOOTING.md: Enhanced SC_MODEL_ID troubleshooting - CONFIGURATION.md, DATA_PREPARATION.md, MONITORING.md Infrastructure improvements: - requirements-minimal.txt: Clean minimal dependencies - nmt.requirements.txt: NMT pipeline requirements - Pipeline/train_SC_venv.sh: venv-compatible SC training - charlotte-test/: Complete quickstart test environment - test_imports.py: Import verification utility Key improvements: - End-to-end focused (not component-only) - Verification at each phase - Phase/step mapping clarification - Success criteria for all phases - Requirements file decision tree - SC_MODEL_ID diagnostic commands Updated .gitignore: - Added venv/ and venv_*/ patterns - Added archive/ pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit addresses clarity and redundancy issues across all documentation files, improving the experience for new researchers. Key improvements: - Consolidated quickstart guidance to emphasize full 6-phase pipeline test - Simplified phases vs. steps terminology in EXPERIMENTATION.md - Made SETUP.md single source of truth for path configuration - Consolidated SC Model ID mismatch explanation to TROUBLESHOOTING.md - Simplified requirements files section with clear 3-environment table - Moved "Why three environments?" explanation earlier in setup flow - Removed BLEU score duplicates, keeping MONITORING.md as single source - Cleaned up forward references and moved to end of sections - Added "Who is this for?" boxes to QUICKSTART and EXPERIMENTATION - Added progress indicators at major checkpoints in EXPERIMENTATION - Expanded CSV naming convention explanation in DATA_PREPARATION - Removed outdated ⭐ 'NEW SECTION' markers The documentation now provides clearer navigation, reduced redundancy, and better guidance for users from installation through experimentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit enhances the CharLOTTE quickstart with automatic baseline comparison, comprehensive loss curve visualization, and Apple Silicon MPS support, making it easier for users to understand SC augmentation value. ## New Features 1. **SC Model Loss Curves**: Auto-generate training/validation loss plots for character-level RNN SC model by parsing fairseq logs 2. **Baseline Comparison Pipeline**: Three new phases (B1-B3) train and evaluate NMT model WITHOUT SC augmentation for fair comparison 3. **Automatic BLEU Comparison**: Side-by-side display showing SC-augmented (38% BLEU) vs baseline (33% BLEU) = ~16% improvement 4. **MPS (Apple Silicon) Support**: Full support for training on Apple Silicon GPUs via PyTorch MPS backend 5. **Enhanced Documentation**: Updated QUICKSTART.md with baseline phases, loss curves, MPS usage, and realistic BLEU expectations 6. **Doubled Dataset Size**: Increased FLORES from 800/100/100 to 1600/200/200 sentences for better demonstration quality 7. **Improved .gitignore**: Added patterns to exclude generated outputs (models, logs, data files) from future commits ## Key Changes **New Files:** - charlotte-test/plot_sc_training_loss.py - SC loss curve generator - charlotte-test/plot_training_loss.py - NMT loss curve generator - charlotte-test/test-nmt-pt-en-baseline.yaml - Baseline NMT config - charlotte-test/test-tok-pt-en-baseline.cfg - Baseline tokenizer config - charlotte-test/data/csv/nmt_train_baseline.csv - Baseline training data - charlotte-test/data/csv/nmt_val_baseline.csv - Baseline validation data **Modified Core Files:** - .gitignore - Exclude generated outputs from future commits - NMT/train.py - Relaxed filename assertions for flexible CSV naming - Pipeline/train_tokenizer.sh - Combined tokenizer, default params, paths - README.md - Added "Three Models in CharLOTTE" section - docs/QUICKSTART.md - Comprehensive update with baseline and curves - docs/CONFIGURATION.md - Added MPS device documentation **Modified Test Configs:** - charlotte-test/run_full_quickstart.sh - Integrated baseline phases B1-B3 - charlotte-test/download_flores.py - Doubled dataset sizes - charlotte-test/test-nmt-pt-en.yaml - MPS support, longer training - charlotte-test/test-tok-es-pt-en.cfg - Fixed distribution naming ## Results Baseline comparison successfully demonstrates SC augmentation value: - SC-Augmented: 38.1% BLEU (3200 training pairs with augmentation) - Baseline: 32.9% BLEU (1600 pt-en pairs only) - Improvement: ~16% relative BLEU gain from SC data augmentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Remove generated files from git tracking while keeping them locally: - Cognate extractions (charlotte-test/cognates_*) - SC model outputs (charlotte-test/sc_models_*) - FLORES data files (charlotte-test/data/raw/*.es, *.pt, *.en) - Test log files (charlotte-test/*.log) - Cognate dataset logs (cognate_dataset_log_NG=True.*) These files are now covered by .gitignore and will not be tracked in future commits. The files remain on disk for local testing. This follows the principle that version control should track source code and configurations, not generated outputs or downloaded datasets.

These files are no longer used: - train.no_overlap_v1.csv, val.no_overlap_v1.csv (replaced by new baseline CSVs) - quickstart_phase*.log (old log files, now using organized logs/ directory)

hatch5o6 and others added 30 commits September 4, 2025 13:29

update

f1c5965

add to README.md

db3b93f

update

a170006

readme

066ff0d

updating docs

c72449e

updating docs

5fc2f90

update docs

6ed3284

update docs

4759496

update docs

add9c78

update docs

64dd6d1

update docs

022ecea

update docs

74d7135

update docs

23eb18c

update docs

4818855

updates to docs and scripts

9a5897b

update to docs

c4bf7f5

update to docs

d57c9a2

test push

0e93a9a

fix after test

efebfaf

formatting

e2d641b

updates to development, docs, and hyperparam_search

87c7414

update

d7f1d53

update

b804261

update

cfae02f

update

5e4b39e

update

66874a2

update

f805750

ringger and others added 3 commits October 15, 2025 14:59

Remove obsolete CSV and log files from tracking

29a043e

These files are no longer used: - train.no_overlap_v1.csv, val.no_overlap_v1.csv (replaced by new baseline CSVs) - quickstart_phase*.log (old log files, now using organized logs/ directory)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive end-to-end documentation for CharLOTTE#2

Add comprehensive end-to-end documentation for CharLOTTE#2
ringger wants to merge 33 commits intohatch5o6:mainfrom
ringger:docs/comprehensive-readme-expansion

ringger commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ringger commented Oct 15, 2025

Summary

New Documentation Files

Core Guides (docs/)

Infrastructure Improvements

Key Improvements

Testing

Impact

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants