Skip to content

Add comprehensive end-to-end documentation for CharLOTTE#2

Open
ringger wants to merge 33 commits intohatch5o6:mainfrom
ringger:docs/comprehensive-readme-expansion
Open

Add comprehensive end-to-end documentation for CharLOTTE#2
ringger wants to merge 33 commits intohatch5o6:mainfrom
ringger:docs/comprehensive-readme-expansion

Conversation

@ringger
Copy link

@ringger ringger commented Oct 15, 2025

Summary

This PR adds comprehensive end-to-end documentation to help new researchers successfully run complete CharLOTTE experiments from installation through translation and evaluation.

New Documentation Files

Core Guides (docs/)

  • QUICKSTART.md: 30-minute end-to-end pipeline demonstration (all 6 phases)
  • EXPERIMENTATION.md: Complete Portuguese→English experimental workflow
  • SETUP.md: Installation guide with requirements decision tree
  • TROUBLESHOOTING.md: Enhanced troubleshooting with SC_MODEL_ID diagnostics
  • CONFIGURATION.md: Complete reference for all configuration parameters
  • DATA_PREPARATION.md: Guide for obtaining and preparing parallel corpora
  • MONITORING.md: Training monitoring and evaluation guide

Infrastructure Improvements

  • requirements-minimal.txt: Clean minimal dependencies for core functionality
  • nmt.requirements.txt: NMT pipeline-specific requirements
  • Pipeline/train_SC_venv.sh: Virtual environment-compatible SC training script
  • charlotte-test/: Complete quickstart test environment with configs and automation
  • test_imports.py: Import verification utility for troubleshooting

Key Improvements

  1. End-to-end focus: All guides now walk through complete workflows, not just isolated components
  2. Verification steps: Each phase includes success criteria and verification commands
  3. Phase/step mapping: Clear correspondence between 6-phase pipeline and actual scripts
  4. Requirements clarity: Decision tree for which requirements file to use
  5. SC_MODEL_ID troubleshooting: Diagnostic commands to find actual SC model IDs from filenames
  6. Quickstart automation: run_full_quickstart.sh runs all 6 phases automatically

Testing

The quickstart has been tested end-to-end on macOS with:

  • Python 3.10
  • All three virtual environments (venv_sound, venv_copper, venv_nmt)
  • FastAlign, CopperMT/Moses, and all dependencies
  • Complete pipeline from SC training through NMT translation

Impact

This documentation enables researchers to:

  • Complete their first end-to-end experiment in ~30 minutes
  • Understand the complete CharLOTTE methodology (not just components)
  • Troubleshoot common issues independently
  • Scale up to full experiments with confidence

Changes

  • 130 files changed: +82,353 insertions, -1,546 deletions
  • Updated .gitignore: Added patterns for venv/ and development artifacts

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

hatch5o6 and others added 30 commits September 4, 2025 13:29
Add extensive documentation to make the CharLOTTE pipeline fully reproducible
from start to finish. Grew from 480 lines to 1,576 lines (+1,096 lines).

Major additions:
- Data acquisition guide with public corpus sources (OPUS, Tatoeba, etc.)
- Complete data preparation workflow with CSV generation scripts
- Path configuration guide explaining placeholder replacement
- 15-minute quick test for environment verification
- End-to-end example with 9 detailed steps and full config files
- Time and resource estimates for all training stages
- NMT training documentation (previously undocumented)
- Training metrics visualization guide (loss curves, TensorBoard)
- COMET evaluation setup (optional advanced metric)
- Troubleshooting section with common issues and solutions
- SC model explanation section clarifying the core method

This comprehensive documentation enables researchers to:
1. Obtain and format parallel data from public sources
2. Configure all three pipeline stages (SC, tokenizer, NMT)
3. Run complete experiments with concrete examples
4. Monitor training progress and visualize results
5. Reproduce published CharLOTTE results on new language pairs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds complete documentation to enable new researchers to
successfully run end-to-end CharLOTTE experiments from installation
through translation and evaluation.

Documentation improvements (docs/):
- QUICKSTART.md: 30-min end-to-end pipeline demo (6 phases)
- EXPERIMENTATION.md: Complete Portuguese→English workflow
- SETUP.md: Installation guide with requirements decision tree
- TROUBLESHOOTING.md: Enhanced SC_MODEL_ID troubleshooting
- CONFIGURATION.md, DATA_PREPARATION.md, MONITORING.md

Infrastructure improvements:
- requirements-minimal.txt: Clean minimal dependencies
- nmt.requirements.txt: NMT pipeline requirements
- Pipeline/train_SC_venv.sh: venv-compatible SC training
- charlotte-test/: Complete quickstart test environment
- test_imports.py: Import verification utility

Key improvements:
- End-to-end focused (not component-only)
- Verification at each phase
- Phase/step mapping clarification
- Success criteria for all phases
- Requirements file decision tree
- SC_MODEL_ID diagnostic commands

Updated .gitignore:
- Added venv/ and venv_*/ patterns
- Added archive/ pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses clarity and redundancy issues across all documentation
files, improving the experience for new researchers.

Key improvements:
- Consolidated quickstart guidance to emphasize full 6-phase pipeline test
- Simplified phases vs. steps terminology in EXPERIMENTATION.md
- Made SETUP.md single source of truth for path configuration
- Consolidated SC Model ID mismatch explanation to TROUBLESHOOTING.md
- Simplified requirements files section with clear 3-environment table
- Moved "Why three environments?" explanation earlier in setup flow
- Removed BLEU score duplicates, keeping MONITORING.md as single source
- Cleaned up forward references and moved to end of sections
- Added "Who is this for?" boxes to QUICKSTART and EXPERIMENTATION
- Added progress indicators at major checkpoints in EXPERIMENTATION
- Expanded CSV naming convention explanation in DATA_PREPARATION
- Removed outdated ⭐ 'NEW SECTION' markers

The documentation now provides clearer navigation, reduced redundancy, and
better guidance for users from installation through experimentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
ringger and others added 3 commits October 15, 2025 14:59
This commit enhances the CharLOTTE quickstart with automatic baseline
comparison, comprehensive loss curve visualization, and Apple Silicon MPS
support, making it easier for users to understand SC augmentation value.

## New Features

1. **SC Model Loss Curves**: Auto-generate training/validation loss plots
   for character-level RNN SC model by parsing fairseq logs

2. **Baseline Comparison Pipeline**: Three new phases (B1-B3) train and
   evaluate NMT model WITHOUT SC augmentation for fair comparison

3. **Automatic BLEU Comparison**: Side-by-side display showing SC-augmented
   (38% BLEU) vs baseline (33% BLEU) = ~16% improvement

4. **MPS (Apple Silicon) Support**: Full support for training on Apple
   Silicon GPUs via PyTorch MPS backend

5. **Enhanced Documentation**: Updated QUICKSTART.md with baseline phases,
   loss curves, MPS usage, and realistic BLEU expectations

6. **Doubled Dataset Size**: Increased FLORES from 800/100/100 to
   1600/200/200 sentences for better demonstration quality

7. **Improved .gitignore**: Added patterns to exclude generated outputs
   (models, logs, data files) from future commits

## Key Changes

**New Files:**
- charlotte-test/plot_sc_training_loss.py - SC loss curve generator
- charlotte-test/plot_training_loss.py - NMT loss curve generator
- charlotte-test/test-nmt-pt-en-baseline.yaml - Baseline NMT config
- charlotte-test/test-tok-pt-en-baseline.cfg - Baseline tokenizer config
- charlotte-test/data/csv/nmt_train_baseline.csv - Baseline training data
- charlotte-test/data/csv/nmt_val_baseline.csv - Baseline validation data

**Modified Core Files:**
- .gitignore - Exclude generated outputs from future commits
- NMT/train.py - Relaxed filename assertions for flexible CSV naming
- Pipeline/train_tokenizer.sh - Combined tokenizer, default params, paths
- README.md - Added "Three Models in CharLOTTE" section
- docs/QUICKSTART.md - Comprehensive update with baseline and curves
- docs/CONFIGURATION.md - Added MPS device documentation

**Modified Test Configs:**
- charlotte-test/run_full_quickstart.sh - Integrated baseline phases B1-B3
- charlotte-test/download_flores.py - Doubled dataset sizes
- charlotte-test/test-nmt-pt-en.yaml - MPS support, longer training
- charlotte-test/test-tok-es-pt-en.cfg - Fixed distribution naming

## Results

Baseline comparison successfully demonstrates SC augmentation value:
- SC-Augmented: 38.1% BLEU (3200 training pairs with augmentation)
- Baseline: 32.9% BLEU (1600 pt-en pairs only)
- Improvement: ~16% relative BLEU gain from SC data augmentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove generated files from git tracking while keeping them locally:
- Cognate extractions (charlotte-test/cognates_*)
- SC model outputs (charlotte-test/sc_models_*)
- FLORES data files (charlotte-test/data/raw/*.es, *.pt, *.en)
- Test log files (charlotte-test/*.log)
- Cognate dataset logs (cognate_dataset_log_NG=True.*)

These files are now covered by .gitignore and will not be tracked in
future commits. The files remain on disk for local testing.

This follows the principle that version control should track source code
and configurations, not generated outputs or downloaded datasets.
These files are no longer used:
- train.no_overlap_v1.csv, val.no_overlap_v1.csv (replaced by new baseline CSVs)
- quickstart_phase*.log (old log files, now using organized logs/ directory)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants