Add comprehensive end-to-end documentation for CharLOTTE#2
Open
ringger wants to merge 33 commits intohatch5o6:mainfrom
Open
Add comprehensive end-to-end documentation for CharLOTTE#2ringger wants to merge 33 commits intohatch5o6:mainfrom
ringger wants to merge 33 commits intohatch5o6:mainfrom
Conversation
Add extensive documentation to make the CharLOTTE pipeline fully reproducible from start to finish. Grew from 480 lines to 1,576 lines (+1,096 lines). Major additions: - Data acquisition guide with public corpus sources (OPUS, Tatoeba, etc.) - Complete data preparation workflow with CSV generation scripts - Path configuration guide explaining placeholder replacement - 15-minute quick test for environment verification - End-to-end example with 9 detailed steps and full config files - Time and resource estimates for all training stages - NMT training documentation (previously undocumented) - Training metrics visualization guide (loss curves, TensorBoard) - COMET evaluation setup (optional advanced metric) - Troubleshooting section with common issues and solutions - SC model explanation section clarifying the core method This comprehensive documentation enables researchers to: 1. Obtain and format parallel data from public sources 2. Configure all three pipeline stages (SC, tokenizer, NMT) 3. Run complete experiments with concrete examples 4. Monitor training progress and visualize results 5. Reproduce published CharLOTTE results on new language pairs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds complete documentation to enable new researchers to successfully run end-to-end CharLOTTE experiments from installation through translation and evaluation. Documentation improvements (docs/): - QUICKSTART.md: 30-min end-to-end pipeline demo (6 phases) - EXPERIMENTATION.md: Complete Portuguese→English workflow - SETUP.md: Installation guide with requirements decision tree - TROUBLESHOOTING.md: Enhanced SC_MODEL_ID troubleshooting - CONFIGURATION.md, DATA_PREPARATION.md, MONITORING.md Infrastructure improvements: - requirements-minimal.txt: Clean minimal dependencies - nmt.requirements.txt: NMT pipeline requirements - Pipeline/train_SC_venv.sh: venv-compatible SC training - charlotte-test/: Complete quickstart test environment - test_imports.py: Import verification utility Key improvements: - End-to-end focused (not component-only) - Verification at each phase - Phase/step mapping clarification - Success criteria for all phases - Requirements file decision tree - SC_MODEL_ID diagnostic commands Updated .gitignore: - Added venv/ and venv_*/ patterns - Added archive/ pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses clarity and redundancy issues across all documentation files, improving the experience for new researchers. Key improvements: - Consolidated quickstart guidance to emphasize full 6-phase pipeline test - Simplified phases vs. steps terminology in EXPERIMENTATION.md - Made SETUP.md single source of truth for path configuration - Consolidated SC Model ID mismatch explanation to TROUBLESHOOTING.md - Simplified requirements files section with clear 3-environment table - Moved "Why three environments?" explanation earlier in setup flow - Removed BLEU score duplicates, keeping MONITORING.md as single source - Cleaned up forward references and moved to end of sections - Added "Who is this for?" boxes to QUICKSTART and EXPERIMENTATION - Added progress indicators at major checkpoints in EXPERIMENTATION - Expanded CSV naming convention explanation in DATA_PREPARATION - Removed outdated ⭐ 'NEW SECTION' markers The documentation now provides clearer navigation, reduced redundancy, and better guidance for users from installation through experimentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit enhances the CharLOTTE quickstart with automatic baseline comparison, comprehensive loss curve visualization, and Apple Silicon MPS support, making it easier for users to understand SC augmentation value. ## New Features 1. **SC Model Loss Curves**: Auto-generate training/validation loss plots for character-level RNN SC model by parsing fairseq logs 2. **Baseline Comparison Pipeline**: Three new phases (B1-B3) train and evaluate NMT model WITHOUT SC augmentation for fair comparison 3. **Automatic BLEU Comparison**: Side-by-side display showing SC-augmented (38% BLEU) vs baseline (33% BLEU) = ~16% improvement 4. **MPS (Apple Silicon) Support**: Full support for training on Apple Silicon GPUs via PyTorch MPS backend 5. **Enhanced Documentation**: Updated QUICKSTART.md with baseline phases, loss curves, MPS usage, and realistic BLEU expectations 6. **Doubled Dataset Size**: Increased FLORES from 800/100/100 to 1600/200/200 sentences for better demonstration quality 7. **Improved .gitignore**: Added patterns to exclude generated outputs (models, logs, data files) from future commits ## Key Changes **New Files:** - charlotte-test/plot_sc_training_loss.py - SC loss curve generator - charlotte-test/plot_training_loss.py - NMT loss curve generator - charlotte-test/test-nmt-pt-en-baseline.yaml - Baseline NMT config - charlotte-test/test-tok-pt-en-baseline.cfg - Baseline tokenizer config - charlotte-test/data/csv/nmt_train_baseline.csv - Baseline training data - charlotte-test/data/csv/nmt_val_baseline.csv - Baseline validation data **Modified Core Files:** - .gitignore - Exclude generated outputs from future commits - NMT/train.py - Relaxed filename assertions for flexible CSV naming - Pipeline/train_tokenizer.sh - Combined tokenizer, default params, paths - README.md - Added "Three Models in CharLOTTE" section - docs/QUICKSTART.md - Comprehensive update with baseline and curves - docs/CONFIGURATION.md - Added MPS device documentation **Modified Test Configs:** - charlotte-test/run_full_quickstart.sh - Integrated baseline phases B1-B3 - charlotte-test/download_flores.py - Doubled dataset sizes - charlotte-test/test-nmt-pt-en.yaml - MPS support, longer training - charlotte-test/test-tok-es-pt-en.cfg - Fixed distribution naming ## Results Baseline comparison successfully demonstrates SC augmentation value: - SC-Augmented: 38.1% BLEU (3200 training pairs with augmentation) - Baseline: 32.9% BLEU (1600 pt-en pairs only) - Improvement: ~16% relative BLEU gain from SC data augmentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Remove generated files from git tracking while keeping them locally: - Cognate extractions (charlotte-test/cognates_*) - SC model outputs (charlotte-test/sc_models_*) - FLORES data files (charlotte-test/data/raw/*.es, *.pt, *.en) - Test log files (charlotte-test/*.log) - Cognate dataset logs (cognate_dataset_log_NG=True.*) These files are now covered by .gitignore and will not be tracked in future commits. The files remain on disk for local testing. This follows the principle that version control should track source code and configurations, not generated outputs or downloaded datasets.
These files are no longer used: - train.no_overlap_v1.csv, val.no_overlap_v1.csv (replaced by new baseline CSVs) - quickstart_phase*.log (old log files, now using organized logs/ directory)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive end-to-end documentation to help new researchers successfully run complete CharLOTTE experiments from installation through translation and evaluation.
New Documentation Files
Core Guides (docs/)
Infrastructure Improvements
Key Improvements
run_full_quickstart.shruns all 6 phases automaticallyTesting
The quickstart has been tested end-to-end on macOS with:
Impact
This documentation enables researchers to:
Changes
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com