Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
305 commits
Select commit Hold shift + click to select a range
db799df
add lightning
thien Jun 29, 2024
a8fe9d3
update test cases to run classifier
thien Jun 29, 2024
39af0e8
group data related libraries into datasets dir
thien Jun 29, 2024
d063f43
misc updates
thien Jun 29, 2024
2cf9344
ruff
thien Jun 29, 2024
ce75b53
update tests
thien Jun 29, 2024
bffee97
cleanup trainer test
thien Jun 29, 2024
d533858
refactor tokenizer save/load
thien Jul 1, 2024
5bdf2a5
enable training with lightning
thien Jul 1, 2024
cc92fe7
add tokenizer assets
thien Jul 1, 2024
0891bea
silence warn, improve throughput on tokenization
thien Jul 1, 2024
18871fa
change batch size based on compute
thien Jul 1, 2024
90fadaa
fix OOM
thien Jul 1, 2024
2070d93
add tensorboard
thien Jul 1, 2024
eee170c
add train metrics
thien Jul 1, 2024
6725870
fix classification metrics
thien Jul 1, 2024
118ca74
increase save frequency
thien Jul 1, 2024
18f6626
add hyperparams, typing
thien Jul 2, 2024
6eb12cf
add profiler
thien Jul 2, 2024
4a19875
initiate zero tensor on device
thien Jul 2, 2024
f39c04f
add perf metrics in checkpoint dir
thien Jul 2, 2024
96476c9
update hyperparam logging
thien Jul 3, 2024
6665468
update test cases
thien Jul 7, 2024
2b94830
fix to_tensor implementation with word tokenizer
thien Jul 7, 2024
aafcdc5
add constants; add support for word tokenizer
thien Jul 7, 2024
cb1131f
rename to models; add pytorch transformer implementation
thien Jul 12, 2024
f0a81ac
change imports to reflect path changes to models
thien Jul 12, 2024
1f0a069
add mps maxpool3d implementation
thien Jul 12, 2024
8856d39
cleanup mps implementation
thien Jul 12, 2024
1df4fc0
update gitignore
thien Jul 15, 2024
5cadc50
update requirements.txt
thien Jul 15, 2024
bfaf9f4
add sacrebleu; allow calculating bleu at eval during model training
thien Jul 15, 2024
40437c0
add manual maxpool/conv implementations
thien Jul 15, 2024
f6fa7fd
update test cases
thien Jul 15, 2024
ef2b982
change tensorboard path
thien Jul 15, 2024
210e698
add deprecated note
thien Jul 15, 2024
9a9ea8e
update docstring to include tensor shapes
thien Jul 15, 2024
f8b8eef
rm maxpool file since it's moved to tests
thien Jul 15, 2024
2a6c144
comment out print statements
thien Jul 15, 2024
0b5b4ff
add explicit args
thien Jul 15, 2024
c713a44
fix imports through __init__
thien Jul 15, 2024
f78a932
ignore models dir
thien Jul 15, 2024
cc3ce9f
add recovered moses based tokenizer
thien Jul 15, 2024
1de5874
add code to train spm
thien Jul 15, 2024
960eb99
add sentencepiece tokenizer, tests
thien Jul 15, 2024
11eda10
add support for spm in classifier
thien Jul 15, 2024
2684004
add tokenizer training
thien Jul 15, 2024
d13514d
change metric keys
thien Jul 15, 2024
e1d0386
initial migration to lightning
thien Jul 15, 2024
d09c732
change import refs based on new directories
thien Jul 15, 2024
e40d2f2
update lint rules for jaxtyping
thien Jul 15, 2024
44b6626
isort
thien Jul 15, 2024
579b002
move classifier code
thien Jul 15, 2024
3f485eb
update gitignore
thien Jul 15, 2024
7b0e010
move import path
thien Jul 16, 2024
041e634
restore tb path
thien Jul 16, 2024
fea325c
move data preprocessing path; change import
thien Jul 16, 2024
5b41030
add training code for pytorch vanilla based transformer
thien Jul 16, 2024
3be56c1
change loss function; update type hinting
thien Jul 16, 2024
497e80c
change preprocess import
thien Jul 16, 2024
bff4529
type hinting
thien Jul 16, 2024
ec0bef7
change import path
thien Jul 16, 2024
3e7ebd5
update tests
thien Jul 16, 2024
c800aa3
add type hinting; class inheritance
thien Jul 16, 2024
fcd52c2
update gitignore
thien Jul 16, 2024
7b98c69
move python bpe implementation
thien Jul 22, 2024
4cff18f
add masking function to ops
thien Jul 22, 2024
c1de748
comment out redundant variables
thien Jul 22, 2024
6c4a218
move bpe
thien Jul 22, 2024
58bdc88
isort; ruff
thien Jul 22, 2024
8ffa501
adjust transformer training; annotation
thien Jul 22, 2024
9700b79
update metrics on dataset
thien Jul 22, 2024
fd34415
split classifier into base class
thien Aug 1, 2024
edacbc2
fix naming on imports
thien Aug 1, 2024
629cf8a
split beam, sampling, decoders
thien Aug 1, 2024
bc4a50a
WIP: update decoder
thien Aug 1, 2024
2d3b256
update test cases
thien Aug 1, 2024
0c8abc2
add ignores
thien Aug 2, 2024
f3cbd58
update decoder tests
thien Aug 2, 2024
2d06e3a
even nicer implementation now since the embeddings are now a matmul
thien Aug 2, 2024
5a501f3
update test cases; fix unused imports
thien Aug 3, 2024
4ee9000
update rnn model; move transformer code
thien Aug 5, 2024
05c2988
update test case for rnn; remove unused imports
thien Aug 5, 2024
8b7e2d5
cleanup call
thien Aug 5, 2024
94adc03
add working implementation for rnn test case
thien Aug 5, 2024
85dedf1
use constants instead of string
thien Aug 5, 2024
4cb19f2
change constants
thien Aug 5, 2024
0716599
update nmt
thien Aug 5, 2024
3621580
add sequence length in hpm
thien Aug 5, 2024
461ba12
move around tests
thien Aug 5, 2024
03a21fb
add small tokenizer
thien Aug 5, 2024
5b0eb13
type hinting
thien Aug 5, 2024
e46ccf1
change batch size
thien Aug 5, 2024
acb6673
update recurrent implementation
thien Aug 5, 2024
ebdb5d3
various refactoring
thien Aug 6, 2024
d508629
add monolingual tokenizers
thien Aug 6, 2024
7d557c1
use einops for attention
thien Aug 8, 2024
285c355
added demo de en tokenizers
thien Aug 8, 2024
da06b80
add teacher forcing for rnn
thien Aug 8, 2024
b3b142c
add debug trainer run
thien Aug 8, 2024
7bca82f
initial learning rate was way too high
thien Aug 8, 2024
9bf1904
improved logging on NMTModule
thien Aug 8, 2024
9d26df5
fix docstring
thien Aug 8, 2024
b688cdb
reannotate rnn implementation for sanity checks
thien Aug 8, 2024
d76c332
update imports
thien Aug 8, 2024
80462c3
update tests for seq2seq
thien Aug 8, 2024
3af099d
update tokenizer detokenize task
thien Aug 8, 2024
e685044
update gitignore
thien Aug 8, 2024
90db0e1
ruff; misc formatting
thien Aug 8, 2024
c1a3a00
ruff
thien Aug 8, 2024
ffed5e3
rm prints
thien Aug 8, 2024
c5fbbcf
change default teacher-forcing value; subsampling on training data
thien Aug 8, 2024
2b39081
change config for main run
thien Aug 8, 2024
083f383
rm redundant files
thien Aug 8, 2024
acbb91f
add watcher for word_tokenizer max_len change
thien Aug 8, 2024
18fcf14
remove redundant files
thien Aug 8, 2024
e46bd4d
add log to optim
thien Aug 8, 2024
a878ea7
verbose hyperparams and model when training
thien Aug 8, 2024
9aedbb3
remove comments
thien Aug 8, 2024
e604ebc
update description on rnn
thien Aug 8, 2024
573c618
add test case for transformer with separate tokenizers for encoder an…
thien Aug 8, 2024
9c2139c
update comments, positions in PyTorchTransformerModel.__init__ to imp…
thien Aug 8, 2024
a1a91d0
simplify reading
thien Aug 9, 2024
ad1026d
make it causal and not learning the output representation
thien Aug 9, 2024
e984b07
find better learning rate
thien Aug 9, 2024
83c5a9a
scheduledoptim doesn't get accessed anyway; TODO implement through tr…
thien Aug 9, 2024
0e1c07e
find better learning rate for default model
thien Aug 9, 2024
6fbc527
split task name by model
thien Aug 9, 2024
ca897af
compile model if on cuda (doesn't work for mps)
thien Aug 9, 2024
925ce80
train for longer, reduce precision
thien Aug 9, 2024
476ccf0
change debug model training
thien Aug 9, 2024
c39faec
add import
thien Aug 9, 2024
55d10b0
remove specials for more accurate bleu scoring
thien Aug 9, 2024
635cd0a
various renaming
thien Aug 9, 2024
727ffa9
separate logging; change init for model training
thien Aug 9, 2024
69a520a
unify configs to make it easier to turn into defaults and argparse
thien Aug 9, 2024
1fef649
fix circular import
thien Aug 9, 2024
c47acf9
relative import
thien Aug 9, 2024
9140344
skip import that isn't used
thien Aug 9, 2024
9726997
save hyperparams to checkpoint
thien Aug 9, 2024
2a84efb
rm empty file
thien Aug 9, 2024
0aa1d99
rm commented out params
thien Aug 10, 2024
1a613f7
change defaults
thien Aug 10, 2024
c322a13
update tests
thien Aug 10, 2024
d5c047c
improve hashing
thien Aug 10, 2024
f38c83e
update comments, remove redundant files
thien Aug 10, 2024
64b4a31
restore logger; add preprocessing caching
thien Aug 11, 2024
9d96aa1
annotations; remove compile since the 2080ti doesn't support it
thien Aug 11, 2024
33cdd3d
add option to run eval every quarter of the epoch; save checkpoint
thien Aug 11, 2024
4a6686c
increase review_rate for prod
thien Aug 11, 2024
b47e3d2
add warmup steps; new optimizer
thien Aug 11, 2024
9670bfa
fix num_workers for prod
thien Aug 11, 2024
f0e1b9b
update filtering mechanism; add lid
thien Aug 16, 2024
8f09383
improve readability
thien Aug 16, 2024
4bf0726
clean up pretraining data filtering
thien Aug 16, 2024
2d0ca2e
refactor filter to split parallel and monolingual corpora
thien Aug 16, 2024
2943a50
cleanup filtering
thien Aug 16, 2024
1bdba31
deduplicate log logic
thien Aug 16, 2024
4185468
make verbose
thien Aug 16, 2024
6f5136b
update requirements.txt
thien Aug 16, 2024
cacc2ad
update gitignore to skip checkpoint files
thien Aug 17, 2024
ed8aae7
add sanity support for legacy transformer implementation
thien Aug 17, 2024
d2916cb
add support for (much larger) dataset, move to wmt15 for even evaluat…
thien Aug 17, 2024
b680212
change prod flag
thien Aug 17, 2024
f403f39
add wakatime (not sure if this is a good idea)
thien Aug 17, 2024
0d309a4
add lr as a param; fix seed
thien Aug 17, 2024
ee134de
add perplexity metric, histogram logging
thien Aug 18, 2024
7a8dca0
add learning rate monitor, gradient accumulation, gradient clipping
thien Aug 18, 2024
0b7d5cb
multithread map; fixed issue
thien Aug 18, 2024
42237ed
make logging a requirement if storing weight norm hists
thien Aug 18, 2024
a47f927
remove scheduler since we're using schedulefree
thien Aug 18, 2024
005b675
git commit -m "increase checkpoint, store all hyperparams"
thien Aug 18, 2024
1289c80
move teacher forcing location
thien Aug 18, 2024
4547728
remove redundant imports
thien Aug 18, 2024
0c05d46
drammatically change learning rate params
thien Aug 19, 2024
10dbc91
add loss to tqdm
thien Aug 19, 2024
e6fc2be
why didn't i add mixed precision training before
thien Aug 19, 2024
01bc823
fix logging diff
thien Aug 19, 2024
f045aa3
add test for sampling
thien Aug 21, 2024
9381981
fix word_tokenizer implementation to enforce batch
thien Aug 21, 2024
b476518
update sampling
thien Aug 22, 2024
498b7f3
move back to legacy scheduler
thien Aug 22, 2024
cf2729d
move to cosine
thien Aug 22, 2024
8a71cfe
make generator test easier
thien Aug 22, 2024
d897352
move to mixed for numerical stability
thien Aug 22, 2024
52186ac
split scheduler from model file
thien Aug 28, 2024
74e950c
add text batching function
thien Aug 30, 2024
8a77c91
update asset import
thien Aug 30, 2024
edadba4
add asset migration for cnn classifiers
thien Aug 30, 2024
63cbb19
update test sets for different classifiers
thien Aug 30, 2024
46f3f75
move nn module for cnn classifier
thien Aug 30, 2024
bf18373
add clarity for sampling
thien Aug 30, 2024
b91bd8b
add swiglu implementation
thien Sep 3, 2024
6429b61
initial commit for hf-based tokenizers
thien Sep 3, 2024
9eceb1b
remove redundant files
thien Sep 4, 2024
e72e5d1
rename hf vocabulary for consistency
thien Sep 4, 2024
a04a1a9
move classifier tests
thien Sep 4, 2024
2249570
initial commit for bert based classifier
thien Sep 4, 2024
f64ecaa
add hf tokenizer for test case
thien Sep 4, 2024
288dd70
update test case for peft
thien Sep 4, 2024
0cc18a6
add distilbert, checkpointing mechanism
thien Sep 4, 2024
3fe22a9
clean up trainer; add distilbert fine-tuning task for classifier trai…
thien Sep 5, 2024
a43e38f
update requirements.txt
thien Sep 5, 2024
6259d4d
update trainer config
thien Sep 5, 2024
72e848f
use constants instead of random strings to ensure consistency across tb
thien Sep 5, 2024
d150dd0
add publications dataset
thien Sep 5, 2024
d8b83ce
update beam search to work with torch transformer
thien Sep 10, 2024
80f8b82
add beam unit tests
thien Sep 10, 2024
6b40874
implement batch beam search; update test cases
thien Sep 11, 2024
f519540
Merge pull request #3 from thien/batch_beam
thien Sep 11, 2024
3be3e93
update test cases
thien Sep 11, 2024
12dab47
make get_logits optional argument incase we already compute it
thien Sep 11, 2024
46daf0a
simplify generate by calling it individually
thien Sep 11, 2024
4ee5465
refactor beam search implementation
thien Sep 11, 2024
6eed7f5
move legacy code
thien Sep 11, 2024
e91be4f
initial refactoring for classifier trainer
thien Sep 11, 2024
bb8e592
refactor paths
thien Sep 11, 2024
7852861
add test for nn.MultiHeadAttention equivalence to custom implementation
thien Sep 16, 2024
78b2ee7
add RoPE
thien Sep 16, 2024
6f44d1f
fix params; add test case for RotaryAttention
thien Sep 16, 2024
1c6e9a0
update test tokenizer file
thien Sep 16, 2024
0d806b5
add lora for marianmt
thien Sep 18, 2024
fa84015
add lr scheduler as optional
thien Sep 18, 2024
5d81b17
add style decoder tests; BERT
thien Sep 18, 2024
8804813
update models; test case
thien Sep 19, 2024
671bf47
update readme
thien Sep 25, 2024
278a85d
move trainer code
thien Sep 25, 2024
5007093
update gitignore
thien Sep 25, 2024
d1787af
update tests
thien Sep 25, 2024
0c171ea
adjust import due to trainer migration
thien Sep 25, 2024
964e076
adjust imports
thien Sep 25, 2024
393eeb2
remove redundant implementations for Beam; Optim; Translator
thien Sep 25, 2024
95ab429
add comment in attn
thien Sep 25, 2024
78821c0
fix test cases
thien Sep 25, 2024
b79848f
update type hinting
thien Sep 25, 2024
2e0106c
skip test since not using code
thien Sep 25, 2024
2e6fecb
mark test cases as either heavy/lightweight
thien Sep 25, 2024
1ff5558
add pyproject markers
thien Sep 25, 2024
460517f
add coverage run/report to pyproject
thien Sep 26, 2024
a7ca3c4
initial commit for CI tests
thien Sep 26, 2024
7f96ed4
fix yml
thien Sep 26, 2024
b2b74e9
update install
thien Sep 26, 2024
e430d93
use venv
thien Sep 26, 2024
a1bd033
remove editable
thien Sep 26, 2024
e144bbc
pin python version
thien Sep 26, 2024
fd5b652
test running pytest w/o uv
thien Sep 26, 2024
794a9c1
migrate to rye
thien Sep 26, 2024
ca1582b
modify test cases to enable github actions against rye
thien Sep 27, 2024
85ee9b2
restore tokenizer; modify test case based on tokenize training data
thien Sep 27, 2024
b6f5731
Merge pull request #5 from thien/rye
thien Sep 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Run Lightweight Tests

on: [push, pull_request]

jobs:
lightweight-tests:
runs-on: ubuntu-latest
timeout-minutes: 10
strategy:
fail-fast: false
steps:
- uses: actions/checkout@v4
- name: Install Rye
uses: eifinger/setup-rye@v4
with:
version: 'latest'
- name: Install Project
run: rye sync
- name: Download misc. Dependencies
run: rye run python -m nltk.downloader cmudict averaged_perceptron_tagger
- name: Run Lightweight Tests
run: |
export PYTHONPATH=$PYTHONPATH:$(pwd)
rye run tests-light
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,22 @@ datasets/machine_translation/global_voices/training/*
*.pyc
*.atok
base/models
base/results
hansard.36
.DS_Store
specialk/results
specialk/telegram.json
cache
*.ruff_cache
.vscode
.venv
.arrow
tensorboard_logs
/models/
.envrc
tb_logs
*.parquet
*.arrow
*.ts
*.ckpt
specialk/notebooks/lightning_logs/*
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.9.7
1 change: 1 addition & 0 deletions assets/tokenizer/de_small_word_moses

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions assets/tokenizer/en_small_word_moses

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions assets/tokenizer/fr_en_bpe

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions assets/tokenizer/fr_en_word_moses

Large diffs are not rendered by default.

Binary file added assets/tokenizer/sentencepiece/enfr.model
Binary file not shown.
Loading