Support encoder only model #41
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #10
This pull request adds comprehensive support for encoder-only (BERT-style) models to the F2LLM embedding training pipeline, alongside the existing decoder-only (LLM) support. The changes include new configuration options, updated tokenization and pooling logic, improved documentation, and a new smoke test script to ensure correct behavior for both encoder and decoder models.
Encoder Model Support and Pooling:
model_archin config and command-line arguments. Pooling strategies for encoders (cls,mean,cls_mean) are now supported, with decoder models continuing to use last-token pooling. [1]], [2]], [3]], [4]])config_bert.json) and updated documentation in bothREADME.mdfiles to reflect encoder support and usage instructions. [1]], [2]], [3]], [4]])Tokenization and Data Processing:
tokenize_data_general.py, which handles tokenization for both encoder and decoder models, with options to force architecture and control EOS token appending. Tokenization logic now auto-detects model type and applies the appropriate special tokens and sequence length constraints. ([F2LLM/tokenize_data_general.pyR1-R91])Infrastructure and Compatibility:
flash-attnis now conditionally installed only on Linux/x86_64, and additional dependencies (scikit-learn,numpy,pandas,pytest) are listed. ([F2LLM/requirements.txtL4-R12])pad_tokenis set, improving compatibility with various Hugging Face models. ([F2LLM/run.pyR85-R92])Testing and Validation:
smoke_encoder_decoder.py, a lightweight test suite to verify encoder/decoder pooling and tokenization behaviors, ensuring robustness across architectures. ([F2LLM/smoke_encoder_decoder.pyR1-R135])These changes make F2LLM more flexible and robust for embedding tasks using both encoder and decoder architectures, with clear configuration, improved data handling, and strong test coverage.