Support encoder only model #41

bbkx226 · 2025-12-14T07:56:09Z

Resolves #10

This pull request adds comprehensive support for encoder-only (BERT-style) models to the F2LLM embedding training pipeline, alongside the existing decoder-only (LLM) support. The changes include new configuration options, updated tokenization and pooling logic, improved documentation, and a new smoke test script to ensure correct behavior for both encoder and decoder models.

Encoder Model Support and Pooling:

Added detection and explicit configuration for encoder-only models (e.g., BERT, RoBERTa) via model_arch in config and command-line arguments. Pooling strategies for encoders (cls, mean, cls_mean) are now supported, with decoder models continuing to use last-token pooling. [1]], [2]], [3]], [4]])
Introduced a new config file example for encoder models (config_bert.json) and updated documentation in both README.md files to reflect encoder support and usage instructions. [1]], [2]], [3]], [4]])

Tokenization and Data Processing:

Added tokenize_data_general.py, which handles tokenization for both encoder and decoder models, with options to force architecture and control EOS token appending. Tokenization logic now auto-detects model type and applies the appropriate special tokens and sequence length constraints. ([F2LLM/tokenize_data_general.pyR1-R91])

Infrastructure and Compatibility:

Improved requirements and installation: flash-attn is now conditionally installed only on Linux/x86_64, and additional dependencies (scikit-learn, numpy, pandas, pytest) are listed. ([F2LLM/requirements.txtL4-R12])
Tokenizer initialization in training now ensures a valid pad_token is set, improving compatibility with various Hugging Face models. ([F2LLM/run.pyR85-R92])

Testing and Validation:

Added smoke_encoder_decoder.py, a lightweight test suite to verify encoder/decoder pooling and tokenization behaviors, ensuring robustness across architectures. ([F2LLM/smoke_encoder_decoder.pyR1-R135])

These changes make F2LLM more flexible and robust for embedding tasks using both encoder and decoder architectures, with clear configuration, improved data handling, and strong test coverage.

- Updated README.md to include instructions for encoder-only model configuration. - Enhanced arguments.py to define model architecture and pooling strategy. - Created config_bert.json for encoder model training parameters. - Modified model.py to handle encoder-only architecture and pooling options. - Added smoke tests for encoder/decoder pooling and tokenizer behaviors. - Implemented tokenize_data_general.py for flexible tokenization based on model type. - Updated requirements.txt to include necessary dependencies.

bbkx226 · 2025-12-14T07:56:20Z

#10

bbkx226 added 2 commits December 14, 2025 15:55

Enhance README to include support for encoder-only models in F2LLM

703b6b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support encoder only model #41

Support encoder only model #41

Uh oh!

bbkx226 commented Dec 14, 2025

Uh oh!

bbkx226 commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support encoder only model #41

Are you sure you want to change the base?

Support encoder only model #41

Uh oh!

Conversation

bbkx226 commented Dec 14, 2025

Uh oh!

bbkx226 commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant