Skip to content

Conversation

@bbkx226
Copy link

@bbkx226 bbkx226 commented Dec 14, 2025

Resolves #10

This pull request adds comprehensive support for encoder-only (BERT-style) models to the F2LLM embedding training pipeline, alongside the existing decoder-only (LLM) support. The changes include new configuration options, updated tokenization and pooling logic, improved documentation, and a new smoke test script to ensure correct behavior for both encoder and decoder models.

Encoder Model Support and Pooling:

  • Added detection and explicit configuration for encoder-only models (e.g., BERT, RoBERTa) via model_arch in config and command-line arguments. Pooling strategies for encoders (cls, mean, cls_mean) are now supported, with decoder models continuing to use last-token pooling. [1]], [2]], [3]], [4]])
  • Introduced a new config file example for encoder models (config_bert.json) and updated documentation in both README.md files to reflect encoder support and usage instructions. [1]], [2]], [3]], [4]])

Tokenization and Data Processing:

  • Added tokenize_data_general.py, which handles tokenization for both encoder and decoder models, with options to force architecture and control EOS token appending. Tokenization logic now auto-detects model type and applies the appropriate special tokens and sequence length constraints. ([F2LLM/tokenize_data_general.pyR1-R91])

Infrastructure and Compatibility:

  • Improved requirements and installation: flash-attn is now conditionally installed only on Linux/x86_64, and additional dependencies (scikit-learn, numpy, pandas, pytest) are listed. ([F2LLM/requirements.txtL4-R12])
  • Tokenizer initialization in training now ensures a valid pad_token is set, improving compatibility with various Hugging Face models. ([F2LLM/run.pyR85-R92])

Testing and Validation:

  • Added smoke_encoder_decoder.py, a lightweight test suite to verify encoder/decoder pooling and tokenization behaviors, ensuring robustness across architectures. ([F2LLM/smoke_encoder_decoder.pyR1-R135])

These changes make F2LLM more flexible and robust for embedding tasks using both encoder and decoder architectures, with clear configuration, improved data handling, and strong test coverage.

- Updated README.md to include instructions for encoder-only model configuration.
- Enhanced arguments.py to define model architecture and pooling strategy.
- Created config_bert.json for encoder model training parameters.
- Modified model.py to handle encoder-only architecture and pooling options.
- Added smoke tests for encoder/decoder pooling and tokenizer behaviors.
- Implemented tokenize_data_general.py for flexible tokenization based on model type.
- Updated requirements.txt to include necessary dependencies.
@bbkx226
Copy link
Author

bbkx226 commented Dec 14, 2025

#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Codefuse开源轻训营] Support for Encoder-only models

1 participant