Open
Conversation
- Updated README.md to include instructions for encoder-only model configuration. - Enhanced arguments.py to define model architecture and pooling strategy. - Created config_bert.json for encoder model training parameters. - Modified model.py to handle encoder-only architecture and pooling options. - Added smoke tests for encoder/decoder pooling and tokenizer behaviors. - Implemented tokenize_data_general.py for flexible tokenization based on model type. - Updated requirements.txt to include necessary dependencies.
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #10
This pull request adds comprehensive support for encoder-only (BERT-style) models to the F2LLM embedding training pipeline, alongside the existing decoder-only (LLM) support. The changes include new configuration options, updated tokenization and pooling logic, improved documentation, and a new smoke test script to ensure correct behavior for both encoder and decoder models.
Encoder Model Support and Pooling:
model_archin config and command-line arguments. Pooling strategies for encoders (cls,mean,cls_mean) are now supported, with decoder models continuing to use last-token pooling. [1]], [2]], [3]], [4]])config_bert.json) and updated documentation in bothREADME.mdfiles to reflect encoder support and usage instructions. [1]], [2]], [3]], [4]])Tokenization and Data Processing:
tokenize_data_general.py, which handles tokenization for both encoder and decoder models, with options to force architecture and control EOS token appending. Tokenization logic now auto-detects model type and applies the appropriate special tokens and sequence length constraints. ([F2LLM/tokenize_data_general.pyR1-R91])Infrastructure and Compatibility:
flash-attnis now conditionally installed only on Linux/x86_64, and additional dependencies (scikit-learn,numpy,pandas,pytest) are listed. ([F2LLM/requirements.txtL4-R12])pad_tokenis set, improving compatibility with various Hugging Face models. ([F2LLM/run.pyR85-R92])Testing and Validation:
smoke_encoder_decoder.py, a lightweight test suite to verify encoder/decoder pooling and tokenization behaviors, ensuring robustness across architectures. ([F2LLM/smoke_encoder_decoder.pyR1-R135])These changes make F2LLM more flexible and robust for embedding tasks using both encoder and decoder architectures, with clear configuration, improved data handling, and strong test coverage.