feat: add cli for tokenizer and training#1842
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! we probably need to update the CI build release to make sure it exposes the bin as well!
| vocab_size, | ||
| output, | ||
| } => { | ||
| use tokenizers::models::bpe::{BpeTrainer, BPE}; |
There was a problem hiding this comment.
probably want to add all flavors here no? Unigram etc
There was a problem hiding this comment.
Not sure what do you mean, could you please explain more?
|
If it's that small, and adds new dependencies, I feel like it should be it's own crate. tokenizers is a library, it shouldn't be a CLI as well. |
This is a valid point... I am not sure if we can define |
|
We can do a new crate just fort the CLI, it does make sense instead. If this is highly requested happy to do! |
This PR adds CLI with two subcommands:
Tokenize: To tokenize a text using a given modelTrain: To train a new tokenize model