C and C++ bindings to Tokenizers#1888
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
very open to this! let's make sure we have a big compat with expectations in terms of the funcs we bind
CPP bindings coverage improvement
fixed benchmarks
- Introduced a new submodule for Jinja2Cpp to handle chat template rendering. - Enhanced the C++ bindings to load and apply chat templates from a configuration file. - Added methods to retrieve special tokens and their IDs from the tokenizer configuration. - Updated the CMake configuration to include Jinja2Cpp and link it with the tokenizers_cpp library. - Refactored tests to validate the new chat template functionality and special token handling.
Support chat template in c++ api
|
Hi @ArthurZucker I've made some progress with c++ adding more APIs. I tried integrating https://github.com/jinja2cpp/Jinja2Cpp/ at c++ bindings layer, but some features are limited and lead to crash (IIRC, negative index like Any tips/recommendation for handling templates natively would be great. |
add chat template jinja rendering with minijinja
|
Upon digging https://github.com/huggingface/text-generation-inference codebase, I found it's using minijina in Rust to render chat templates (with some workarounds for unsupported features). So, I've removed jinja rendering at c++ bindings layer, instead moved it to tokenizer core in Rust, using minijinja2. This way, all bindings can access the functionality. Disclaimer: I'm not proficient in Rust, and most of the code is done by AI agents (though I've tried to closely supervise it/them). Based on my testing, everything seems to work (at least for my usecase and its tests pass). |
|
I will drop this more as a FYI: check https://github.com/mlc-ai/tokenizers-cpp There's an existing C++ bindings from Tokenizers that goes through the same Rust -> C -> C++ path. The C++ code also binds more than tokenizers because it includes sentencepiece, but that can be cut off. I think if you want a more battle tested code, fork tokenizers-cpp/rust into your current Once that is done, fork tokenizers_cpp.h and huggingface_tokenizer.cc into The LICENSE shouldn't be an issue. And I think https://chat.webllm.ai/ has a live deployment of the tokenizer with WASM. |
|
Hi! Is there any plan to merge this PR? I’m currently looking for tokenizers implemented in C. |
|
Actually yeah but I need to take over a little and check a few things but planned for sure |
Skip unknown fields in deserialization for experimental wrappers.
ArthurZucker
left a comment
There was a problem hiding this comment.
Late on the review! Confirm we plan on maintaining more bindings, but with a much stricter explicit list of what we maintain.
For chat template its a bit too early, and if it makes it in tokenizer.json we'll have a bigger update / it should be separated from this PR entirely! Its also less prio, but we want to have an easy way to use it with minijinja which is a great lib
| - **Rust** (baseline): 100% | ||
| - **C Bindings**: ~100% (essentially identical to Rust) | ||
| - **C++ Bindings**: 97.6% (only 2.4% slower) | ||
| - **Python**: 66.5% (33.5% slower) |
There was a problem hiding this comment.
cc @McPatate there's stuff we can do to improve this i'm sure 👁️ 👁️
| /// Tokenizer configuration loaded from tokenizer_config.json | ||
| /// Contains authoritative special token definitions and chat template | ||
| #[derive(Default, Clone)] | ||
| struct TokenizerConfig { | ||
| bos_token: Option<String>, | ||
| eos_token: Option<String>, | ||
| pad_token: Option<String>, | ||
| unk_token: Option<String>, | ||
| chat_template: Option<String>, | ||
| add_bos_token: bool, | ||
| add_eos_token: bool, | ||
| } |
There was a problem hiding this comment.
We partially have a PR for #1942 + this is not in the core tokenizers we should not add it !
There was a problem hiding this comment.
chat_template potentially we'll add support for it, never add_bos_token as tokenizer.json has that info
| } | ||
|
|
||
| impl CTokenizer { | ||
| fn new_from_file(path: &str, config_path: Option<&str>) -> Option<Self> { |
There was a problem hiding this comment.
the C binding will be at worst more minimal than the python one, at best the same. So we should not have anything extra like TokenizerConfig
| /// | ||
| /// Returns: rendered template string (caller must free with tokenizers_string_free), or null on error | ||
| #[no_mangle] | ||
| pub extern "C" fn tokenizers_apply_chat_template( |
There was a problem hiding this comment.
for now this should never go here unless its in rust core!
There was a problem hiding this comment.
I like the idea of having examples for each bidnings!
There was a problem hiding this comment.
let's split this in another PR! 🤗
There was a problem hiding this comment.
not 100% sure this makes sense to have in tokenizers
Adding in bindings for two more languages!
bindings/cppbindings/cC is an intermediate step to bind C++ and Rust: i.e., C++ <--> C <--> Rust.
--