Skip to content

C and C++ bindings to Tokenizers#1888

Open
thammegowda wants to merge 19 commits intohuggingface:mainfrom
thammegowda:tg/cpp
Open

C and C++ bindings to Tokenizers#1888
thammegowda wants to merge 19 commits intohuggingface:mainfrom
thammegowda:tg/cpp

Conversation

@thammegowda
Copy link
Copy Markdown

@thammegowda thammegowda commented Nov 21, 2025

Adding in bindings for two more languages!

  • bindings/cpp
  • bindings/c

C is an intermediate step to bind C++ and Rust: i.e., C++ <--> C <--> Rust.

--

  • Added tests to c++
  • Added benchmarks for my sanity checks and the results are as expected.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very open to this! let's make sure we have a big compat with expectations in terms of the funcs we bind

Comment thread bindings/c/src/lib.rs
Comment thread bindings/c/src/lib.rs
@thammegowda thammegowda marked this pull request as draft November 22, 2025 18:51
thammegowda and others added 8 commits November 22, 2025 10:56
CPP bindings coverage improvement
- Introduced a new submodule for Jinja2Cpp to handle chat template rendering.
- Enhanced the C++ bindings to load and apply chat templates from a configuration file.
- Added methods to retrieve special tokens and their IDs from the tokenizer configuration.
- Updated the CMake configuration to include Jinja2Cpp and link it with the tokenizers_cpp library.
- Refactored tests to validate the new chat template functionality and special token handling.
Support chat template in c++ api
@thammegowda
Copy link
Copy Markdown
Author

Hi @ArthurZucker I've made some progress with c++ adding more APIs.
But... templating like jinaj2 is giving me a bit of trouble. Wondering how your team is handling templates, e..g. chat_template -- is there a native support in the rust code for chat_template in jinja2 format?

I tried integrating https://github.com/jinja2cpp/Jinja2Cpp/ at c++ bindings layer, but some features are limited and lead to crash (IIRC, negative index like messages[-1])

Any tips/recommendation for handling templates natively would be great.

@thammegowda
Copy link
Copy Markdown
Author

Upon digging https://github.com/huggingface/text-generation-inference codebase, I found it's using minijina in Rust to render chat templates (with some workarounds for unsupported features). So, I've removed jinja rendering at c++ bindings layer, instead moved it to tokenizer core in Rust, using minijinja2. This way, all bindings can access the functionality.

Disclaimer: I'm not proficient in Rust, and most of the code is done by AI agents (though I've tried to closely supervise it/them). Based on my testing, everything seems to work (at least for my usecase and its tests pass).

@thammegowda thammegowda marked this pull request as ready for review December 7, 2025 22:03
@IvanIsCoding
Copy link
Copy Markdown

I will drop this more as a FYI: check https://github.com/mlc-ai/tokenizers-cpp

There's an existing C++ bindings from Tokenizers that goes through the same Rust -> C -> C++ path. The C++ code also binds more than tokenizers because it includes sentencepiece, but that can be cut off.

I think if you want a more battle tested code, fork tokenizers-cpp/rust into your current bindings/c folder and ship tokenizers_c.h.

Once that is done, fork tokenizers_cpp.h and huggingface_tokenizer.cc into bindings/cpp.

The LICENSE shouldn't be an issue. And I think https://chat.webllm.ai/ has a live deployment of the tokenizer with WASM.

@johnmai-dev
Copy link
Copy Markdown

Hi! Is there any plan to merge this PR? I’m currently looking for tokenizers implemented in C.

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Actually yeah but I need to take over a little and check a few things but planned for sure

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late on the review! Confirm we plan on maintaining more bindings, but with a much stricter explicit list of what we maintain.

For chat template its a bit too early, and if it makes it in tokenizer.json we'll have a bigger update / it should be separated from this PR entirely! Its also less prio, but we want to have an easy way to use it with minijinja which is a great lib

Comment thread benchmarks/README.md
- **Rust** (baseline): 100%
- **C Bindings**: ~100% (essentially identical to Rust)
- **C++ Bindings**: 97.6% (only 2.4% slower)
- **Python**: 66.5% (33.5% slower)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @McPatate there's stuff we can do to improve this i'm sure 👁️ 👁️

Comment thread bindings/c/src/lib.rs
Comment on lines +18 to +29
/// Tokenizer configuration loaded from tokenizer_config.json
/// Contains authoritative special token definitions and chat template
#[derive(Default, Clone)]
struct TokenizerConfig {
bos_token: Option<String>,
eos_token: Option<String>,
pad_token: Option<String>,
unk_token: Option<String>,
chat_template: Option<String>,
add_bos_token: bool,
add_eos_token: bool,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We partially have a PR for #1942 + this is not in the core tokenizers we should not add it !

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat_template potentially we'll add support for it, never add_bos_token as tokenizer.json has that info

Comment thread bindings/c/src/lib.rs
}

impl CTokenizer {
fn new_from_file(path: &str, config_path: Option<&str>) -> Option<Self> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the C binding will be at worst more minimal than the python one, at best the same. So we should not have anything extra like TokenizerConfig

Comment thread bindings/c/src/lib.rs
Comment thread bindings/c/src/lib.rs
///
/// Returns: rendered template string (caller must free with tokenizers_string_free), or null on error
#[no_mangle]
pub extern "C" fn tokenizers_apply_chat_template(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now this should never go here unless its in rust core!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having examples for each bidnings!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's split this in another PR! 🤗

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure this makes sense to have in tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants