Open
Conversation
Contributor
Author
|
I will update docs after the code gets approved |
Contributor
Author
|
I changed only docs for |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes few bugs related to the LLMs, caused by mixing two approaches - functional (as we pass whole messages history each time) and stateful (as we keep
pos_in the runner, representing at which position the KV cache is), which resulted in 3 bugs:pos_ += num_generated_tokens), but in next turns,jinja templateremoved these reasoning tokens from the messages history - as a result, KV-cache was incoherentpos_) - as a result tokens were "duplicated" in the KV cache and we were running out of available tokens very fast (exceedingcontext_window_length)generate()method is called functional, it kept internal state in the runner (e. g.pos_)These bugs were fixed by resetting the runner before each generation, which makes it truly functional - old messages are prefilled and the KV cache can be still used during generation phase.
Additionally, this PR adds
ContextStrategytoChatConfiginterface, so now it's possible to define (or use one of already implemented) strategy for managing context (e. g. naive, message count based, sliding window) - it gives us more flexibility and user can decide what's best for their use case. From now on,SlidingWindowContextStrategyis also configured as the default one.Introduces a breaking change?
These changes will not break anything until max number of messages is not modified (I removed
contextWindowLengthfromChatConfigand replaced it withcontextStrategy)Type of change
Tested on
Testing instructions
Run example llm app, open executorch logs (
adb logcat | grep -i "executorch"for example) and see if numbers of tokens are properly aligned and ifpos_is correct.To test different context management strategies, change
contextStrategyin llm app and modify model configuration.Screenshots
Related issues
#776
Checklist
Additional notes
Position in KV cache, number of prompt tokens and number of generated tokens for both non-reasoning and reasoning models BEFORE changes.
LLAMA 3.2 1B SPINQUANT (without reasoning)
QWEN 3.0 0.6B QUANTIZED (with reasoning)