What happened: The ShareGPT dataset seems to have 2 inputs that are longer the max input size for Qwen2.5 0.5B. This results in 2 failures when running the following config.
What you expected to happen: I am not sure what is the intended behavior. Is it supposed to just fail or should we truncate the datagen input to fit within the model's context length
How to reproduce it (as minimally and precisely as possible): I used Qwen/Qwen2.5-0.5B-Instruct along with the shareGPT dataset. The following is the config.yml:
load:
type: constant
stages:
- rate: 1
duration: 30
api:
type: completion
streaming: true
server:
type: vllm
model_name: Qwen/Qwen2.5-0.5B-Instruct
base_url: http://0.0.0.0:8000
ignore_eos: true
tokenizer:
pretrained_model_name_or_path: Qwen/Qwen2.5-0.5B-Instruct
data:
type: shareGPT
metrics:
type: prometheus
prometheus:
url: http://localhost:9090
scrape_interval: 15
report:
request_lifecycle:
summary: true
per_stage: true
per_request: false
per_adapter: true
per_adapter_stage: true
prometheus:
summary: true
per_stage: false
Anything else we need to know?:
Environment:
- inference-perf version: Latest (v0.3.0)
- config.yml (entire one printed by the benchmark run): See above
- cloud provider or hardware configuration: local
- others:
What happened: The ShareGPT dataset seems to have 2 inputs that are longer the max input size for Qwen2.5 0.5B. This results in 2 failures when running the following config.
What you expected to happen: I am not sure what is the intended behavior. Is it supposed to just fail or should we truncate the datagen input to fit within the model's context length
How to reproduce it (as minimally and precisely as possible): I used Qwen/Qwen2.5-0.5B-Instruct along with the shareGPT dataset. The following is the config.yml:
Anything else we need to know?:
Environment: