fix(deps): update dependency vllm to v0.20.0 [security]#89
Open
renovate[bot] wants to merge 1 commit intomainfrom
Open
fix(deps): update dependency vllm to v0.20.0 [security]#89renovate[bot] wants to merge 1 commit intomainfrom
renovate[bot] wants to merge 1 commit intomainfrom
Conversation
813413e to
0cbf292
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==0.18.0→==0.20.0==0.19.1→==0.20.0==0.18.0→==0.20.0Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
vLLM: Unauthenticated OOM Denial of Service via Unbounded
nParameter in OpenAI API ServerCVE-2026-34756 / GHSA-3mwp-wvh9-7528
More information
Details
Summary
A Denial of Service vulnerability exists in the vLLM OpenAI-compatible API server. Due to the lack of an upper bound validation on the
nparameter in theChatCompletionRequestandCompletionRequestPydantic models, an unauthenticated attacker can send a single HTTP request with an astronomically largenvalue. This completely blocks the Pythonasyncioevent loop and causes immediate Out-Of-Memory crashes by allocating millions of request object copies in the heap before the request even reaches the scheduling queue.Details
The root cause of this vulnerability lies in the missing upper bound checks across the request parsing and asynchronous scheduling layers:
In
vllm/entrypoints/openai/chat_completion/protocol.py, thenparameter is defined simply as an integer without anypydantic.Fieldconstraints for an upper bound.When the API request is converted to internal
SamplingParamsinvllm/sampling_params.py, the_verify_argsmethod only checks the lower bound (self.n < 1), entirely omitting an upper bounds check.When the malicious request reaches the core engine (
vllm/v1/engine/async_llm.py), the engine attempts to fan out the requestntimes to generate identical independent sequences within a synchronous loop.Because Python's
asyncioruns on a single thread and event loop, this monolithicfor-loop monopolizes the CPU thread. The server stops responding to all other connections (including liveness probes). Simultaneously, the memory allocator is overwhelmed by cloning millions of request object instances viacopy(request), driving the host's Resident Set Size (RSS) up by gigabytes per second until the OSOOM-killerterminates the vLLM process.Impact
Vulnerability Type: Resource Exhaustion / Denial of Service
Impacted Parties:
vllm.entrypoints.openai.api_server), which happens to be the primary entrypoint for OpenAI-compatible setups.Because this vulnerability exploits the control plane rather than the data plane, an unauthenticated remote attacker can achieve a high success rate in taking down production inference hosts with a single HTTP request. This effectively circumvents any hardware-level capacity planning and conventional bandwidth stress limitations.
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM: Server-Side Request Forgery (SSRF) in
download_bytes_from_urlCVE-2026-34753 / GHSA-pf3h-qjgv-vcpr
More information
Details
Summary
A Server Side Request Forgery (SSRF) vulnerability in
download_bytes_from_urlallows any actor who can control batch input JSON to make the vLLM batch runner issue arbitrary HTTP/HTTPS requests from the server, without any URL validation or domain restrictions.This can be used to target internal services (e.g. cloud metadata endpoints or internal HTTP APIs) reachable from the vLLM host.
Details
Vulnerable component
The vulnerable logic is in the batch runner entrypoint
vllm/entrypoints/openai/run_batch.py, functiondownload_bytes_from_url:Key properties:
data,http,https).http/https, it directly callssession.get(url)on the provided string.MediaConnector), which implements an explicit domain allowlist.download_bytes_from_urldoes not reuse that protection.URL controllability
The
urlargument is fully controlled by batch input JSON via thefile_urlfield ofBatchTranscriptionRequest/BatchTranslationRequest.There is no restriction on the domain, IP, or port of
file_urlin these models.The batch runner reads each line of the input file (
args.input_file), parses it as JSON, and constructs aBatchTranscriptionRequest/BatchTranslationRequest. Whateverfile_urlappears in that JSON line becomesbatch_request_body.file_url.file_urlis passed directly intodownload_bytes_from_url:So the data flow is:
body.file_url.BatchRequestInput/BatchTranscriptionRequest/BatchTranslationRequestparse that JSON and storefile_urlverbatim.make_transcription_wrappercallsdownload_bytes_from_url(batch_request_body.file_url).download_bytes_from_url’s HTTP/HTTPS branch issuesaiohttp.ClientSession().get(url)to that attacker-controlled URL with no further validation.This is a classic SSRF pattern: a server-side component makes arbitrary HTTP requests to a URL string taken from untrusted input.
Comparison with safer code
The project already contains a safer URL-handling path for multimodal media in
vllm/multimodal/media/connector.py, which demonstrates the intent to mitigate SSRF via domain allowlists and URL normalization:and:
download_bytes_from_urldoes not reuse this allowlist or any equivalent validation, even though it also fetches user-provided URLs.Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:LReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM: Denial of Service via Unbounded Frame Count in video/jpeg Base64 Processing
CVE-2026-34755 / GHSA-pq5c-rjhq-qp7p
More information
Details
Summary
The
VideoMediaIO.load_base64()method atvllm/multimodal/media/video.py:51-62splitsvideo/jpegdata URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. Thenum_framesparameter (default: 32), which is enforced by theload_bytes()code path at line 47-48, is completely bypassed in thevideo/jpegbase64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.Details
Vulnerable code
The
load_bytes()path (line 47-48) properly delegates to a video loader that respectsself.num_frames(default 32). Theload_base64("video/jpeg", ...)path bypasses this limit entirely —data.split(",")produces an unbounded list and every frame is decoded into a numpy array.video/jpeg is part of vLLM's public API
video/jpegis a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:encode_video_url()atvllm/multimodal/utils.py:96-108generatesdata:video/jpeg;base64,...URLstests/entrypoints/openai/test_video.py:62andtests/entrypoints/test_chat_utils.py:153both use this formatMemory amplification
Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB.
np.stack()then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.Data flow
connector.py:91usessplit(",", 1)which splits on only the first comma. All remaining commas stay indataand are later split byvideo.py:54.Comparison with existing protections
load_bytes()(binary video)num_frames(default 32)load_base64("video/jpeg", ...)data.split(",")Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM Vulnerable to Remote DoS via Special-Token Placeholders
CVE-2026-44222 / GHSA-hpv8-x276-m59f
More information
Details
Summary
This report explains a Token Injection vulnerability in vLLM’s multimodal processing. Unauthenticated, text-only prompts that spell special tokens are interpreted as control. Image and video placeholder sequences supplied without matching data cause vLLM to index into empty grids during input-position computation, raising an unhandled IndexError and terminating the worker or degrading availability. Multimodal paths that rely on
image_grid_thw/video_grid_thware affected. Severity: High (remote DoS). Reproduced on vLLM 0.10.0 with Qwen2.5-VL.Details
Representative snippet (context):
Abbreviated call path:
PoC
Environment
Request (text-only, no image/video data)
Observed result
Impact
Fixes
Credits
Pengyu Ding (Infra Security, Ant Group)
Ziteng Xu (Infra Security, Ant Group)
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM makes Use of Uninitialized Resource
CVE-2026-7141 / GHSA-x368-4g9h-fvv4
More information
Details
A vulnerability was found in vLLM up to 0.19.0. The affected element is the function has_mamba_layers of the file vllm/v1/kv_cache_interface.py of the component KV Block Handler. Performing a manipulation results in uninitialized resource. It is possible to initiate the attack remotely. The attack is considered to have high complexity. The exploitability is described as difficult. The exploit has been made public and could be used. The patch is named 1ad67864c0c20f167929e64c875f5c28e1aad9fd. To fix this issue, it is recommended to deploy a patch.
Severity
CVSS:4.0/AV:N/AC:H/AT:N/PR:N/UI:N/VC:L/VI:L/VA:L/SC:N/SI:N/SA:N/E:PReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters
CVE-2026-44223 / GHSA-83vm-p52w-f9pw
More information
Details
Summary
The
extract_hidden_statesspeculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing aRuntimeErrorthat crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty,frequency_penalty, orpresence_penalty).A single request with a penalty parameter (e.g.,
"repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.Details
In vLLM v0.17.0, the
extract_hidden_statesproposer'spropose()method returnedsampled_token_ids.unsqueeze(-1), producing a tensor of shape(batch_size, 1).In PR #37013 (first released in v0.18.0), the KV connector interface was refactored out of
propose(). The return type changed fromtuple[Tensor, KVConnectorOutput | None]toTensor, and the.unsqueeze(-1)call was removed along with the KV connector output:The refactor missed that
sampled_token_idschanged semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as(batch_size, max_spec_len + 1). Withnum_speculative_tokens=1, this produces shape(batch_size, 2)instead of the expected(batch_size, 1), causing a broadcast shape mismatch during penalty application.Impact
Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with
extract_hidden_statesspeculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.Patches
Fixed in PR #38610, first included in vLLM v0.20.0. The fix slices the return value to
sampled_token_ids[:, :1], ensuring the correct(batch_size, 1)shape regardless of the rejection sampler's output dimensions.Workarounds
extract_hidden_statesas the speculative decoding method on affected versions.repetition_penalty,frequency_penalty,presence_penalty) from incoming requests at an API gateway before they reach vLLM.Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Release Notes
vllm-project/vllm (vllm)
v0.20.0Compare Source
vLLM v0.20.0
Highlights
This release features 752 commits from 320 contributors (123 new)!
vllm/vllm-openai:v0.20.0image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#40669). As a general rule of thumb, our CUDA version policy follows PyTorch's. We highly recommend to install vLLM withuvand use--torch-backend=cu129if you are on CUDA 12.9.transformers>=5(#30566), with vision-encoder torch.compile bypass (#30518) and continued v4/v5 compat fixes including PaddleOCR-VL image processormax_pixels(#38629), Mistral YaRN warning (#37292), and Jina ColBERT rotary inv_freq recompute (#39176).SharedFusedMoEremoved (#35782), DefaultMoERunner split (#35326) and later combined back intoMoERunnerBase(#40560), shared/fused expert output sum moved intoMoERunnerBase(#35949), ZeroExpertFusedMoE in new framework (#35549),compressed_tensors_moe.pysplit (#38960),GPTQMarlinMoEMethodreworked with MK (#37990), XPU & CUTLASS MoE relocated tofused_moe/experts/(#40568, #40574),make_expert_params_mappingrenamed (#40671), MoE LoRA refactor (#40338), and MoE DP chunking removed (#39107).seq_lens_cpuGPU→CPU sync (#40654); cacheInductorPass.hash_source(#39328); skip FX-graph deserialization on loading for faster warm compile (#40151); CUDAGraph memory profiling enabled by default for clearer startup memory accounting (#38284).Model Support
mamba_ssm_cache_dtype=float32with NemotronHNanoVLV2 auto-hook (#39032); new TP plan styles for the Transformers backend (#40467); GLM-5.1 fix on ROCm (#40763).Engine Core
donate_graph_module=Trueforstandalone_compile(#39733), skip FX graph deserialization on loading (#40151), include Inductor & functorch configs in compile-cache key (#40627), respectTORCH_COMPILE_DISABLEat vLLM config level (#40715), disable Sequence Parallelism for piecewise compilation (#38373).concat_mla_qhalf-types only (#37892), batch-invariance-aware backend auto-selection (#40193), avoidseq_lens_cpuGPU→CPU sync (#40654).shutdown()on OffloadingConnector (#39182), request context passed through KV offload (#39185), sliding-window lookup (#36645), multi-group worker transfer (#38453), multi-KV-group lookup/load/store (#39401, #39402, #39403).VLLM_MEDIA_CACHEmedia URL caching (#37123), safe request abort when FSM fails to advance (#38663), KV connector prioritized over internal registry (#38301), CUDAGraph memory profiling on by default (#38284), shared-expert overlap restored (#39222),CONFIG_REGISTRYconfig-class lookup fix when on-disk model_type differs (#39554), workspace-resize GPU memory leak fix (#39226), SWA/chunked-local runtime admission capped to startup pool-sizing bound (#40946).selective_state_update(#36162).get_num_embedoverhead reduced (#40143),request_idonFinishedRequestStats(#39710).Hardware & Performance
--enable-vit-cuda-graphfor VLM examples (#40580), defaultmax_frames_per_batchauto-infer for ViT CG video (#40445), fused FP8 output quantization intomerge_attn_states(#36518), batched KV-cache swap viacuMemcpyBatchAsync(#38460), sm_110 (Jetson Thor) added to CUDA 13.0 build targets (#39233).TritonInt8ScaledMMLinearKernel(#38501),fused_silu_mul_block_quantenabled (#38817), KV-cache shuffle forpaged_attention_common(#32914), MLA decode output zero-fill removed in AITER (#37539), MLA dual RMS norm fusion pass for DeepSeek/Kimi-K2 (#39242, with older-AITer guard #40386), AITER MLA + Eagle3 spec decode (#39616), DFlash on ROCm (#39703), wvSplitK FP8 path for RDNA (#37712), GPU↔NUMA-node detection (#40015), non-causal attention inROCM_ATTN(#40176), engine-shutdown GPU memory leak fix (#38503), score-correction-bias dtype cast for DeepSeek/Kimi-K2 (#39999).round_int8for Intel Triton (#38825), MoE Triton in online FP8 quantization fix (#40109),current_platform.supports_fp8()updated for TritonExperts (#40132), NIXL import on XPU fix (#40430), fusion-pattern support disabled on XPU ([#39789](https://reConfiguration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about these updates again.
This PR was generated by Mend Renovate. View the repository job log.