Skip to content

fix(deps): update dependency vllm to v0.20.0 [security]#89

Open
renovate[bot] wants to merge 1 commit intomainfrom
renovate/pypi-vllm-vulnerability
Open

fix(deps): update dependency vllm to v0.20.0 [security]#89
renovate[bot] wants to merge 1 commit intomainfrom
renovate/pypi-vllm-vulnerability

Conversation

@renovate
Copy link
Copy Markdown
Contributor

@renovate renovate Bot commented May 6, 2026

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Confidence
vllm ==0.18.0==0.20.0 age confidence
vllm ==0.19.1==0.20.0 age confidence
vllm ==0.18.0==0.20.0 age confidence

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


vLLM: Unauthenticated OOM Denial of Service via Unbounded n Parameter in OpenAI API Server

CVE-2026-34756 / GHSA-3mwp-wvh9-7528

More information

Details

Summary

A Denial of Service vulnerability exists in the vLLM OpenAI-compatible API server. Due to the lack of an upper bound validation on the n parameter in the ChatCompletionRequest and CompletionRequest Pydantic models, an unauthenticated attacker can send a single HTTP request with an astronomically large n value. This completely blocks the Python asyncio event loop and causes immediate Out-Of-Memory crashes by allocating millions of request object copies in the heap before the request even reaches the scheduling queue.

Details

The root cause of this vulnerability lies in the missing upper bound checks across the request parsing and asynchronous scheduling layers:

  1. Protocol Layer:
    In vllm/entrypoints/openai/chat_completion/protocol.py, the n parameter is defined simply as an integer without any pydantic.Field constraints for an upper bound.
class ChatCompletionRequest(OpenAIBaseModel):
    # Ordered by official OpenAI API documentation
    # https://platform.openai.com/docs/api/reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: str | None = None
    frequency_penalty: float | None = 0.0
    logit_bias: dict[str, float] | None = None
    logprobs: bool | None = False
    top_logprobs: int | None = 0
    max_tokens: int | None = Field(
        default=None,
        deprecated="max_tokens is deprecated in favor of "
        "the max_completion_tokens field",
    )
    max_completion_tokens: int | None = None
    n: int | None = 1
    presence_penalty: float | None = 0.0
  1. SamplingParams Layer (Incomplete Validation):
    When the API request is converted to internal SamplingParams in vllm/sampling_params.py, the _verify_args method only checks the lower bound (self.n < 1), entirely omitting an upper bounds check.
    def _verify_args(self) -> None:
        if not isinstance(self.n, int):
            raise ValueError(f"n must be an int, but is of type {type(self.n)}")
        if self.n < 1:
            raise ValueError(f"n must be at least 1, got {self.n}.")
  1. Engine Layer (The OOM Trigger):
    When the malicious request reaches the core engine (vllm/v1/engine/async_llm.py), the engine attempts to fan out the request n times to generate identical independent sequences within a synchronous loop.
        # Fan out child requests (for n>1).
        parent_request = ParentRequest(request)
        for idx in range(parent_params.n):
            request_id, child_params = parent_request.get_child_info(idx)
            child_request = request if idx == parent_params.n - 1 else copy(request)
            child_request.request_id = request_id
            child_request.sampling_params = child_params
            await self._add_request(
                child_request, prompt_text, parent_request, idx, queue
            )
        return queue

Because Python's asyncio runs on a single thread and event loop, this monolithic for-loop monopolizes the CPU thread. The server stops responding to all other connections (including liveness probes). Simultaneously, the memory allocator is overwhelmed by cloning millions of request object instances via copy(request), driving the host's Resident Set Size (RSS) up by gigabytes per second until the OS OOM-killer terminates the vLLM process.

Impact

Vulnerability Type: Resource Exhaustion / Denial of Service

Impacted Parties:

  • Any individual or organization hosting a public-facing vLLM API server (vllm.entrypoints.openai.api_server), which happens to be the primary entrypoint for OpenAI-compatible setups.
  • SaaS / AI-as-a-Service platforms acting as reverse proxies sitting in front of vLLM without strict HTTP body payload validation or rate limitations.

Because this vulnerability exploits the control plane rather than the data plane, an unauthenticated remote attacker can achieve a high success rate in taking down production inference hosts with a single HTTP request. This effectively circumvents any hardware-level capacity planning and conventional bandwidth stress limitations.

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: Server-Side Request Forgery (SSRF) in download_bytes_from_url

CVE-2026-34753 / GHSA-pf3h-qjgv-vcpr

More information

Details

Summary

A Server Side Request Forgery (SSRF) vulnerability in download_bytes_from_url allows any actor who can control batch input JSON to make the vLLM batch runner issue arbitrary HTTP/HTTPS requests from the server, without any URL validation or domain restrictions.

This can be used to target internal services (e.g. cloud metadata endpoints or internal HTTP APIs) reachable from the vLLM host.


Details
Vulnerable component

The vulnerable logic is in the batch runner entrypoint vllm/entrypoints/openai/run_batch.py, function download_bytes_from_url:


##### run_batch.py Lines 442-482
async def download_bytes_from_url(url: str) -> bytes:
    """
    Download data from a URL or decode from a data URL.

    Args:
        url: Either an HTTP/HTTPS URL or a data URL (data:...;base64,...)

    Returns:
        Data as bytes
    """
    parsed = urlparse(url)

    # Handle data URLs (base64 encoded)
    if parsed.scheme == "data":
        # Format: data:...;base64,<base64_data>
        if "," in url:
            header, data = url.split(",", 1)
            if "base64" in header:
                return base64.b64decode(data)
            else:
                raise ValueError(f"Unsupported data URL encoding: {header}")
        else:
            raise ValueError(f"Invalid data URL format: {url}")

    # Handle HTTP/HTTPS URLs
    elif parsed.scheme in ("http", "https"):
        async with (
            aiohttp.ClientSession() as session,
            session.get(url) as resp,
        ):
            if resp.status != 200:
                raise Exception(
                    f"Failed to download data from URL: {url}. Status: {resp.status}"
                )
            return await resp.read()

    else:
        raise ValueError(
            f"Unsupported URL scheme: {parsed.scheme}. "
            "Supported schemes: http, https, data"
        )

Key properties:

  • The function only parses the URL to dispatch on the scheme (data, http, https).
  • For http / https, it directly calls session.get(url) on the provided string.
  • There is no validation of:
    • hostname or IP address,
    • whether the target is internal or external,
    • port number,
    • path, query, or redirect target.
  • This is in contrast to the multimodal media path (MediaConnector), which implements an explicit domain allowlist. download_bytes_from_url does not reuse that protection.
URL controllability

The url argument is fully controlled by batch input JSON via the file_url field of BatchTranscriptionRequest / BatchTranslationRequest.

  1. Batch request body type:

##### run_batch.py Line 67-80
class BatchTranscriptionRequest(TranscriptionRequest):
    """
    Batch transcription request that uses file_url instead of file.

    This class extends TranscriptionRequest but replaces the file field
    with file_url to support batch processing from audio files written in JSON format.
    """

    file_url: str = Field(
        ...,
        description=(
            "Either a URL of the audio or a data URL with base64 encoded audio data. "
        ),
    )

##### run_batch.py Line 98-111
class BatchTranslationRequest(TranslationRequest):
    """
    Batch translation request that uses file_url instead of file.

    This class extends TranslationRequest but replaces the file field
    with file_url to support batch processing from audio files written in JSON format.
    """

    file_url: str = Field(
        ...,
        description=(
            "Either a URL of the audio or a data URL with base64 encoded audio data. "
        ),
    )

There is no restriction on the domain, IP, or port of file_url in these models.

  1. Batch input is parsed directly from the batch file:

##### run_batch.py Line 139-179
class BatchRequestInput(OpenAIBaseModel):
    ...
    url: str
    body: BatchRequestInputBody
    @&#8203;field_validator("body", mode="plain")
    @&#8203;classmethod
    def check_type_for_url(cls, value: Any, info: ValidationInfo):
        url: str = info.data["url"]
        ...
        if url == "/v1/audio/transcriptions":
            return BatchTranscriptionRequest.model_validate(value)
        if url == "/v1/audio/translations":
            return BatchTranslationRequest.model_validate(value)

##### run_batch.py Line 770-781
   logger.info("Reading batch from %s...", args.input_file)

    # Submit all requests in the file to the engine "concurrently".
    response_futures: list[Awaitable[BatchRequestOutput]] = []
    for request_json in (await read_file(args.input_file)).strip().split("\n"):
        # Skip empty lines.
        request_json = request_json.strip()
        if not request_json:
            continue

        request = BatchRequestInput.model_validate_json(request_json)

The batch runner reads each line of the input file (args.input_file), parses it as JSON, and constructs a BatchTranscriptionRequest / BatchTranslationRequest. Whatever file_url appears in that JSON line becomes batch_request_body.file_url.

  1. file_url is passed directly into download_bytes_from_url:

##### run_batch.py Line 610-623
def wrapper(handler_fn: Callable):
        async def transcription_wrapper(
            batch_request_body: (BatchTranscriptionRequest | BatchTranslationRequest),
        ) -> (
            TranscriptionResponse
            | TranscriptionResponseVerbose
            | TranslationResponse
            | TranslationResponseVerbose
            | ErrorResponse
        ):
            try:
                # Download data from URL
                audio_data = await download_bytes_from_url(batch_request_body.file_url)

So the data flow is:

  1. Attacker supplies JSON line in the batch input file with arbitrary body.file_url.
  2. BatchRequestInput / BatchTranscriptionRequest / BatchTranslationRequest parse that JSON and store file_url verbatim.
  3. make_transcription_wrapper calls download_bytes_from_url(batch_request_body.file_url).
  4. download_bytes_from_url’s HTTP/HTTPS branch issues aiohttp.ClientSession().get(url) to that attacker-controlled URL with no further validation.

This is a classic SSRF pattern: a server-side component makes arbitrary HTTP requests to a URL string taken from untrusted input.

Comparison with safer code

The project already contains a safer URL-handling path for multimodal media in vllm/multimodal/media/connector.py, which demonstrates the intent to mitigate SSRF via domain allowlists and URL normalization:


##### connector.py Lines 169-189
 def load_from_url(
        self,
        url: str,
        media_io: MediaIO[_M],
        *,
        fetch_timeout: int | None = None,
    ) -> _M:  # type: ignore[type-var]
        url_spec = parse_url(url)

        if url_spec.scheme and url_spec.scheme.startswith("http"):
            self._assert_url_in_allowed_media_domains(url_spec)

            connection = self.connection
            data = connection.get_bytes(
                url_spec.url,
                timeout=fetch_timeout,
                allow_redirects=envs.VLLM_MEDIA_URL_ALLOW_REDIRECTS,
            )

            return media_io.load_bytes(data)

and:


##### connector.py Lines 158-167
  def _assert_url_in_allowed_media_domains(self, url_spec: Url) -> None:
        if (
            self.allowed_media_domains
            and url_spec.hostname not in self.allowed_media_domains
        ):
            raise ValueError(
                f"The URL must be from one of the allowed domains: "
                f"{self.allowed_media_domains}. Input URL domain: "
                f"{url_spec.hostname}"
            )

download_bytes_from_url does not reuse this allowlist or any equivalent validation, even though it also fetches user-provided URLs.

Severity

  • CVSS Score: 5.4 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:L

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: Denial of Service via Unbounded Frame Count in video/jpeg Base64 Processing

CVE-2026-34755 / GHSA-pq5c-rjhq-qp7p

More information

Details

Summary

The VideoMediaIO.load_base64() method at vllm/multimodal/media/video.py:51-62 splits video/jpeg data URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. The num_frames parameter (default: 32), which is enforced by the load_bytes() code path at line 47-48, is completely bypassed in the video/jpeg base64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.

Details
Vulnerable code
##### video.py:51-62
def load_base64(self, media_type: str, data: str) -> tuple[npt.NDArray, dict[str, Any]]:
    if media_type.lower() == "video/jpeg":
        load_frame = partial(self.image_io.load_base64, "image/jpeg")
        return np.stack(
            [np.asarray(load_frame(frame_data)) for frame_data in data.split(",")]
            #                                                       ^^^^^^^^^^
            # Unbounded split — no frame count limit
        ), {}
    return self.load_bytes(base64.b64decode(data))

The load_bytes() path (line 47-48) properly delegates to a video loader that respects self.num_frames (default 32). The load_base64("video/jpeg", ...) path bypasses this limit entirely — data.split(",") produces an unbounded list and every frame is decoded into a numpy array.

video/jpeg is part of vLLM's public API

video/jpeg is a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:

  • encode_video_url() at vllm/multimodal/utils.py:96-108 generates data:video/jpeg;base64,... URLs
  • Official test suites at tests/entrypoints/openai/test_video.py:62 and tests/entrypoints/test_chat_utils.py:153 both use this format
Memory amplification

Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB. np.stack() then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.

Data flow
POST /v1/chat/completions
  → chat_utils.py:1434   video_url type → mm_parser.parse_video()
  → chat_utils.py:872    parse_video() → self._connector.fetch_video()
  → connector.py:295     fetch_video() → load_from_url(url, self.video_io)
  → connector.py:91      _load_data_url(): url_spec.path.split(",", 1)
                          → media_type = "video/jpeg"
                          → data = "<frame1>,<frame2>,...,<frame10000>"
  → connector.py:100     media_io.load_base64("video/jpeg", data)
  → video.py:54          data.split(",")  ← UNBOUNDED
  → video.py:55-57       all frames decoded into numpy arrays
  → video.py:56          np.stack([...])  ← massive combined array → OOM

connector.py:91 uses split(",", 1) which splits on only the first comma. All remaining commas stay in data and are later split by video.py:54.

Comparison with existing protections
Code Path Frame Limit File
load_bytes() (binary video) Yes — num_frames (default 32) video.py:46-49
load_base64("video/jpeg", ...) No — unlimited data.split(",") video.py:51-62

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM Vulnerable to Remote DoS via Special-Token Placeholders

CVE-2026-44222 / GHSA-hpv8-x276-m59f

More information

Details

Summary

This report explains a Token Injection vulnerability in vLLM’s multimodal processing. Unauthenticated, text-only prompts that spell special tokens are interpreted as control. Image and video placeholder sequences supplied without matching data cause vLLM to index into empty grids during input-position computation, raising an unhandled IndexError and terminating the worker or degrading availability. Multimodal paths that rely on image_grid_thw/video_grid_thw are affected. Severity: High (remote DoS). Reproduced on vLLM 0.10.0 with Qwen2.5-VL.

Details
  • Affected component: multimodal input position computation.
  • File/functions (paths are indicative):
    • vllm/model_executor/layers/rotary_embedding.py
      • get_input_positions_tensor(...)
      • _vl_get_input_positions_tensor(...)
  • Failure mechanism:
    • The code counts detected vision tokens and then indexes video_grid_thw/image_grid_thw accordingly.
    • When user input carries placeholder tokens but no actual multimodal payload, these grids are empty. The code does not bounds-check before indexing.

Representative snippet (context):

##### vllm/model_executor/layers/rotary_embedding.py
@&#8203;classmethod
def _vl_get_input_positions_tensor(
    cls,
    input_tokens,
    hf_config,
    image_grid_thw,
    video_grid_thw,
    ...,
):
    # detect video tokens
    video_nums = (vision_tokens == video_token_id).sum()
    # later in processing
    t, h, w = (
        video_grid_thw[video_index][0],  # IndexError if no video data
        video_grid_thw[video_index][1],
        video_grid_thw[video_index][2],
    )

Abbreviated call path:

OpenAI API request
 → vllm.v1.engine.core: step/execute_model
 → vllm.v1.worker.gpu_model_runner: _update_states/execute_model
 → vllm.model_executor.layers.rotary_embedding: get_input_positions_tensor
 → _vl_get_input_positions_tensor
 → IndexError: list index out of range
PoC
Environment
  • vLLM: 0.10.0
  • Model: Qwen/Qwen2.5-VL-3B-Instruct
  • Launch server:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-3B-Instruct \
  --port 8000
Request (text-only, no image/video data)
cat > request.json <<'JSON'
{
  "model": "Qwen/Qwen2.5-VL-3B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text",
          "text": "what's in picture <|vision_start|><|image_pad|><|vision_end|>" }
      ]
    }
  ]
}
JSON

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  --data @&#8203;request.json
Observed result
  • HTTP 500; logs show IndexError: list index out of range from _vl_get_input_positions_tensor(...).
  • In some deployments, the worker exits and capacity remains reduced until manual restart.
Impact
  • Type: Token Injection leading to Remote Denial of Service (unauthenticated). A single request can trigger the fault.
  • Scope: Any vLLM deployment that serves VLMs and accepts raw user text via OpenAI-compatible endpoints (self-hosted or proxied/managed fronts).
  • Effect: Request → unhandled exception in position computation → worker termination / service unavailability.
Fixes
Credits

Pengyu Ding (Infra Security, Ant Group)
Ziteng Xu (Infra Security, Ant Group)

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM makes Use of Uninitialized Resource

CVE-2026-7141 / GHSA-x368-4g9h-fvv4

More information

Details

A vulnerability was found in vLLM up to 0.19.0. The affected element is the function has_mamba_layers of the file vllm/v1/kv_cache_interface.py of the component KV Block Handler. Performing a manipulation results in uninitialized resource. It is possible to initiate the attack remotely. The attack is considered to have high complexity. The exploitability is described as difficult. The exploit has been made public and could be used. The patch is named 1ad67864c0c20f167929e64c875f5c28e1aad9fd. To fix this issue, it is recommended to deploy a patch.

Severity

  • CVSS Score: 2.9 / 10 (Low)
  • Vector String: CVSS:4.0/AV:N/AC:H/AT:N/PR:N/UI:N/VC:L/VI:L/VA:L/SC:N/SI:N/SA:N/E:P

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

CVE-2026-44223 / GHSA-83vm-p52w-f9pw

More information

Details

Summary

The extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty).

A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.

Details

In vLLM v0.17.0, the extract_hidden_states proposer's propose() method returned sampled_token_ids.unsqueeze(-1), producing a tensor of shape (batch_size, 1).

In PR #​37013 (first released in v0.18.0), the KV connector interface was refactored out of propose(). The return type changed from tuple[Tensor, KVConnectorOutput | None] to Tensor, and the .unsqueeze(-1) call was removed along with the KV connector output:

##### Before (v0.17.0):
return sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)

##### After (v0.18.0+):
return sampled_token_ids  # shape (batch_size, 2) after first decode step

The refactor missed that sampled_token_ids changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as (batch_size, max_spec_len + 1). With num_speculative_tokens=1, this produces shape (batch_size, 2) instead of the expected (batch_size, 1), causing a broadcast shape mismatch during penalty application.

Impact

Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with extract_hidden_states speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.

Patches

Fixed in PR #​38610, first included in vLLM v0.20.0. The fix slices the return value to sampled_token_ids[:, :1], ensuring the correct (batch_size, 1) shape regardless of the rejection sampler's output dimensions.

Workarounds
  • Upgrade to vLLM v0.20.0 or later.
  • If upgrading is not possible, avoid using extract_hidden_states as the speculative decoding method on affected versions.
  • Alternatively, reject or strip penalty parameters (repetition_penalty, frequency_penalty, presence_penalty) from incoming requests at an API gateway before they reach vLLM.

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


Release Notes

vllm-project/vllm (vllm)

v0.20.0

Compare Source

vLLM v0.20.0

Highlights

This release features 752 commits from 320 contributors (123 new)!

  • DeepSeek V4: Initial DeepSeek V4 support landed (#​40860), with DSML token-leakage fix in DSV4/3.2 (#​40806), DSA + MTP IMA fix (#​40772), and a silu clamp limit on the shared expert (#​40950).
  • CUDA 13.0 default: Default CUDA wheel on PyPI and vllm/vllm-openai:v0.20.0 image switched to CUDA 13.0; architecture lists and build-args cleaned up (#​39878), and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#​40669). As a general rule of thumb, our CUDA version policy follows PyTorch's. We highly recommend to install vLLM with uv and use --torch-backend=cu129 if you are on CUDA 12.9.
  • PyTorch 2.11 upgrade (#​34644): vLLM ships on torch 2.11 for CUDA, and XPU is now also on torch 2.11 (#​37947) — XPU is no longer pinned to 2.10. This is a breaking change for environment dependency.
  • Python 3.14: Added to the supported Python version list (#​34770).
  • Transformers v5: vLLM now runs on HuggingFace transformers>=5 (#​30566), with vision-encoder torch.compile bypass (#​30518) and continued v4/v5 compat fixes including PaddleOCR-VL image processor max_pixels (#​38629), Mistral YaRN warning (#​37292), and Jina ColBERT rotary inv_freq recompute (#​39176).
  • New large models: Hunyuan v3 (Hy3) preview (#​40681) with HYV3 reasoning parser (#​40713); Granite 4.1 Vision as a built-in multimodal model (#​40282).
  • FlashAttention 4 as default MLA prefill: FA4 re-enabled as the default MLA prefill backend (#​38819) with head-dim 512 and paged-KV support on SM90+ (#​38835), plus an upstream FA4 sync (#​38690).
  • TurboQuant 2-bit KV cache: New attention backend delivering 2-bit KV cache compression with 4× capacity (#​38479), now with FA3/FA4 prefill support (#​40092).
  • Online quantization frontend: New end-to-end online quantization frontend (#​38138), with docs (#​39736); experts_int8 consolidated into the FP8 online path (#​38463); MXFP8 online quant moved to the new frontend (#​40152).
  • vLLM IR: Initial IR skeleton with rms_norm op (#​33825), OOT-platform kernel imports (#​38807), gemma_rms_norm reworked on IR (#​39014), and IR op testing/benchmarking infra added (#​40167) — foundation for future kernel work.
  • Model Runner V2 advances: Eagle prefill full-CUDA-graph (#​37588), auto-resolve cudagraph mode/sizes from attention backend (#​32936), fused probabilistic rejection sample kernels (#​38496), config validation for unsupported features (#​38758), piecewise-fallback disabled for eagle draft decodes (#​39773), multiple prompt-logprobs support (#​39937), prefill warmup coverage (#​40746), and a fix for accuracy regression caused by stale sampled/draft tokens (#​39833).
  • MoE refactor series: Unquantized migrated to Full Oracle Flow (#​36286), CT W8A8 to Oracle (#​39187), SharedExperts class (#​35153), SharedFusedMoE removed (#​35782), DefaultMoERunner split (#​35326) and later combined back into MoERunnerBase (#​40560), shared/fused expert output sum moved into MoERunnerBase (#​35949), ZeroExpertFusedMoE in new framework (#​35549), compressed_tensors_moe.py split (#​38960), GPTQMarlinMoEMethod reworked with MK (#​37990), XPU & CUTLASS MoE relocated to fused_moe/experts/ (#​40568, #​40574), make_expert_params_mapping renamed (#​40671), MoE LoRA refactor (#​40338), and MoE DP chunking removed (#​39107).
  • Performance: Optimize batch invariant with fused rms norm — 2.1% E2E latency improvement (#​40413); avoid seq_lens_cpu GPU→CPU sync (#​40654); cache InductorPass.hash_source (#​39328); skip FX-graph deserialization on loading for faster warm compile (#​40151); CUDAGraph memory profiling enabled by default for clearer startup memory accounting (#​38284).
Model Support
Engine Core
  • Model Runner V2: Full CUDA graph for eagle prefill (#​37588), auto cudagraph mode/sizes based on attention backend (#​32936), fused probabilistic rejection-sample kernels (#​38496), config validation (#​38758), eagle-draft piecewise fallback disabled (#​39773), multiple prompt logprobs (#​39937), prefill warmup coverage (#​40746), stale sampled/draft tokens accuracy fix (#​39833).
  • vLLM IR: IR skeleton + rms_norm (#​33825), OOT kernel import hooks (#​38807), gemma_rms_norm on IR (#​39014), IR op testing/benchmarking infra (#​40167).
  • torch.compile: Opaque Objects on torch 2.11 (#​39286), AOT compile with batch-invariance mode (#​39201), Inductor cache nested under AOT dir (#​39718), split FX graph via codegen (#​38657), Inductor pre-grad passes re-enabled for torch≥2.12 (#​38944), strings in custom ops without compile regressions (#​38123), MLA + group FP8 fusion (#​38877), SiluMul activation+quant fusion refactor (#​39684), donate_graph_module=True for standalone_compile (#​39733), skip FX graph deserialization on loading (#​40151), include Inductor & functorch configs in compile-cache key (#​40627), respect TORCH_COMPILE_DISABLE at vLLM config level (#​40715), disable Sequence Parallelism for piecewise compilation (#​38373).
  • Attention: FA4 as default MLA prefill (#​38819), head-dim 512 + paged-KV on sm90+FA4 (#​38835), FA4 upstream sync (#​38690), full CUDA graph for FlexAttention (#​36298), FlexAttention non-causal support (#​40394), unified 2D/3D triton_unified_attention (#​40631), TRTLLM minimax_allreduce_rms ported (#​37045), concat_mla_q half-types only (#​37892), batch-invariance-aware backend auto-selection (#​40193), avoid seq_lens_cpu GPU→CPU sync (#​40654).
  • Helion kernels: torch.compile support for Helion kernels (#​38592).
  • HMA / KV offload: GPU-side KV events for HMA (#​37688), group block hashes/IDs tracked (#​37109), unified memory layout for offloading workers (#​37206), shutdown() on OffloadingConnector (#​39182), request context passed through KV offload (#​39185), sliding-window lookup (#​36645), multi-group worker transfer (#​38453), multi-KV-group lookup/load/store (#​39401, #​39402, #​39403).
  • Features: NUMA binding for GPU workers (#​38635), opt-in VLLM_MEDIA_CACHE media URL caching (#​37123), safe request abort when FSM fails to advance (#​38663), KV connector prioritized over internal registry (#​38301), CUDAGraph memory profiling on by default (#​38284), shared-expert overlap restored (#​39222), CONFIG_REGISTRY config-class lookup fix when on-disk model_type differs (#​39554), workspace-resize GPU memory leak fix (#​39226), SWA/chunked-local runtime admission capped to startup pool-sizing bound (#​40946).
  • Pluggable layers: Applied to llm_head / vocab embedding (#​33465) and MoE layers (#​33556).
  • Mamba: Stochastic rounding (#​35753), different Conv state layouts (#​37416), FlashInfer selective_state_update (#​36162).
  • Metrics & scheduling: Labeled waiting-breakdown (capacity/deferred) metric (#​38435), API server handshake simplified (#​39364), mm-scheduler get_num_embed overhead reduced (#​40143), request_id on FinishedRequestStats (#​39710).
  • Executor: RayExecutorV2 introduced (#​36836); unified engine process monitoring with Ray backend (#​35862).
Hardware & Performance
  • NVIDIA: swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#​38325), MXFP4 W4A4 CUTLASS MoE for SM100 (#​37463), TRTLLM GEN NVFP4 MoE with non-512-aligned hidden dims via weight padding (#​39510), TRTLLM FP8 MoE with shuffled weights + BlockMajorK layout (#​38993), fused qknorm+rope kernel on SM9.0 (#​37376), tuned fused_moe config for RTX PRO 6000 Blackwell (#​39183), ViT full CUDA graph for Qwen3-VL video (#​38061), --enable-vit-cuda-graph for VLM examples (#​40580), default max_frames_per_batch auto-infer for ViT CG video (#​40445), fused FP8 output quantization into merge_attn_states (#​36518), batched KV-cache swap via cuMemcpyBatchAsync (#​38460), sm_110 (Jetson Thor) added to CUDA 13.0 build targets (#​39233).
  • AMD ROCm: ZenCPU / AMD Zen CPU backend via zentorch (#​39967), RDNA 3.5/4 device IDs (gfx1150/1151/1201) (#​38455), gfx1102/gfx1103 added (#​40037), MORI EP for unquantized MoE with AITER (#​37529), MoRI build with AMD AINIC stack (#​38371), MoRI-IO message format aligned with P2pNcclConnector and vllm-router (#​39565), MORI prefill/decode API correction (#​39835), AITER gemm w8a8 ptpc integration (#​33773), TritonW4A16LinearKernel (#​37352), asymmetric INT8 in TritonInt8ScaledMMLinearKernel (#​38501), fused_silu_mul_block_quant enabled (#​38817), KV-cache shuffle for paged_attention_common (#​32914), MLA decode output zero-fill removed in AITER (#​37539), MLA dual RMS norm fusion pass for DeepSeek/Kimi-K2 (#​39242, with older-AITer guard #​40386), AITER MLA + Eagle3 spec decode (#​39616), DFlash on ROCm (#​39703), wvSplitK FP8 path for RDNA (#​37712), GPU↔NUMA-node detection (#​40015), non-causal attention in ROCM_ATTN (#​40176), engine-shutdown GPU memory leak fix (#​38503), score-correction-bias dtype cast for DeepSeek/Kimi-K2 (#​39999).
  • Intel XPU: torch 2.11 upgrade for XPU (#​37947) — no longer pinned to 2.10, initial GDN attention for Qwen3-Next / Qwen3.5 (#​33657), torch.compile for XPU GDN attention (#​39466), XPU MXFP8 quant op (#​38682), XPU MXFP4 quant op (#​39857), per-channel FP8 linear (#​38316), FP8 KV cache on XPU (#​37731), round_int8 for Intel Triton (#​38825), MoE Triton in online FP8 quantization fix (#​40109), current_platform.supports_fp8() updated for TritonExperts (#​40132), NIXL import on XPU fix (#​40430), fusion-pattern support disabled on XPU ([#​39789](https://re

Note

PR body was truncated to here.


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • ""
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot force-pushed the renovate/pypi-vllm-vulnerability branch from 813413e to 0cbf292 Compare May 7, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants