AV latent support for LTXVLoopingSampler and LTXVExtendSampler by jjdejong · Pull Request #472 · Lightricks/ComfyUI-LTXVideo

Jean J. de Jong (jjdejong) · 2026-04-25T08:37:51Z

Summary

Add audio-visual latent support to LTXVLoopingSampler and LTXVExtendSampler (and the LTXVBaseSampler / LTXVInContextSampler building blocks), removing the ValueError guard that previously rejected AV latents.
Audio is carried jointly with video through the temporal tiling loop — separated on input, sliced/extended per tile alongside the video overlap geometry, accumulated across tiles, and reassembled into an AV NestedTensor on output.
For stage-2 refinement (low-sigma upscale), the input audio data initializes each tile's audio frames instead of zeros, so lipsync and AV coherence are refined jointly with video — matching the behavior of the single-pass SamplerCustomAdvanced workflow.
Adds example workflows + docs for two-pass I2V looping (single-tile and 30s 3-tile variants with MultiPromptProvider) and a V2V detailer doc.

Implementation notes

New helpers _make_av_latent_dict() / _split_av_latent_dict() in easy_samplers.py handle NestedTensor packing/unpacking with proper noise-mask propagation for both modalities.
Audio temporal compression differs from video; frame_overlap (expressed in video latent frames) is converted via the audio VAE stride before audio slicing.
Backward compatible: with no audio component in the input latent, both samplers behave exactly as before.

Motivation

Previously, the looping and extend samplers raised ValueError: LoopingSampler currently does not support Audio Visual latents., forcing users to generate audio in a separate low-sigma pass on top of a video-only result. That workaround produces inferior lipsync because audio is never refined jointly with video across temporal tiles. This change makes joint AV generation possible in long-form clips.

Testing status

✅ LTXVLoopingSampler AV path — confirmed working end-to-end. Audio is generated jointly with video and stays synchronised across tile boundaries; verified with the included two-pass workflow.
⚠️ LTXVExtendSampler AV path — implemented but not fully validated. The extend pass runs without errors when handed a source AV latent, but joint AV continuity across the extend boundary has not been rigorously compared against the workaround pipeline.
⚠️ optional_negative_index_strength — wired through all samplers but not extensively tested. The default (1.0) preserves prior behavior; intermediate values to soften reference-image influence have not been validated on a reference set.

Test plan

Run LTX-2.3_Two_Pass_I2V_Looping.json with an AV latent and confirm decoded audio is in sync with video.
Run LTX-2.3_Two_Pass_I2V_Looping_30s.json (3 tiles) and confirm audio continuity across both tile boundaries.
Run a video-extension workflow with a source AV latent through LTXVExtendSampler and confirm audio continuation is coherent across the overlap region.
Sweep optional_negative_index_strength across {0.0, 0.5, 1.0} on a reference-conditioned generation to confirm monotonic influence reduction.
Confirm video-only inputs still produce identical output (no regression for the non-AV path).

🤖 Generated with Claude Code

…ampler The looping and extend samplers currently reject AV latents with a ValueError, forcing users to generate audio in a separate low-sigma pass. This produces inferior results because audio is not refined jointly with video across temporal tiles. This change removes the AV rejection guard and carries audio latents through the temporal tiling loop alongside video: - LTXVLoopingSampler: separates AV input into video + audio, passes audio slices to each tile's sampler, accumulates audio output across tiles, and reassembles the AV NestedTensor on output. - LTXVExtendSampler: accepts audio tile data, computes audio overlap and new-frame geometry matching the video tile structure, creates proper audio noise masks, and wraps/unwraps AV latents around all SamplerCustomAdvanced calls. - LTXVBaseSampler / LTXVInContextSampler: accept optional audio tile, wrap into AV latent before sampling, split on output. For stage-2 refinement (low-sigma upscale pass), the input audio data is used to initialize each tile's audio frames instead of zeros, enabling the model to refine lipsync and audio-visual coherence at higher resolution — matching the behavior of the standard two-stage workflow using SamplerCustomAdvanced directly. Helper functions _make_av_latent_dict() and _split_av_latent_dict() handle the NestedTensor packing/unpacking with proper noise mask propagation for both modalities. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Two-pass I2V looping workflow (single-tile and 30s 3-tile variants) with reference image conditioning at tile boundaries - 30s variant adds MultiPromptProvider for per-tile prompt variation and RepeatImageBatch for guiding images at transitions - V2V Detailer doc with Strix Halo OOM prevention and arbitrary-length video handling notes - Python generator script for the two-pass workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pdates - LTXVLoopingSampler: add optional `save_checkpoints` toggle (default off) that writes the accumulated latent to output/ltxv_looping_ckpt_v{v}_h{h}.safetensors after each temporal tile (atomic write, best-effort) so a mid-run crash leaves a decodable partial result on disk. - stg.py: fix STG crash / attention-index miscount when cond-image guides (strength != 1.0) are combined with STG perturbation. Comfy core (CORE-166) splits guide-mask self-attention into sliced-query sub-calls; STG now detects them via low_precision_attention=False, collapses them to one logical index, and returns the matching value slice when skipping. (Also isolated on branch stg-guide-mask-fix for a standalone upstream PR.) - example_workflows: update Two-Pass I2V looping workflow, notes, and generator. - .gitignore: ignore macOS junk (.DS_Store, ._*) and local AI-context files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Jean J. de Jong (jjdejong) · 2026-06-02T09:20:51Z

Friendly ping on this PR 🙂 — it's been open since late April and I'd love to get it reviewed whenever someone has bandwidth. Happy to address any changes, and if there's a preferred contribution/review process I should follow, just let me know.

One scope heads-up: this branch also includes a small, general STG bug fix — a crash and an attention-index miscount that occur when cond-image guides with strength != 1.0 are combined with STG perturbation (triggered by the guide-mask self-attention split introduced in CORE-166). Since that fix isn't AV-specific and helps any STG + cond-image workflow, I've extracted it into a standalone PR for easier, independent review: #503. Happy to drop it from this PR if you'd prefer to keep the two from overlapping.

Thanks!

Jean J. de Jong (jjdejong) and others added 7 commits April 9, 2026 12:34

Reformatted the json

77006fd

Sync of my fork with official branch.

db13ed4

Merge remote-tracking branch 'upstream/master' into av-looping-sampler

397274e

Merge remote-tracking branch 'upstream/master' into av-looping-sampler

adfe337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AV latent support for LTXVLoopingSampler and LTXVExtendSampler#472

AV latent support for LTXVLoopingSampler and LTXVExtendSampler#472
Jean J. de Jong (jjdejong) wants to merge 7 commits into
Lightricks:masterfrom
jjdejong:av-looping-sampler

Jean J. de Jong (jjdejong) commented Apr 25, 2026

Uh oh!

Jean J. de Jong (jjdejong) commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jean J. de Jong (jjdejong) commented Apr 25, 2026

Summary

Implementation notes

Motivation

Testing status

Test plan

Uh oh!

Jean J. de Jong (jjdejong) commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant