Skip to content

[New Model] Add LensPipeline (microsoft/Lens)#13837

Open
RuixiangMa wants to merge 2 commits into
huggingface:mainfrom
RuixiangMa:LensPipeline
Open

[New Model] Add LensPipeline (microsoft/Lens)#13837
RuixiangMa wants to merge 2 commits into
huggingface:mainfrom
RuixiangMa:LensPipeline

Conversation

@RuixiangMa
Copy link
Copy Markdown
Contributor

@RuixiangMa RuixiangMa commented May 29, 2026

What does this PR do?

Add support for Lens , text-to-image models from Microsoft.
Repoid: microsoft/Lens

import torch
from diffusers import LensPipeline

pipe = LensPipeline.from_pretrained("microsoft/Lens", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A steampunk floating sky-city built on massive gear-driven platforms, 
brass and copper towers connected by chain bridges, steam-powered airships and 
hot air balloons docking at various levels, sunset clouds below the city, detailed concept art"
generator = torch.Generator("cuda").manual_seed(42)
image = pipe(prompt, height=1440, width=1440, num_inference_steps=20, guidance_scale=5.0, generator=generator).images[0]
image.save("lens.png")
lens lens3 lens2 lens lens

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @yiyixuxu

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@github-actions github-actions Bot added models pipelines size/L PR with diff > 200 LOC labels May 29, 2026
@RuixiangMa RuixiangMa changed the title [Feat] Add LensPipeline [Feat] Add LensPipeline (microsoft/Lens) May 29, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation tests labels May 30, 2026
@RuixiangMa RuixiangMa changed the title [Feat] Add LensPipeline (microsoft/Lens) [New Model] Add LensPipeline (microsoft/Lens) May 31, 2026
@dxqb
Copy link
Copy Markdown
Contributor

dxqb commented Jun 3, 2026

thanks for the PR!
I am not associated with diffusers and they'll have to review it, but I might have some useful comments and questions because I have implemented Lens for https://github.com/Nerogar/OneTrainer using the Microsoft upstream. As soon as your PR is merged, I'd like to change it to diffusers as upstream.

  • GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
    If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
    It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
    How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
    I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.

  • I think your PR does this correctly, but it's something to watch out for:
    Lens uses a few selected hidden states of the text encoder, one of which is the last layer. The Microsoft upstream extends the text encoder for that. First I wondered why they don't just used output_hidden_states=True, but this is not the same thing: output_hidden_states returns the hidden states, but the last selected layer is the actual last layer of the model: The output then runs through a norm-layer before it's returned - it does not run through the norm-layer using the upstream text encoder.
    I'm not sure if output_hidden_states is intended/documented by transformers to work like that, but using it as it is now is an embedding mismatch.

@RuixiangMa
Copy link
Copy Markdown
Contributor Author

thanks for the PR! I am not associated with diffusers and they'll have to review it, but I might have some useful comments and questions because I have implemented Lens for https://github.com/Nerogar/OneTrainer using the Microsoft upstream. As soon as your PR is merged, I'd like to change it to diffusers as upstream.

  • GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
    If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
    It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
    How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
    I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.
  • I think your PR does this correctly, but it's something to watch out for:
    Lens uses a few selected hidden states of the text encoder, one of which is the last layer. The Microsoft upstream extends the text encoder for that. First I wondered why they don't just used output_hidden_states=True, but this is not the same thing: output_hidden_states returns the hidden states, but the last selected layer is the actual last layer of the model: The output then runs through a norm-layer before it's returned - it does not run through the norm-layer using the upstream text encoder.
    I'm not sure if output_hidden_states is intended/documented by transformers to work like that, but using it as it is now is an embedding mismatch.

Thx, this is very helpful. This PR does not add any special handling beyond the current diffusers loading/device-placement behavior, so agreed this is something to watch for with CPU offload / VRAM-saving paths.

For the hidden-state selection: this PR intentionally does not use output_hidden_states=True. Instead it uses LensGptOssEncoder to capture the selected decoder-layer outputs directly, before the final norm, to match the Microsoft upstream behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation models pipelines size/L PR with diff > 200 LOC tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants