[New Model] Add LensPipeline （microsoft/Lens） by RuixiangMa · Pull Request #13837 · huggingface/diffusers

RuixiangMa · 2026-05-29T20:05:57Z

What does this PR do?

Add support for Lens , text-to-image models from Microsoft.
Repoid: microsoft/Lens

import torch
from diffusers import LensPipeline

pipe = LensPipeline.from_pretrained("microsoft/Lens", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A steampunk floating sky-city built on massive gear-driven platforms, 
brass and copper towers connected by chain bridges, steam-powered airships and 
hot air balloons docking at various levels, sunset clouds below the city, detailed concept art"
generator = torch.Generator("cuda").manual_seed(42)
image = pipe(prompt, height=1440, width=1440, num_inference_steps=20, guidance_scale=5.0, generator=generator).images[0]
image.save("lens.png")

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @yiyixuxu

Signed-off-by: Lancer <maruixiang6688@gmail.com>

dxqb · 2026-06-03T17:03:53Z

thanks for the PR!
I am not associated with diffusers and they'll have to review it, but I might have some useful comments and questions because I have implemented Lens for https://github.com/Nerogar/OneTrainer using the Microsoft upstream. As soon as your PR is merged, I'd like to change it to diffusers as upstream.

GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.
I think your PR does this correctly, but it's something to watch out for:
Lens uses a few selected hidden states of the text encoder, one of which is the last layer. The Microsoft upstream extends the text encoder for that. First I wondered why they don't just used output_hidden_states=True, but this is not the same thing: output_hidden_states returns the hidden states, but the last selected layer is the actual last layer of the model: The output then runs through a norm-layer before it's returned - it does not run through the norm-layer using the upstream text encoder.
I'm not sure if output_hidden_states is intended/documented by transformers to work like that, but using it as it is now is an embedding mismatch.

RuixiangMa · 2026-06-03T17:48:48Z

thanks for the PR! I am not associated with diffusers and they'll have to review it, but I might have some useful comments and questions because I have implemented Lens for https://github.com/Nerogar/OneTrainer using the Microsoft upstream. As soon as your PR is merged, I'd like to change it to diffusers as upstream.

GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.

I think your PR does this correctly, but it's something to watch out for:
Lens uses a few selected hidden states of the text encoder, one of which is the last layer. The Microsoft upstream extends the text encoder for that. First I wondered why they don't just used output_hidden_states=True, but this is not the same thing: output_hidden_states returns the hidden states, but the last selected layer is the actual last layer of the model: The output then runs through a norm-layer before it's returned - it does not run through the norm-layer using the upstream text encoder.
I'm not sure if output_hidden_states is intended/documented by transformers to work like that, but using it as it is now is an embedding mismatch.

Thx, this is very helpful. This PR does not add any special handling beyond the current diffusers loading/device-placement behavior, so agreed this is something to watch for with CPU offload / VRAM-saving paths.

For the hidden-state selection: this PR intentionally does not use output_hidden_states=True. Instead it uses LensGptOssEncoder to capture the selected decoder-layer outputs directly, before the final norm, to match the Microsoft upstream behavior.

[Feat] Add LensPipeline

ecb3c4e

Signed-off-by: Lancer <maruixiang6688@gmail.com>

github-actions Bot added models pipelines size/L PR with diff > 200 LOC labels May 29, 2026

RuixiangMa changed the title ~~[Feat] Add LensPipeline~~ [Feat] Add LensPipeline （microsoft/Lens） May 29, 2026

upd

d004842

github-actions Bot added documentation Improvements or additions to documentation tests labels May 30, 2026

RuixiangMa changed the title ~~[Feat] Add LensPipeline （microsoft/Lens）~~ [New Model] Add LensPipeline （microsoft/Lens） May 31, 2026

dxqb mentioned this pull request Jun 3, 2026

Loading quantized text encoder to CPU microsoft/Lens#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model] Add LensPipeline （microsoft/Lens）#13837

[New Model] Add LensPipeline （microsoft/Lens）#13837
RuixiangMa wants to merge 2 commits into
huggingface:mainfrom
RuixiangMa:LensPipeline

RuixiangMa commented May 29, 2026 •

edited

Loading

Uh oh!

dxqb commented Jun 3, 2026 •

edited

Loading

Uh oh!

RuixiangMa commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RuixiangMa commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

dxqb commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RuixiangMa commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RuixiangMa commented May 29, 2026 •

edited

Loading

dxqb commented Jun 3, 2026 •

edited

Loading