Add ernie image by HsiaWinter · Pull Request #13432 · huggingface/diffusers

HsiaWinter · 2026-04-08T04:14:03Z

What does this PR do?

We have introduced a new text-to-image model called ERNIE-Image, which will soon be open-sourced to the community. This PR includes the model architecture definition, the pipeline, as well as the related documentation and test files.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[✅] Did you read the contributor guideline?
[✅] Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[✅] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

yiyixuxu

thanks for the PR!
i left some feedbacks

yiyixuxu · 2026-04-08T07:18:03Z

src/diffusers/models/transformers/transformer_ernie_image.py

+    def __init__(self, hidden_size: int, num_heads: int, ffn_hidden_size: int, eps: float = 1e-6, qk_layernorm: bool = True):
+        super().__init__()
+        self.adaLN_sa_ln = RMSNorm(hidden_size, eps=eps)
+        self.self_attention = Attention(


can you create a custom attention class for ernie?see flux2 example https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux2.py#L493

ok, I recreate a custom attention class

yiyixuxu · 2026-04-08T07:48:11Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()
+
+
+class TimestepEmbedding(nn.Module):


is this same as our TimestepEeembedding? should we reuse?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L1261

yiyixuxu · 2026-04-08T07:51:36Z

src/diffusers/models/transformers/transformer_ernie_image.py

+
+        return ErnieImageTransformer2DModelOutput(sample=output) if return_dict else (output,)
+
+    def _pad_text(self, text_hiddens: List[torch.Tensor], device: torch.device, dtype: torch.dtype):


ohh, we are padding the text embeddings here, is it possible to move this outside of the model, in to the pipeline? e.g. you can pass image_ids, text_ids and text_seq_lens instead
i think it would affect torch.compile too if we pad text embeddings inside the transformer

yiyixuxu · 2026-04-08T08:08:28Z

src/diffusers/models/transformers/transformer_ernie_image.py

+    return out.float()
+
+
+class EmbedND3(nn.Module):


Suggested change

class EmbedND3(nn.Module):

class ErnieImageEmbedND3(nn.Module):

can we follow our naming conventions and add the ErnieImage prefix every where?

yiyixuxu · 2026-04-08T08:53:46Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        self.vae_scale_factor = 16  # VAE downsample factor
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):


why did you write a custom from_pretrained method here? is there any reason, you could not use the from_pretrained in inhrited from DiffusionPipeline?

yiyixuxu · 2026-04-08T08:58:34Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"):
+            pe_device = self.pe._hf_hook.execution_device
+        else:
+            pe_device = device


Suggested change

if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"):

pe_device = self.pe._hf_hook.execution_device

else:

pe_device = device

pe_device = device or self._execution_deivce

this is basically self._execution_device, no? https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_utils.py#L1136

yiyixuxu · 2026-04-08T09:00:36Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        text_hiddens = self.encode_prompt(prompt, device, num_images_per_prompt)
+
+        # CFG with negative prompt
+        do_cfg = guidance_scale > 1.0


can we add a do_classifier_free_guidance property instead?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py#L590

Suggested change

do_cfg = guidance_scale > 1.0

yiyixuxu · 2026-04-08T09:00:58Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+
+        # CFG with negative prompt
+        do_cfg = guidance_scale > 1.0
+        if do_cfg:


Suggested change

if do_cfg:

if self.do_classifier_free_guidance:

yiyixuxu

thanks!
i left a few more comments

yiyixuxu · 2026-04-08T19:02:21Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        return torch.stack([emb, emb], dim=-1).reshape(*emb.shape[:-1], -1)  # [B, S, 1, head_dim]
+
+
+class PatchEmbedDynamic(nn.Module):


can we add a prefix to these names too?

yiyixuxu · 2026-04-08T19:13:02Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        return self.processor(self, hidden_states, encoder_hidden_states, attention_mask, image_rotary_emb, **kwargs)
+
+
+class FeedForward(nn.Module):


Suggested change

class FeedForward(nn.Module):

class ErnieImageFeedForward(nn.Module):

yiyixuxu · 2026-04-08T19:27:34Z

src/diffusers/models/transformers/transformer_ernie_image.py

+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x)
+        B, D, Hp, Wp = x.shape


Suggested change

B, D, Hp, Wp = x.shape

batch_size, dim, height, width = x.shape

yiyixuxu · 2026-04-08T19:28:47Z

src/diffusers/models/transformers/transformer_ernie_image.py

+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x)
+        B, D, Hp, Wp = x.shape
+        return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()


Suggested change

return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()

return x.reshape(batch_size, dim, height * width).transpose(1, 2).contiguous()

we prefer to use more descriptive variable names

yiyixuxu · 2026-04-08T19:36:38Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        self,
+        attn: Attention,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor | None = None,


Suggested change

encoder_hidden_states: torch.Tensor | None = None,

it's not needed since we are single stream here no?

src/diffusers/models/transformers/transformer_ernie_image.py

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

yiyixuxu

thanks! left two small comments
let's merge this soon

yiyixuxu · 2026-04-09T17:53:23Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        sample = sample.to(self.time_embedding.linear_1.weight.dtype)
+        c = self.time_embedding(sample)
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = [t.unsqueeze(0).expand(S, -1, -1).contiguous() for t in self.adaLN_modulation(c).chunk(6, dim=-1)]
+        for layer in self.layers:


let's implement gradient checkpointing? https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux.py#L715

yiyixuxu · 2026-04-09T17:57:39Z

tests/models/transformers/test_models_transformer_ernie_image.py

+
+@unittest.skipIf(
+    IS_GITHUB_ACTIONS,
+    reason="Skipping test-suite inside the CI because the model has `torch.empty()` inside of it during init and we don't have a clear way to override it in the modeling tests.",


ohhh I think we shoulld not skip the test here
let's not have torch.empty() during init then? (I didn't find any torch.empty() there actutlly)

yiyixuxu · 2026-04-09T18:18:31Z

@claude can you do a review here also? please keep these 3 note in mind as well during your review

compare the Ernie model/pipeline to others like Qwen/Flux —let us know if there is any significant inconsistencies you found.
if you see any unused code paths, let us know
Look over the PR comments I made and check if the same patterns we caught/fixed still exist elsewhere in the code.

github-actions · 2026-04-09T18:18:48Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

HsiaWinter and others added 9 commits April 2, 2026 16:39

Add ERNIE-Image

4533474

Update doc

4049a20

Update doc

579e6c7

Change from Custom-Attention to Diffusers Style Attention

d16d16e

Change from Custom-Attention to Diffusers Style Attention

9cbbf5d

兼容SGLang

9fca912

优化PE模块的加载与offload策略

465f009

更新Doc文件与config配置相关内容

6afd534

Merge branch 'huggingface:main' into add-ernie-image

11ffcd9

github-actions bot added documentation Improvements or additions to documentation models tests utils pipelines size/L PR with diff > 200 LOC labels Apr 8, 2026

yiyixuxu reviewed Apr 8, 2026

View reviewed changes

yiyixuxu requested a review from dg845 April 8, 2026 09:02

Fix官方反馈的内容

b360596

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 8, 2026

yiyixuxu reviewed Apr 8, 2026

View reviewed changes

根据官方建议优化代码

298322d

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 9, 2026

yiyixuxu reviewed Apr 9, 2026

View reviewed changes

Update code

c482b0d

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 10, 2026

		return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()


		class TimestepEmbedding(nn.Module):


		return ErnieImageTransformer2DModelOutput(sample=output) if return_dict else (output,)

		def _pad_text(self, text_hiddens: List[torch.Tensor], device: torch.device, dtype: torch.dtype):

	class EmbedND3(nn.Module):
	class ErnieImageEmbedND3(nn.Module):

		return torch.stack([emb, emb], dim=-1).reshape(*emb.shape[:-1], -1) # [B, S, 1, head_dim]


		class PatchEmbedDynamic(nn.Module):

		return self.processor(self, hidden_states, encoder_hidden_states, attention_mask, image_rotary_emb, **kwargs)


		class FeedForward(nn.Module):

	class FeedForward(nn.Module):
	class ErnieImageFeedForward(nn.Module):

	B, D, Hp, Wp = x.shape
	batch_size, dim, height, width = x.shape

	return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()
	return x.reshape(batch_size, dim, height * width).transpose(1, 2).contiguous()

Conversation

HsiaWinter commented Apr 8, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HsiaWinter Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HsiaWinter Apr 8, 2026 •

edited

Loading