Loading quantized text encoder to CPU

Is the issue I have noted here https://github.com/huggingface/diffusers/pull/13837#issuecomment-4614858152 correct:

> GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.

Would it be possible to load the quantized text model to CPU and just move it to GPU and back?
Many software packages do or a allow this with other models for consumer hardware, including Comfy, diffusers and OneTrainer


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading quantized text encoder to CPU #11

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Loading quantized text encoder to CPU #11

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions