Skip to content

Loading quantized text encoder to CPU #11

@dxqb

Description

@dxqb

Is the issue I have noted here huggingface/diffusers#13837 (comment) correct:

GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.

Would it be possible to load the quantized text model to CPU and just move it to GPU and back?
Many software packages do or a allow this with other models for consumer hardware, including Comfy, diffusers and OneTrainer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions