Is the issue I have noted here huggingface/diffusers#13837 (comment) correct:
GPT-OSS is a 20B model that comes pre-quantized to mxfp4 for most of its parameters. If you load it to CPU using the Microsoft/transformer upstream code, it gets automatically dequantized, using 40 GB of RAM - but then it also requires 40 GB of VRAM when you move it to GPU.
If you load it to GPU, it materializes on GPU as 10 GB of VRAM using the kernels library - but it cannot be moved to CPU then.
It doesn't seem to be possible to load the model quantized to CPU, and move it on demand.
How do you handle this in this PR? diffusers might need some new infrastructure for this case, otherwise the vram-saving optimizations that diffusers have will fail.
I have settled for on-demand loading and discard-after-use for now, to avoid the 40 GB ram.
Would it be possible to load the quantized text model to CPU and just move it to GPU and back?
Many software packages do or a allow this with other models for consumer hardware, including Comfy, diffusers and OneTrainer
Is the issue I have noted here huggingface/diffusers#13837 (comment) correct:
Would it be possible to load the quantized text model to CPU and just move it to GPU and back?
Many software packages do or a allow this with other models for consumer hardware, including Comfy, diffusers and OneTrainer