Yes. MoE models tipically use a different set of experts at each token. So while...

		NitpickLawyer 8 months ago \| parent \| context \| favorite \| on: The Llama 4 herd Yes. MoE models tipically use a different set of experts at each token. So while the "compute" is similar to a dense model equal to the "active" parameters, the VRAM requirements are larger. You could technically run inference & swap the models around, but the latency would be pretty horrendous.

I think prompt processing also needs all the weights.