Anshad Ameenza.
Tools

Can I run this LLM?

Pick a model and a quantization and see roughly how much GPU memory it needs, whether it fits your card, and how many you would need. Memory holds all the weights, even for a mixture-of-experts model, so the total is what matters here, not the active params.

Estimate only. Real usage varies with the runtime (llama.cpp, vLLM), batch size, and KV-cache settings. It is meant to get you in the right ballpark.

Estimated memory needed

0 GB

Weights
0 GB
KV cache
0 GB
Overhead
0 GB

Will it fit?

Can't fit it locally?

Rent a GPU by the hour, or grab the card you need:

Some links are referral links. They cost you nothing and help keep these tools free.

How the estimate works

The big number is the weights: total parameters times bytes per weight, which the quantization sets. FP16 is two bytes per weight, 4-bit is half a byte, so a 4-bit build is roughly a quarter the size of full precision. On top of that sits the KV cache, which grows with your context length, plus a slice of runtime overhead. For a mixture-of-experts model like GLM-5.2, every expert still has to live in memory, so use the total parameter count, not the active count.

If the total lands above your card, you have three moves: drop to a smaller quant, shorten the context, or add GPUs. The self-host vs API calculator can tell you whether owning the hardware even beats paying per token for your volume.