Can I run this LLM?
Pick a model and a quantization and see roughly how much GPU memory it needs, whether it fits your card, and how many you would need. Memory holds all the weights, even for a mixture-of-experts model, so the total is what matters here, not the active params.
Estimate only. Real usage varies with the runtime (llama.cpp, vLLM), batch size, and KV-cache settings. It is meant to get you in the right ballpark.
Estimated memory needed
0 GB
- Weights
- 0 GB
- KV cache
- 0 GB
- Overhead
- 0 GB
Will it fit?
Can't fit it locally?
Rent a GPU by the hour, or grab the card you need:
Some links are referral links. They cost you nothing and help keep these tools free.
How the estimate works
The big number is the weights: total parameters times bytes per weight, which the quantization sets. FP16 is two bytes per weight, 4-bit is half a byte, so a 4-bit build is roughly a quarter the size of full precision. On top of that sits the KV cache, which grows with your context length, plus a slice of runtime overhead. For a mixture-of-experts model like GLM-5.2, every expert still has to live in memory, so use the total parameter count, not the active count.
If the total lands above your card, you have three moves: drop to a smaller quant, shorten the context, or add GPUs. The self-host vs API calculator can tell you whether owning the hardware even beats paying per token for your volume.