Tools

Can I run this LLM?

Pick a model and a quantization and see roughly how much GPU memory it needs, whether it fits your card, and how many you would need. Memory holds all the weights, even for a mixture-of-experts model, so the total is what matters here, not the active params.

Model preset

Total parameters (B) Quantization

Context length: 8K tokens

Estimate only. Real usage varies with the runtime (llama.cpp, vLLM), batch size, and KV-cache settings. It is meant to get you in the right ballpark.

Estimated memory needed

0 GB

Weights: 0 GB
KV cache: 0 GB
Overhead: 0 GB

Will it fit?

Can't fit it locally?

Rent a GPU by the hour, or grab the card you need:

Rent on RunPod Rent on Vast.ai GPU on Amazon

Some links are referral links. They cost you nothing and help keep these tools free.

How the estimate works

The big number is the weights: total parameters times bytes per weight, which the quantization sets. FP16 is two bytes per weight, 4-bit is half a byte, so a 4-bit build is roughly a quarter the size of full precision. On top of that sits the KV cache, which grows with your context length, plus a slice of runtime overhead. For a mixture-of-experts model like GLM-5.2, every expert still has to live in memory, so use the total parameter count, not the active count.

If the total lands above your card, you have three moves: drop to a smaller quant, shorten the context, or add GPUs. The self-host vs API calculator can tell you whether owning the hardware even beats paying per token for your volume.

Can I run this LLM?

Will it fit?

How the estimate works

Cookie & Reality Check