Generation speed is estimated using the formula memory_bandwidth / model_size x efficiency. A model must fit in VRAM (or unified memory) for peak performance — if it overflows, weights spill to system RAM, which has far lower bandwidth and significantly reduces tokens per second. Quantization (Q4, Q5, Q8, etc.) reduces model size in bits-per-weight, trading a small amount of output quality for faster generation and lower memory use. KV cache grows with context length and is added on top of model size. All figures are theoretical estimates; real-world results vary approximately 20% depending on framework, batch size, and system configuration.
LLMfit covers four hardware tiers: NVIDIA consumer GPUs (RTX 3000 through 5000 series), NVIDIA datacenter cards (A100, H100, H200), Apple Silicon (M1 through M4 Pro/Max/Ultra with unified memory), and AMD consumer GPUs (RX 7000 series). CPU-only inference is also supported for systems without a discrete GPU. Hardware specs are updated regularly from verified sources.
Quantization compresses a model's weights from full 16-bit or 32-bit floating point values to lower-precision integers, dramatically reducing the amount of VRAM required to run the model. A full-precision FP16 Llama 3.1 8B model requires approximately 16 GB of VRAM, while a Q4_K_M quantization of the same model needs only about 5 GB — making it accessible on consumer GPUs.
The trade-off is output quality. Lower quantizations remove more information from each weight, which can subtly degrade reasoning, coherence, and factual accuracy. In practice, Q4_K_M preserves around 96% of full-precision quality for most tasks, making it the recommended default for local inference. Q8_0 is near-lossless and only needed when maximum fidelity matters.
Quantization also affects generation speed. Because generation is memory-bandwidth-bound, smaller quantizations mean fewer bytes must be streamed from VRAM per token, resulting in measurably higher tokens per second — particularly on GPUs with high memory bandwidth like the RTX 4090 (1008 GB/s) or Apple M3 Ultra (800+ GB/s).
| Format | Bits/weight | Quality | Best for |
|---|---|---|---|
| FP16 | 16 | 100% | Max quality, high VRAM |
| Q8_0 | 8.5 | 99.5% | Near-lossless, half the VRAM of FP16 |
| Q5_K_M | 5.7 | 98% | High quality with moderate VRAM savings |
| Q4_K_M | 4.9 | 96% | Best balance of quality and speed |
| Q3_K_M | 3.9 | 90% | Low VRAM builds, noticeable quality loss |
| Q2_K | 3.0 | 80% | Extreme compression only |
Local LLM inference performance is determined almost entirely by memory bandwidth — how quickly the GPU can stream weight data from VRAM into the compute units during each token generation step. VRAM capacity determines which models can run without spilling to slower system RAM. For Apple Silicon, unified memory serves as both VRAM and system RAM, removing the spill penalty but reducing effective bandwidth compared to discrete GPUs.
When a model is too large to fit in VRAM, weights spill to system RAM. System DDR5 RAM has roughly 50–80 GB/s bandwidth versus 500–1000+ GB/s for modern GPUs, causing generation speed to drop sharply. Running a large model partially in RAM is still possible but expect 5–20x slower tokens per second.
| Tier | Example hardware | VRAM | Best models |
|---|---|---|---|
| NVIDIA consumer | RTX 4090 · RTX 5090 | 16–32 GB | Up to 70B at Q4, 7–13B at FP16 |
| NVIDIA datacenter | H100 · H200 · A100 | 80–141 GB | 70B+ at FP16, frontier-scale models |
| Apple Silicon | M3/M4 Max/Ultra | 48–192 GB unified | Large models with lower bandwidth |
| CPU-only | High-core desktop/server | System RAM | Small models (≤7B) at low tok/s |
It means the model weights plus KV cache at your selected context length can be loaded entirely into your GPU's video memory. When a model fits in VRAM, it runs at full GPU memory-bandwidth speed. When it does not fit, weights overflow to system RAM, which is 10–20x slower.
Ollama and llama.cpp will layer-offload: some transformer layers stay on the GPU while the rest are computed on the CPU reading from RAM. Generation speed drops sharply — often below 5 tok/s — because the bottleneck becomes your RAM bandwidth rather than your GPU bandwidth.
TTFT stands for Time To First Token. It measures how long the model takes to process your prompt and produce the very first output token. Longer prompts and larger models increase TTFT. For interactive use, a TTFT under 1–2 seconds feels responsive; above 5 seconds starts to feel slow.
Estimates are typically within ±20% of real-world results. The generation speed formula (bandwidth / model_size × efficiency) captures the dominant factor well. Actual results vary based on Ollama version, batch size, system load, and whether your GPU is running other workloads simultaneously.
Q4_K_M is the recommended default for most users: it provides about 96% of FP16 quality at roughly 30% of the VRAM cost. If you have ample VRAM and want near-lossless quality, choose Q8_0. Only use Q2 or Q3 if your hardware cannot fit Q4 at all, as quality degradation becomes noticeable at those levels.
Yes. The KV cache — which stores the attention state for each token in your context — grows linearly with context length. A 128K context window can add several gigabytes on top of the model size. Reducing context length is one of the easiest ways to fit a larger model into limited VRAM.