How it works

Generation speed is estimated using the formula memory_bandwidth / model_size x efficiency. A model must fit in VRAM (or unified memory) for peak performance — if it overflows, weights spill to system RAM, which has far lower bandwidth and significantly reduces tokens per second. Quantization (Q4, Q5, Q8, etc.) reduces model size in bits-per-weight, trading a small amount of output quality for faster generation and lower memory use. KV cache grows with context length and is added on top of model size. All figures are theoretical estimates; real-world results vary approximately 20% depending on framework, batch size, and system configuration.

Supported hardware

LLMfit covers four hardware tiers: NVIDIA consumer GPUs (RTX 3000 through 5000 series), NVIDIA datacenter cards (A100, H100, H200), Apple Silicon (M1 through M4 Pro/Max/Ultra with unified memory), and AMD consumer GPUs (RX 7000 series). CPU-only inference is also supported for systems without a discrete GPU. Hardware specs are updated regularly from verified sources.

Quantization guide

Quantization compresses a model's weights from full 16-bit or 32-bit floating point values to lower-precision integers, dramatically reducing the amount of VRAM required to run the model. A full-precision FP16 Llama 3.1 8B model requires approximately 16 GB of VRAM, while a Q4_K_M quantization of the same model needs only about 5 GB — making it accessible on consumer GPUs.

The trade-off is output quality. Lower quantizations remove more information from each weight, which can subtly degrade reasoning, coherence, and factual accuracy. In practice, Q4_K_M preserves around 96% of full-precision quality for most tasks, making it the recommended default for local inference. Q8_0 is near-lossless and only needed when maximum fidelity matters.

Quantization also affects generation speed. Because generation is memory-bandwidth-bound, smaller quantizations mean fewer bytes must be streamed from VRAM per token, resulting in measurably higher tokens per second — particularly on GPUs with high memory bandwidth like the RTX 4090 (1008 GB/s) or Apple M3 Ultra (800+ GB/s).

Format	Bits/weight	Quality	Best for
FP16	16	100%	Max quality, high VRAM
Q8_0	8.5	99.5%	Near-lossless, half the VRAM of FP16
Q5_K_M	5.7	98%	High quality with moderate VRAM savings
Q4_K_M	4.9	96%	Best balance of quality and speed
Q3_K_M	3.9	90%	Low VRAM builds, noticeable quality loss
Q2_K	3.0	80%	Extreme compression only

Hardware selection guide

Local LLM inference performance is determined almost entirely by memory bandwidth — how quickly the GPU can stream weight data from VRAM into the compute units during each token generation step. VRAM capacity determines which models can run without spilling to slower system RAM. For Apple Silicon, unified memory serves as both VRAM and system RAM, removing the spill penalty but reducing effective bandwidth compared to discrete GPUs.

When a model is too large to fit in VRAM, weights spill to system RAM. System DDR5 RAM has roughly 50–80 GB/s bandwidth versus 500–1000+ GB/s for modern GPUs, causing generation speed to drop sharply. Running a large model partially in RAM is still possible but expect 5–20x slower tokens per second.

Tier	Example hardware	VRAM	Best models
NVIDIA consumer	RTX 4090 · RTX 5090	16–32 GB	Up to 70B at Q4, 7–13B at FP16
NVIDIA datacenter	H100 · H200 · A100	80–141 GB	70B+ at FP16, frontier-scale models
Apple Silicon	M3/M4 Max/Ultra	48–192 GB unified	Large models with lower bandwidth
CPU-only	High-core desktop/server	System RAM	Small models (≤7B) at low tok/s

FAQ

What does "fits in VRAM" mean?

It means the model weights plus KV cache at your selected context length can be loaded entirely into your GPU's video memory. When a model fits in VRAM, it runs at full GPU memory-bandwidth speed. When it does not fit, weights overflow to system RAM, which is 10–20x slower.

What happens when a model spills to system RAM?

Ollama and llama.cpp will layer-offload: some transformer layers stay on the GPU while the rest are computed on the CPU reading from RAM. Generation speed drops sharply — often below 5 tok/s — because the bottleneck becomes your RAM bandwidth rather than your GPU bandwidth.

What is TTFT?

TTFT stands for Time To First Token. It measures how long the model takes to process your prompt and produce the very first output token. Longer prompts and larger models increase TTFT. For interactive use, a TTFT under 1–2 seconds feels responsive; above 5 seconds starts to feel slow.

How accurate are these estimates?

Estimates are typically within ±20% of real-world results. The generation speed formula (bandwidth / model_size × efficiency) captures the dominant factor well. Actual results vary based on Ollama version, batch size, system load, and whether your GPU is running other workloads simultaneously.

What quantization should I use?

Q4_K_M is the recommended default for most users: it provides about 96% of FP16 quality at roughly 30% of the VRAM cost. If you have ample VRAM and want near-lossless quality, choose Q8_0. Only use Q2 or Q3 if your hardware cannot fit Q4 at all, as quality degradation becomes noticeable at those levels.

Does context length affect performance?

Yes. The KV cache — which stores the attention state for each token in your context — grows linearly with context length. A 128K context window can add several gigabytes on top of the model size. Reducing context length is one of the easiest ways to fit a larger model into limited VRAM.