LLMfit Logo

LLMfit.io

Will it run? Estimate local LLM performance on your hardware through pure calculation.

Loading models
Model Directory (0/0)
ModelProviderParamsContextGeneration SpeedPrefill SpeedTTFTQualityFit
Loading models...

How it works

Generation speed is estimated using the formula memory_bandwidth / model_size x efficiency. A model must fit in VRAM (or unified memory) for peak performance — if it overflows, weights spill to system RAM, which has far lower bandwidth and significantly reduces tokens per second. Quantization (Q4, Q5, Q8, etc.) reduces model size in bits-per-weight, trading a small amount of output quality for faster generation and lower memory use. KV cache grows with context length and is added on top of model size. All figures are theoretical estimates; real-world results vary approximately 20% depending on framework, batch size, and system configuration.

Supported hardware

LLMfit covers four hardware tiers: NVIDIA consumer GPUs (RTX 3000 through 5000 series), NVIDIA datacenter cards (A100, H100, H200), Apple Silicon (M1 through M4 Pro/Max/Ultra with unified memory), and AMD consumer GPUs (RX 7000 series). CPU-only inference is also supported for systems without a discrete GPU. Hardware specs are updated regularly from verified sources.