How It Works

What happens behind the scenes when you hit "Get Recommendations."

1. Collecting Benchmark Data

We pull scores from six independent benchmark sources that test models in different ways:

  • Chatbot Arena โ€” Real users vote on which model gives better answers in blind comparisons. The gold standard for "vibes."
  • Open LLM Leaderboard โ€” Standardized academic benchmarks (MMLU-PRO, GPQA, BBH, MATH) testing knowledge and reasoning.
  • LiveBench โ€” Regularly updated questions across coding, math, reasoning, language, and data analysis. Hard to game because questions rotate.
  • ZeroEval โ€” Focused evaluations including SimpleQA (factual accuracy), HLE (hard problems), SWE-Bench (real software engineering), and SciCode.
  • EvalPlus โ€” Code generation benchmarks (HumanEval+, MBPP+) that test whether models can write correct code.
  • BigCodeBench โ€” Advanced coding tasks that go beyond simple function generation.

Each source tests something different. No single benchmark tells the whole story.

2. Scoring Models

Raw scores from different benchmarks aren't directly comparable โ€” an 85 on MMLU means something very different from an 85 on Arena Elo. So we normalize everything to a 0โ€“100 scale per benchmark using min-max normalization:

normalized = ((raw - min) / (max - min)) ร— 100

Then we compute a unified score for each use case as a weighted sum across benchmark groups:

S = wโ‚ ร— bฬ„โ‚ + wโ‚‚ ร— bฬ„โ‚‚ + ... + wโ‚™ ร— bฬ„โ‚™

Where wแตข is the weight for benchmark group i and bฬ„แตข is the average normalized score across that group's tasks. If a model is missing data for a benchmark group, that group contributes 0 to the sum โ€” the weight is not redistributed. This means models with sparse benchmark coverage naturally score lower:

A model with data for only 30% of the weight groups can score at most 30 out of 100.

This prevents a model with a single great benchmark score from being inflated to the top of the rankings. As more benchmarks evaluate the model, its score rises naturally.

The weights for each use case:

Use CaseWeights
GeneralArena 30%, Open LLM 25%, ZeroEval 20%, LiveBench 15%, IFEval 10%
CodingEvalPlus 20%, ZeroEval 20%, LiveBench 20%, BigCodeBench 15%, Open LLM 15%, Arena 10%
RAGOpen LLM 30%, LiveBench 25%, Arena 20%, ZeroEval 15%, Open LLM 10%
RoleplayArena 45%, LiveBench 25%, Open LLM 15%, ZeroEval 15%

Notice how roleplay leans heavily on Arena (human preference) while coding spreads weight across code-specific benchmarks. The weights reflect what actually matters for each task.

3. Matching to Your Hardware

When you tell us your setup, we calculate how much VRAM is available for model weights:

Windows / Linux

V_max = V_GPU - 2 GB

Mac (unified memory)

V_max = (RAM_total ร— 0.75) - 2 GB

The 2 GB overhead accounts for KV cache and system memory. On Mac, we use 75% of total RAM since unified memory is shared between the GPU and the rest of the system.

A model fits if:

size_variant โ‰ค V_max

Models that don't fit are excluded entirely โ€” a model that doesn't load isn't useful, no matter how good its scores are.

4. Picking the Best Variant

Most models come in multiple quantization levels โ€” compressed versions that trade a small amount of quality for significantly less memory:

PrecisionQualitySize
FP16Full precision, best qualityLargest
Q8Nearly indistinguishable from full~50% of FP16
Q6 / Q5Sweet spot for most users~40% of FP16
Q4Noticeable but acceptable loss~30% of FP16
Q3Significant quality trade-off~25% of FP16

Your preference controls which variant we pick for each model:

  • Max Quality โ€” Highest precision that fits in your VRAM:
    pick variant with max(precision_rank) where size โ‰ค V_max
  • Balanced โ€” Closest to 70% VRAM utilization, leaving room for longer context:
    pick variant with min(|size - 0.7 ร— V_max|)
  • Max Context โ€” Smallest variant, maximizing headroom for KV cache:
    pick variant with min(precision_rank) where size โ‰ค V_max

5. The Recommendation

You get two sets of results:

Top Picks (by quality)

Models ranked purely by their unified score for your use case. The best model is #1 regardless of size โ€” we just pick the variant that fits your hardware. You get the top 5 models, each with the variant that makes sense for your setup.

No magic multipliers, no hidden boosts for bigger models. Quality first, hardware fit second.

Trending Now (from HuggingFace)

The currently trending models on HuggingFace that are compatible with your hardware. This list updates automatically โ€” we pull from the HuggingFace API and match trending models to available GGUF variants in our database so you can actually run them locally.

Trending models might not have the highest benchmark scores, but they represent what the community is excited about right now. We show the benchmark score alongside download counts when available, so you can see how trending models stack up against the top-scoring ones.

ยฉ 2026 WhichLLM