How It Works
What happens behind the scenes when you hit "Get Recommendations."
1. Collecting Benchmark Data
We pull scores from six independent benchmark sources that test models in different ways:
- Chatbot Arena โ Real users vote on which model gives better answers in blind comparisons. The gold standard for "vibes."
- Open LLM Leaderboard โ Standardized academic benchmarks (MMLU-PRO, GPQA, BBH, MATH) testing knowledge and reasoning.
- LiveBench โ Regularly updated questions across coding, math, reasoning, language, and data analysis. Hard to game because questions rotate.
- ZeroEval โ Focused evaluations including SimpleQA (factual accuracy), HLE (hard problems), SWE-Bench (real software engineering), and SciCode.
- EvalPlus โ Code generation benchmarks (HumanEval+, MBPP+) that test whether models can write correct code.
- BigCodeBench โ Advanced coding tasks that go beyond simple function generation.
Each source tests something different. No single benchmark tells the whole story.
2. Scoring Models
Raw scores from different benchmarks aren't directly comparable โ an 85 on MMLU means something very different from an 85 on Arena Elo. So we normalize everything to a 0โ100 scale per benchmark using min-max normalization:
normalized = ((raw - min) / (max - min)) ร 100
Then we compute a unified score for each use case as a weighted sum across benchmark groups:
S = wโ ร bฬโ + wโ ร bฬโ + ... + wโ ร bฬโ
Where wแตข is the
weight for benchmark group i and bฬแตข is the
average normalized score across that group's tasks. If a model is
missing data for a benchmark group, that group contributes 0 to the
sum โ the weight is not redistributed. This means models
with sparse benchmark coverage naturally score lower:
A model with data for only 30% of the weight groups can score at most 30 out of 100.
This prevents a model with a single great benchmark score from being inflated to the top of the rankings. As more benchmarks evaluate the model, its score rises naturally.
The weights for each use case:
| Use Case | Weights |
|---|---|
| General | Arena 30%, Open LLM 25%, ZeroEval 20%, LiveBench 15%, IFEval 10% |
| Coding | EvalPlus 20%, ZeroEval 20%, LiveBench 20%, BigCodeBench 15%, Open LLM 15%, Arena 10% |
| RAG | Open LLM 30%, LiveBench 25%, Arena 20%, ZeroEval 15%, Open LLM 10% |
| Roleplay | Arena 45%, LiveBench 25%, Open LLM 15%, ZeroEval 15% |
Notice how roleplay leans heavily on Arena (human preference) while coding spreads weight across code-specific benchmarks. The weights reflect what actually matters for each task.
3. Matching to Your Hardware
When you tell us your setup, we calculate how much VRAM is available for model weights:
Windows / Linux
V_max = V_GPU - 2 GB
Mac (unified memory)
V_max = (RAM_total ร 0.75) - 2 GB
The 2 GB overhead accounts for KV cache and system memory. On Mac, we use 75% of total RAM since unified memory is shared between the GPU and the rest of the system.
A model fits if:
size_variant โค V_max
Models that don't fit are excluded entirely โ a model that doesn't load isn't useful, no matter how good its scores are.
4. Picking the Best Variant
Most models come in multiple quantization levels โ compressed versions that trade a small amount of quality for significantly less memory:
| Precision | Quality | Size |
|---|---|---|
| FP16 | Full precision, best quality | Largest |
| Q8 | Nearly indistinguishable from full | ~50% of FP16 |
| Q6 / Q5 | Sweet spot for most users | ~40% of FP16 |
| Q4 | Noticeable but acceptable loss | ~30% of FP16 |
| Q3 | Significant quality trade-off | ~25% of FP16 |
Your preference controls which variant we pick for each model:
- Max Quality โ Highest precision that fits in your VRAM:
pick variant with max(precision_rank) where size โค V_max
- Balanced โ Closest to 70% VRAM utilization, leaving room for longer context:
pick variant with min(|size - 0.7 ร V_max|)
- Max Context โ Smallest variant, maximizing headroom for KV cache:
pick variant with min(precision_rank) where size โค V_max
5. The Recommendation
You get two sets of results:
Top Picks (by quality)
Models ranked purely by their unified score for your use case. The best model is #1 regardless of size โ we just pick the variant that fits your hardware. You get the top 5 models, each with the variant that makes sense for your setup.
No magic multipliers, no hidden boosts for bigger models. Quality first, hardware fit second.
Trending Now (from HuggingFace)
The currently trending models on HuggingFace that are compatible with your hardware. This list updates automatically โ we pull from the HuggingFace API and match trending models to available GGUF variants in our database so you can actually run them locally.
Trending models might not have the highest benchmark scores, but they represent what the community is excited about right now. We show the benchmark score alongside download counts when available, so you can see how trending models stack up against the top-scoring ones.