How it works
DoesItLocal turns a messy practitioner question (“can my local model do this?”) into a single, evidence-backed verdict per task. Here is the flow end to end.
1. A task enters the catalog
Section titled “1. A task enters the catalog”The unit is a task — a describable piece of work you’d ask an AI to do: “extract structured fields from an invoice,” “write a unit test for a pure function,” “summarize a meeting transcript,” “refactor a 200-line module.” Tasks are seeded from real-usage and benchmark taxonomies (so they reflect what people actually ask, not invented categories) and grow from community submissions. Each carries a name, a plain description, a category, and — crucially — a verification method: how you’d check the output is right. See Task catalog.
2. Evidence accumulates
Section titled “2. Evidence accumulates”Two kinds of signal attach to a task, newest first:
- Eval runs — a model runs the task and its output is scored by the task’s verifier (tests pass / lint clean / types check / exact-match / a validated judge). This is the reproducible measurement.
- Practitioner reports — developers report what happened when they ran a model on this kind of task, with their hardware and model. This is the cheap, scaling, breadth signal — and it’s collected through a manipulation-resistant voting system, not a naïve up/down tally.
A model’s published benchmark score (AA Index, LiveBench, etc.) is used only to decide which models are worth running on the task at all — a shortlist input, never evidence of the verdict.
3. The verdict resolves
Section titled “3. The verdict resolves”The signals roll up into one current verdict per task, on an asymmetric, default-conservative scale:
- 🟢 Safe for local — a named local model reliably clears the task’s verifier.
- 🟡 Local with a check — a local model can do it if you gate the output with a cheap verification step; trusting it blind is not safe.
- 🔴 Needs a bigger model — no local model reliably clears it; use the recommended fallback.
- 🔶 Needs more data — the honest default until the evidence is strong enough to assert anything.
The bar is asymmetric on purpose: a negative (“needs a bigger model”) demands reproduced evidence, and anything ambiguous stays “needs more data” — DoesItLocal never guesses a green light. See Local-safety verdicts.
4. You get a recommendation, not just a label
Section titled “4. You get a recommendation, not just a label”Alongside the verdict, each task shows:
- a table of local models that clear it (and at what size/quantization/hardware), and
- a recommended fallback — the cheapest open-weights or frontier model that does handle it — when local isn’t safe.
See Model recommendations.
5. Agents query it directly
Section titled “5. Agents query it directly”A coding agent or router hits the agent API / MCP with a task (and the local model it has) and gets back the verdict + recommendation in one request — so it can run the task locally when that’s safe, add a verifier when that’s the unlock, and escalate to a bigger model only when it must.
Why it stays honest
Section titled “Why it stays honest”Freshness and asymmetry are the whole game. Models turn over every quarter, so verdicts carry a date and a staleness flag and re-resolve as new evidence lands; and because a false “safe for local” is the one error that burns trust, the default is always the conservative call. The reasoning behind this design is in Design principles.