Methodology · v1 draft · 2026-04
How we build an Assay.
This page is the working specification for every Agentsia Labs benchmark. It is a living document; it will be versioned alongside each release and the published dataset. If you are considering citing an Assay result, read this first.
Authors: Agentsia Labs editorial team, with invited volunteer reviewers named per release. Feedback and errata: ammar@agentsia.uk.
1. Why these benchmarks exist
The public leaderboards that frontier AI is measured by do not cover commercial work.
Every frontier release comes with a table of scores: SWE-bench Verified, MMLU-Pro, GPQA Diamond, HLE, HumanEval, AIME. These benchmarks are rigorously constructed, publicly contested, and reliably updated. They have also, collectively, become the register in which AI progress is discussed.
None of them test the work that pays salaries inside regulated enterprises. An adtech DSP does not care whether a model can pass graduate-level physics. A clinical-documentation team does not care whether it can solve a LeetCode-hard. A credit-decisioning group does not care about Codeforces. They care about whether a model can reason correctly, within a hard latency budget, over a narrow but deeply specific decision space: the auction window, the triage form, the transaction graph.
When a buyer at one of these teams asks “which model should I run here”, there is no neutral reference to turn to. Frontier labs publish numbers on the work they enjoy being measured on. Everyone else relies on vendor-authored case studies, private consultancy reports, or ad-hoc trials that do not survive contact with a changed frontier version.
Agentsia Labs publishes what is missing: rigorously constructed, openly reproducible benchmarks for the verticals that matter commercially, retested every time a frontier model meaningfully changes.
2. Scenario authoring
Every scenario is authored by a practitioner and validated by a panel.
A benchmark is a set of scenarios. Each scenario has an input (prompt, context, data), a rubric (what a correct or acceptable response looks like), and a scoring function (how to turn a model response into a number).
Scenarios come from two sources. First, domain practitioners on the invited review panel for the vertical: these are the people who have actually made the decisions our scenarios simulate, and they bring the texture that synthetic generation alone cannot. Second, a structured synthetic pipeline run inside Modelsmith: generate a candidate, run it past a panel reviewer, iterate to publication quality.
We publish the ratio of practitioner-authored to synthetic scenarios in every release. Neither is better absolutely. Practitioner-authored scenarios are higher signal per item; synthetic scenarios let us cover a capability grid at scale without systematic author bias.
Every scenario is marked with the intended capability axis it tests (see §4). A scenario that cannot be cleanly labelled against the rubric is rewritten or dropped. We do not publish ambiguous items.
3. Rubric design
Rubrics are prose first, code second, and always co-signed.
For each scenario, we write the rubric as a short prose paragraph first. If a human expert reading that paragraph cannot reliably decide whether a given model response meets the criteria, the rubric is wrong and we rewrite it. Only once the prose rubric passes a three-reviewer sanity check do we translate it to a scoring function.
Scoring functions are one of three kinds, in preference order: programmatic checkers (for cases where the correct output is structurally decidable), reference-matched LLM-as-judge (for cases where the correct output is expressible but has meaningful lexical variation), and panel-reviewed human scoring (for cases where the response quality is irreducibly judgement-heavy).
Every LLM-as-judge scoring run uses a judge model disclosed in the release. We publish inter-judge agreement numbers when multiple judges are used. Judge choices are pinned per release so that retest deltas isolate the candidate-model change, not the judge change.
Co-signing matters. Every rubric in every Assay release is approved by at least two named reviewers drawn from the invited volunteer panel for that vertical. The reviewers see the prose rubric, a sample of representative scenarios, and the scoring-function selection. Their names appear on the release page.
4. Capability axes and aggregation
Every Assay breaks into 4 to 6 capability axes, plus one composite.
A single headline number is less useful than a capability breakdown. A DSP trader wants to know whether a model reasons correctly about bid shading separately from whether it handles blocklist decisioning. Composite scores flatten that information. We preserve it.
Each vertical release defines 4 to 6 capability axes in its methodology page, each scored on a 0 to 100 normalised scale. The composite is a weighted average. Weights are published with rationale. If a reader disagrees with the weighting, the per-axis numbers let them compute their own composite in a spreadsheet.
We publish variance. Every score is run three times with different sampling seeds (temperature 0 where the API permits, temperature 0.2 otherwise), and the variance across seeds is reported. When variance is high relative to inter-model deltas, the release calls that out explicitly: a small gap in a high-variance regime is not a real gap.
5. Reproducibility
Every number we publish can be reproduced from the published artefacts in under an hour.
Each Assay release publishes: the scenario dataset (Apache 2.0), the rubric prose and scoring code (Apache 2.0), the model-runner configuration including version strings and temperature settings, the raw per-scenario model outputs (JSON), the scoring outputs per axis, the composite, and a reproduction command.
The reproduction command is a single invocation of the assay-harness CLI, pinned to the release tag, pointed at the published dataset. Running it against the published model identifiers regenerates the numbers within reported variance. If you cannot reproduce a number we published, that is a bug and we want the report.
The harness lives at github.com/agentsia-uk/assay-harness. It depends on nothing proprietary. Third parties can use it to score their own models against our datasets without involving us. Provider adapters (Anthropic, OpenAI, Google, Hugging Face inference, local vLLM) ship with the Assay-Adtech v1 release; before then, the harness is a runnable v0.1 scaffold with a stub runner used for testing.
6. Model inclusion
Frontier APIs, major open-weights families, and the post-trained specialists Agentsia builds.
Every release evaluates three classes of models side by side on identical scenarios with identical scoring. Class one: current frontier APIs (Claude, Gemini, GPT at latest-generation versions in use at the release date). Class two: major open-weights model families at their best-of-breed sizes (Nemotron, Qwen, Gemma, and where applicable Llama, DeepSeek, Mistral). Class three: Agentsia post-trained specialists for that vertical, labelled distinctly so readers can see the delta between an out-of-the-box open-weights model and one that has been post-trained through Modelsmith.
Model versions are pinned and disclosed with every run. When a frontier release ships with a material capability delta, we retest within two weeks and publish a delta note alongside the refreshed numbers. The previous release’s numbers remain available with a superseded tag.
Third-party submissions are not part of v1. Planned for v2: a submission route for teams to have their own models scored against published Assay datasets. Until then, the harness is open and anyone can reproduce numbers locally.
7. Retest cadence and versioning
Quarterly refresh per vertical. Automatic retest on frontier releases. Every number is dated and versioned.
A benchmark that is not current is not useful. Frontier models ship every two to three months; open-weights families iterate on similar or faster cycles. Two disciplines hold the refresh cadence together.
The scheduled discipline: every active vertical is refreshed once per quarter, regardless of external events. We publish the refresh date on the vertical page at least three weeks in advance. The refresh includes the current frontier model versions, any new open-weights releases in scope, and any rubric corrections that came out of the prior release’s reviewer feedback.
The triggered discipline: when a frontier model ships that we judge to be a material capability delta (new generation number, or step-change on at least one prior-release capability axis), we retest the affected verticals within two weeks and publish a delta note. We do not wait for the quarterly slot.
Versioning is simple: each benchmark has a vertical name and a version (Assay-Adtech v1.0, Assay-Adtech v1.1, Assay-Adtech v2.0). Within a version, minor increments (v1.1) indicate new model scores with the same dataset. Major increments (v2.0) indicate a revised dataset, a revised rubric, or a revised scoring function. Major increments publish a migration note explaining why the previous version’s numbers are no longer comparable.
8. Licensing and data posture
Datasets under Apache 2.0. Harness under Apache 2.0. Raw outputs published in full.
Every scenario dataset is released under Apache 2.0 with version tags. Every line of the harness is released under Apache 2.0 on GitHub. Every per-model raw output is published in JSON alongside the leaderboard.
All scenario data is either practitioner-authored or Modelsmith-generated synthetic. No customer data is used in any Agentsia Labs publication. No scraped or unlicensed material appears in our datasets. If a scenario reflects a real-world distribution (volume of RTB traffic, fraud pattern shape), the distribution is synthesised from public information and practitioner review, not sourced from any specific third party.
Per provider terms: model outputs are published as scored responses to our scenarios, with model versions and dates of access. We disclose temperature, system prompts, and any non-default settings. We do not use the outputs to train competing models, nor do we offer the outputs as a drop-in replacement for the underlying services.
9. Corrections, errata, and negative findings
We publish when we are wrong. We publish when we lose. Neither deserves to be hidden.
If a rubric is found to be flawed after publication, we issue a corrected version, mark the affected scenarios on the leaderboard, and recompute the composite. Superseded numbers remain visible with a superseded tag so that prior citations remain anchored.
Negative findings, including cases where Agentsia post-trained specialists underperform a frontier API on a particular capability axis, are published on the same leaderboard as everything else. We will never suppress a result because it is commercially inconvenient. A research surface that hides its losses is not credible on its wins.
Issue reports: open an issue on the harness repo, or email ammar@agentsia.uk. All substantive corrections are acknowledged within five working days.
Colophon
This methodology is authored by Agentsia Labs. It is versioned in the same git repository as the harness. Contributions welcome through the repo’s issues and pull requests, or by writing to ammar@agentsia.uk.
v1.0 draft · 2026-04 · Next revision expected ahead of the Assay-Adtech v1 release.