Open release pack
Each Assay release is designed to publish the dataset, rubric prose, scoring code, redacted runner config, non-sensitive scored artefacts, axis scores, and composite formula.
Methodology - v1 draft - 2026-04
Agentsia Labs benchmarks are designed to test commercial work that public leaderboards skip: latency-constrained decisions, regulated workflows, private rubrics, and domain-specific failure modes. This page is the operating spec for how scenarios are authored, scored, versioned, reproduced, retested, licensed, and corrected.
Authors: Agentsia Labs editorial team, with invited volunteer reviewers named per release. Feedback and errata: contact form.
What this page answers
Can I trust the benchmark before I trust the score?
Current release state
Method contract
Each Assay release is designed to publish the dataset, rubric prose, scoring code, redacted runner config, non-sensitive scored artefacts, axis scores, and composite formula.
Scenario sources, reviewer roles, scoring mode, and known limitations are disclosed so a reader can challenge the method, not just the number.
Quarterly refresh policy, triggered retests on material frontier changes, and version tags keep old results dated rather than silently overwritten.
Corrections, superseded numbers, and negative findings remain part of the record. A benchmark that hides inconvenient results is not useful.
Jump to
Why these benchmarks exist
SWE-bench, MMLU-Pro, GPQA, HLE, AIME, and HumanEval are useful public reference points. They do not tell a DSP trader whether a model can reason inside a bid path, or a regulated operator whether it can handle a narrow workflow under latency, privacy, and audit constraints.
Public frontier scoreboard
Assay benchmark
Scenario authoring
Scenarios can begin as practitioner-authored items or as synthetic candidates generated through internal tooling. Neither path is accepted automatically. Each candidate must survive review, axis labelling, rubric sanity checks, and scoring selection before it can be published.
Source A
Practitioner authored
High-signal examples from operators who understand the workflow.
Source B
Synthetic candidate
Coverage generation through internal tooling, then human review.
Gate
Published or dropped
Ambiguous items are rewritten or removed before release.
Candidate scenario
Practitioner-authored or tooling-generated synthetic candidate.
Panel review
Domain reviewers reject, rewrite, or accept the candidate for rubric work.
Axis label
Each scenario is mapped to a capability axis before scoring begins.
Rubric sanity check
Three reviewers must be able to apply the prose rubric consistently.
Scoring mode
Use the most deterministic scorer that still fits the task.
Release or drop
Ambiguous scenarios do not ship. Published items carry source and scoring metadata.
Rubric design and scoring
We write the rubric in prose before writing scoring code. If a human expert cannot apply the rubric consistently, the rubric is rewritten. Scoring then uses the most deterministic mode that fits the task.
Used when
The correct output is structurally decidable.
Evidence published
Checker code, fixtures, pass/fail outputs.
Known limit
Only works where correctness can be encoded without judgement.
Used when
A correct answer has lexical variation but a known reference shape.
Evidence published
Pinned judge model, redacted judge configuration, prompt-shape metadata, and inter-judge agreement where used.
Known limit
Judge choice can move scores, so it is pinned per release.
Used when
The response is judgement-heavy and cannot be reduced safely.
Evidence published
Reviewer rubric, sample scoring notes, disagreement treatment.
Known limit
Slower and more expensive. Reserved for high-judgement items.
Capability axes and aggregation
A single headline number can flatten the reason a model helps or fails. Each vertical release defines four to six capability axes on a normalised 0 to 100 scale, reports variance across seeds, and publishes weights so readers can recompute their own composite.
Example shape, not release data
A DSP buyer can see whether a model is strong on bid shading but weak on MFA classification instead of reading one flattened score.
Axis labels, weights, seed variance, and formula ship with the release pack.
Reproducibility
Release artefacts are designed to make published numbers independently reproducible. The assay-harness repository is open-source and is being built around the scenario loader, scoring pipeline, non-sensitive run manifests, and provider runners. Those manifests exclude raw prompts, raw completions, secrets, internal endpoints, and any customer-derived data. Each release pack discloses the harness version and adapter set used for that release.
Inputs, metadata, source class, axis labels, and scenario version. Apache 2.0 unless release notes state otherwise.
Prose rubrics and reviewer sign-off notes for each capability axis. Apache 2.0 unless release notes state otherwise.
Programmatic checks, redacted judge configuration, fixtures, and scoring selection. Apache 2.0 unless release notes state otherwise.
Model identifiers, temperature, redacted prompt-shape metadata, seed policy, and public provider settings. Secrets, auth headers, internal endpoints, and private operational metadata are excluded.
Non-sensitive scored artefacts from the frozen scenario dataset only. Customer data, private prompts, raw completions, and sensitive metadata are excluded.
Dated axis scores, variance, weights, and formula used for the release headline.
Example harness command shape
assay run \
--dataset ./release-pack/dataset/ \
--runner <provider>:<model> \
--out ./reproduction-run/redacted-run-manifest.jsonModel inclusion
Each release is designed to evaluate frontier APIs and open-weight families side by side, plus Agentsia specialists where a relevant specialist exists. Model versions, dates, temperatures, and non-default settings are disclosed.
Claude, Gemini, and GPT family versions current at the release date, pinned and dated.
Best-of-breed Nemotron, Qwen, Gemma, and where applicable Llama, DeepSeek, or Mistral.
Where a vertical specialist exists, it is labelled separately from its base open-weight model.
External submissions are not accepted in v1. Teams can still reproduce locally with the harness.
Retest cadence and versioning
A benchmark that is not current is not useful. Active verticals follow a quarterly refresh policy, and material frontier releases trigger a retest policy for affected verticals. Version tags say whether scores changed or the benchmark itself changed.
Dataset, rubrics, scoring functions, model set, redacted scored outputs, and composite formula.
New model versions or additional model classes scored against unchanged dataset and rubric.
Dataset, rubric, scorer, or weighting changed. A migration note explains non-comparability.
Every active vertical refreshes once per quarter with current model versions and approved rubric corrections.
A material frontier capability change starts a retest path rather than waiting for the next scheduled slot.
Licensing and data posture
Scenario datasets and harness code are released under Apache 2.0 with version tags. Provider outputs are published as scored responses to Labs scenarios, with model versions and dates of access. The data boundary is explicit.
No customer data is used in any Agentsia Labs publication.
Datasets are practitioner-authored or generated by internal tooling.
Higher-signal items authored or reviewed by people who understand the workflow.
Generated to cover the capability grid, then reviewed before publication.
Published as scored responses from the released scenario dataset, with model version, date, and non-sensitive settings.
Corrections, errata, and negative findings
If a rubric is found to be flawed after publication, the affected scenarios are marked, the composite is recomputed, and superseded numbers remain visible so prior citations stay anchored. Negative findings are published on the same surface as wins.
Report
Acknowledge within five working days
Triage affected scenarios
Issue corrected release
Keep superseded numbers visible
Integrity rule
A research surface that hides its losses is not credible on its wins.
Colophon
This methodology is authored by Agentsia Labs. The public methodology page is versioned with the Agentsia website. Harness implementation changes are tracked in the assay-harness repository. Contributions are welcome through the harness repository’s issues and pull requests, or through the Agentsia contact form.
v1.0 draft - 2026-04 - Next revision expected before the Assay-Adtech v1 dataset freeze.