Methodology - v1 draft - 2026-04

The Assay method, in public.

Agentsia Labs benchmarks are designed to test commercial work that public leaderboards skip: latency-constrained decisions, regulated workflows, private rubrics, and domain-specific failure modes. This page is the operating spec for how scenarios are authored, scored, versioned, reproduced, retested, licensed, and corrected.

Authors: Agentsia Labs editorial team, with invited volunteer reviewers named per release. Feedback and errata: contact form.

Draft v1.0 - 2026-04Assay-Adtech v1 released cadenceOpen harness scaffold nowNo customer data in Labs publications

What this page answers

Can I trust the benchmark before I trust the score?

Current release state

Methodologydraft

Harnessscaffold

Assay-Adtech v1scheduled

External submissionsv2 plan

Method contract

sources disclosed

scores reproducible

corrections preserved

Open release pack

Each Assay release is designed to publish the dataset, rubric prose, scoring code, redacted runner config, non-sensitive scored artefacts, axis scores, and composite formula.

Practitioner review

Scenario sources, reviewer roles, scoring mode, and known limitations are disclosed so a reader can challenge the method, not just the number.

Retest discipline

Quarterly refresh policy, triggered retests on material frontier changes, and version tags keep old results dated rather than silently overwritten.

Losses stay visible

Corrections, superseded numbers, and negative findings remain part of the record. A benchmark that hides inconvenient results is not useful.

Jump to

01 Benchmark gap 02 Scenario lifecycle 03 Rubrics and scoring 04 Capability axes 05 Reproducibility 06 Model inclusion 07 Cadence 08 Data posture 09 Corrections

Why these benchmarks exist

Public leaderboards do not test private commercial work.

SWE-bench, MMLU-Pro, GPQA, HLE, AIME, and HumanEval are useful public reference points. They do not tell a DSP trader whether a model can reason inside a bid path, or a regulated operator whether it can handle a narrow workflow under latency, privacy, and audit constraints.

Public frontier scoreboard

Broad ability, public tasks.

MMLU-Pro, GPQA, AIME, HumanEval, SWE-bench.
Good for comparing broad model capability.
Weak fit for private workflow gates and latency budgets.

Assay benchmark

Narrow work, published method.

RTB reasoning, compliance reads, credit decisions, clinical documentation.
Axis scores preserve where a model is good or brittle.
Release packs are designed for independent reproduction.

Scenario authoring

Every scenario is practitioner-reviewed. Sources are disclosed.

Scenarios can begin as practitioner-authored items or as synthetic candidates generated through internal tooling. Neither path is accepted automatically. Each candidate must survive review, axis labelling, rubric sanity checks, and scoring selection before it can be published.

Source A

Practitioner authored

High-signal examples from operators who understand the workflow.

Source B

Synthetic candidate

Coverage generation through internal tooling, then human review.

Gate

Published or dropped

Ambiguous items are rewritten or removed before release.

Candidate scenario

Practitioner-authored or tooling-generated synthetic candidate.

Panel review

Domain reviewers reject, rewrite, or accept the candidate for rubric work.

Axis label

Each scenario is mapped to a capability axis before scoring begins.

Rubric sanity check

Three reviewers must be able to apply the prose rubric consistently.

Scoring mode

Use the most deterministic scorer that still fits the task.

Release or drop

Ambiguous scenarios do not ship. Published items carry source and scoring metadata.

Rubric design and scoring

Prose first, code second, deterministic where possible.

We write the rubric in prose before writing scoring code. If a human expert cannot apply the rubric consistently, the rubric is rewritten. Scoring then uses the most deterministic mode that fits the task.

Programmatic checker

Used when

The correct output is structurally decidable.

Evidence published

Checker code, fixtures, pass/fail outputs.

Known limit

Only works where correctness can be encoded without judgement.

Reference-matched judge

Used when

A correct answer has lexical variation but a known reference shape.

Evidence published

Pinned judge model, redacted judge configuration, prompt-shape metadata, and inter-judge agreement where used.

Known limit

Judge choice can move scores, so it is pinned per release.

Human panel score

Used when

The response is judgement-heavy and cannot be reduced safely.

Evidence published

Reviewer rubric, sample scoring notes, disagreement treatment.

Known limit

Slower and more expensive. Reserved for high-judgement items.

Capability axes and aggregation

The composite is not allowed to hide the shape.

A single headline number can flatten the reason a model helps or fails. Each vertical release defines four to six capability axes on a normalised 0 to 100 scale, reports variance across seeds, and publishes weights so readers can recompute their own composite.

Example shape, not release data

Assay-Adtech axis weights.

A DSP buyer can see whether a model is strong on bid shading but weak on MFA classification instead of reading one flattened score.

composite = sum(axis_score * published_weight)

Bid shading28%

MFA classification22%

Supply-path reasoning20%

Blocklist decisions18%

Latency fit12%

Axis labels, weights, seed variance, and formula ship with the release pack.

Reproducibility

The release pack is the receipt.

Release artefacts are designed to make published numbers independently reproducible. The assay-harness repository is open-source and is being built around the scenario loader, scoring pipeline, non-sensitive run manifests, and provider runners. Those manifests exclude raw prompts, raw completions, secrets, internal endpoints, and any customer-derived data. Each release pack discloses the harness version and adapter set used for that release.

At release

scenario-dataset.jsonl

Inputs, metadata, source class, axis labels, and scenario version. Apache 2.0 unless release notes state otherwise.

At release

rubrics.md

Prose rubrics and reviewer sign-off notes for each capability axis. Apache 2.0 unless release notes state otherwise.

At release

scorers/

Programmatic checks, redacted judge configuration, fixtures, and scoring selection. Apache 2.0 unless release notes state otherwise.

At release

runner-config.json

Model identifiers, temperature, redacted prompt-shape metadata, seed policy, and public provider settings. Secrets, auth headers, internal endpoints, and private operational metadata are excluded.

At release

scored-outputs/

Non-sensitive scored artefacts from the frozen scenario dataset only. Customer data, private prompts, raw completions, and sensitive metadata are excluded.

At release

composite.json

Dated axis scores, variance, weights, and formula used for the release headline.

Example harness command shape

assay run \
  --dataset ./release-pack/dataset/ \
  --runner <provider>:<model> \
  --out ./reproduction-run/redacted-run-manifest.json

Model inclusion

Model classes are labelled by status, not blended together.

Each release is designed to evaluate frontier APIs and open-weight families side by side, plus Agentsia specialists where a relevant specialist exists. Model versions, dates, temperatures, and non-default settings are disclosed.

Frontier APIs

Included in v1 plan

Claude, Gemini, and GPT family versions current at the release date, pinned and dated.

Open-weight families

Included in v1 plan

Best-of-breed Nemotron, Qwen, Gemma, and where applicable Llama, DeepSeek, or Mistral.

Agentsia specialists

Included when available

Where a vertical specialist exists, it is labelled separately from its base open-weight model.

Third-party submissions

Planned for v2

External submissions are not accepted in v1. Teams can still reproduce locally with the harness.

Retest cadence and versioning

Every number is dated. Method changes get a new major version.

A benchmark that is not current is not useful. Active verticals follow a quarterly refresh policy, and material frontier releases trigger a retest policy for affected verticals. Version tags say whether scores changed or the benchmark itself changed.

v1.0

Initial release

Dataset, rubrics, scoring functions, model set, redacted scored outputs, and composite formula.

v1.1

Same benchmark, new scores

New model versions or additional model classes scored against unchanged dataset and rubric.

v2.0

Method changed

Dataset, rubric, scorer, or weighting changed. A migration note explains non-comparability.

Scheduled discipline

Every active vertical refreshes once per quarter with current model versions and approved rubric corrections.

Triggered discipline

A material frontier capability change starts a retest path rather than waiting for the next scheduled slot.

Licensing and data posture

Open where possible. Customer data never enters Labs publications.

Scenario datasets and harness code are released under Apache 2.0 with version tags. Provider outputs are published as scored responses to Labs scenarios, with model versions and dates of access. The data boundary is explicit.

Customer data

Not used

No customer data is used in any Agentsia Labs publication.

Scraped or unlicensed material

Excluded

Datasets are practitioner-authored or generated by internal tooling.

Practitioner-authored scenarios

Disclosed

Higher-signal items authored or reviewed by people who understand the workflow.

Synthetic scenarios

Disclosed

Generated to cover the capability grid, then reviewed before publication.

Provider model outputs

Published as scores

Published as scored responses from the released scenario dataset, with model version, date, and non-sensitive settings.

Corrections, errata, and negative findings

Wrong results are corrected, not buried.

If a rubric is found to be flawed after publication, the affected scenarios are marked, the composite is recomputed, and superseded numbers remain visible so prior citations stay anchored. Negative findings are published on the same surface as wins.

Report

Acknowledge within five working days

Triage affected scenarios

Issue corrected release

Keep superseded numbers visible

Integrity rule

A research surface that hides its losses is not credible on its wins.

Report an issue

Colophon

This methodology is authored by Agentsia Labs. The public methodology page is versioned with the Agentsia website. Harness implementation changes are tracked in the assay-harness repository. Contributions are welcome through the harness repository’s issues and pull requests, or through the Agentsia contact form.

v1.0 draft - 2026-04 - Next revision expected before the Assay-Adtech v1 dataset freeze.