Skip to main content

Agentsia Labs · Research surface

Vertical AI benchmarks that public leaderboards skip.

Frontier leaderboards measure SWE-bench, MMLU-Pro, GPQA Diamond, HLE. The active Agentsia evidence is adtech: auction decisions, bid shading, and MFA classification. Credit reasoning, clinical documentation, automotive agents, and other narrow workflows remain future Assay candidates until their evidence is frozen.

Agentsia Labs publishes Assay, an open benchmark programme for domain-specific model evaluation. The current active surface is Assay-Adtech v1 on a released 344-scenario corpus; other verticals are roadmap candidates, not active proof.

Current benchmark

Assay-Adtech v1 specification.

Brand-safety, bid shading, and MFA classification under a 100ms auction envelope. Released corpus v1.8.0-rc.4 with current-hash production frontier baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5. The released leaderboard and release artefacts remain the citable record.

Available now

Specification

Available now

Methodology draft

Available now

Released corpus v1.8.0-rc.4

Available now

Current-hash frontier proof package

Available now

Technical corpus inspection

The programme

Three disciplines hold every release together.

Open by default

Assays are designed to ship with scenario datasets, rubric prose, scoring code, and redacted non-sensitive run metadata. The current public surface shows the programme status before the first frozen dataset is published.

Versioned and reproducible

Each release candidate will be tied to a dataset tag, harness version, model version, date, temperature, and run configuration. Retest policy is part of the release contract, not a vague promise.

Published by practitioners

Rubrics are written in prose first, then translated into scoring functions. Practitioner reviewers are named on release pages when a release is frozen, so review status stays auditable.

Methodology at a glance

A release is not citable until the artefacts are frozen.

The current methodology is a draft protocol. It defines how an Assay moves from scenario authoring to reviewer validation, frozen dataset, scored run, release note, and retest.

  1. 01Scenario authoring
  2. 02Reviewer validation
  3. 03Frozen dataset
  4. 04Scored run
  5. 05Release note and retest policy

The roadmap

Adtech is active. Other verticals remain future candidates.

Future labels are planning posture, not released results. A non-adtech vertical enters the refresh cadence only after its first dataset, rubric, and result table are frozen.

Assay-Adtech v1

Active evidence: released 344-scenario RTB reasoning, bid shading, pre-bid MFA filtering, supply-path optimisation, blocklist decisioning, and latency-constrained reasoning.

Released
Assay-Fintech v1

Future candidate: fraud typology, credit decisioning, transaction-pattern reasoning, and compliance-aligned reasoning after adtech evidence.

Future
Assay-Legaltech v1

Statutory interpretation, contract analysis, regulatory reasoning under jurisdiction-specific constraints. In design.

2027
Assay-Health v1

Future candidate: clinical triage, decision support, and documentation quality after clinical reviewer gates are established.

2027
Assay-Auto v1

ADAS perception edge cases, in-vehicle agent reasoning, driving-telemetry interpretation. In design.

2027

Colophon

Research programme produced by Agentsia.

Labs is the public research surface for Agentsia’s vertical evaluation work. The public site shows release specifications, methodology, roadmap, and links to open harness artefacts. Customer data, private prompts, raw logs, adapter paths, and internal run artefacts stay outside Labs.

If you want the platform behind the programme, visit agentsia.uk/platform. For release feedback, use the contact form.