Skip to main content

Assay · Ad tech · May 2026

Assay-Adtech v1 specification

Brand-safety, bid shading, and MFA classification under a 100ms auction envelope.

Status
Released
Release
May 2026
Scenarios
344 scenarios · hash 162ff7fcd8ce

Leaderboard

Released proof set: 344 scenarios, manifest v1.8.0-rc.4, scenario_set_hash 162ff7fcd8ce4266af8848938b3fc6415000843e0901651456d3fa4191fc65b6. The table below contains the current no-tools production frontier baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5.

Rank
Model
Provider
Cluster
Composite
95% CI
Runs
Unparseable
claude-opus-4-7
Anthropic
assay-adtech-gemma-4-e4b-it
42.4%
37.2-47.7%
3
0
2
claude-opus-4-7
Anthropic
assay-adtech
41.6%
36.3-46.8%
3
1
3
claude-opus-4-7
Anthropic
assay-adtech-gemma-4-e2b-it
41.0%
35.8-46.2%
3
0
4
gpt-5-5
OpenAI
assay-adtech
40.4%
35.2-45.6%
3
0
5
gpt-5-5
OpenAI
assay-adtech-gemma-4-e4b-it
39.8%
34.9-44.8%
3
0
6
gpt-5-5
OpenAI
assay-adtech-gemma-4-e2b-it
39.2%
34.0-44.5%
3
1
7
gemini-3-1-pro-preview
Google
assay-adtech-gemma-4-e4b-it
38.4%
33.1-43.6%
3
0
8
gemini-3-1-pro-preview
Google
assay-adtech
37.2%
32.3-42.4%
3
0
9
gemini-3-1-pro-preview
Google
assay-adtech-gemma-4-e2b-it
37.2%
32.3-42.4%
3
0

Current-hash production baselines for scenario_set_hash 162ff7fcd8ce. Each row is a no-tools three-run provider baseline against the released 344-scenario Assay-Adtech corpus.

Description

Assay-Adtech v1 measures how well a model can reason around the IAB OpenRTB 100ms auction window across enterprise adtech decisions including brand safety, MFA detection, bid shading, supply-path reasoning, privacy compliance, and multi-turn incident triage. The released public surface is the v1.8.0-rc.4, 344-scenario corpus with scenario_set_hash 162ff7fcd8ce4266af8848938b3fc6415000843e0901651456d3fa4191fc65b6 and no-tools production baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5.

Capability axes

Brand safety
MFA detection
Bid shading
Latency budget

Rubric

  • Brand-safety classification: precision and recall against a held-out blocklist that ages between retest cycles.
  • MFA detection: structural reasoning over a synthetic publisher graph, scored on F1 against editorial labels.
  • Bid shading: predicted win-rate calibration measured by Brier score against synthetic, publicly derived auction cohorts.
  • Latency budget: every scenario must complete decode within 100ms on the published consumer-grade reference hardware.

Retest policy

Retest triggers: (1) any frontier release within the announced families, (2) a fresh quarterly synthetic auction cohort, (3) blocklist drift exceeding 5% by editorial audit. Each retest republishes the dataset version tag and keeps superseded results dated rather than overwritten.

Publication artefacts

SpecificationAvailable
Methodology draftAvailable
Released corpus v1.8.0-rc.4Available
Current-hash frontier proof packageAvailable
Technical corpus inspectionAvailable

Published scenario rows are intended to be scored using the open assay-harness. Provider adapter availability is disclosed per release. See the methodology for scenario authoring, HITL validation, and dataset licensing.

← All benchmarks