Assay · Ad tech · May 2026

Assay-Adtech v1 specification

Brand-safety, bid shading, and MFA classification under a 100ms auction envelope.

Status

Released

Release

May 2026

Scenarios

344 scenarios · hash 162ff7fcd8ce

Results

Released leaderboard available

Leaderboard

Released proof set: 344 scenarios, manifest v1.8.0-rc.4, scenario_set_hash 162ff7fcd8ce4266af8848938b3fc6415000843e0901651456d3fa4191fc65b6. The table below contains the current no-tools production frontier baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5.

Rank

Model

Provider

Cluster

Composite

95% CI

Runs

Unparseable

claude-opus-4-7

Anthropic

assay-adtech-gemma-4-e4b-it

42.4%

37.2-47.7%

claude-opus-4-7

Anthropic

assay-adtech

41.6%

36.3-46.8%

claude-opus-4-7

Anthropic

assay-adtech-gemma-4-e2b-it

41.0%

35.8-46.2%

gpt-5-5

OpenAI

assay-adtech

40.4%

35.2-45.6%

gpt-5-5

OpenAI

assay-adtech-gemma-4-e4b-it

39.8%

34.9-44.8%

gpt-5-5

OpenAI

assay-adtech-gemma-4-e2b-it

39.2%

34.0-44.5%

gemini-3-1-pro-preview

Google

assay-adtech-gemma-4-e4b-it

38.4%

33.1-43.6%

gemini-3-1-pro-preview

Google

assay-adtech

37.2%

32.3-42.4%

gemini-3-1-pro-preview

Google

assay-adtech-gemma-4-e2b-it

37.2%

32.3-42.4%

Benchmark Insights & Takeaways

v1.0 corpus (162ff7fc)

Key Findings

Frontier model scores range from 37.2% (Gemini 3.1 Pro Preview) to 42.4% (Claude Opus 4.7), representing a performance spread of 5.2 percentage points.
126 out of 344 scenarios (36.6%) were failed by ALL evaluated frontier models, highlighting persistent industry-wide reasoning blind spots.
Models generally handle high-level conceptual guidelines (e.g. general GDPR/TCF consent rules) well, but fail operational field-level validations and edge cases.
No evaluated model shows a complete advantage across all topics, reinforcing the need for specialized multi-agent packaging.

Topic Performance Breakdown

OpenRTB48 scenarios

Claude Opus 4.762%

GPT 5.558%

Gemini 3.1 Pro42%

privacy36 scenarios

Claude Opus 4.775%

GPT 5.569%

Gemini 3.1 Pro56%

IVT32 scenarios

Claude Opus 4.744%

GPT 5.550%

Gemini 3.1 Pro38%

formats40 scenarios

Claude Opus 4.780%

GPT 5.578%

Gemini 3.1 Pro70%

targeting38 scenarios

Claude Opus 4.768%

GPT 5.563%

Gemini 3.1 Pro58%

supply-chain30 scenarios

Claude Opus 4.750%

GPT 5.543%

Gemini 3.1 Pro40%

Caveats & Methodology

•Topic groupings are determined heuristically based on scenario ID naming conventions and prefixes.
•Pass rates represent the median performance across evaluated baseline runs.

Current-hash production baselines for scenario_set_hash 162ff7fcd8ce. Each row is a no-tools three-run provider baseline against the released 344-scenario Assay-Adtech corpus.

Description

Assay-Adtech v1 measures how well a model can reason around the IAB OpenRTB 100ms auction window across enterprise adtech decisions including brand safety, MFA detection, bid shading, supply-path reasoning, privacy compliance, and multi-turn incident triage. The released public surface is the v1.8.0-rc.4, 344-scenario corpus with scenario_set_hash 162ff7fcd8ce4266af8848938b3fc6415000843e0901651456d3fa4191fc65b6 and no-tools production baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5.

Capability axes

Brand safety

MFA detection

Bid shading

Latency budget

Rubric

Brand-safety classification: precision and recall against a held-out blocklist that ages between retest cycles.
MFA detection: structural reasoning over a synthetic publisher graph, scored on F1 against editorial labels.
Bid shading: predicted win-rate calibration measured by Brier score against synthetic, publicly derived auction cohorts.
Latency budget: every scenario must complete decode within 100ms on the published consumer-grade reference hardware.

Retest policy

Retest triggers: (1) any frontier release within the announced families, (2) a fresh quarterly synthetic auction cohort, (3) blocklist drift exceeding 5% by editorial audit. Each retest republishes the dataset version tag and keeps superseded results dated rather than overwritten.

Publication artefacts

SpecificationAvailable

Methodology draftAvailable

Released corpus v1.8.0-rc.4Available

Current-hash frontier proof packageAvailable

Technical corpus inspectionAvailable

Published scenario rows are intended to be scored using the open assay-harness. Provider adapter availability is disclosed per release. See the methodology for scenario authoring, HITL validation, and dataset licensing.

← All benchmarks