Agentsia Labs · Research surface

Independent benchmarks for the commercial verticals no leaderboard tests.

Frontier leaderboards measure SWE-bench, MMLU-Pro, GPQA Diamond, HLE. None of them test adtech, fintech, legaltech, automotive, or clinical decisioning. Agentsia Labs publishes rigorous, open, reproducible evaluations of frontier APIs and open-weights specialists on the commercial workflows that matter to regulated enterprises.

Quarterly cadence. Automatic retest on frontier releases. Open-source harness. Published datasets. Invited expert review.

Read the methodology See the roadmap Harness on GitHub

The programme

Three disciplines hold every release together.

Open by default

Every scenario, every rubric, every score aggregation, every raw model output is published. Datasets ship under Apache 2.0. The harness is on GitHub. Practitioners can reproduce any number we publish, end to end, on their own hardware.

Versioned and reproducible

Each release is tagged with a git SHA on the dataset and the harness. Model responses are recorded with version, date, temperature, and system prompt. Frontier updates trigger automatic retests. Stale numbers do not survive.

Published by practitioners

Rubrics are co-authored with invited domain operators (DSP traders, credit decision scientists, clinical informaticians). Methodology is readable in prose, not just code. If you disagree with a scoring choice, the argument to have is visible.

The cadence

Quarterly refresh per vertical. Automatic retest on every material frontier release.

A benchmark is only credible if it is current. Every active vertical is refreshed each quarter with the latest frontier versions and new open-weights releases. When a frontier model ships with a material capability delta, we retest within two weeks and publish. Announced cadence creates external accountability and gives practitioners a reason to return.

The roadmap

Five verticals in total. Two with quarter commitments.

Assay-Adtech v1

RTB reasoning, bid shading, pre-bid MFA filtering, supply-path optimisation, blocklist decisioning, latency-constrained reasoning.

Q2 2026

Assay-Fintech v1

Fraud typology, credit decisioning, transaction-pattern reasoning, compliance-aligned reasoning under regulatory constraints.

Q3 2026

Assay-Legaltech v1

Statutory interpretation, contract analysis, regulatory reasoning under jurisdiction-specific constraints. In design.

2027

Assay-Health v1

Clinical triage, decision support, documentation quality under HIPAA-aligned constraints. In design.

2027

Assay-Auto v1

ADAS perception edge cases, in-vehicle agent reasoning, driving-telemetry interpretation. In design.

2027

Full roadmap and rationale

Colophon

Research produced with Modelsmith.

Every benchmark release is authored end to end by Modelsmith, the specialisation control plane that Agentsia licenses to regulated enterprises. The synthetic scenarios come out of the data-generation agent. The evaluations run through the same eval harness practitioners use on their own workflows. The post-trained specialists ship with real promotion records. The surface is the receipts.

If you want the platform behind the numbers, visit agentsia.uk/platform.

agentsia.uk