Constructs before proxies
We begin with the underlying behavior or risk, not the easiest score to instrument.
Psychometrics for AI development
Eidos Research designs psychometric infrastructure for labs and product teams: constructs, rubrics, calibration loops, and launch thresholds that stay legible when scores start carrying actual power.
Operating frame
Scores become policy faster than teams expect. We design the layer that makes those scores interpretable before they become institutional.
Principle
We begin with the underlying behavior or risk, not the easiest score to instrument.
Reliability work comes before reporting polish, because the metric matters only if the judgment is stable.
Instruments
Psychometrics spent a century building tools to answer one question: does this number mean what we think it means? Most of that apparatus has never reached AI evaluation. These are the instruments we bring, and the specific failure each one catches.
Whether a score measures the capability you intend, or an artifact of how you built the test.
In AI eval The question "we scored 87%" cannot answer on its own — and the one frontier safety frameworks and regulators are learning to ask first.
Models item difficulty and discrimination as parameters, and ability as a latent estimate on the same scale — so systems that answered different items stay comparable.
In AI eval Replaces "fraction correct on a frozen suite" and makes adaptive, saturation-resistant benchmarks possible.
Decomposes score variance into its sources — item, rater, occasion, prompt phrasing — instead of collapsing them into one number.
In AI eval Tells you whether a two-point gain is capability or a run-to-run artifact of the harness.
Detects items that behave differently across groups for reasons unrelated to the trait being measured.
In AI eval Surfaces prompt formats that advantage one model family independent of capability — format luck scored as skill.
Cohen's κ, Krippendorff's α, and intraclass correlation: agreement beyond chance among independent raters.
In AI eval The precondition for trusting any rubric-scored or LLM-judged result — human or model rater alike.
Separates convergent signal — what a construct shares with its kin — from discriminant signal — what sets it apart.
In AI eval Catches composite "capability" scores that silently aggregate unrelated constructs into one misleading headline.
Demonstration
A practical review batch: two raters score the same set of model outputs against a gold adjudication policy. Change a verdict and watch reliability, miss patterns, and decision readiness move together.
Architecture
For teams moving from evals to agents, we use the same measurement discipline at runtime. PCP turns that discipline into an execution frame: perception gathers live evidence, context preserves durable state, and permission gates every action before an agent touches the world.
Perceive what is live. Carry what matters. Approve what can change the world.
The point is not to make agent behavior feel magical. The point is to make it legible, resumable, and governable when the run crosses into operational, financial, legal, or safety-relevant work.
Threaded state, checkpoints, resumable runs.
MCP or direct connectors for docs, logs, and systems.
Role, payload, and case checks before every tool call.
Approvals, tenant boundaries, redaction, and audit trails.
Normalize user input, retrieved documents, UI state, logs, and external evidence into typed signals. Retrieved material stays separate from policy so prompt injection does not become authority.
Persist task state, approvals, tool history, and working summaries across long-running workflows. Compact aggressively, keep lossless audit records, and version the schema so paused runs can resume safely.
Evaluate each action request against least-privilege policy, payload constraints, and runtime risk. High-impact actions pause for approval, and side effects carry execution records and idempotency keys.
Durable, state-first orchestration for long-running workflows; lighter sequential runtimes where a full state machine is overkill. We stay framework-agnostic and prefer MCP when teams need live context interoperability.
Practice
We help teams replace improvised eval stacks with measurement systems that can survive growth, governance, and scrutiny.
Define the capability, behavior, or risk — and its indicator map — before optimization begins, so teams stop steering by unstable proxies and inviting Goodhart's law.
Design rubrics, adjudication loops, and gold sets that hold inter-rater reliability — kappa, alignment to gold — stable as the queue and headcount grow, for human and LLM raters alike.
Audit for leakage, saturation, weak transfer, and construct drift. Test whether score movement is capability or item-format luck — and whether the benchmark still measures what its name claims.
Translate score distributions into cut scores tied to launch gates, triage routes, and escalation — pricing the asymmetry between false positives and false negatives explicitly.
Method
Psychometrics only matters when it reaches the product boundary.
We isolate the target behavior or risk so the team knows exactly what is being measured before any benchmark becomes a target.
We shape tasks, rubrics, and rating procedures to maximize interpretability, reliability, and operational value.
We tie scores to decisions, monitor drift, and define when a queue needs adjudication instead of intuition.
Field thesis
They don't yet realize it. The convergence completes within a decade, and the teams that build the bridge define the conventions everyone else inherits. Six shifts are already underway.
As benchmarks saturate, "did the score go up?" gives way to "is the score measuring the capability we care about?" — the century-old psychometric question, newly load-bearing.
Difficulty and discrimination become item parameters; ability becomes a latent estimate. Scoring models on fixed thousand-item suites will look like scoring students on the pre-adaptive SAT.
The unit of analysis moves from the benchmark to the production trace. Live traffic is sampled and scored; a model's capability profile updates in real time instead of at release.
The larger market is the reverse direction — AI scoring essays, tickets, code, and medical notes — held to the calibration, reliability, and bias standards built for human raters. Today it isn't, and the failure is silent.
LLMs generate plausible items at scale, so the bottleneck moves from authoring to validation. Whoever owns the validation methodology wins, because the item generator is commoditized.
The EU AI Act, NIST AI RMF, and frontier safety frameworks will require evidence of what a score measures, not just the score. "We scored 87%" stops being an answer.
"General capability" risks being reified by the very measurements built to detect it — the same trap "g" set for a century of human testing.
Done well, AI evaluation just looks like rigorous psychometrics — the way serious educational testing needs no special name. Our job is to close the methodological gap fast enough to be unnecessary in ten years.
Standard
We care less about dashboard aesthetics than whether a score can carry real decision weight without distorting the system around it.
Calibration loops that keep human judgment stable as teams scale.
Cut scores and launch gates grounded in evidence rather than convention.
Cleaner signals so product, research, and safety teams can act without relitigating the metric itself.
Contact