ER Eidos Research Psychometrics for AI

Psychometrics for AI development

Evaluation systems for AI teams that need measurement to survive real decisions.

Eidos Research designs psychometric infrastructure for labs and product teams: constructs, rubrics, calibration loops, and launch thresholds that stay legible when scores start carrying actual power.

Operating frame

Scores become policy faster than teams expect. We design the layer that makes those scores interpretable before they become institutional.

Constructs
Define the thing before the proxy starts driving behavior.
Calibration
Keep human judgment stable as headcount, queue size, and pressure rise.
Thresholds
Turn uncertain model outputs into decisions people can defend.

Principle

Measurement should hold up after the org starts trusting it.

A

Constructs before proxies

We begin with the underlying behavior or risk, not the easiest score to instrument.

B

Raters before dashboards

Reliability work comes before reporting polish, because the metric matters only if the judgment is stable.

Instruments

The measurement apparatus AI evaluation hasn't adopted yet.

Psychometrics spent a century building tools to answer one question: does this number mean what we think it means? Most of that apparatus has never reached AI evaluation. These are the instruments we bring, and the specific failure each one catches.

Construct validity

Validity

Whether a score measures the capability you intend, or an artifact of how you built the test.

In AI eval The question "we scored 87%" cannot answer on its own — and the one frontier safety frameworks and regulators are learning to ask first.

Item Response Theory

Scaling

Models item difficulty and discrimination as parameters, and ability as a latent estimate on the same scale — so systems that answered different items stay comparable.

In AI eval Replaces "fraction correct on a frozen suite" and makes adaptive, saturation-resistant benchmarks possible.

Generalizability theory

Variance

Decomposes score variance into its sources — item, rater, occasion, prompt phrasing — instead of collapsing them into one number.

In AI eval Tells you whether a two-point gain is capability or a run-to-run artifact of the harness.

Differential Item Functioning

Fairness

Detects items that behave differently across groups for reasons unrelated to the trait being measured.

In AI eval Surfaces prompt formats that advantage one model family independent of capability — format luck scored as skill.

Inter-rater reliability

Reliability

Cohen's κ, Krippendorff's α, and intraclass correlation: agreement beyond chance among independent raters.

In AI eval The precondition for trusting any rubric-scored or LLM-judged result — human or model rater alike.

Multitrait–multimethod

Structure

Separates convergent signal — what a construct shares with its kin — from discriminant signal — what sets it apart.

In AI eval Catches composite "capability" scores that silently aggregate unrelated constructs into one misleading headline.

Demonstration

Calibration workbench

A practical review batch: two raters score the same set of model outputs against a gold adjudication policy. Change a verdict and watch reliability, miss patterns, and decision readiness move together.

Architecture

A governed runtime for agent systems.

For teams moving from evals to agents, we use the same measurement discipline at runtime. PCP turns that discipline into an execution frame: perception gathers live evidence, context preserves durable state, and permission gates every action before an agent touches the world.

Perceive what is live. Carry what matters. Approve what can change the world.

The point is not to make agent behavior feel magical. The point is to make it legible, resumable, and governable when the run crosses into operational, financial, legal, or safety-relevant work.

Durable context

Threaded state, checkpoints, resumable runs.

Live evidence

MCP or direct connectors for docs, logs, and systems.

Policy gate

Role, payload, and case checks before every tool call.

Regulated controls

Approvals, tenant boundaries, redaction, and audit trails.

P

Perception

Normalize user input, retrieved documents, UI state, logs, and external evidence into typed signals. Retrieved material stays separate from policy so prompt injection does not become authority.

C

Context

Persist task state, approvals, tool history, and working summaries across long-running workflows. Compact aggressively, keep lossless audit records, and version the schema so paused runs can resume safely.

P

Permission

Evaluate each action request against least-privilege policy, payload constraints, and runtime risk. High-impact actions pause for approval, and side effects carry execution records and idempotency keys.

Runtime posture

Durable, state-first orchestration for long-running workflows; lighter sequential runtimes where a full state machine is overkill. We stay framework-agnostic and prefer MCP when teams need live context interoperability.

Practice

The layer between model behavior and organizational judgment.

We help teams replace improvised eval stacks with measurement systems that can survive growth, governance, and scrutiny.

01

Construct design

Define the capability, behavior, or risk — and its indicator map — before optimization begins, so teams stop steering by unstable proxies and inviting Goodhart's law.

02

Rater calibration

Design rubrics, adjudication loops, and gold sets that hold inter-rater reliability — kappa, alignment to gold — stable as the queue and headcount grow, for human and LLM raters alike.

03

Benchmark validity

Audit for leakage, saturation, weak transfer, and construct drift. Test whether score movement is capability or item-format luck — and whether the benchmark still measures what its name claims.

04

Threshold policy

Translate score distributions into cut scores tied to launch gates, triage routes, and escalation — pricing the asymmetry between false positives and false negatives explicitly.

Method

From construct to cutoff.

Psychometrics only matters when it reaches the product boundary.

01

Name the construct

We isolate the target behavior or risk so the team knows exactly what is being measured before any benchmark becomes a target.

02

Engineer the signal

We shape tasks, rubrics, and rating procedures to maximize interpretability, reliability, and operational value.

03

Attach the threshold

We tie scores to decisions, monitor drift, and define when a queue needs adjudication instead of intuition.

Field thesis

Psychometrics and AI evaluation are becoming one field.

They don't yet realize it. The convergence completes within a decade, and the teams that build the bridge define the conventions everyone else inherits. Six shifts are already underway.

Rising scores Construct validity

As benchmarks saturate, "did the score go up?" gives way to "is the score measuring the capability we care about?" — the century-old psychometric question, newly load-bearing.

Static suites Adaptive testing

Difficulty and discrimination become item parameters; ability becomes a latent estimate. Scoring models on fixed thousand-item suites will look like scoring students on the pre-adaptive SAT.

Discrete tests Continuous assessment

The unit of analysis moves from the benchmark to the production trace. Live traffic is sampled and scored; a model's capability profile updates in real time instead of at release.

Evaluating AI AI as the rater

The larger market is the reverse direction — AI scoring essays, tickets, code, and medical notes — held to the calibration, reliability, and bias standards built for human raters. Today it isn't, and the failure is silent.

Hand-authored items Generative items

LLMs generate plausible items at scale, so the bottleneck moves from authoring to validation. Whoever owns the validation methodology wins, because the item generator is commoditized.

Performance numbers Evidence of validity

The EU AI Act, NIST AI RMF, and frontier safety frameworks will require evidence of what a score measures, not just the score. "We scored 87%" stops being an answer.

The trap

"General capability" risks being reified by the very measurements built to detect it — the same trap "g" set for a century of human testing.

Done well, AI evaluation just looks like rigorous psychometrics — the way serious educational testing needs no special name. Our job is to close the methodological gap fast enough to be unnecessary in ten years.

Standard

Metrics should earn influence before they acquire authority.

We care less about dashboard aesthetics than whether a score can carry real decision weight without distorting the system around it.

Reliable raters

Calibration loops that keep human judgment stable as teams scale.

Defensible thresholds

Cut scores and launch gates grounded in evidence rather than convention.

Faster governance

Cleaner signals so product, research, and safety teams can act without relitigating the metric itself.

Contact

Need an evaluation system that can survive launch pressure?

hello@eidosresearch.com