Constructs before proxies
We begin with the underlying behavior or risk, not the easiest score to instrument.
Psychometrics for AI development
Eidos Research designs psychometric infrastructure for labs and product teams: constructs, rubrics, calibration loops, and launch thresholds that stay legible when scores start carrying actual power.
Operating frame
Scores become policy faster than teams expect. We design the layer that makes those scores interpretable before they become institutional.
Principle
We begin with the underlying behavior or risk, not the easiest score to instrument.
Reliability work comes before reporting polish, because the metric matters only if the judgment is stable.
Demonstration
A practical review batch: two raters score the same set of model outputs against a gold adjudication policy. Change a verdict and watch reliability, miss patterns, and decision readiness move together.
Architecture
For teams moving from evals to agents, we use the same measurement discipline at runtime. PCP turns that discipline into an execution frame: perception gathers live evidence, context preserves durable state, and permission gates every action before an agent touches the world.
Perceive what is live. Carry what matters. Approve what can change the world.
The point is not to make agent behavior feel magical. The point is to make it legible, resumable, and governable when the run crosses into operational, financial, legal, or safety-relevant work.
Threaded state, checkpoints, resumable runs.
MCP or direct connectors for docs, logs, and systems.
Role, payload, and case checks before every tool call.
Approvals, tenant boundaries, redaction, and audit trails.
Normalize user input, retrieved documents, UI state, logs, and external evidence into typed signals. Retrieved material stays separate from policy so prompt injection does not become authority.
Persist task state, approvals, tool history, and working summaries across long-running workflows. Compact aggressively, keep lossless audit records, and version the schema so paused runs can resume safely.
Evaluate each action request against least-privilege policy, payload constraints, and runtime risk. High-impact actions pause for approval, and side effects carry execution records and idempotency keys.
LangGraph.js for durable, state-heavy workflows. OpenAI Agents SDK for lighter sequential specialists. MCP is preferred when teams need live context interoperability.
Practice
We help teams replace improvised eval stacks with measurement systems that can survive growth, governance, and scrutiny.
Define the capability, behavior, or risk before optimization begins, so teams stop steering by unstable proxies.
Design rubrics, adjudication loops, and QA procedures that keep human judgment reliable as the queue and headcount grow.
Test whether score movement means what the team thinks it means, including leakage, saturation, and weak transfer.
Connect uncertain outputs to launch gates, triage routes, and escalation rules that executives and operators can defend.
Method
Psychometrics only matters when it reaches the product boundary.
We isolate the target behavior or risk so the team knows exactly what is being measured before any benchmark becomes a target.
We shape tasks, rubrics, and rating procedures to maximize interpretability, reliability, and operational value.
We tie scores to decisions, monitor drift, and define when a queue needs adjudication instead of intuition.
Standard
We care less about dashboard aesthetics than whether a score can carry real decision weight without distorting the system around it.
Calibration loops that keep human judgment stable as teams scale.
Cut scores and launch gates grounded in evidence rather than convention.
Cleaner signals so product, research, and safety teams can act without relitigating the metric itself.
Contact