ER Eidos Research Psychometrics for AI

Case study / 01 · Buy-side

Decomposing 100 production failures in a macro fund's research-synthesis LLM.

A discretionary global macro fund had built an in-house LLM that turned central bank communications, earnings transcripts, and news into analyst-facing briefs. PMs flagged the briefs as inconsistent — sometimes confidently wrong on direction, sometimes overstepping mandate. Eidos ran a four-week PCP Decision Audit on 100 of those flagged briefs and returned a failure-layer map, per-layer fix recommendations, and a re-audit at twelve weeks.

Engagement
PCP Decision Audit
Duration
4 weeks · re-audit at 12
Sample
100 production briefs
Outcome
Failure rate 8.0% → 2.4%

Client

Discretionary global macro, mid-sized.

A G10-focused discretionary macro fund trading rates, FX, equity index, and selected commodity themes, with a small in-house ML and engineering team running internal LLM tooling alongside the traditional research stack. The synthesis engine sits between raw inputs (central bank communications, earnings, news, internal notes) and the analyst desk; it produces roughly two hundred briefs per week.

Problem

Failures the team could not partition.

Roughly eight percent of briefs were being flagged by analysts or PMs as wrong in some way: a confident hawkish call where the statement was deliberately ambiguous, a "long EUR" framing that ignored a position the fund was already heavily sized in, the occasional sentence recommending a position size when the engine's mandate was synthesis only.

The internal debate was not whether the failures existed. It was which team owned them. Engineering blamed the prompt. The ML team blamed the retrieval setup. Research blamed the model. Each story was internally consistent and externally unfalsifiable. The conversation kept resetting.

Engagement

A PCP Decision Audit on 100 flagged briefs.

Eidos was provided the flagged briefs, the system prompt, the retrieval configuration, the policy on what the engine should and should not say, and access to the analysts who had flagged the failures. We classified each failure along three independent layers — Perception, Context, Permission — with reasoning, and mapped the resulting concentration to actionable team ownership.

The premise: most observed failures collapse two or three layers into one undifferentiated label ("the model hallucinated") and lose the diagnostic information that would tell you what to fix. Decomposition forces the question — which layer broke? — and once that's answered, the right team owns the right fix.

Findings

Where the failures actually concentrated.

P

Perception

Misreading what the input said
52%

Concentrated on misreading hedged or deliberately ambiguous central bank language. The model averaged hawkish and dovish signals into "neutral" instead of flagging the divergence as the news. The dominant pattern: a Fed statement upgrades the inflation forecast while softening the path-of-rates language; the model read this as balanced and called it neutral, when the deliberate divergence was the signal an analyst would have flagged.

Failure pattern

Hawkish + dovish averaged to neutral on hedged FOMC and ECB statements; missed the divergence as itself the signal.

Owner

ML team. Fix lives in training data and a small classifier.

C

Context

Ignoring state the decision should have used
29%

Concentrated on ignoring portfolio state. The synthesis engine had no view of the fund's current positioning, so it would describe a EUR/USD divergence story as a long-EUR opportunity without noting the fund was already 2.5× sized in long EUR. PMs experienced this as the engine being naïve, and over time learned to discount it — which defeated the point of having the engine in the first place.

Failure pattern

Synthesis written without awareness of existing positions, hedges, or stated portfolio biases.

Owner

Engineering and product. Fix lives in surfacing portfolio state to the model via a structured tool call.

P

Permission

Acting outside the authorized scope
19%

Concentrated on volunteering sizing language. The engine's mandate was synthesis only; it was not authorized to recommend position sizing or directional trades. Roughly one in five briefs contained sentences like "this argues for adding to the JGB short" — over the mandate line, even when the underlying synthesis was correct. The output looked like research, but the institution had not authorized it to act like one.

Failure pattern

Synthesis briefs introducing sizing or directional trade language outside the mandate.

Owner

Policy and post-training. Fix lives in a refusal classifier and an updated system prompt.

Fixes

Different layer, different fix, different team.

  1. 01

    Perception fix

    Fine-tune a small classifier on a labeled set of hedged central bank language. The training set was assembled from prior FOMC, ECB, BOJ, and BOE statements, adjudicated by the fund's senior analysts into hawkish, dovish, and genuinely-ambiguous classes. The classifier runs upstream of the synthesis model and tags the input; the synthesis model is prompted to preserve, not collapse, deliberate ambiguity.

  2. 02

    Context fix

    Surface portfolio state to the model via a structured tool call before generation. The model now knows current positions, sizing, and stated biases, and explicitly references them when relevant ("the fund is already 2.5× sized long EUR; this read marginally adds to the existing thesis rather than introducing a new one"). Briefs became legibly dependent on portfolio state.

  3. 03

    Permission fix

    Add a refusal classifier that scans output for sizing and directional language and returns a synthesis-only version. Update the system prompt to make the synthesis-only mandate explicit, with named examples of what the engine does and does not produce. The review interface gained a "mandate breach" flag so analysts could surface false negatives back into training.

Outcome

Re-audit at twelve weeks.

Before 8.0% Brief failure rate
After (12 weeks) 2.4% Brief failure rate
Reduction −70% Failures vs. baseline

The composition of the remaining failures shifted as much as the absolute count. At re-audit, 81% of remaining failures were perception, 14% context, and 5% permission — meaning the engineering and policy fixes held, and the residual problem concentrated on the genuinely hard layer. Several of the residual perception failures were on central bank communications where senior analysts on the desk also disagreed about the right read, which the team interpreted as the appropriate end-state rather than a defect to keep grinding against.

Second-order effects

What the fund actually said changed.

The audit's headline number was the failure rate, but the durable change was in how the team talked about failures. Before the audit, a wrong brief was "the model hallucinated" — a statement that pointed at no team and no fix. After the audit, a wrong brief was a perception failure on a specific sentence pattern, or a context failure where a specific piece of state was missing, or a permission failure where a specific guardrail had not fired. Conversations got shorter. The team that owned each layer also owned each fix. The next quality push could start from where the prior one ended, instead of restarting from "what is going wrong here."

Disclosure

Client identifiers, AUM, exact percentages, and timing details have been adjusted to preserve confidentiality. The engagement structure, failure patterns, and methodology described here are representative of how Eidos delivers a PCP Decision Audit for institutional buy-side clients deploying LLM-based research workflows.

Engage

Running a research-synthesis LLM you cannot fully diagnose?

hello@eidosresearch.com