Case study / 01 · Buy-side
Decomposing 100 production failures in a macro fund's
research-synthesis LLM.
A discretionary global macro fund had built an in-house LLM that
turned central bank communications, earnings transcripts, and news
into analyst-facing briefs. PMs flagged the briefs as inconsistent
— sometimes confidently wrong on direction, sometimes overstepping
mandate. Eidos ran a four-week PCP Decision Audit on 100 of those
flagged briefs and returned a failure-layer map, per-layer fix
recommendations, and a re-audit at twelve weeks.
- Engagement
- PCP Decision Audit
- Duration
- 4 weeks · re-audit at 12
- Sample
- 100 production briefs
- Outcome
- Failure rate 8.0% → 2.4%
Client
Discretionary global macro, mid-sized.
A G10-focused discretionary macro fund trading rates, FX, equity
index, and selected commodity themes, with a small in-house ML and
engineering team running internal LLM tooling alongside the
traditional research stack. The synthesis engine sits between raw
inputs (central bank communications, earnings, news, internal
notes) and the analyst desk; it produces roughly two hundred briefs
per week.
Problem
Failures the team could not partition.
Roughly eight percent of briefs were being flagged by analysts or
PMs as wrong in some way: a confident hawkish call where the
statement was deliberately ambiguous, a "long EUR" framing that
ignored a position the fund was already heavily sized in, the
occasional sentence recommending a position size when the engine's
mandate was synthesis only.
The internal debate was not whether the failures existed. It was
which team owned them. Engineering blamed the prompt. The ML team
blamed the retrieval setup. Research blamed the model. Each story
was internally consistent and externally unfalsifiable. The
conversation kept resetting.
Engagement
A PCP Decision Audit on 100 flagged briefs.
Eidos was provided the flagged briefs, the system prompt, the
retrieval configuration, the policy on what the engine should and
should not say, and access to the analysts who had flagged the
failures. We classified each failure along three independent
layers — Perception, Context, Permission — with reasoning, and
mapped the resulting concentration to actionable team ownership.
The premise: most observed failures collapse two or three layers
into one undifferentiated label ("the model hallucinated") and
lose the diagnostic information that would tell you what to fix.
Decomposition forces the question — which layer broke? —
and once that's answered, the right team owns the right fix.
Findings
Where the failures actually concentrated.
P
Perception
Misreading what the input said
52%
Concentrated on misreading hedged or deliberately ambiguous
central bank language. The model averaged hawkish and dovish
signals into "neutral" instead of flagging the divergence as
the news. The dominant pattern: a Fed statement upgrades the
inflation forecast while softening the path-of-rates language;
the model read this as balanced and called it neutral, when
the deliberate divergence was the signal an analyst would
have flagged.
Failure pattern
Hawkish + dovish averaged to neutral on hedged FOMC and ECB
statements; missed the divergence as itself the signal.
Owner
ML team. Fix lives in training data and a small classifier.
C
Context
Ignoring state the decision should have used
29%
Concentrated on ignoring portfolio state. The synthesis engine
had no view of the fund's current positioning, so it would
describe a EUR/USD divergence story as a long-EUR opportunity
without noting the fund was already 2.5× sized in long EUR.
PMs experienced this as the engine being naïve, and over time
learned to discount it — which defeated the point of having
the engine in the first place.
Failure pattern
Synthesis written without awareness of existing positions,
hedges, or stated portfolio biases.
Owner
Engineering and product. Fix lives in surfacing portfolio
state to the model via a structured tool call.
P
Permission
Acting outside the authorized scope
19%
Concentrated on volunteering sizing language. The engine's
mandate was synthesis only; it was not authorized to recommend
position sizing or directional trades. Roughly one in five
briefs contained sentences like "this argues for adding to the
JGB short" — over the mandate line, even when the underlying
synthesis was correct. The output looked like research, but
the institution had not authorized it to act like one.
Failure pattern
Synthesis briefs introducing sizing or directional trade
language outside the mandate.
Owner
Policy and post-training. Fix lives in a refusal classifier
and an updated system prompt.
Fixes
Different layer, different fix, different team.
-
01
Perception fix
Fine-tune a small classifier on a labeled set of hedged
central bank language. The training set was assembled from
prior FOMC, ECB, BOJ, and BOE statements, adjudicated by
the fund's senior analysts into hawkish, dovish, and
genuinely-ambiguous classes. The classifier runs upstream
of the synthesis model and tags the input; the synthesis
model is prompted to preserve, not collapse, deliberate
ambiguity.
-
02
Context fix
Surface portfolio state to the model via a structured tool
call before generation. The model now knows current
positions, sizing, and stated biases, and explicitly
references them when relevant ("the fund is already 2.5×
sized long EUR; this read marginally adds to the existing
thesis rather than introducing a new one"). Briefs became
legibly dependent on portfolio state.
-
03
Permission fix
Add a refusal classifier that scans output for sizing and
directional language and returns a synthesis-only version.
Update the system prompt to make the synthesis-only mandate
explicit, with named examples of what the engine does and
does not produce. The review interface gained a "mandate
breach" flag so analysts could surface false negatives back
into training.
Outcome
Re-audit at twelve weeks.
Before
8.0%
Brief failure rate
→
After (12 weeks)
2.4%
Brief failure rate
Reduction
−70%
Failures vs. baseline
The composition of the remaining failures shifted as much as the
absolute count. At re-audit, 81% of remaining failures were
perception, 14% context, and 5% permission — meaning the
engineering and policy fixes held, and the residual problem
concentrated on the genuinely hard layer. Several of the residual
perception failures were on central bank communications where
senior analysts on the desk also disagreed about the right read,
which the team interpreted as the appropriate end-state rather
than a defect to keep grinding against.
Second-order effects
What the fund actually said changed.
The audit's headline number was the failure rate, but the
durable change was in how the team talked about failures. Before
the audit, a wrong brief was "the model hallucinated" — a
statement that pointed at no team and no fix. After the audit, a
wrong brief was a perception failure on a specific sentence
pattern, or a context failure where a specific piece of state
was missing, or a permission failure where a specific guardrail
had not fired. Conversations got shorter. The team that owned
each layer also owned each fix. The next quality push could
start from where the prior one ended, instead of restarting from
"what is going wrong here."
Disclosure
Client identifiers, AUM, exact percentages, and timing details
have been adjusted to preserve confidentiality. The engagement
structure, failure patterns, and methodology described here are
representative of how Eidos delivers a PCP Decision Audit for
institutional buy-side clients deploying LLM-based research
workflows.