Executive Summary
AI coding agents are usually evaluated on whether they produce plausible code. Bullseye's evaluation asked a narrower buyer question: when an agent must make a choice, does it follow the team's actual intent or fall back to a generic default?
Across 100 realistic decision scenarios selected because the team's intent diverged from common best practice, agents given Bullseye's current served intent followed the intended decision far more often than agents without it. Adherence rose from about 41% to about 89%, and the served-intent agent won the direct comparison in 93 of 100 scenarios.
What We Tested
Each scenario described an engineering task with a team-specific decision, constraint, rejection, or unresolved tradeoff. The scenarios intentionally emphasized the cases where a fresh agent is most likely to be wrong: multi-stakeholder decisions, contested calls, conditional constraints, and rejected paths that are not obvious from final code.
The baseline agent received the task without Bullseye's served intent. The treatment agent received the same task plus a concise, current intent block: the team's goal, the constraints that bounded it, the rejected alternatives where relevant, and the provenance needed to understand why the decision existed.
Primary Findings
- Current served intent materially changed output. Adherence to the team's actual decision rose from about 41% to about 89%.
- The improvement was visible head-to-head. In direct comparisons, the served-intent agent produced the more team-aligned answer in 93 of 100 scenarios.
- The largest gains appeared where generic advice is least reliable. Multi-stakeholder, conditional, and contested decisions benefited most because the right answer depended on local history rather than universal best practice.
Negative Control: Stale Intent
The evaluation also tested whether any extra context helps, or whether the context has to be current. In separate controlled runs, agents were given stale intent: a decision the team had already superseded.
The stale-intent agents performed worse than agents given no Bullseye context at all and overwhelmingly followed the outdated directive. That result replicated across two model setups. In one run, scores were none 60, current 91, stale 43. In a decoupled judge setup, scores were none 43, current 89, stale 7, with current intent beating stale intent 30 out of 30 times.
The buyer implication is direct: the value is not context volume. The value is current, reconciled, bounded intent. A manually maintained context file that drifts can actively push agents in the wrong direction.
Why This Matters
Teams do not only need agents that understand code. They need agents that understand why the team chose this code, why alternatives were rejected, and when a decision is unresolved enough to ask rather than guess. Those facts often live in tickets, PRs, chats, docs, and prior sessions rather than in the final implementation.
Bullseye captures that intent from work artifacts, reconciles it into current truth, and serves it where coding agents already read. The scorecard is not just a reporting surface; it measures the distance between served intent and agent output so teams can prove the context is improving decisions.
Limitations
- This was internal testing, not a third-party benchmark.
- The scenarios were deliberately selected for divergence from generic best practice, so results should not be generalized to every coding task.
- The tests used Claude model setups. Results may vary across vendors, prompts, and agent harnesses.
- The evaluation measured decision adherence, not full production software quality.
Conclusion
Current served intent made agents substantially more likely to follow the team's real decision. Stale intent made them worse. For teams putting AI agents into real engineering workflows, keeping org intent live is not documentation overhead; it is part of the execution path.