Papaya's 200+ analyses are grounded in current agent and LLM research — from context economy and trajectory health to model routing and verification design. Below: the research areas we draw from, the questions our engines ask, and the optimization patterns they look for in production runs.
Detect what the model actually reads vs. what it's sent — informed by recent context-distillation research.
Score tools on consistency, signal-to-noise, and whether their output changes the model's behavior.
Check whether self-checks actually catch failures the run produced — not just whether they ran.
Identify reasoning that should be staged, parallelized, or collapsed based on observed run traces.
Cluster retries by root cause: parsing, tool flakiness, or instruction conflicts.
Surface scaffolding patterns that consistently outperform on similar workloads.
Or did it claim success without doing the work the user asked for.
Critical fields the model needs — and bloat drowning what is there.
Half the prompt may be invisible to the answer.
Tool output that quietly pollutes the next step's context.
Some belong in a sub-task; others belong inline.
Hand-offs dropping context, roles unclear.
Papaya runs every analysis on your workflows. Individual findings are tailored to a specific workflow, while these pages explain the broader pattern, impact, and common fixes.
Inputs the model needed to succeed weren't in the prompt or available via tools.
Context was sent but never referenced — bloat without payoff.
Prompts so large the signal gets drowned or truncated downstream.
The same instructions or examples sent over and over across steps.
The agent keeps asking the user for things it could resolve itself.
The same step retried without changing inputs, tools, or strategy.
Runs cut off mid-thought and resumed with lossy state.
Right tool, wrong arguments — or wrong tool for the job entirely.
Repeated failing tool calls without escalation or fallback.
A sequence of tool calls that should be collapsed into one operation.
Writes happening without confirmation, idempotency, or rollback.
Decisions made on stale state instead of the latest observed change.
Returned shape doesn't match what the caller — or the next step — expects.
Whole classes of runs missing from your evals or success metrics.
Evals that don't exercise the failure modes production actually hits.
Over- or under-powered model choices for the step's real difficulty.
Sub-agents adding latency and tokens without earning the hand-off.
Steps run in an order that forces rework or blocks parallelism.
An ad-hoc workflow that recurs often enough to deserve a reusable template.
No structured check between a risky step and what it affects.