Skip to content

feat: reactive mid-generation compression + artifact exclusion#3133

Open
tim-inkeep wants to merge 10 commits intomainfrom
feat/reactive-compression-artifact-exclusion
Open

feat: reactive mid-generation compression + artifact exclusion#3133
tim-inkeep wants to merge 10 commits intomainfrom
feat/reactive-compression-artifact-exclusion

Conversation

@tim-inkeep
Copy link
Copy Markdown
Contributor

Summary

Implementation of spec #3129 (specs/2026-04-14-reactive-compression/).

  • Reactive mid-gen compression via wrapLanguageModel middleware. Compresses and retries only when the provider actually returns a context-overflow error. Preserves the AI SDK multi-step loop.
  • Oversized tool-output exclusion at tool-wrapper.ts's toModelOutput seam. Raw outputs exceeding 30% of the context window are replaced with a structured error-shaped stub matching default-tools.ts's retrieval_blocked shape — the LLM sees a tool failure it can react to instead of bloated content.
  • Removed proactive compression from prepareStep callback.
  • Removed _oversizedWarning injection from BaseCompressor and ArtifactService; metadata.isOversized remains the durable signal.

Stories delivered

Story Commit
US-001 Provider-aware context-overflow detector 9f25d9c57
US-002 compressionRetryMiddleware (wrapGenerate + wrapStream) 555620fc2
US-003 Wire middleware at generate call sites; remove proactive compression c12970714
US-004 Oversized exclusion at tool-wrapper.ts:111 toModelOutput 4ec8c2131
US-005 Remove _oversizedWarning injection 83ae43a4e
US-006 Changeset 6bcd3bf2d

Architecture highlights

  • compressionRetryMiddleware is per-request (factory closes over run context — DEC-13), so compression scope matches run scope.
  • doGenerate / doStream are nullary in @ai-sdk/[email protected] — retry uses options.model.doGenerate(modifiedOptions) directly. Verified middleware contract is identical across 3.0.2 and 3.0.4.
  • Stream peek uses async-iterator .next() + prepend-then-pipe generator. No tee(), no array buffering (DEC-07). TTFT on the happy path is within microseconds of an unwrapped model.
  • Post-commit mid-stream errors propagate without retry (R7) — safe because tool state may already be captured and text may already have been forwarded.
  • Pre-generation conversation-history compression is unchanged (acts as safety net).

Telemetry added

Span attributes: compression.trigger, compression.provider, compression.detector, compression.retry_number, compression.outcome, anthropic_overflow_regex_hit, tool.result.oversized_excluded.

Changeset

`@inkeep/agents-api` patch: "Make mid-generation compression reactive to provider overflow errors and exclude oversized tool outputs from LLM context"

Test plan

  • pnpm typecheck passes
  • pnpm lint passes
  • pnpm --filter @inkeep/agents-api test --run passes (includes new tests)
  • Review: compressionRetryMiddleware.ts + detectContextOverflow.ts (new code paths)
  • Review: tool-wrapper.ts:111 oversized-exclusion wiring
  • Review: removed token-budget branch in ai-sdk-callbacks.ts and compression.ts helper
  • Manual verification (recommended): run an agent with a tool that returns an oversized payload; confirm via Jaeger that tool.result.oversized_excluded = true appears and no content bytes reach the LLM.

Notes

🤖 Generated with Claude Code

tim-inkeep and others added 8 commits April 14, 2026 13:43
Ready-for-implementation spec for refactoring agents-api compression triggers:
- Mid-gen compression becomes reactive (via AI SDK wrapLanguageModel middleware
  that catches provider-signaled overflow and retries once with compressed input)
- Oversized tool outputs excluded at tool-wrapper.ts toModelOutput seam with
  structured error-shaped tool result (matching default-tools.ts retrieval_blocked)
- Pre-generation conversation-history compression unchanged (safety net)

Includes four evidence files (compression triggers, oversized artifact handling,
provider overflow signals, middleware approach), audit + challenger findings,
and audit-resolution changelog. 14 locked/directed decisions at HIGH confidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Correct @ai-sdk/provider version citation: 3.0.4 → 3.0.2 (verified
  middleware contract identical across both via dist/index.d.ts diff)
- Fix formatOversizedReason → formatOversizedRetrievalReason (actual
  function exported from artifact-utils.ts:50)
- Clarify R8 telemetry: all attributes are span attributes, not log fields
- Remove local filesystem path from §17 References
- Add End-User visibility gap to Risks (MEDIUM) with operator mitigation
- Add Traces UI surfacing, enriched stub metadata, and retrieval-tool
  migration note to Future Work
- Add docs-check step to Verification Plan

No design changes. All review findings addressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…roactive compression

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…rtifactService

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 14, 2026

🦋 Changeset detected

Latest commit: 10e0d12

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-core Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 15, 2026 2:09pm
agents-manage-ui Ready Ready Preview, Comment Apr 15, 2026 2:09pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agents-docs Skipped Skipped Apr 15, 2026 2:09pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

I could not sync this public PR into agents-private automatically.

Patch application failed. The diff could not be applied cleanly. Please rebase your PR on the latest main.

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog Bot commented Apr 14, 2026

Summary

This PR fundamentally changes how mid-generation context compression is triggered: instead of proactively compressing based on token-budget predictions (which often fire unnecessarily), compression now only happens reactively when the LLM provider actually returns a context-overflow error. It also adds a new seam to exclude oversized tool outputs from reaching the LLM context at all, replacing them with structured error stubs.

Key changes

  • New compressionRetryMiddleware wraps the language model via AI SDK's wrapLanguageModel, intercepting overflow errors and retrying with compressed prompts
  • New detectContextOverflow module identifies provider-specific overflow errors (OpenAI context_length_exceeded code, Anthropic regex patterns)
  • Gutted the token-budget prediction branch from handlePrepareStepCompression — it now always returns {}
  • Moved compression logic from the prepareStep callback into a standalone buildCompressPrompt function reused by the middleware
  • tool-wrapper.ts's toModelOutput hook now checks for oversized tool outputs before they reach the LLM, substituting a structured error stub
  • Removed _oversizedWarning field injection from BaseCompressor and ArtifactService (metadata isOversized flag remains)

Reactive compression middleware

Before: The prepareStep AI SDK callback checked token counts after each step — using actual usage from the previous step or falling back to estimates — and preemptively compressed if the context appeared to exceed a per-model threshold (75-91%). This fired on many calls that would have succeeded, causing unnecessary LLM summarization calls, context detail loss, and injected synthetic messages.

After: A new compressionRetryMiddleware.ts implements LanguageModelMiddleware with wrapGenerate and wrapStream hooks. For non-streaming: it catches overflow errors from doGenerate(), compresses the prompt via buildCompressPrompt, and retries once. For streaming: it peeks the first chunk using async-iterator .next() — if it's an overflow error ("pre-commit"), it compresses and retries; if it's real data, it prepends the buffered chunk and pipes the rest through. A second overflow propagates without further retry. The middleware is constructed per-request, closing over the run-scoped compressor. The middleware includes a version comment noting verification against @ai-sdk/[email protected] and 3.0.4 — a reminder to re-verify if the SDK bumps to v4.

Context overflow detection

New detectContextOverflow.ts provides isContextOverflowError(err) which discriminates between real context overflow and other 400 errors. It checks for OpenAI's stable context_length_exceeded error code via data.error.code, and falls back to regex patterns for Anthropic messages (prompt is too long, input length and max_tokens exceed context limit). Explicitly excludes 413 (byte-limit) and non-400 status codes. Includes OTEL span attributes for monitoring regex-based detections.

Oversized tool output exclusion at the source

Before: tool-wrapper.ts's toModelOutput unconditionally called buildToolResultForModelInput(output), sending raw tool results — even gigantic ones — into the LLM context. Oversized detection only affected downstream compression/artifact storage.

After: toModelOutput now calls detectOversizedArtifact against the model's context window size. If the output exceeds the threshold, it returns a structured JSON stub with status: 'oversized', the tool name, args, structure info, and a recommendation to narrow queries. This matches the existing retrieval_blocked shape from default-tools.ts so the LLM interprets it as a tool-level failure. OTEL attributes are set for observability. The context window is now resolved lazily using ctx.currentModelSettings (stashed after configureModelSettings runs), fixing a bug where getModelContextWindow() was called without args and always returned the 120K default — making the 30% oversized threshold hardcoded at ~36K regardless of the actual model.

Compression logic extraction

The summary-building logic (artifact references, stop instructions, compression cycle tracking) that previously lived inline in handlePrepareStepCompression was extracted into buildCompressPrompt() in compression.ts. This function now accepts an originalMessageCount parameter and preserves the original message prefix (system + user message + pre-generation conversation history) when compressing — only the generated tail (tool calls + assistant responses accumulated during the current run) is summarized. This matches the pre-middleware behavior where originalMessages = stepMessages.slice(0, originalMessageCount) and the summary was appended after the preserved prefix.

Removal of _oversizedWarning field

The _oversizedWarning string field was removed from summaryData injection in both BaseCompressor.ts and ArtifactService.ts. Since oversized tool outputs are now excluded at the tool-wrapper seam before reaching the LLM, the warning in compression/artifact metadata is redundant. The metadata.isOversized boolean and _structureInfo field remain as the durable signals.

Async-iterator fallback fix

The toAsyncIterator helper in compressionRetryMiddleware.ts previously used an unsound as unknown as AsyncIterator cast on ReadableStream. This has been replaced with a proper Reader-to-iterator adapter that wraps reader.read() into .next() calls — safe if the fallback branch is ever triggered on a runtime without Symbol.asyncIterator on ReadableStream.

Testing

  • New compressionRetryMiddleware.test.ts (299 lines): Tests wrapGenerate passthrough, overflow retry success, double-overflow propagation, non-overflow error passthrough. Tests wrapStream with data passthrough, pre-commit overflow retry, pre-commit non-overflow error throw, and post-commit mid-stream error propagation. Includes a code-review constraint test verifying no .tee() or Array.from usage on streams.
  • New detectContextOverflow.test.ts (114 lines): Comprehensive coverage of OpenAI code-based detection, both Anthropic regex patterns, 413 exclusion, generic 400 without overflow indicators, non-API errors, null/undefined/string inputs, 500 status with overflow message, and wrong data shapes.
  • New tool-wrapper-oversized.test.ts (152 lines): Tests oversized exclusion stub structure for large outputs, passthrough for normal-sized outputs, and stub payload contents.
  • Updated ai-sdk-callbacks.test.ts: Gutted ~220 lines of token-budget prediction tests, replaced with 2 tests confirming the function is now a no-op.
  • Updated BaseCompressor.test.ts and OversizedArtifacts.test.ts: Assertions changed from expecting _oversizedWarning to be undefined.

Spec & documentation

The PR includes a full spec at specs/2026-04-14-reactive-compression/SPEC.md covering the problem statement (SCR format), goals/non-goals, personas, current vs target state, architecture diagram, and detailed design. Supporting evidence documents cover current compression triggers, middleware approach rationale, oversized artifact handling investigation, and provider overflow signal formats. Meta documents include a changelog, audit findings, and design challenge writeup.

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR. The reactive compression approach (detect overflow from provider error, compress, retry once) is cleaner than the proactive token-budget check it replaces. The stream peek strategy is sound — single .next() call, no tee(), no buffering. The toModelOutput oversized exclusion wiring is the right seam for filtering before the LLM sees bloated content. Two issues worth addressing: the toAsyncIterator fallback has a broken cast, and the context window used for oversized detection is always the 120K default rather than the actual model's window.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

if (Symbol.asyncIterator in stream) {
return (stream as AsyncIterable<StreamPart>)[Symbol.asyncIterator]();
}
return (stream as ReadableStream<StreamPart>).getReader() as unknown as AsyncIterator<StreamPart>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadableStreamDefaultReader does not have .next() — it has .read(). This cast to AsyncIterator is unsound and would throw at runtime if the branch were reached.

In practice, modern Node.js ReadableStream has Symbol.asyncIterator, so the if branch on line 33 always triggers and this fallback is dead code. But the cast is still incorrect — either remove the fallback entirely, or wrap the reader in a proper async-iterator adapter:

const reader = (stream as ReadableStream<StreamPart>).getReader();
return {
  next: () => reader.read() as Promise<IteratorResult<StreamPart>>,
};

const relationshipId = getRelationshipIdForTool(ctx, toolName, toolType);

const originalExecute = toolDefinition.execute;
const contextWindowSize = getModelContextWindow().contextWindow ?? 120000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getModelContextWindow() without arguments always returns the 120K default (source: fallback). This means the 30% oversized threshold is hardcoded at ~36K tokens regardless of the actual model. An agent using Claude with a 200K context window would still exclude results above ~36K tokens instead of ~60K.

Consider passing model settings from ctx.config or accepting contextWindowSize as a parameter to wrapToolWithStreaming so the threshold matches the generation model.

}
return {
type: 'json' as const,
value: JSON.parse(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON.parse(JSON.stringify({...})) round-trip is a no-op on this plain object literal — every value is already JSON-safe (strings, numbers, plain objects). You can return the object directly as value.

Suggested change
value: JSON.parse(
value: {
status: 'oversized',
toolCallId: lastToolCallId,
toolName,
warning:
'⚠️ Tool produced an oversized result that cannot be included in the conversation.',
reason: formatOversizedRetrievalReason(
detection.originalTokenSize,
detection.contextWindowSize ?? contextWindowSize
),
toolInfo: {
toolName,
toolArgs: lastArgs,
structureInfo: detection.structureInfo ?? '',
},
recommendation:
'Consider: 1) narrowing filters/queries on the next tool call, 2) asking the user to break down the request, 3) processing data differently.',
},

}

const summaryMessage = JSON.stringify(summaryData);
const systemMessages = messages.filter((m: any) => m.role === 'system');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old handlePrepareStepCompression kept both system messages and original conversation messages (stepMessages.slice(0, originalMessageCount)). This new buildCompressPrompt only keeps role === 'system' messages. That means the original user query and any initial conversation history are dropped — the LLM only sees the compression summary.

Is this intentional? If the original user message carries important framing that the summary might not perfectly capture, the LLM could lose track of the original ask after compression.

ctx: CompressionRetryContext
): LanguageModelMiddleware {
return {
specificationVersion: 'v3',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specificationVersion: 'v3' field may tie this to a specific AI SDK middleware contract version. If the SDK upgrades the spec version (e.g. to v4), this middleware might silently stop being called. Consider adding a comment noting which @ai-sdk/provider versions this was verified against (you mention 3.0.2–3.0.4 in the PR description) so future maintainers know to re-check.

…nal messages

- Context window (pullfrog #2, load-bearing): getModelContextWindow() was
  called without args and always returned the 120K default, so the 30%
  oversized threshold was hardcoded at ~36K regardless of the actual
  model. Added currentModelSettings to AgentRunContext, stashed after
  configureModelSettings, and read lazily inside toModelOutput.
- Compression prompt (pullfrog #4, load-bearing): buildCompressPrompt
  only kept role==='system' messages, dropping the original user query
  and conversation-history prefix. Now takes originalMessageCount and
  preserves messages.slice(0, originalMessageCount) as the prefix —
  matching the pre-middleware handlePrepareStepCompression behavior.
- Async-iterator fallback (pullfrog #1): replaced the unsound
  `as unknown as AsyncIterator` cast with a proper Reader → iterator
  adapter so the dead branch is safe if ever triggered.
- Middleware spec-version comment (pullfrog #5): documented which
  @ai-sdk/provider versions the wrapGenerate/wrapStream contract was
  verified against.
- JSON round-trip (pullfrog #3): kept as-is. The round-trip is not a
  no-op — it launders `unknown` tool args through JSONValue and strips
  non-JSON types. Added a comment explaining this.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) Missing test coverage Compression failure during retry is untested

Issue: The wrapGenerate and wrapStream implementations call ctx.compressPrompt() which can throw (e.g., summarizer model failure, network error), but there is no test verifying the behavior when compression itself fails. The compressPrompt call at lines 156 and 186 is outside any explicit error handling in the middleware.

Why: If compressPrompt throws during a retry attempt, the error propagation behavior is untested. A bug here could cause silent failures or incorrect error messages being surfaced to users. This is a realistic failure mode since compression involves an LLM call. The tests currently only verify success paths and overflow retry scenarios, not compression infrastructure failures.

Fix: Add test:

it('propagates compression errors during retry', async () => {
  const ctx = { 
    compressPrompt: vi.fn().mockRejectedValue(new Error('Summarizer failed')) 
  };
  const middleware = createCompressionRetryMiddleware(ctx);
  const doGenerate = vi.fn().mockRejectedValue(makeOverflowError());
  
  await expect(middleware.wrapGenerate?.({
    doGenerate,
    doStream: vi.fn(),
    params: { prompt: [] } as any,
    model: { doGenerate: vi.fn(), doStream: vi.fn() } as any,
  })).rejects.toThrow('Summarizer failed');
  
  expect(ctx.compressPrompt).toHaveBeenCalledOnce();
});

Refs:


🟠 2) compression.ts buildCompressPrompt has no dedicated unit tests

Issue: The new buildCompressPrompt function (lines 41-104) transforms the prompt after compression, including critical logic for: (1) handling array vs object summary results, (2) injecting stop instructions based on compression cycle count, (3) mapping related_artifacts to artifact_reference tags, and (4) filtering system messages. These code paths are only tested implicitly through integration.

Why: Bugs in buildCompressPrompt could cause: (1) incorrect prompt structure after compression leading to LLM confusion, (2) stop instructions not being applied correctly causing infinite loops, (3) artifact references being malformed. The function has conditional branches (compressionCycle >= 1, hasNewWork checks) that should be explicitly tested.

Fix: Create unit tests for buildCompressPrompt:

describe('buildCompressPrompt', () => {
  it('injects stop instruction after multiple compression cycles', async () => {
    const compressor = makeCompressor({ 
      getCompressionCycleCount: vi.fn().mockReturnValue(2) 
    });
    const prompt = buildCompressPrompt(compressor);
    const result = await prompt([
      { role: 'system', content: 'sys' }, 
      { role: 'user', content: 'hi' }
    ]);
    expect(result[1].content).toContain('STOP ALL TOOL CALLS');
  });

  it('preserves system messages and removes others', async () => { /* ... */ });
  it('formats artifact references correctly', async () => { /* ... */ });
});

Refs:


Inline Comments:

  • 🟠 Major: compressionRetryMiddleware.ts:156 Compression failure lacks telemetry
  • 🟡 Minor: tool-wrapper.ts:136-155 Unnecessary JSON.parse(JSON.stringify()) wrapper

🕐 Pending Recommendations (2)

These issues were raised in the prior pullfrog review and are still unresolved:

  • 🟠 compressionRetryMiddleware.ts:36toAsyncIterator fallback casts ReadableStreamDefaultReader to AsyncIterator, but these interfaces are incompatible. reader.next() will throw TypeError when the stream lacks Symbol.asyncIterator.

  • 🟠 tool-wrapper.ts:114getModelContextWindow() called without arguments always returns 120K fallback. The oversized detection threshold will be wrong for models with different context windows.


🚫 REQUEST CHANGES

Summary: The reactive compression approach is architecturally sound — detecting overflow from provider errors and retrying once is cleaner than proactive token-budget prediction. The stream peek strategy (single .next(), no tee(), no buffering) is efficient. However, the two issues flagged by pullfrog remain unaddressed (broken toAsyncIterator cast, context window always using 120K fallback), and the new middleware lacks telemetry for compression failures and test coverage for failure paths. The JSON.parse/stringify wrapper on a plain object is also unnecessary overhead.

Discarded (8)
Location Issue Reason Discarded
compressionRetryMiddleware.ts:140 specificationVersion: 'v3' hardcoded Valid observation but low risk — existing middleware in agents-core uses same pattern; adding a shared constant is a nice-to-have, not blocking
compression.ts:41 safeCompress result shape handling is fragile The array vs object handling matches existing compressor contract; spread operator would be a refactor preference
ai-sdk-callbacks.ts:11 Vestigial handlePrepareStepCompression parameters Intentional scaffolding per spec — prep for future step-based logic
detectContextOverflow.ts:22 413 status exclusion may silently misclassify 413 is bytes not tokens per HTTP spec; adding trace-level logging is a nitpick
compressionRetryMiddleware.ts:173 Original stream not explicitly released The reader is consumed (one chunk read); explicit cancel is a nice-to-have for edge cases
compressionRetryMiddleware.ts:12 Loose StreamPart type with index signature Intentional flexibility for AI SDK internal chunk variations
compression.ts:45 Prompt cast to any[] loses type info Type guards would be nice but current implementation works; not blocking
compressionRetryMiddleware.test.ts:n/a Missing test for empty stream (immediate EOF) Edge case with synthetic finish chunk handling; nice-to-have
Reviewers (6)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 3 0 0 0 1 2 0
pr-review-tests 7 2 0 0 0 0 5
pr-review-errors 5 0 0 0 1 2 2
pr-review-sre 5 0 0 0 0 2 3
pr-review-types 5 0 0 0 0 2 3
pr-review-architecture 5 0 0 0 0 2 3
Total 30 2 0 0 2 2 8

Note: Multiple reviewers flagged the same two issues (toAsyncIterator cast, context window fallback) that were already raised by pullfrog — these were consolidated into Pending Recs rather than duplicated.

'Context overflow detected in doGenerate, compressing and retrying'
);

const compressedPrompt = await ctx.compressPrompt(params.prompt as unknown[]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Compression failure lacks telemetry

Issue: When ctx.compressPrompt() throws, the error propagates without calling setRetryTelemetry(). The compression.outcome span attribute is only set on success (line 163) or retry error (line 167), not on compression infrastructure failures.

Why: During incidents, operators cannot distinguish compression failures (summarizer timeout, model unavailable) from LLM errors in observability dashboards. This creates a blind spot that extends mean-time-to-resolution. The same pattern applies to the streaming path at line 186.

Fix:

let compressedPrompt: unknown[];
try {
  compressedPrompt = await ctx.compressPrompt(params.prompt as unknown[]);
} catch (compressionErr) {
  setRetryTelemetry(provider, detector, 1, 'compression_failed');
  logger.error(
    { provider, detector, error: compressionErr instanceof Error ? compressionErr.message : String(compressionErr) },
    'Compression failed during overflow retry'
  );
  throw compressionErr;
}

Refs:

Comment on lines +136 to +155
value: JSON.parse(
JSON.stringify({
status: 'oversized',
toolCallId: lastToolCallId,
toolName,
warning:
'⚠️ Tool produced an oversized result that cannot be included in the conversation.',
reason: formatOversizedRetrievalReason(
detection.originalTokenSize,
detection.contextWindowSize ?? contextWindowSize
),
toolInfo: {
toolName,
toolArgs: lastArgs,
structureInfo: detection.structureInfo ?? '',
},
recommendation:
'Consider: 1) narrowing filters/queries on the next tool call, 2) asking the user to break down the request, 3) processing data differently.',
})
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Unnecessary JSON.parse(JSON.stringify()) wrapper

Issue: The oversized stub is a plain object literal constructed inline with only primitive values and nested plain objects. The JSON.parse(JSON.stringify(...)) round-trip has no effect on such an object.

Why: This pattern is typically used for deep-cloning or stripping non-serializable properties. Here, the object is constructed from scratch with no circular references, functions, or special objects — the serialization round-trip is pure overhead (CPU, memory, potential error surface if any dynamic value is non-serializable).

Fix: (1-click apply)

Suggested change
value: JSON.parse(
JSON.stringify({
status: 'oversized',
toolCallId: lastToolCallId,
toolName,
warning:
'⚠️ Tool produced an oversized result that cannot be included in the conversation.',
reason: formatOversizedRetrievalReason(
detection.originalTokenSize,
detection.contextWindowSize ?? contextWindowSize
),
toolInfo: {
toolName,
toolArgs: lastArgs,
structureInfo: detection.structureInfo ?? '',
},
recommendation:
'Consider: 1) narrowing filters/queries on the next tool call, 2) asking the user to break down the request, 3) processing data differently.',
})
),
return {
type: 'json' as const,
value: {
status: 'oversized',
toolCallId: lastToolCallId,
toolName,
warning:
'⚠️ Tool produced an oversized result that cannot be included in the LLM context. The full result is stored as an artifact and can be retrieved later if needed.',
reason: formatOversizedRetrievalReason(
detection.originalTokenSize,
detection.contextWindowSize ?? contextWindowSize
),
toolInfo: {
toolName,
toolArgs: lastArgs,
structureInfo: detection.structureInfo ?? '',
},
recommendation:
'Consider: 1) narrowing filters/queries, 2) requesting specific fields, or 3) breaking into smaller requests.',
},
};

Refs:

@github-actions github-actions Bot deleted a comment from claude Bot Apr 14, 2026
@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 14, 2026

Ito Test Report ✅

16 test cases ran. 1 additional finding, 15 passed.

Across 16 test cases, 15 passed and 1 failed, showing generally stable local non-production behavior for baseline and edge chat flows (streaming/non-streaming completions, rapid same-conversation requests, mid-stream reload resume, and mobile 390x844), overflow handling (bounded recovery with a one-retry cap and no retries for non-overflow errors), oversized tool-output safeguards, and authorization/injection probes. The key finding was one High-severity validation gap in /run/api/chat where malformed approval-response-shaped payloads in messages[].content can bypass the approval guard when conversationId is omitted and proceed through normal execution instead of returning the intended HTTP 400, while no cross-tenant access bypass was observed.

✅ Passed (15)
Category Summary Screenshot
Adversarial Oversized repeated-token streaming payload completed in bounded time with clean stream termination. ADV-1
Adversarial Forged x-target-* headers did not grant cross-tenant scope; requests were denied under local dev/test auth behavior. ADV-2
Adversarial Injection-shaped strings remain inert literal text through the oversized stub serialization path. ADV-4
Edge Using one shared conversationId, 4 rapid requests all completed quickly with deterministic terminal responses and no deadlock/hung stream. EDGE-1
Edge A streaming request produced partial output, page reload occurred mid-stream, and a follow-up with the same conversationId completed (HTTP 200), showing no stuck pending stream blocker. EDGE-2
Edge Out-of-order and missing-turn history payloads completed safely without crashes or stack-trace leakage. EDGE-3
Edge Original BLOCKED result was not a product bug: prior run hung on a restart-server tooling step. After proper setup and non-production playground-token bypass for missing temp signing keys, the test was re-executed at 390x844 and mobile chat flow could be initiated and completed. EDGE-4
Logic Non-streaming overflow-retry request completed with HTTP 200 and assistant output in bounded time. LOGIC-1
Logic Retry budget behavior is capped at one retry and a second overflow is propagated (verified via source and deterministic tests). LOGIC-2
Logic Invalid model/agent request returned an immediate terminal error with no retry side effects. LOGIC-3
Logic Streaming recoverable overflow path returned normal first chunks and completed cleanly. LOGIC-4
Logic Oversized tool output is replaced by a structured status=oversized stub and raw payload is excluded. LOGIC-5
Logic Normal-size tool output follows the standard pass-through path without oversized stubbing. LOGIC-6
Logic Oversized artifact metadata keeps isOversized and _structureInfo while _oversizedWarning remains absent. LOGIC-7
Happy-path Baseline streaming request returned HTTP 200 with SSE chunks and clean stream completion after local setup alignment. ROUTE-1
ℹ️ Additional Findings (1)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Adversarial ⚠️ Malformed approval-response content can bypass validation and proceed down normal chat execution flow. ADV-3
⚠️ Approval-response payload bypasses validation without conversation context
  • What failed: The endpoint should reject malformed approval-response input with HTTP 400 and require conversationId, but this payload shape is not recognized as an approval response and execution proceeds as a normal chat request.
  • Impact: Clients can submit malformed approval-response payloads that bypass intended validation and trigger normal execution paths. This weakens approval flow integrity and can create unintended conversation state.
  • Steps to reproduce:
    1. Send a POST request to /run/api/chat with a user message containing an approval-response-shaped object in messages[].content.
    2. Omit conversationId from the request body.
    3. Observe that the request can continue through normal execution flow instead of returning a 400 validation failure.
  • Stub / mock context: Requests were executed with local bypass authentication and a mock/default model; no extra route interception was used to fake the approval-response behavior itself.
  • Code analysis: I reviewed the chat route and message parsing types. Approval detection only inspects messages[].parts for state === 'approval-responded', while the request schema accepts content as unconstrained data and the normal path auto-creates a conversationId, so malformed content payloads avoid the approval guard and continue into execution setup.
  • Why this is likely a bug: The route enforces conversationId only in an approval branch that malformed content payloads can skip, while schema permissiveness and normal-path conversation creation allow unintended execution.

Relevant code:

agents-api/src/domains/run/routes/chatDataStream.ts (lines 126-145)

const approvalParts = (body.messages || [])
  .flatMap((m: any) => m?.parts || [])
  .filter((p: any) => p?.state === 'approval-responded' && typeof p?.toolCallId === 'string');

const isApprovalResponse = approvalParts.length > 0;

if (isApprovalResponse) {
  const conversationId = body.conversationId;
  if (!conversationId) {
    return c.json(
      {
        success: false,
        error: 'conversationId is required for approval response',
      },
      400
    );
  }
}

agents-api/src/domains/run/routes/chatDataStream.ts (lines 284-341)

const conversationId = body.conversationId ?? getConversationId();

return await otelContext.with(ctxWithBaggage, async () => {
  const agent = executionContext.project.agents[agentId];
  if (!agent) {
    throw createApiError({
      code: 'not_found',
      message: 'Agent not found',
    });
  }

  const activeAgent = await getActiveAgentForConversation(runDbClient)({
    scopes: { tenantId, projectId },
    conversationId,
  });
  if (!activeAgent) {
    await setActiveAgentForConversation(runDbClient)({
      scopes: { tenantId, projectId },
      conversationId,
      subAgentId: defaultSubAgentId,
      ref: executionContext.resolvedRef,
      agentId: agentId,
    });
  }
});

agents-api/src/domains/run/types/chat.ts (lines 132-136)

export const VercelMessageSchema = z.object({
  role: z.enum(['system', 'user', 'assistant', 'function', 'tool']),
  content: z.any(),
  parts: z.array(VercelMessagePartSchema).optional(),
});

Commit: 0b0d51f

View Full Run


Tell us how we did: Give Ito Feedback

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) New Issues in Delta | Risk: Low

✅ Delta Review: All Pullfrog Issues Addressed

The commit 10e0d12a3 successfully addresses all issues raised in the prior pullfrog review:

Issue Status Fix
toAsyncIterator broken cast ✅ Fixed Reader now properly wrapped with { next: () => reader.read() } instead of unsafe cast
Context window always 120K ✅ Fixed toModelOutput now uses getModelContextWindow(ctx.currentModelSettings) with stored settings
Original messages dropped ✅ Fixed buildCompressPrompt now preserves originalMessages.slice(0, originalMessageCount) prefix
SDK version comment ✅ Added Comment documents verified SDK versions (3.0.2, 3.0.4) for future maintainers

🔍 Delta Changes Reviewed

5 files modified (+35/-9 lines):

  1. agent-types.ts — Added currentModelSettings?: ModelSettings to AgentRunContext
  2. compression.tsbuildCompressPrompt now accepts originalMessageCount and preserves prefix ✓
  3. generate.ts — Sets ctx.currentModelSettings = primaryModelSettings before tool wrapping ✓
  4. tool-wrapper.ts — Context window lookup now uses actual model settings ✓
  5. compressionRetryMiddleware.ts — Fixed async iterator adapter + added SDK version comment ✓

🕐 Pending Recommendations (4)

Issues from prior reviews that remain valid for unchanged code paths:


✅ APPROVE

Summary: The delta cleanly addresses all blocking issues from the prior review. The fixes are minimal and correct — the async iterator adapter now properly wraps the reader, context window detection uses actual model settings, and compression preserves the original conversation prefix. The reactive compression architecture is sound. The pending recommendations are valid improvements for test coverage and telemetry but are not blocking for merge.

Reviewers (2)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-types 0 0 0 0 0 0 0
Total 0 0 0 0 0 0 0

Note: Delta scope — sub-reviewers found no new issues in the 5 changed files. Prior issues from full PR review remain as Pending Recommendations.

@github-actions github-actions Bot deleted a comment from claude Bot Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Preview URLs

Use these stable preview aliases for testing this PR:

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

@itoqa
Copy link
Copy Markdown

itoqa Bot commented Apr 15, 2026

Ito Test Report ✅

12 test cases ran. 12 passed.

Unified QA results were fully green: 12 executed test cases passed with 0 failures (plus one verified-empty segment with no remaining cases), indicating stable behavior across the covered run and chat endpoints in the local non-production setup. The key findings were that earlier blockers were confirmed as harness/environment issues rather than product defects, while critical behaviors worked as designed—including SSE streaming completion and clean termination, interruption recovery on the same conversation, mobile and burst-submit transport stability, correct reactive overflow retry semantics (with 413 excluded), and correct tool-output threshold handling for below-threshold pass-through versus above-threshold structured oversized exclusion.

✅ Passed (12)
Category Summary Screenshot
Adversarial Five rapid parallel submissions plus a same-conversation follow-up all returned HTTP 200 with no crash/500. ADV-1
Edge HTTP 413-style size errors remain excluded from overflow retry classification. EDGE-1
Edge Not a real app bug. Boundary below 30% is intentionally treated as non-oversized and remains pass-through. EDGE-2
Edge Not a real app bug. Above-threshold payloads are correctly excluded as oversized per detector and serialization logic. EDGE-3
Edge Interrupted stream resumed on the same conversationId with HTTP 200 and a clean single-start [DONE] completion. EDGE-4
Edge Mobile 390x844 chat submissions reached /run/api/chat with HTTP 200 and v2 stream headers. EDGE-5
Logic Non-stream overflow retry succeeds once compression retry path is exercised. LOGIC-1
Logic Streaming pre-commit overflow retry resumes and completes via single retry path. LOGIC-2
Logic Not a real app bug. Oversized tool output exclusion and structured replacement are implemented and verified. LOGIC-3
Logic Near-limit non-overflow requests proceed without proactive compression injection. LOGIC-4
Happy-path /run/v1/chat/completions streamed successfully with HTTP 200 SSE, assistant chunks, and a clean finish marker after local project setup. ROUTE-1
Happy-path Not a real app bug. Prior blocked result was harness/infrastructure-related. Re-validation confirms under-threshold tool results are passed through normally via non-oversized path. ROUTE-4

Commit: 10e0d12

View Full Run


Tell us how we did: Give Ito Feedback

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had recent activity.
It will be closed in 7 days if no further activity occurs.

If this PR is still relevant:

  • Rebase it on the latest main branch
  • Add a comment explaining its current status
  • Request a review if it's ready

Thank you for your contributions!

@github-actions github-actions Bot added the stale label Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant