Docs simplify auth by nick-inkeep · Pull Request #198 · inkeep/agents

nick-inkeep · 2025-09-14T01:03:03Z

No description provided.

…plify-auth

vercel · 2025-09-14T01:03:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
agents-docs	Building	Preview	Comment	Sep 14, 2025 1:03am
agents-manage-api	Building	Preview	Comment	Sep 14, 2025 1:03am
agents-manage-ui	Building	Preview	Comment	Sep 14, 2025 1:03am
agents-run-api	Building	Preview	Comment	Sep 14, 2025 1:03am

claude · 2025-09-14T01:03:17Z

Claude finished @nick-inkeep's task —— View job

Code Review: PR #198 - Docs Simplify Auth

I'll provide a comprehensive TypeScript Staff Engineer and System Architect review of this authentication simplification.

Todo List:

Read repository guidelines (CLAUDE.md)
Examine git history and commits to understand actual changes
Focus on authentication documentation changes specifically
Analyze system design implications of auth simplification
Evaluate customer/user experience impact
Provide final comprehensive review feedback

Analysis Summary

This PR appears to be the initial commit of the Inkeep Agent Framework with documentation focused on "simplifying auth". The codebase includes sophisticated multi-API authentication patterns across two main services (Run API and Manage API) with comprehensive documentation.

Comprehensive Code Review

🏗️ System Architecture: Authentication Design

Strengths:

Multi-layered auth strategy: Elegant fallback pattern from bypass secrets → API keys → development mode
Environment-aware authentication: Clean separation between development, staging, and production modes
Security-first approach: Proper scrypt hashing for API keys, secure token validation, and request scoping

Architectural Concerns:

1. Authentication Complexity vs "Simplification" ⚠️ High Priority

The title claims "Docs Simplify Auth" but the implementation reveals significant complexity:

Run API has 3 auth modes:

Development (headers only)
Standard (API keys)
Bypass (secret + headers OR API keys)

This creates decision paralysis for developers. Consider:

// Current: Complex decision tree
if (ENVIRONMENT === 'development') { /* headers */ }
else if (BYPASS_SECRET && token === BYPASS_SECRET) { /* bypass + headers */ }
else if (BYPASS_SECRET && token !== BYPASS_SECRET) { /* validate API key */ }
else { /* standard API key */ }

Recommendation: Consolidate to 2 modes:

Development: Header-based (current)
Production: API key only (remove bypass secret complexity)

2. Bypass Secret Anti-Pattern ⚠️ High Priority

The bypass secret creates several architectural concerns:

// agents-run-api/src/middleware/api-key-auth.ts:90-131
if (env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
  if (apiKey === env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
    // Bypass with headers
  } else {
    // Fall back to API key validation
  }
}

Problems:

Security: Single secret grants access to ALL graphs across ALL tenants
Auditability: Bypass authentication cannot be traced to specific API keys
Inconsistency: Run API requires headers with bypass, Manage API doesn't
Operational Risk: Secret compromise exposes entire system

Recommendation: Remove bypass secret pattern entirely. Use service-to-service API keys with broader scopes if needed.

3. Authentication State Management ⚠️ Medium Priority

The execution context creation is inconsistent:

// Development fallback has different defaults than production
executionContext = createExecutionContext({
  apiKey: 'development',  // Hardcoded magic string
  tenantId: tenantId || 'test-tenant',  // Could be undefined
  projectId: projectId || 'test-project',
  // ...
});

Issues:

Magic strings for development keys
Inconsistent validation between dev/prod contexts
No validation of required headers in development mode

📚 Documentation Review

Strengths:

Comprehensive coverage: All auth modes documented with examples
Clear structure: Tabbed interface separating development vs production
Security warnings: Appropriate warnings about development mode

Areas for Improvement:

1. User Journey Optimization ⚠️ Medium Priority

The documentation presents 3 auth modes for Run API but doesn't guide users on when to use each:

# Current docs show all options equally
### Development Mode
### Standard Mode  
### Bypass Mode

Better approach:

### Quick Start (Recommended)
Use API keys for both development and production

### Advanced: Development Headers
For local development without API key setup

### Enterprise: Service Authentication
For service-to-service communication

2. Security Guidance Missing ⚠️ Medium Priority

Documentation lacks:

API key rotation procedures
Scope limitation best practices
Incident response (key compromise)
Network security requirements (HTTPS enforcement)

🔒 Security Architecture Review

Strengths:

Proper hashing: Using scrypt for API key storage (/home/runner/work/agents/agents/agents-docs/content/docs/api-keys.mdx:14)
Header validation: Checking for proper Bearer token format
Scope enforcement: API keys tied to specific tenant/project/graph

Critical Security Concerns:

1. Development Mode Security ⚠️ High Priority

// agents-run-api/src/middleware/api-key-auth.ts:36-79
if (process.env.ENVIRONMENT === 'development' || process.env.ENVIRONMENT === 'test') {
  // Bypasses ALL authentication
  executionContext = createExecutionContext({...});
}

Problems:

No tenant isolation in development
Default credentials could leak to production
Unlimited access to all graphs/projects

Recommendation:

Require explicit opt-in for auth bypass: DEVELOPMENT_BYPASS_AUTH=true
Add tenant validation even in development mode
Log all bypass authentications for audit trails

2. Error Information Disclosure ⚠️ Low Priority

Authentication errors could provide information to attackers:

throw new HTTPException(401, {
  message: 'Missing or invalid tenant, project, or graph ID',  // Reveals expected headers
});

Consider more generic error messages for production.

🚀 Performance & Scalability

Concerns:

1. Database Lookups per Request ⚠️ Medium Priority

Every authenticated request hits the database for API key validation:

const apiKeyRecord = await validateAndGetApiKey(apiKey, dbClient);

Recommendations:

JWT tokens with embedded scope (self-validating)
Redis cache for API key lookups
API key pre-validation to avoid database hits for malformed keys

2. Async Context Creation ⚠️ Low Priority

Multiple async operations in middleware could impact request latency. Consider batching context operations.

🎯 Customer Experience Impact

Positive:

Clear separation of concerns between Run API and Manage API
Flexible authentication for different deployment scenarios
Good developer experience with development mode

Areas for Improvement:

1. API Key Management UX ⚠️ Medium Priority

Documentation mentions "Visual Builder Dashboard" but doesn't provide:

Screenshots of the UI flow
API key format examples (sk_live_... pattern)
Troubleshooting guide for common auth errors

2. Migration Path Unclear ⚠️ Low Priority

No guidance for:

Migrating from development to production auth
Updating from old auth patterns (if applicable)
Batch API key creation/management

🛠️ Implementation Quality

Strengths:

Type safety: Proper TypeScript interfaces for auth context
Error handling: Appropriate HTTP exceptions and logging
Middleware pattern: Clean separation of concerns

Improvement Opportunities:

1. Code Duplication ⚠️ Low Priority

Auth logic duplicated between Manage API and Run API with slight variations. Consider shared auth utilities.

2. Testing Coverage ⚠️ Medium Priority

Need comprehensive tests for:

All authentication mode combinations
Edge cases (expired keys, malformed headers)
Security boundary testing

Final Recommendations

High Priority Actions:

Simplify auth modes - Remove bypass secret pattern, consolidate to dev/prod
Strengthen development mode - Add tenant validation and audit logging
Improve documentation flow - Guide users to recommended patterns first

Medium Priority Actions:

Add performance optimizations - API key caching, JWT consideration
Enhance security documentation - Rotation, incident response procedures
Implement comprehensive testing - All auth scenarios and edge cases

Architecture Decision Records Needed:

Why bypass secrets over service API keys?
Authentication mode selection criteria
API key scope design principles

The authentication system shows solid engineering fundamentals but could benefit from simplification to match the PR title's intent. The multi-mode approach, while flexible, may create more complexity than value for most users.

…e + preview paths (inkeep#198) (inkeep#3172) * fix(ci): close remaining silent strandings in release cascade + bridge + preview paths Bundle of 10 code + 1 doc changes, each closing a distinct silent-failure mode where CI would pass green (or sit red invisibly) while production work stranded. Complement to inkeep#194's release-cascade hardening — each fix here addresses a gap that pass missed. Release cascade (5) - release-handler.yml: pin GitHub Release to client_payload.commit_sha, not origin/main. Auto-format landing between sync and handler used to create a SHA drift where Vercel deployed a newer commit than npm got. - release-handler.yml: 3-attempt bash retry around gh release view/edit/ create. Mirrors the dispatch-side retry from inkeep#174 — one transient 5xx used to leave npm published with no Release (so no Vercel deploy) and no alert. - release-handler.yml: notify-on-failure job. If the success job itself errors (app-token, checkout, or handler logic), the dispatch was already delivered and public-side retries don't re-fire. Without this notifier, npm publishes but the cascade stops silently. - public-mirror-sync.yml: 3-attempt retry on sync PR approval. One dropped GH API call used to leave the sync PR open with no approval; now exhaustion exits non-zero so the workflow turns red and the janitor's sync-PR sweep (below) catches long-open PRs as a safety net. - scripts/check-monorepo-traps.mjs: flip check:override-masks-bump from warn to hard-fail. Current tree is clean of masks, so flipping is safe and prevents future PR inkeep#170-class silent divergence. Preview envs (2) - deploy-vercel-preview.sh: validate captured deployment URL matches *.vercel.app pattern before alias-set. The log-grep fallback could silently capture a docs/inspection URL and alias api.preview.inkeep.com to the wrong place. - public-agents-preview-environments.yml: add teardown-failure-notify job. Railway and Vercel teardowns run in parallel by design; this closes the observability gap when either silently fails — Railway continues billing, Vercel env vars pollute, and the next 6h janitor sweep re-attempts cleanup. Dependency sync + bridge (2) - dependabot-sync-root-lockfile.yml: on push failure, full-reset onto origin/$HEAD_REF and re-regenerate both lockfiles (root + public/agents) from the new baseline. Previous simple rebase would replay our stale lockfile on top of a different package.json after a Dependabot force-push, yielding inconsistent committed state that either broke --frozen-lockfile or silently shipped drift into main. - bridge-public-pr-to-monorepo.mjs (both copies): truncate bridged PR body to fit GitHub's 65,536-char limit, with a link to the original PR. Was failing 422 "body is too long" on Dependabot mega-bumps, stranding those PRs outside the agents-private review surface. Janitor (1) - public-agents-preview-janitor.yml: new sweep-stale-sync-prs job. Every 6h, identify open copybara/sync PRs on inkeep/agents >4h old, open (idempotent) tracking issue on agents-private, re-dispatch public-mirror-sync.yml. Does NOT auto-close the PR — destructive actions stay human-driven. Docs (1) - CI_RUNBOOK.md: partial-npm-publish recovery path. Documents what to do when pnpm changeset publish fails on package N of 10 (some live, some not): confirm via npm view, re-run the failed job (publish is idempotent), verify all 10 present before unblocking cascade. Each change verified in isolation with pnpm check:structural; YAML validated with safe_load; bridge truncation logic tested at boundary. * fix(ci): add checkout step and remove heredoc indentation in release-handler notify Addresses review feedback on PR inkeep#198: - notify-success-failure now does a Checkout before gh calls, matching the failure job pattern (without it, gh CLI lacks git context and issue-create would silently fail with 'fatal: not a git repository'). - Body is now built with echo into a temp file instead of a heredoc. The original 10-space indent inside run: | would have been preserved literally by bash, making the tracking issue body unreadable. * fix(ci): address review feedback on release-handler retry + docs drift Review feedback from inkeep#198 (Claude + Pullfrog): release-handler.yml: - Capture stderr in gh_with_retry so on-call sees the actual API error (rate-limit, 5xx, auth) instead of a generic 'command failed' line. Was making incident triage 15-30m slower. (Claude Major) - Stop wrapping 'gh release view' in gh_with_retry. It returns non-zero for the legitimate 'release doesn't exist' case (the normal new-release happy path); retrying burned 15s of backoff before falling through to create. Only mutating calls (edit/create) need retry. (Pullfrog + Claude) - notify-success-failure now fails loudly (exit 1) if gh issue create fails, matching the failure job's pattern. Silent swallow would mean the success handler failed AND the notifier failed with no visible signal beyond a red check — exactly the class this PR closes elsewhere. (Claude Major) Docs drift (Pullfrog): - AGENTS.md: remove '(soft today)' / 'non-blocking today' annotations from override-masks-bump in both the cheatsheet and the trap-list. - AGENTS.md trap inkeep#4: flip 'Currently a warning, not a hard fail' to 'Hard-fail (flipped...)' to match the code. - CI_RUNBOOK.md: rename the override-masks-bump section header from 'warns (non-blocking today)' to 'fails — root override masks a workspace bump'. Preview janitor (Claude Consider): - Document that sweep-stale-sync-prs runs unconditionally regardless of PREVIEW_ENVIRONMENTS_ENABLED. Pr body said so; workflow didn't. Bridge script (Claude Consider): - console.log when PR body is truncated, with original + new sizes. Helps CI log debugging when a bridged PR looks shorter than expected. Applied to both copies (public/agents + public/agents-optional-local-dev). Intentionally not addressed: - dependabot-sync $MSG staleness after reset: Pullfrog flagged as intentional (message describes the operation, not the specific baseline), Claude flagged as cosmetic minor. Siding with Pullfrog. - Preview teardown notify swallowed error: lower severity given the janitor re-attempts cleanup every 6h; accepting current pattern. * fix(ci): pre-fetch PR base into agents-private before git apply --3way Closes the bridge-failure class seen on inkeep#3171 (and eventually on any bridged PR with a conflicting hunk). Root cause: `git apply --index --3way` resolves the patch's `index <old>..<new>` blob SHAs against the local repo's object store. The SHAs come from inkeep/agents' object graph; agents-private has never fetched from inkeep/agents, so those SHAs aren't resolvable. Clean patches work (no --3way fallback needed); any conflict triggers --3way, which fails with 'repository lacks the necessary blob to perform 3-way merge' — bridge stops with no useful diagnostic, PR strands outside the canonical review surface. Fix: Before calling git apply, shallow-fetch the PR's base commit from the public repo into agents-private's object store. GitHub permits fetching by SHA when the SHA is reachable from a ref (PR base is on main, always reachable), and shallow=1 keeps it cheap. After fetch, --3way can find the `<old>` blob and resolve conflicts into merge markers instead of a hard error. Why not drop --3way: Dropping --3way makes every conflicting PR fail hard with no recovery path. Pre-fetching preserves the --3way fallback so conflicting hunks produce visible conflict markers in the bridged PR, which is reviewable rather than opaque. Failure handling: If the fetch itself fails (rate-limit, network flake), we log and proceed. The subsequent git apply will hit the same blob error as before the fix — no regression, just no improvement for that run. Applied to both script copies: public/agents and public/agents-optional-local-dev. Bundled into PR inkeep#198 since it's the same release/sync hardening scope. * Revert "fix(ci): pre-fetch PR base into agents-private before git apply --3way" This reverts commit e7e0640e2345bbd7cdea1185eb3abc666f2c60fa. * fix(ci): switch bridge patch fetch from .patch to .diff media type Root cause of bridge failures on multi-commit PRs (eg inkeep#3171, 9 commits): `gh api .../pulls/N Accept: application/vnd.github.patch` returns a mailbox-format patch series — one patch per PR commit, each with `index <old>..<new>` lines referencing INTERMEDIATE blob SHAs created during the PR's history on inkeep/agents. Those intermediates exist only in inkeep/agents' object store; agents-private has never seen them. The first patch applies against agents-private's file (blob content matches the PR base even if SHAs differ), git's index advances to the intermediate state (only referenced by SHA, never fetched as an object). Any subsequent conflicting hunk triggers `git apply --3way` fallback, which needs the intermediate blob → 'repository lacks the necessary blob' → stranded PR. Single-commit PRs don't hit this (no intermediates). That's why the bridge used to work for most Dependabot bumps but started failing on longer branches like inkeep#3171. Fix: switch `Accept: application/vnd.github.patch` → `Accept: application/vnd.github.diff` in both bridge-public-pr-to- monorepo.mjs copies. .diff returns a single consolidated base-vs-head unified diff with only the PR's base blob SHAs referenced. Those are content-identical to agents-private's copies (Copybara mirrors blobs 1:1 without content rewrites for .ts files), so --3way can resolve them locally from agents-private's own object store. Also updated githubRequest's text-response check to include .diff alongside .patch (both are text, not JSON). Previous pre-fetch commit (`e7e0640e2`) was reverted in a prior commit — it was addressing the wrong layer (only PR-base blob, not intermediates). This commit supersedes that approach entirely. * chore(ci): biome format fix on bridge script console.log GitOrigin-RevId: a26d5dac74ed9938fc512703af21a25d82616504 Co-authored-by: Varun Varahabhotla <[email protected]>

nick-inkeep added 4 commits September 13, 2025 17:30

add relative link

195a113

auth

380fe64

quick start

c419c00

Merge branch 'main' of https://github.com/inkeep/agents into docs-sim…

d8b2adc

…plify-auth

nick-inkeep closed this Sep 14, 2025

vercel Bot had a problem deploying to Preview – agents-manage-ui September 14, 2025 01:03 Failure

nick-inkeep reopened this Sep 14, 2025

nick-inkeep merged commit 3fc8ad8 into main Sep 14, 2025
2 of 8 checks passed

nick-inkeep deleted the docs-simplify-auth branch September 14, 2025 01:03

vercel Bot deployed to Preview – agents-manage-api September 14, 2025 01:03 View deployment

vercel Bot deployed to Preview – agents-run-api September 14, 2025 01:04 View deployment

vercel Bot had a problem deploying to Preview – agents-docs September 14, 2025 01:09 Failure

vnv-varun added a commit that referenced this pull request Apr 22, 2026

chore: retrigger monorepo-pr-bridge after #198 .diff fix landed

fe9c4d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs simplify auth#198

Docs simplify auth#198
nick-inkeep merged 4 commits intomainfrom
docs-simplify-auth

nick-inkeep commented Sep 14, 2025

Uh oh!

vercel Bot commented Sep 14, 2025

Uh oh!

claude Bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nick-inkeep commented Sep 14, 2025

Uh oh!

vercel Bot commented Sep 14, 2025

Uh oh!

claude Bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #198 - Docs Simplify Auth

Todo List:

Analysis Summary

Comprehensive Code Review

🏗️ System Architecture: Authentication Design

1. Authentication Complexity vs "Simplification" ⚠️ High Priority

2. Bypass Secret Anti-Pattern ⚠️ High Priority

3. Authentication State Management ⚠️ Medium Priority

📚 Documentation Review

Strengths:

Areas for Improvement:

1. User Journey Optimization ⚠️ Medium Priority

2. Security Guidance Missing ⚠️ Medium Priority

🔒 Security Architecture Review

Strengths:

Critical Security Concerns:

1. Development Mode Security ⚠️ High Priority

2. Error Information Disclosure ⚠️ Low Priority

🚀 Performance & Scalability

Concerns:

1. Database Lookups per Request ⚠️ Medium Priority

2. Async Context Creation ⚠️ Low Priority

🎯 Customer Experience Impact

Positive:

Areas for Improvement:

1. API Key Management UX ⚠️ Medium Priority

2. Migration Path Unclear ⚠️ Low Priority

🛠️ Implementation Quality

Strengths:

Improvement Opportunities:

1. Code Duplication ⚠️ Low Priority

2. Testing Coverage ⚠️ Medium Priority

Final Recommendations

High Priority Actions:

Medium Priority Actions:

Architecture Decision Records Needed:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Sep 14, 2025 •

edited

Loading