Skip to content

Docs simplify auth#198

Merged
nick-inkeep merged 4 commits intomainfrom
docs-simplify-auth
Sep 14, 2025
Merged

Docs simplify auth#198
nick-inkeep merged 4 commits intomainfrom
docs-simplify-auth

Conversation

@nick-inkeep
Copy link
Copy Markdown
Collaborator

No description provided.

@vercel
Copy link
Copy Markdown

vercel Bot commented Sep 14, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agents-docs Building Building Preview Comment Sep 14, 2025 1:03am
agents-manage-api Building Building Preview Comment Sep 14, 2025 1:03am
agents-manage-ui Building Building Preview Comment Sep 14, 2025 1:03am
agents-run-api Building Building Preview Comment Sep 14, 2025 1:03am

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Sep 14, 2025

Claude finished @nick-inkeep's task —— View job


Code Review: PR #198 - Docs Simplify Auth

I'll provide a comprehensive TypeScript Staff Engineer and System Architect review of this authentication simplification.

Todo List:

  • Read repository guidelines (CLAUDE.md)
  • Examine git history and commits to understand actual changes
  • Focus on authentication documentation changes specifically
  • Analyze system design implications of auth simplification
  • Evaluate customer/user experience impact
  • Provide final comprehensive review feedback

Analysis Summary

This PR appears to be the initial commit of the Inkeep Agent Framework with documentation focused on "simplifying auth". The codebase includes sophisticated multi-API authentication patterns across two main services (Run API and Manage API) with comprehensive documentation.

Comprehensive Code Review

🏗️ System Architecture: Authentication Design

Strengths:

  • Multi-layered auth strategy: Elegant fallback pattern from bypass secrets → API keys → development mode
  • Environment-aware authentication: Clean separation between development, staging, and production modes
  • Security-first approach: Proper scrypt hashing for API keys, secure token validation, and request scoping

Architectural Concerns:

1. Authentication Complexity vs "Simplification" ⚠️ High Priority

The title claims "Docs Simplify Auth" but the implementation reveals significant complexity:

Run API has 3 auth modes:

  • Development (headers only)
  • Standard (API keys)
  • Bypass (secret + headers OR API keys)

This creates decision paralysis for developers. Consider:

// Current: Complex decision tree
if (ENVIRONMENT === 'development') { /* headers */ }
else if (BYPASS_SECRET && token === BYPASS_SECRET) { /* bypass + headers */ }
else if (BYPASS_SECRET && token !== BYPASS_SECRET) { /* validate API key */ }
else { /* standard API key */ }

Recommendation: Consolidate to 2 modes:

  • Development: Header-based (current)
  • Production: API key only (remove bypass secret complexity)

2. Bypass Secret Anti-Pattern ⚠️ High Priority

The bypass secret creates several architectural concerns:

// agents-run-api/src/middleware/api-key-auth.ts:90-131
if (env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
  if (apiKey === env.INKEEP_AGENTS_RUN_API_BYPASS_SECRET) {
    // Bypass with headers
  } else {
    // Fall back to API key validation
  }
}

Problems:

  • Security: Single secret grants access to ALL graphs across ALL tenants
  • Auditability: Bypass authentication cannot be traced to specific API keys
  • Inconsistency: Run API requires headers with bypass, Manage API doesn't
  • Operational Risk: Secret compromise exposes entire system

Recommendation: Remove bypass secret pattern entirely. Use service-to-service API keys with broader scopes if needed.

3. Authentication State Management ⚠️ Medium Priority

The execution context creation is inconsistent:

// Development fallback has different defaults than production
executionContext = createExecutionContext({
  apiKey: 'development',  // Hardcoded magic string
  tenantId: tenantId || 'test-tenant',  // Could be undefined
  projectId: projectId || 'test-project',
  // ...
});

Issues:

  • Magic strings for development keys
  • Inconsistent validation between dev/prod contexts
  • No validation of required headers in development mode

📚 Documentation Review

Strengths:

  • Comprehensive coverage: All auth modes documented with examples
  • Clear structure: Tabbed interface separating development vs production
  • Security warnings: Appropriate warnings about development mode

Areas for Improvement:

1. User Journey Optimization ⚠️ Medium Priority

The documentation presents 3 auth modes for Run API but doesn't guide users on when to use each:

# Current docs show all options equally
### Development Mode
### Standard Mode  
### Bypass Mode

Better approach:

### Quick Start (Recommended)
Use API keys for both development and production

### Advanced: Development Headers
For local development without API key setup

### Enterprise: Service Authentication
For service-to-service communication

2. Security Guidance Missing ⚠️ Medium Priority

Documentation lacks:

  • API key rotation procedures
  • Scope limitation best practices
  • Incident response (key compromise)
  • Network security requirements (HTTPS enforcement)

🔒 Security Architecture Review

Strengths:

  • Proper hashing: Using scrypt for API key storage (/home/runner/work/agents/agents/agents-docs/content/docs/api-keys.mdx:14)
  • Header validation: Checking for proper Bearer token format
  • Scope enforcement: API keys tied to specific tenant/project/graph

Critical Security Concerns:

1. Development Mode Security ⚠️ High Priority

// agents-run-api/src/middleware/api-key-auth.ts:36-79
if (process.env.ENVIRONMENT === 'development' || process.env.ENVIRONMENT === 'test') {
  // Bypasses ALL authentication
  executionContext = createExecutionContext({...});
}

Problems:

  • No tenant isolation in development
  • Default credentials could leak to production
  • Unlimited access to all graphs/projects

Recommendation:

  • Require explicit opt-in for auth bypass: DEVELOPMENT_BYPASS_AUTH=true
  • Add tenant validation even in development mode
  • Log all bypass authentications for audit trails

2. Error Information Disclosure ⚠️ Low Priority

Authentication errors could provide information to attackers:

throw new HTTPException(401, {
  message: 'Missing or invalid tenant, project, or graph ID',  // Reveals expected headers
});

Consider more generic error messages for production.

🚀 Performance & Scalability

Concerns:

1. Database Lookups per Request ⚠️ Medium Priority

Every authenticated request hits the database for API key validation:

const apiKeyRecord = await validateAndGetApiKey(apiKey, dbClient);

Recommendations:

  • JWT tokens with embedded scope (self-validating)
  • Redis cache for API key lookups
  • API key pre-validation to avoid database hits for malformed keys

2. Async Context Creation ⚠️ Low Priority

Multiple async operations in middleware could impact request latency. Consider batching context operations.

🎯 Customer Experience Impact

Positive:

  • Clear separation of concerns between Run API and Manage API
  • Flexible authentication for different deployment scenarios
  • Good developer experience with development mode

Areas for Improvement:

1. API Key Management UX ⚠️ Medium Priority

Documentation mentions "Visual Builder Dashboard" but doesn't provide:

  • Screenshots of the UI flow
  • API key format examples (sk_live_... pattern)
  • Troubleshooting guide for common auth errors

2. Migration Path Unclear ⚠️ Low Priority

No guidance for:

  • Migrating from development to production auth
  • Updating from old auth patterns (if applicable)
  • Batch API key creation/management

🛠️ Implementation Quality

Strengths:

  • Type safety: Proper TypeScript interfaces for auth context
  • Error handling: Appropriate HTTP exceptions and logging
  • Middleware pattern: Clean separation of concerns

Improvement Opportunities:

1. Code Duplication ⚠️ Low Priority

Auth logic duplicated between Manage API and Run API with slight variations. Consider shared auth utilities.

2. Testing Coverage ⚠️ Medium Priority

Need comprehensive tests for:

  • All authentication mode combinations
  • Edge cases (expired keys, malformed headers)
  • Security boundary testing

Final Recommendations

High Priority Actions:

  1. Simplify auth modes - Remove bypass secret pattern, consolidate to dev/prod
  2. Strengthen development mode - Add tenant validation and audit logging
  3. Improve documentation flow - Guide users to recommended patterns first

Medium Priority Actions:

  1. Add performance optimizations - API key caching, JWT consideration
  2. Enhance security documentation - Rotation, incident response procedures
  3. Implement comprehensive testing - All auth scenarios and edge cases

Architecture Decision Records Needed:

  • Why bypass secrets over service API keys?
  • Authentication mode selection criteria
  • API key scope design principles

The authentication system shows solid engineering fundamentals but could benefit from simplification to match the PR title's intent. The multi-mode approach, while flexible, may create more complexity than value for most users.


@nick-inkeep nick-inkeep reopened this Sep 14, 2025
@nick-inkeep nick-inkeep merged commit 3fc8ad8 into main Sep 14, 2025
2 of 8 checks passed
@nick-inkeep nick-inkeep deleted the docs-simplify-auth branch September 14, 2025 01:03
Zeeeepa pushed a commit to Zeeeepa/inkeep_agents that referenced this pull request Apr 23, 2026
…e + preview paths (inkeep#198) (inkeep#3172)

* fix(ci): close remaining silent strandings in release cascade + bridge + preview paths

Bundle of 10 code + 1 doc changes, each closing a distinct silent-failure
mode where CI would pass green (or sit red invisibly) while production
work stranded. Complement to inkeep#194's release-cascade hardening — each fix
here addresses a gap that pass missed.

Release cascade (5)
- release-handler.yml: pin GitHub Release to client_payload.commit_sha,
  not origin/main. Auto-format landing between sync and handler used to
  create a SHA drift where Vercel deployed a newer commit than npm got.
- release-handler.yml: 3-attempt bash retry around gh release view/edit/
  create. Mirrors the dispatch-side retry from inkeep#174 — one transient 5xx
  used to leave npm published with no Release (so no Vercel deploy) and
  no alert.
- release-handler.yml: notify-on-failure job. If the success job itself
  errors (app-token, checkout, or handler logic), the dispatch was
  already delivered and public-side retries don't re-fire. Without this
  notifier, npm publishes but the cascade stops silently.
- public-mirror-sync.yml: 3-attempt retry on sync PR approval. One
  dropped GH API call used to leave the sync PR open with no approval;
  now exhaustion exits non-zero so the workflow turns red and the
  janitor's sync-PR sweep (below) catches long-open PRs as a safety net.
- scripts/check-monorepo-traps.mjs: flip check:override-masks-bump from
  warn to hard-fail. Current tree is clean of masks, so flipping is safe
  and prevents future PR inkeep#170-class silent divergence.

Preview envs (2)
- deploy-vercel-preview.sh: validate captured deployment URL matches
  *.vercel.app pattern before alias-set. The log-grep fallback could
  silently capture a docs/inspection URL and alias api.preview.inkeep.com
  to the wrong place.
- public-agents-preview-environments.yml: add teardown-failure-notify
  job. Railway and Vercel teardowns run in parallel by design; this
  closes the observability gap when either silently fails — Railway
  continues billing, Vercel env vars pollute, and the next 6h janitor
  sweep re-attempts cleanup.

Dependency sync + bridge (2)
- dependabot-sync-root-lockfile.yml: on push failure, full-reset onto
  origin/$HEAD_REF and re-regenerate both lockfiles (root + public/agents)
  from the new baseline. Previous simple rebase would replay our stale
  lockfile on top of a different package.json after a Dependabot
  force-push, yielding inconsistent committed state that either broke
  --frozen-lockfile or silently shipped drift into main.
- bridge-public-pr-to-monorepo.mjs (both copies): truncate bridged PR
  body to fit GitHub's 65,536-char limit, with a link to the original PR.
  Was failing 422 "body is too long" on Dependabot mega-bumps, stranding
  those PRs outside the agents-private review surface.

Janitor (1)
- public-agents-preview-janitor.yml: new sweep-stale-sync-prs job. Every
  6h, identify open copybara/sync PRs on inkeep/agents >4h old, open
  (idempotent) tracking issue on agents-private, re-dispatch
  public-mirror-sync.yml. Does NOT auto-close the PR — destructive
  actions stay human-driven.

Docs (1)
- CI_RUNBOOK.md: partial-npm-publish recovery path. Documents what to
  do when pnpm changeset publish fails on package N of 10 (some live,
  some not): confirm via npm view, re-run the failed job (publish is
  idempotent), verify all 10 present before unblocking cascade.

Each change verified in isolation with pnpm check:structural; YAML
validated with safe_load; bridge truncation logic tested at boundary.

* fix(ci): add checkout step and remove heredoc indentation in release-handler notify

Addresses review feedback on PR inkeep#198:
- notify-success-failure now does a Checkout before gh calls, matching
  the failure job pattern (without it, gh CLI lacks git context and
  issue-create would silently fail with 'fatal: not a git repository').
- Body is now built with echo into a temp file instead of a heredoc.
  The original 10-space indent inside run: | would have been preserved
  literally by bash, making the tracking issue body unreadable.

* fix(ci): address review feedback on release-handler retry + docs drift

Review feedback from inkeep#198 (Claude + Pullfrog):

release-handler.yml:
- Capture stderr in gh_with_retry so on-call sees the actual API error
  (rate-limit, 5xx, auth) instead of a generic 'command failed' line.
  Was making incident triage 15-30m slower. (Claude Major)
- Stop wrapping 'gh release view' in gh_with_retry. It returns non-zero
  for the legitimate 'release doesn't exist' case (the normal new-release
  happy path); retrying burned 15s of backoff before falling through to
  create. Only mutating calls (edit/create) need retry. (Pullfrog + Claude)
- notify-success-failure now fails loudly (exit 1) if gh issue create
  fails, matching the failure job's pattern. Silent swallow would mean
  the success handler failed AND the notifier failed with no visible
  signal beyond a red check — exactly the class this PR closes elsewhere.
  (Claude Major)

Docs drift (Pullfrog):
- AGENTS.md: remove '(soft today)' / 'non-blocking today' annotations
  from override-masks-bump in both the cheatsheet and the trap-list.
- AGENTS.md trap inkeep#4: flip 'Currently a warning, not a hard fail' to
  'Hard-fail (flipped...)' to match the code.
- CI_RUNBOOK.md: rename the override-masks-bump section header from
  'warns (non-blocking today)' to 'fails — root override masks a
  workspace bump'.

Preview janitor (Claude Consider):
- Document that sweep-stale-sync-prs runs unconditionally regardless of
  PREVIEW_ENVIRONMENTS_ENABLED. Pr body said so; workflow didn't.

Bridge script (Claude Consider):
- console.log when PR body is truncated, with original + new sizes.
  Helps CI log debugging when a bridged PR looks shorter than expected.
  Applied to both copies (public/agents + public/agents-optional-local-dev).

Intentionally not addressed:
- dependabot-sync $MSG staleness after reset: Pullfrog flagged as
  intentional (message describes the operation, not the specific
  baseline), Claude flagged as cosmetic minor. Siding with Pullfrog.
- Preview teardown notify swallowed error: lower severity given the
  janitor re-attempts cleanup every 6h; accepting current pattern.

* fix(ci): pre-fetch PR base into agents-private before git apply --3way

Closes the bridge-failure class seen on inkeep#3171 (and
eventually on any bridged PR with a conflicting hunk).

Root cause:
`git apply --index --3way` resolves the patch's `index <old>..<new>`
blob SHAs against the local repo's object store. The SHAs come from
inkeep/agents' object graph; agents-private has never fetched from
inkeep/agents, so those SHAs aren't resolvable. Clean patches work
(no --3way fallback needed); any conflict triggers --3way, which
fails with 'repository lacks the necessary blob to perform 3-way
merge' — bridge stops with no useful diagnostic, PR strands outside
the canonical review surface.

Fix:
Before calling git apply, shallow-fetch the PR's base commit from
the public repo into agents-private's object store. GitHub permits
fetching by SHA when the SHA is reachable from a ref (PR base is on
main, always reachable), and shallow=1 keeps it cheap. After fetch,
--3way can find the `<old>` blob and resolve conflicts into merge
markers instead of a hard error.

Why not drop --3way:
Dropping --3way makes every conflicting PR fail hard with no
recovery path. Pre-fetching preserves the --3way fallback so
conflicting hunks produce visible conflict markers in the bridged
PR, which is reviewable rather than opaque.

Failure handling:
If the fetch itself fails (rate-limit, network flake), we log and
proceed. The subsequent git apply will hit the same blob error as
before the fix — no regression, just no improvement for that run.

Applied to both script copies: public/agents and
public/agents-optional-local-dev. Bundled into PR inkeep#198 since it's
the same release/sync hardening scope.

* Revert "fix(ci): pre-fetch PR base into agents-private before git apply --3way"

This reverts commit e7e0640e2345bbd7cdea1185eb3abc666f2c60fa.

* fix(ci): switch bridge patch fetch from .patch to .diff media type

Root cause of bridge failures on multi-commit PRs (eg inkeep#3171,
9 commits):

`gh api .../pulls/N Accept: application/vnd.github.patch` returns a
mailbox-format patch series — one patch per PR commit, each with
`index <old>..<new>` lines referencing INTERMEDIATE blob SHAs created
during the PR's history on inkeep/agents. Those intermediates exist
only in inkeep/agents' object store; agents-private has never seen
them. The first patch applies against agents-private's file (blob
content matches the PR base even if SHAs differ), git's index advances
to the intermediate state (only referenced by SHA, never fetched as
an object). Any subsequent conflicting hunk triggers `git apply
--3way` fallback, which needs the intermediate blob → 'repository
lacks the necessary blob' → stranded PR.

Single-commit PRs don't hit this (no intermediates). That's why the
bridge used to work for most Dependabot bumps but started failing on
longer branches like inkeep#3171.

Fix: switch `Accept: application/vnd.github.patch` →
`Accept: application/vnd.github.diff` in both bridge-public-pr-to-
monorepo.mjs copies. .diff returns a single consolidated base-vs-head
unified diff with only the PR's base blob SHAs referenced. Those are
content-identical to agents-private's copies (Copybara mirrors blobs
1:1 without content rewrites for .ts files), so --3way can resolve
them locally from agents-private's own object store.

Also updated githubRequest's text-response check to include .diff
alongside .patch (both are text, not JSON).

Previous pre-fetch commit (`e7e0640e2`) was reverted in a prior commit
— it was addressing the wrong layer (only PR-base blob, not
intermediates). This commit supersedes that approach entirely.

* chore(ci): biome format fix on bridge script console.log

GitOrigin-RevId: a26d5dac74ed9938fc512703af21a25d82616504

Co-authored-by: Varun Varahabhotla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant