Skip to content

BhumikaSaini/KOBy

Repository files navigation

KOBy

Check out a quick demo video here

KOBy is a multi-agent AI library that diagnoses and patches issues in distributed systems (Kubernetes cluster issues supported as of this release). An orchestrator agent breaks the problem down across a diagnosis team and a patching team, with a human approval gate before anything gets written to the cluster.


Contents

  1. Architecture overview
  2. Interaction flow
  3. Tech stack
  4. Engineering highlights
  5. Setup — Docker Compose
  6. Setup — local dev
  7. Demo walkthrough

Architecture Overview

System Architecture

System Architecture

Agent hierarchy

Orchestrator  (L0 — top-level supervisor)
│
├──▶ Diagnosis Supervisor  (L1 — troubleshooting team supervisor)
│         ├──▶ Cluster Inspector    (L2 — ReAct)
│         ├──▶ RAG Agent            (L2 — ReAct)
│         └──▶ Web Search Agent     (L2 — ReAct)
│
└──▶ Patch Supervisor  (L1 — patching team supervisor)
          ├──▶ Patch Executor       (L2 — ReAct)
          └──▶ Patch Validator      (L2 — ReAct)

8 agents, 20 LangGraph nodes (15 LLM, 5 deterministic)

Progressive task decomposition

Each level only knows about its direct reports. The orchestrator has no knowledge of Kubernetes tooling or RAG internals — it just creates a TODO list and routes to teams. LangGraph subgraph boundaries enforce this at the framework level.

L0 Orchestrator
  "Something is wrong with the cluster" → create TODO, route to teams

  L1 Diagnosis Supervisor
    "Investigate these symptoms" → decide inspection strategy, synthesize

    L2 Cluster Inspector
      "Gather raw cluster data for these symptoms" → MCP read tool calls

    L2 RAG Agent
      "Find Knowledge Base articles for these symptoms" → Pinecone query

    L2 Web Search Agent
      "Supplement Knowledge Base research with web results" → Tavily query

  L1 Patch Supervisor
    "Fix these diagnosed issues" → produce PatchPlan, Human-in-the-Loop gate

    L2 Patch Executor
      "Execute applroved patch plan" → MCP write tool call

    L2 Patch Validator
      "Confirm symptoms resolved" → MCP read tool calls, reason

Interaction Flow

User                 Frontend           FastAPI           LangGraph           MCP
 │                      │                  │                  │                │
 │  enter symptoms       │                  │                  │                │
 │  select scenario      │                  │                  │                │
 │  click Run ──────────▶│                  │                  │                │
 │                       │ POST /api/run ──▶│                  │                │
 │                       │                  │ graph.astream_   │                │
 │                       │                  │ events() ───────▶│                │
 │                       │◀── SSE stream ───│                  │                │
 │                       │                  │           AGENT_START events      │
 │                       │ (timeline fills) │                  │                │
 │                       │                  │              cluster_inspector    │
 │                       │                  │              calls MCP tools ────▶│
 │                       │◀── TOOL_CALL ────│◀─────────────────│◀── JSON state │
 │                       │◀── TOOL_RESULT ──│                  │                │
 │                       │                  │           rag_agent queries       │
 │                       │                  │           Pinecone ──────────────▶ Pinecone
 │                       │◀── AGENT_THINKING│                  │                │
 │                       │  (streaming tok) │           diagnosis synthesized   │
 │                       │                  │           patch plan produced     │
 │                       │◀── HITL_REQUIRED─│ ◀── interrupt() ─│                │
 │◀─── approval card ────│                  │                  │                │
 │                       │                  │              (graph paused)       │
 │  review patch plan    │                  │                  │                │
 │  click Approve ───────▶                  │                  │                │
 │                       │ POST /api/resume▶│                  │                │
 │                       │                  │ graph.update_    │                │
 │                       │                  │ state() ────────▶│                │
 │                       │                  │ astream_events() │                │
 │                       │◀── SSE stream ───│                  │                │
 │                       │                  │              patch_executor      │
 │                       │                  │              calls write tools ──▶│
 │                       │◀── TOOL_CALL ────│◀─────────────────│◀── mutation   │
 │                       │◀── TOOL_RESULT ──│                  │                │
 │                       │                  │              patch_validator      │
 │                       │                  │              reads cluster ──────▶│
 │                       │◀── AGENT_COMPLETE│                  │                │
 │◀─── run complete ─────│                  │                  │                │

Tech Stack

Concern Choice Notes
LLM Google Gemini Flash Google AI Studio free tier
Embeddings (prod) text-embedding-004 768 dims, Google AI Studio
Embeddings (dev) sentence-transformers/all-MiniLM-L6-v2 Local, offline
Vector DB Pinecone Starter
Agent orchestration LangGraph
Agent framework LangChain + LangGraph ReAct prebuilts, tool wrappers, prompt templates
HITL mechanism LangGraph interrupt() + MemorySaver Graph suspends mid-run; resumes on /api/resume
Web search Tavily API
MCP Official Anthropic MCP Python SDK
Backend FastAPI + SSE StreamingResponse, Server-Sent Events
Frontend Next.js 14 + TypeScript
SSE client @microsoft/fetch-event-source
Deployment Docker Compose Two services: backend, frontend
Evals ragas

Engineering Highlights

LangGraph subgraph boundaries as information barriers

The orchestrator shouldn't need to know anything about Kubernetes, RAG, or MCP schemas — that's the whole point of having specialist teams. LangGraph subgraphs make this enforced rather than just a convention.

Each team is a standalone StateGraph with its own typed state schema. The orchestrator passes in a typed input contract (DiagnosisRequest / PatchPlan) and gets back a typed output (DiagnosisResult / ValidationResult). Everything that happens inside — cluster snapshots, RAG confidence scores, intermediate patch steps — is invisible at the orchestrator level.


Two-level quality assurance

The producing agent runs a self_validate LLM call before returning and attaches a QualityAssessment ({passed, confidence, gaps, recommendation}) to its output. The receiving supervisor evaluates this independently — it doesn't just trust the producer's self-score.

If passed=False and retries remain, the orchestrator re-routes to the diagnosis subgraph with the gaps list as retry_context, so the next attempt has specific direction rather than starting blind. After two failed retries, cannot_diagnose surfaces the gaps to the user.


Typed Pydantic contracts as API between every layer

Every inter-agent handoff uses a Pydantic model: DiagnosisRequest, DiagnosisResult, PatchPlan, PatchStep, ValidationResult, QualityAssessment, TodoItem, StatusEvent. TypeScript types in frontend/src/lib/types.ts mirror each one. EventType is defined once in frontend/src/constants/events.ts — not redefined in types.ts.

Untyped dict passing between agents is the most common source of runtime surprises in multi-agent systems. Explicit contracts surface mismatches at parse time instead of mid-run.

Manual supervisor routing instead of langgraph-supervisor

The langgraph-supervisor prebuilt was excluded deliberately. The routing logic in check_diagnosis_quality, hitl_gate, and check_rag_confidence contains real business logic — retry thresholds, quality gates, HITL branching — that needs to be readable and testable. Burying it in a prebuilt would make it harder to audit and harder to change. Manual conditional edges also make the retry loop and HITL interrupt easy to trace without stepping through library internals.

LangGraph subgraphs over a flat graph

A flat 20-node graph would give the orchestrator implicit visibility into cluster inspection strategy, RAG confidence thresholds, and patch execution mechanics. Subgraph boundaries enforce the team abstraction at the framework level rather than relying on convention.

Batched SSE token flushes on the frontend

At 10–30 tokens per second, naive 1:1 setState per token produces hundreds of re-renders per second and visible jank. The 50ms flush window collapses this to ≤20 re-renders per second while keeping the streaming-text effect smooth.


ReAct loop hard caps and partial-result handling

Every ReAct agent has a recursion_limit in its LangGraph config: RAG Agent and Web Search Agent cap at 5 iterations; Cluster Inspector at 6; Patch Executor at 8; Patch Validator at 5. When the cap is hit, a forced-exit node returns the best result so far with partial_result=True.

Supervisors check this flag before routing — a partial RAG result triggers the web search fallback regardless of the confidence threshold, since partial usually means the Pinecone results weren't enough.


Setup — Docker Compose

Prerequisites

  • Docker Desktop
  • API keys

Steps

git clone <repo-url> koby && cd koby

cp .env.example .env
# Fill in your API keys

# Ingest the knowledge base into Pinecone (one-time setup)
docker compose run --rm backend python -m scripts.ingest_knowledge_base

docker compose up --build

# Frontend:  http://localhost:3000
# Backend:   http://localhost:8000
# API docs:  http://localhost:8000/docs

Setup — Local Dev

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • Pinecone index created (768 dims, cosine metric, name matching PINECONE_INDEX_NAME in .env)

Backend

# From project root
python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

cp .env.example .env
# Fill in your API keys

# Ingest knowledge base (one-time)
python -m scripts.ingest_knowledge_base

# Start the backend
uvicorn backend.main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

Demo Walkthrough

Scenario: CrashLoopBackOff

  1. Open http://localhost:3000.
  2. In the Scenario dropdown, select crashloopbackoff.
  3. In the Symptoms textarea, enter:
    Pod payment-service-7d9f8b-xkz2p is in CrashLoopBackOff.
    It has restarted 14 times in the last 30 minutes.
    
  4. Click Run.

What to watch in the timeline:

Phase Events
Orchestrator AGENT_START → creates TODO list → routes to diagnosis team
Cluster Inspector AGENT_THINKING (reasoning about which tools to call) → TOOL_CALL + TOOL_RESULT per MCP read (get_pod_status, get_pod_logs, get_events)
Diagnosis Supervisor AGENT_THINKING (streaming) → planning RAG queries
RAG Agent TOOL_CALL pinecone_query → TOOL_RESULT KB chunks returned
Diagnosis Supervisor AGENT_THINKING → synthesizing diagnosis → AGENT_COMPLETE
Patch Supervisor AGENT_THINKING → producing patch plan
HITL Gate Timeline pauses — inline approval card appears
  1. Review the patch plan. Steps typically include:

    • restart_pod — restarts the failing pod
    • apply_patch — corrects the misconfigured resource limits or env var
  2. Click Approve (or Edit to modify parameters before applying).

What happens after approval:

Phase Events
Patch Executor TOOL_CALL + TOOL_RESULT per write tool (mutates in-memory state)
Patch Validator AGENT_THINKINGTOOL_CALL get_pod_status → TOOL_RESULTAGENT_COMPLETE (confirmed resolved)
Orchestrator AGENT_COMPLETE — final summary
  1. Click Reset to restore the cluster to its broken baseline for another run.

Other scenarios

Scenario name Issue simulated
crashloopbackoff Pod restart loop due to OOM / bad config
oomkilled Container killed by kernel OOM killer
imagepullbackoff Container image not found or registry unreachable
pvc_binding_failure PersistentVolumeClaim stuck in Pending
node_notready Worker node unreachable or under disk pressure

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors