KOBy

Check out a quick demo video here

KOBy is a multi-agent AI library that diagnoses and patches issues in distributed systems (Kubernetes cluster issues supported as of this release). An orchestrator agent breaks the problem down across a diagnosis team and a patching team, with a human approval gate before anything gets written to the cluster.

Architecture Overview

System Architecture

Agent hierarchy

Orchestrator  (L0 — top-level supervisor)
│
├──▶ Diagnosis Supervisor  (L1 — troubleshooting team supervisor)
│         ├──▶ Cluster Inspector    (L2 — ReAct)
│         ├──▶ RAG Agent            (L2 — ReAct)
│         └──▶ Web Search Agent     (L2 — ReAct)
│
└──▶ Patch Supervisor  (L1 — patching team supervisor)
          ├──▶ Patch Executor       (L2 — ReAct)
          └──▶ Patch Validator      (L2 — ReAct)

8 agents, 20 LangGraph nodes (15 LLM, 5 deterministic)

Progressive task decomposition

Each level only knows about its direct reports. The orchestrator has no knowledge of Kubernetes tooling or RAG internals — it just creates a TODO list and routes to teams. LangGraph subgraph boundaries enforce this at the framework level.

L0 Orchestrator
  "Something is wrong with the cluster" → create TODO, route to teams

  L1 Diagnosis Supervisor
    "Investigate these symptoms" → decide inspection strategy, synthesize

    L2 Cluster Inspector
      "Gather raw cluster data for these symptoms" → MCP read tool calls

    L2 RAG Agent
      "Find Knowledge Base articles for these symptoms" → Pinecone query

    L2 Web Search Agent
      "Supplement Knowledge Base research with web results" → Tavily query

  L1 Patch Supervisor
    "Fix these diagnosed issues" → produce PatchPlan, Human-in-the-Loop gate

    L2 Patch Executor
      "Execute applroved patch plan" → MCP write tool call

    L2 Patch Validator
      "Confirm symptoms resolved" → MCP read tool calls, reason

Interaction Flow

User                 Frontend           FastAPI           LangGraph           MCP
 │                      │                  │                  │                │
 │  enter symptoms       │                  │                  │                │
 │  select scenario      │                  │                  │                │
 │  click Run ──────────▶│                  │                  │                │
 │                       │ POST /api/run ──▶│                  │                │
 │                       │                  │ graph.astream_   │                │
 │                       │                  │ events() ───────▶│                │
 │                       │◀── SSE stream ───│                  │                │
 │                       │                  │           AGENT_START events      │
 │                       │ (timeline fills) │                  │                │
 │                       │                  │              cluster_inspector    │
 │                       │                  │              calls MCP tools ────▶│
 │                       │◀── TOOL_CALL ────│◀─────────────────│◀── JSON state │
 │                       │◀── TOOL_RESULT ──│                  │                │
 │                       │                  │           rag_agent queries       │
 │                       │                  │           Pinecone ──────────────▶ Pinecone
 │                       │◀── AGENT_THINKING│                  │                │
 │                       │  (streaming tok) │           diagnosis synthesized   │
 │                       │                  │           patch plan produced     │
 │                       │◀── HITL_REQUIRED─│ ◀── interrupt() ─│                │
 │◀─── approval card ────│                  │                  │                │
 │                       │                  │              (graph paused)       │
 │  review patch plan    │                  │                  │                │
 │  click Approve ───────▶                  │                  │                │
 │                       │ POST /api/resume▶│                  │                │
 │                       │                  │ graph.update_    │                │
 │                       │                  │ state() ────────▶│                │
 │                       │                  │ astream_events() │                │
 │                       │◀── SSE stream ───│                  │                │
 │                       │                  │              patch_executor      │
 │                       │                  │              calls write tools ──▶│
 │                       │◀── TOOL_CALL ────│◀─────────────────│◀── mutation   │
 │                       │◀── TOOL_RESULT ──│                  │                │
 │                       │                  │              patch_validator      │
 │                       │                  │              reads cluster ──────▶│
 │                       │◀── AGENT_COMPLETE│                  │                │
 │◀─── run complete ─────│                  │                  │                │

Tech Stack

Concern	Choice	Notes
LLM	Google Gemini Flash	Google AI Studio free tier
Embeddings (prod)	text-embedding-004	768 dims, Google AI Studio
Embeddings (dev)	sentence-transformers/all-MiniLM-L6-v2	Local, offline
Vector DB	Pinecone Starter
Agent orchestration	LangGraph
Agent framework	LangChain + LangGraph	ReAct prebuilts, tool wrappers, prompt templates
HITL mechanism	LangGraph `interrupt()` + MemorySaver	Graph suspends mid-run; resumes on /api/resume
Web search	Tavily API
MCP	Official Anthropic MCP Python SDK
Backend	FastAPI + SSE	StreamingResponse, Server-Sent Events
Frontend	Next.js 14 + TypeScript
SSE client	@microsoft/fetch-event-source
Deployment	Docker Compose	Two services: backend, frontend
Evals	ragas

Engineering Highlights

LangGraph subgraph boundaries as information barriers

The orchestrator shouldn't need to know anything about Kubernetes, RAG, or MCP schemas — that's the whole point of having specialist teams. LangGraph subgraphs make this enforced rather than just a convention.

Each team is a standalone StateGraph with its own typed state schema. The orchestrator passes in a typed input contract (DiagnosisRequest / PatchPlan) and gets back a typed output (DiagnosisResult / ValidationResult). Everything that happens inside — cluster snapshots, RAG confidence scores, intermediate patch steps — is invisible at the orchestrator level.

Two-level quality assurance

The producing agent runs a self_validate LLM call before returning and attaches a QualityAssessment ({passed, confidence, gaps, recommendation}) to its output. The receiving supervisor evaluates this independently — it doesn't just trust the producer's self-score.

If passed=False and retries remain, the orchestrator re-routes to the diagnosis subgraph with the gaps list as retry_context, so the next attempt has specific direction rather than starting blind. After two failed retries, cannot_diagnose surfaces the gaps to the user.

Typed Pydantic contracts as API between every layer

Every inter-agent handoff uses a Pydantic model: DiagnosisRequest, DiagnosisResult, PatchPlan, PatchStep, ValidationResult, QualityAssessment, TodoItem, StatusEvent. TypeScript types in frontend/src/lib/types.ts mirror each one. EventType is defined once in frontend/src/constants/events.ts — not redefined in types.ts.

Untyped dict passing between agents is the most common source of runtime surprises in multi-agent systems. Explicit contracts surface mismatches at parse time instead of mid-run.

Manual supervisor routing instead of `langgraph-supervisor`

The langgraph-supervisor prebuilt was excluded deliberately. The routing logic in check_diagnosis_quality, hitl_gate, and check_rag_confidence contains real business logic — retry thresholds, quality gates, HITL branching — that needs to be readable and testable. Burying it in a prebuilt would make it harder to audit and harder to change. Manual conditional edges also make the retry loop and HITL interrupt easy to trace without stepping through library internals.

LangGraph subgraphs over a flat graph

A flat 20-node graph would give the orchestrator implicit visibility into cluster inspection strategy, RAG confidence thresholds, and patch execution mechanics. Subgraph boundaries enforce the team abstraction at the framework level rather than relying on convention.

Batched SSE token flushes on the frontend

At 10–30 tokens per second, naive 1:1 setState per token produces hundreds of re-renders per second and visible jank. The 50ms flush window collapses this to ≤20 re-renders per second while keeping the streaming-text effect smooth.

ReAct loop hard caps and partial-result handling

Every ReAct agent has a recursion_limit in its LangGraph config: RAG Agent and Web Search Agent cap at 5 iterations; Cluster Inspector at 6; Patch Executor at 8; Patch Validator at 5. When the cap is hit, a forced-exit node returns the best result so far with partial_result=True.

Supervisors check this flag before routing — a partial RAG result triggers the web search fallback regardless of the confidence threshold, since partial usually means the Pinecone results weren't enough.

Setup — Docker Compose

Prerequisites

Docker Desktop
API keys

Steps

git clone <repo-url> koby && cd koby

cp .env.example .env
# Fill in your API keys

# Ingest the knowledge base into Pinecone (one-time setup)
docker compose run --rm backend python -m scripts.ingest_knowledge_base

docker compose up --build

# Frontend:  http://localhost:3000
# Backend:   http://localhost:8000
# API docs:  http://localhost:8000/docs

Setup — Local Dev

Prerequisites

Python 3.11+
Node.js 18+
Pinecone index created (768 dims, cosine metric, name matching PINECONE_INDEX_NAME in .env)

Backend

# From project root
python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

cp .env.example .env
# Fill in your API keys

# Ingest knowledge base (one-time)
python -m scripts.ingest_knowledge_base

# Start the backend
uvicorn backend.main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

Demo Walkthrough

Scenario: CrashLoopBackOff

Open http://localhost:3000.
In the Scenario dropdown, select crashloopbackoff.

In the Symptoms textarea, enter:

Pod payment-service-7d9f8b-xkz2p is in CrashLoopBackOff.
It has restarted 14 times in the last 30 minutes.

Click Run.

What to watch in the timeline:

Phase	Events
Orchestrator	`AGENT_START` → creates TODO list → routes to diagnosis team
Cluster Inspector	`AGENT_THINKING` (reasoning about which tools to call) → `TOOL_CALL` + `TOOL_RESULT` per MCP read (get_pod_status, get_pod_logs, get_events)
Diagnosis Supervisor	`AGENT_THINKING` (streaming) → planning RAG queries
RAG Agent	`TOOL_CALL` pinecone_query → `TOOL_RESULT` KB chunks returned
Diagnosis Supervisor	`AGENT_THINKING` → synthesizing diagnosis → `AGENT_COMPLETE`
Patch Supervisor	`AGENT_THINKING` → producing patch plan
HITL Gate	Timeline pauses — inline approval card appears

Review the patch plan. Steps typically include:
- restart_pod — restarts the failing pod
- apply_patch — corrects the misconfigured resource limits or env var
Click Approve (or Edit to modify parameters before applying).

What happens after approval:

Phase	Events
Patch Executor	`TOOL_CALL` + `TOOL_RESULT` per write tool (mutates in-memory state)
Patch Validator	`AGENT_THINKING` → `TOOL_CALL` get_pod_status → `TOOL_RESULT` → `AGENT_COMPLETE` (confirmed resolved)
Orchestrator	`AGENT_COMPLETE` — final summary

Click Reset to restore the cluster to its broken baseline for another run.

Other scenarios

Scenario name	Issue simulated
`crashloopbackoff`	Pod restart loop due to OOM / bad config
`oomkilled`	Container killed by kernel OOM killer
`imagepullbackoff`	Container image not found or registry unreachable
`pvc_binding_failure`	PersistentVolumeClaim stuck in Pending
`node_notready`	Worker node unreachable or under disk pressure

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets/images		assets/images
backend		backend
cluster_simulation		cluster_simulation
frontend		frontend
knowledge_base		knowledge_base
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KOBy

Contents

Architecture Overview

System Architecture

Agent hierarchy

Progressive task decomposition

Interaction Flow

Tech Stack

Engineering Highlights

LangGraph subgraph boundaries as information barriers

Two-level quality assurance

Typed Pydantic contracts as API between every layer

Manual supervisor routing instead of `langgraph-supervisor`

LangGraph subgraphs over a flat graph

Batched SSE token flushes on the frontend

ReAct loop hard caps and partial-result handling

Setup — Docker Compose

Prerequisites

Steps

Setup — Local Dev

Prerequisites

Backend

Frontend

Demo Walkthrough

Scenario: CrashLoopBackOff

Other scenarios

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KOBy

Contents

Architecture Overview

System Architecture

Agent hierarchy

Progressive task decomposition

Interaction Flow

Tech Stack

Engineering Highlights

LangGraph subgraph boundaries as information barriers

Two-level quality assurance

Typed Pydantic contracts as API between every layer

Manual supervisor routing instead of langgraph-supervisor

LangGraph subgraphs over a flat graph

Batched SSE token flushes on the frontend

ReAct loop hard caps and partial-result handling

Setup — Docker Compose

Prerequisites

Steps

Setup — Local Dev

Prerequisites

Backend

Frontend

Demo Walkthrough

Scenario: CrashLoopBackOff

Other scenarios

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Manual supervisor routing instead of `langgraph-supervisor`

Packages