| Paper | Model | Dataset | Data Pipeline | Training | Evaluation |
- 🌐 4-round data pipeline — turns raw Wikipedia seeds into a verified multi-hop QA dataset through generation, self-verification, and external judging
- 💸 Zero API spend — Zero dollars spent on expensive Web Search APIs or search-enabled LLMs; the entire pipeline runs on Selenium + DeepSeek Chat and uses search aggregators such as DDGS
- 🤖 GRPO training with web search — trains search-augmented LLMs via verl-tool with a parallel multi-backend DDGS retrieval server
- 📊 12-benchmark evaluation suite — covers single-hop (NQ, TriviaQA, PopQA), multi-hop (HotpotQA, 2WikiMHQA, MuSiQue), and hard reasoning (FRAMES, GAIA, MoNaCo) benchmarks
- 🗂️ Data Pipeline
- 🌱 Round 1 — Seed Creation
- ❓ Round 2 — QA Generation
- ✅ Round 3 — Self-Verification
- 🏅 Round 4 — External Verification
- 🏋️ Model Training
- 📐 Data Processing
- 🔎 Evaluation
ORBIT QA pairs are built through four sequential rounds. See data/README.md for the full overview.
| Round | What it does | Key script |
|---|---|---|
| 1 | Wikipedia category pages → seed titles | create_seeds.py |
| 2 | Seeds → multi-hop reasoning questions + answers | deepseek_generate_qa.py |
| 3 | Self-verification via DeepSeek Chat + Selenium | deepseek_self_verify.py |
| 4 | External verification on scraped documents via vLLM | external_verification.py |
Dataset Example — TV Shows & Movies domain (click to expand)
Question: What was the exact runtime (minutes) of the 2017 animated feature set inside a smartphone's messaging application, directed by a filmmaker previously known for sequels to popular children's franchises, featuring a protagonist whose facial expression malfunctions, with voice casting that includes a lead actor from a critically acclaimed sitcom, and produced by a studio that later won an Oscar for Spider-Man animation?
Answer: 86 minutes
Verification / Summary of Supporting Facts:
| Clue | Supporting Evidence | |
|---|---|---|
| ✅ | Animated Set | The Emoji Movie (2017), set inside a smartphone messaging app world called Textopolis |
| ✅ | Filmmaker | Tony Leondis, previously directed sequels: Lilo & Stitch 2, Kung Fu Panda: Secrets of the Masters |
| ✅ | Protagonist | Gene, the main character in The Emoji Movie, struggles with malfunctioning facial expressions |
| ✅ | Voice Cast | T.J. Miller, lead actor from the HBO sitcom Silicon Valley |
| ✅ | Studio | Sony Pictures Animation, which later won an Oscar for Spider-Man: Into the Spider-Verse |
Installing uv itself:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activatePrior to training & evaluating search agents, you should prepare the training, eval and test datasets. See models/data_process for the full data processing scripts.
export HF_HOME=/u3/n3thakur/projects/cache
export HF_TOKEN=hf_...Training data — ORBIT + NQ + HotpotQA mixed at a 1:1:1 ratio:
python models/data_process/prepare_train_data.py \
--datasets 'nq,hotpotqa,orbit-ai/orbit-20k' \
--local_dir models/train/data/mix-nq-hotpotqa-orbit-ratio-1-1-1 \
--ratio 1:1:1Eval data — all 12 validation benchmarks:
python models/data_process/prepare_eval_data.py \
--data_sources nq,triviaqa,popqa,hotpotqa,2wikimultihopqa,musique,bamboogle,frames,gaia,monaco,webwalkerqa,webshaper \
--local_dir models/eval/data/all-12-val-datasetsTest data — 8 Wikipedia test benchmarks:
python models/data_process/prepare_test_data.py \
--local_dir models/eval/data/all-8-wikipedia-test-datasetsTo train search agents using ORBIT, please see models/train.
export HF_TOKEN="hf_..."
export WANDB_API_KEY="..."
bash models/train/ddgs_web_search.sh # run ddgs web server
bash models/train/run_grpo.sh # train search agent with verl-toolTo evaluate search agents trained using ORBIT, please see models/eval.
bash models/eval/retrieval_server_bge.sh # run BGE index using faiss (conda env required)
bash models/eval/run_eval.sh # evaluate search agent on Wikipedia datasetsorbit/
├── data/
│ ├── round-1-seed-creation/ # Wikipedia pages → seed JSONL
│ ├── round-2-qa-generation/ # Seeds → inverted QA pairs (DeepSeek)
│ ├── round-3-self-verification/ # QA pairs → self-verified answers (DeepSeek)
│ ├── round-4-external-verification/ # Verified pairs → externally judged (vLLM)
│ └── outputs/ # Intermediate + final JSONL files
├── models/
│ ├── train/ # GRPO training + DDGS retrieval server
│ ├── eval/ # BGE retrieval index + evaluation
│ └── data_process/ # Prepare train/eval parquet files
├── pyproject.toml
└── README.md
We thank the following open-source projects:
- verl-tool for the GRPO training framework for training search agents.
- vLLM for LLM inference.
- DDGS for a web search aggregator tool (used for search agent training with GRPO).
- Search-R1 for retrieval server design inspiration and initial experiments.
If you have any questions or suggestions, please contact us at:
- Nandan Thakur: [email protected]
- Zijian Chen: [email protected]
- Xueguang Ma: [email protected]
If you find this data generation repository helpful, please cite our preprint work ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget:
@misc{thakur2026orbit,
title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget},
author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
year={2026},
eprint={2604.01195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01195},
}