CROPI is a curriculum reinforcement learning framework for large language models that improves RLVR efficiency through off-policy influence estimation over pre-collected trajectories. Instead of relying on repeated online trial-and-error to identify useful training data, CROPI estimates which prompts are most helpful for the current policy, selects a compact high-value subset, and then runs RL on that subset. In short: less wasted compute, more targeted learning, and a much cleaner RL loop. 🎯
Paper: Data-Efficient RLVR via Off-Policy Influence Guidance
- 2025-11: CROPI repository initialized 🎉 The project was set up and organized for public release.
- 2026-03: First public code release 🚀 The initial open-source implementation of the CROPI pipeline is now available.
Large-scale RL for reasoning models is expensive because it spends substantial compute on prompts that contribute little to policy improvement. CROPI addresses this by:
- estimating prompt utility with off-policy influence scores
- reusing pre-collected rollout logs instead of requiring fresh online sampling for every scoring step
- using sparse random projection to make gradient-based scoring practical at scale
- building a curriculum over selected data subsets, rather than training on the full pool at every stage
The result is a more compute-efficient RL loop that keeps or improves downstream performance while substantially reducing the amount of data trained per stage. That is the core appeal of CROPI: stronger efficiency without giving up serious performance. 🔥
CROPI follows three core steps:
- Off-policy influence estimation Compute projected policy-gradient features from pre-collected rollout trajectories.
- Validation-targeted scoring Measure how aligned each train prompt is with the validation objectives via influence scores.
- Curriculum RL Select a compact training subset and run RL on that subset, then repeat for the next round.
This makes CROPI particularly attractive when rollout generation is expensive, RL budgets are limited, or you want a more principled curriculum than heuristic filtering. It is a simple idea operationally, but it unlocks a very strong efficiency story in practice. ⚡
On the 1.5B setting, CROPI achieves 2.66x step-level acceleration while training on only 10% of the data per stage. The key takeaway is not just faster training, but better use of RL compute: CROPI focuses updates on prompts that matter most for validation-time improvement. This is where the method really stands out. 🏎️
The paper shows that CROPI consistently improves the efficiency-quality tradeoff compared with stronger data-hungry baselines. The method is designed to be practical: it works with realistic RL pipelines and avoids introducing another expensive online loop solely for data scoring. Better data selection, less redundant RL, stronger results. ✅
These results highlight a second advantage of CROPI: the influence-estimation pipeline remains usable even for larger models because it relies on sparse random projection rather than storing or comparing raw full-dimensional gradients. That scalability is a big part of why CROPI is practical beyond toy settings. 📦
This repository contains the open-source implementation of the CROPI selection and curriculum loop. The runnable code lives under cropi/. The goal is to make the paper pipeline easy to inspect, easy to reproduce, and straightforward to extend. 🛠️
cropi/coreCore logic for gradient extraction, influence-score computation, and prompt selection.cropi/utilsUtilities for JSONL splitting and RL checkpoint merging/export.cropi/scriptsShell entry points, including a single-script end-to-end controller.cropi/inferenceOptional wrappers for external rollout-generation scripts.figuresPaper figures used in this README.
- CROPI scoring and selection pipeline
- multi-round orchestration through one script
- support for iterative
select -> RL -> select -> RL - support for exporting RL checkpoints into Hugging Face format for the next CROPI round
- datasets
- model checkpoints
- rollout logs
- experiment logs and tracking artifacts
You are expected to prepare these assets locally.
cd CROPI
uv venv
source .venv/bin/activate
uv pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
uv pip install -e .
uv pip install numpy pandas pyarrow tqdm "transformers<5" math-verify tabulate
uv pip install --no-deps traker==0.3.2
uv pip install fast_jl --no-build-isolationIf you already have a suitable PyTorch installation on the machine, you can also use:
uv venv --system-site-packages
source .venv/bin/activateuv run python -m compileall cropi
uv run cropi-select --help
uv run cropi-compute-inf-score --help
uv run cropi-get-grad --help
bash -n cropi/scripts/*.sh cropi/inference/*.shThe public pipeline expects files in the following structure:
data/
<train_dataset>/
train_qwen.parquet
<model_name>/
train_<infer_note>.jsonl
train_<infer_note>_grad_<proj_note>.jsonl.<rank>
<valid_dataset>/
valid_qwen.parquet
<model_name>/
valid_<valid_infer_note>.jsonl
valid_<valid_infer_note>_grad_<proj_note>.jsonl.<rank>
In other words:
- parquet files provide the raw train/validation prompt pool
- rollout JSONL files contain
prompt,answer,responses, andrewards - gradient JSONL shards contain projected gradient features for CROPI scoring
To compute influence scores and select data once:
cd CROPI
bash cropi/scripts/run_cropi.sh select-only ./data Qwen2.5-1.5B-Instruct_curriculumThis runs:
cropi-compute-inf-scorecropi-select
and writes the selected parquet under the corresponding dataset directory. One command, one selection stage, clean and reproducible. 🎯
cropi/scripts/run_cropi.sh is the top-level entry point for the full pipeline. It supports:
select-onlyrl-onlyfull
The full mode executes the iterative pipeline:
select -> RL -> recompute gradients for the new checkpoint -> select -> RL -> ...
cd CROPI
BASE_MODEL_PATH=/path/to/Qwen2.5-1.5B-Instruct \
RL_PYTHON=/path/to/verl-env/bin/python \
TRAIN_DATA_NAMES=gsm_math_dsr_test \
VALID_DATA_NAMES=gsm8k,math,gaokao2023en \
RL_VAL_DATA_NAMES=gsm8k,math,gaokao2023en \
NUM_RL_ROUNDS=2 \
RL_NUM_GPUS=8 \
RL_TP_SIZE=2 \
bash cropi/scripts/run_cropi.sh full ./data Qwen2.5-1.5B-Instruct_curriculumBASE_MODEL_PATHInitial Hugging Face checkpoint for the first RL round.RL_PYTHONPython executable withverlinstalled.NUM_RL_ROUNDSNumber of CROPI+RL stages to run.RL_NUM_GPUSNumber of GPUs for RL. Default:8.RL_TP_SIZEvLLM tensor-parallel size during rollout. Default:2.NUM_PARALLELNumber of gradient shards /cropi-get-gradworkers.RL_TOTAL_TRAINING_STEPSRL steps per stage.DRY_RUN=1Print the full command chain without executing it.
run_cropi.shis the supported public entry point for the full iterative pipeline.- The script now supports 8-GPU RL by default. 🖥️
- After each RL stage, the script exports the actor checkpoint to
huggingface/format so the next CROPI stage can reuse it directly. - If your RL environment is separate from the CROPI environment, point
RL_PYTHONto that interpreter.
We thank the following projects and teams for making this work possible:
- VeRL for the RL training framework
- TRAK / traker for efficient influence-inspired gradient projection tooling
- Qwen2.5 for the base language models
- Qwen2.5-Math for math-oriented model and evaluation resources
- vLLM for efficient rollout generation and serving
We also acknowledge the external math-evaluation tooling used by the original project setup, including the Qwen2.5-Math evaluation pipeline. This repository builds on a very strong open ecosystem, and we are grateful for it. 🌍
If you find this repository useful, please cite our paper. If CROPI helps your research or engineering workflow, a citation is greatly appreciated. 💙
@misc{zhu2025dataefficientrlvroffpolicyinfluence,
title={Data-Efficient RLVR via Off-Policy Influence Guidance},
author={Erle Zhu and Dazhi Jiang and Yuan Wang and Xujun Li and Jiale Cheng and Yuxian Gu and Yilin Niu and Aohan Zeng and Jie Tang and Minlie Huang and Hongning Wang},
year={2025},
eprint={2510.26491},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.26491}
}