CROPI: Data-Efficient RLVR via Off-Policy Influence Guidance 🚀

CROPI is a curriculum reinforcement learning framework for large language models that improves RLVR efficiency through off-policy influence estimation over pre-collected trajectories. Instead of relying on repeated online trial-and-error to identify useful training data, CROPI estimates which prompts are most helpful for the current policy, selects a compact high-value subset, and then runs RL on that subset. In short: less wasted compute, more targeted learning, and a much cleaner RL loop. 🎯

Paper: Data-Efficient RLVR via Off-Policy Influence Guidance

News 📣

2025-11: CROPI repository initialized 🎉 The project was set up and organized for public release.
2026-03: First public code release 🚀 The initial open-source implementation of the CROPI pipeline is now available.

Why CROPI? ✨

Large-scale RL for reasoning models is expensive because it spends substantial compute on prompts that contribute little to policy improvement. CROPI addresses this by:

estimating prompt utility with off-policy influence scores
reusing pre-collected rollout logs instead of requiring fresh online sampling for every scoring step
using sparse random projection to make gradient-based scoring practical at scale
building a curriculum over selected data subsets, rather than training on the full pool at every stage

The result is a more compute-efficient RL loop that keeps or improves downstream performance while substantially reducing the amount of data trained per stage. That is the core appeal of CROPI: stronger efficiency without giving up serious performance. 🔥

Main Idea 🧠

CROPI follows three core steps:

Off-policy influence estimation Compute projected policy-gradient features from pre-collected rollout trajectories.
Validation-targeted scoring Measure how aligned each train prompt is with the validation objectives via influence scores.
Curriculum RL Select a compact training subset and run RL on that subset, then repeat for the next round.

This makes CROPI particularly attractive when rollout generation is expensive, RL budgets are limited, or you want a more principled curriculum than heuristic filtering. It is a simple idea operationally, but it unlocks a very strong efficiency story in practice. ⚡

Results 📈

1.5B Main Training Curve

On the 1.5B setting, CROPI achieves 2.66x step-level acceleration while training on only 10% of the data per stage. The key takeaway is not just faster training, but better use of RL compute: CROPI focuses updates on prompts that matter most for validation-time improvement. This is where the method really stands out. 🏎️

Overall Comparison

The paper shows that CROPI consistently improves the efficiency-quality tradeoff compared with stronger data-hungry baselines. The method is designed to be practical: it works with realistic RL pipelines and avoids introducing another expensive online loop solely for data scoring. Better data selection, less redundant RL, stronger results. ✅

Sparse Projection Efficiency

These results highlight a second advantage of CROPI: the influence-estimation pipeline remains usable even for larger models because it relies on sparse random projection rather than storing or comparing raw full-dimensional gradients. That scalability is a big part of why CROPI is practical beyond toy settings. 📦

Repository Overview 🗂️

This repository contains the open-source implementation of the CROPI selection and curriculum loop. The runnable code lives under cropi/. The goal is to make the paper pipeline easy to inspect, easy to reproduce, and straightforward to extend. 🛠️

Repository structure

cropi/core Core logic for gradient extraction, influence-score computation, and prompt selection.
cropi/utils Utilities for JSONL splitting and RL checkpoint merging/export.
cropi/scripts Shell entry points, including a single-script end-to-end controller.
cropi/inference Optional wrappers for external rollout-generation scripts.
figures Paper figures used in this README.

What is included

CROPI scoring and selection pipeline
multi-round orchestration through one script
support for iterative select -> RL -> select -> RL
support for exporting RL checkpoints into Hugging Face format for the next CROPI round

What is not included

datasets
model checkpoints
rollout logs
experiment logs and tracking artifacts

You are expected to prepare these assets locally.

How To Run ▶️

1. Create the environment with `uv`

cd CROPI

uv venv
source .venv/bin/activate

uv pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
uv pip install -e .
uv pip install numpy pandas pyarrow tqdm "transformers<5" math-verify tabulate
uv pip install --no-deps traker==0.3.2
uv pip install fast_jl --no-build-isolation

If you already have a suitable PyTorch installation on the machine, you can also use:

uv venv --system-site-packages
source .venv/bin/activate

2. Sanity-check the installation

uv run python -m compileall cropi
uv run cropi-select --help
uv run cropi-compute-inf-score --help
uv run cropi-get-grad --help
bash -n cropi/scripts/*.sh cropi/inference/*.sh

3. Prepare the expected data layout

The public pipeline expects files in the following structure:

data/
  <train_dataset>/
    train_qwen.parquet
    <model_name>/
      train_<infer_note>.jsonl
      train_<infer_note>_grad_<proj_note>.jsonl.<rank>
  <valid_dataset>/
    valid_qwen.parquet
    <model_name>/
      valid_<valid_infer_note>.jsonl
      valid_<valid_infer_note>_grad_<proj_note>.jsonl.<rank>

In other words:

parquet files provide the raw train/validation prompt pool
rollout JSONL files contain prompt, answer, responses, and rewards
gradient JSONL shards contain projected gradient features for CROPI scoring

4. Run one CROPI selection stage

To compute influence scores and select data once:

cd CROPI

bash cropi/scripts/run_cropi.sh select-only ./data Qwen2.5-1.5B-Instruct_curriculum

This runs:

cropi-compute-inf-score
cropi-select

and writes the selected parquet under the corresponding dataset directory. One command, one selection stage, clean and reproducible. 🎯

5. Run the full CROPI loop from one script

cropi/scripts/run_cropi.sh is the top-level entry point for the full pipeline. It supports:

select-only
rl-only
full

The full mode executes the iterative pipeline:

select -> RL -> recompute gradients for the new checkpoint -> select -> RL -> ...

Minimal full-pipeline example

cd CROPI

BASE_MODEL_PATH=/path/to/Qwen2.5-1.5B-Instruct \
RL_PYTHON=/path/to/verl-env/bin/python \
TRAIN_DATA_NAMES=gsm_math_dsr_test \
VALID_DATA_NAMES=gsm8k,math,gaokao2023en \
RL_VAL_DATA_NAMES=gsm8k,math,gaokao2023en \
NUM_RL_ROUNDS=2 \
RL_NUM_GPUS=8 \
RL_TP_SIZE=2 \
bash cropi/scripts/run_cropi.sh full ./data Qwen2.5-1.5B-Instruct_curriculum

Important runtime knobs

BASE_MODEL_PATH Initial Hugging Face checkpoint for the first RL round.
RL_PYTHON Python executable with verl installed.
NUM_RL_ROUNDS Number of CROPI+RL stages to run.
RL_NUM_GPUS Number of GPUs for RL. Default: 8.
RL_TP_SIZE vLLM tensor-parallel size during rollout. Default: 2.
NUM_PARALLEL Number of gradient shards / cropi-get-grad workers.
RL_TOTAL_TRAINING_STEPS RL steps per stage.
DRY_RUN=1 Print the full command chain without executing it.

6. Notes on RL support

run_cropi.sh is the supported public entry point for the full iterative pipeline.
The script now supports 8-GPU RL by default. 🖥️
After each RL stage, the script exports the actor checkpoint to huggingface/ format so the next CROPI stage can reuse it directly.
If your RL environment is separate from the CROPI environment, point RL_PYTHON to that interpreter.

Acknowledgements 🙏

We thank the following projects and teams for making this work possible:

VeRL for the RL training framework
TRAK / traker for efficient influence-inspired gradient projection tooling
Qwen2.5 for the base language models
Qwen2.5-Math for math-oriented model and evaluation resources
vLLM for efficient rollout generation and serving

We also acknowledge the external math-evaluation tooling used by the original project setup, including the Qwen2.5-Math evaluation pipeline. This repository builds on a very strong open ecosystem, and we are grateful for it. 🌍

Citation 📝

If you find this repository useful, please cite our paper. If CROPI helps your research or engineering workflow, a citation is greatly appreciated. 💙

@misc{zhu2025dataefficientrlvroffpolicyinfluence,
  title={Data-Efficient RLVR via Off-Policy Influence Guidance},
  author={Erle Zhu and Dazhi Jiang and Yuan Wang and Xujun Li and Jiale Cheng and Yuxian Gu and Yilin Niu and Aohan Zeng and Jie Tang and Minlie Huang and Hongning Wang},
  year={2025},
  eprint={2510.26491},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2510.26491}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cropi		cropi
figures		figures
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CROPI: Data-Efficient RLVR via Off-Policy Influence Guidance 🚀

News 📣

Why CROPI? ✨

Main Idea 🧠

Results 📈

1.5B Main Training Curve

Overall Comparison

Sparse Projection Efficiency

Repository Overview 🗂️

Repository structure

What is included

What is not included

How To Run ▶️

1. Create the environment with `uv`

2. Sanity-check the installation

3. Prepare the expected data layout

4. Run one CROPI selection stage

5. Run the full CROPI loop from one script

Minimal full-pipeline example

Important runtime knobs

6. Notes on RL support

Acknowledgements 🙏

Citation 📝

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CROPI: Data-Efficient RLVR via Off-Policy Influence Guidance 🚀

News 📣

Why CROPI? ✨

Main Idea 🧠

Results 📈

1.5B Main Training Curve

Overall Comparison

Sparse Projection Efficiency

Repository Overview 🗂️

Repository structure

What is included

What is not included

How To Run ▶️

1. Create the environment with uv

2. Sanity-check the installation

3. Prepare the expected data layout

4. Run one CROPI selection stage

5. Run the full CROPI loop from one script

Minimal full-pipeline example

Important runtime knobs

6. Notes on RL support

Acknowledgements 🙏

Citation 📝

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Create the environment with `uv`

Packages