Skip to content

thu-coai/CROPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CROPI: Data-Efficient RLVR via Off-Policy Influence Guidance 🚀

CROPI logo

Paper

CROPI is a curriculum reinforcement learning framework for large language models that improves RLVR efficiency through off-policy influence estimation over pre-collected trajectories. Instead of relying on repeated online trial-and-error to identify useful training data, CROPI estimates which prompts are most helpful for the current policy, selects a compact high-value subset, and then runs RL on that subset. In short: less wasted compute, more targeted learning, and a much cleaner RL loop. 🎯

Paper: Data-Efficient RLVR via Off-Policy Influence Guidance

News 📣

  • 2025-11: CROPI repository initialized 🎉 The project was set up and organized for public release.
  • 2026-03: First public code release 🚀 The initial open-source implementation of the CROPI pipeline is now available.

Why CROPI? ✨

Large-scale RL for reasoning models is expensive because it spends substantial compute on prompts that contribute little to policy improvement. CROPI addresses this by:

  • estimating prompt utility with off-policy influence scores
  • reusing pre-collected rollout logs instead of requiring fresh online sampling for every scoring step
  • using sparse random projection to make gradient-based scoring practical at scale
  • building a curriculum over selected data subsets, rather than training on the full pool at every stage

The result is a more compute-efficient RL loop that keeps or improves downstream performance while substantially reducing the amount of data trained per stage. That is the core appeal of CROPI: stronger efficiency without giving up serious performance. 🔥

Main Idea 🧠

CROPI framework

CROPI follows three core steps:

  1. Off-policy influence estimation Compute projected policy-gradient features from pre-collected rollout trajectories.
  2. Validation-targeted scoring Measure how aligned each train prompt is with the validation objectives via influence scores.
  3. Curriculum RL Select a compact training subset and run RL on that subset, then repeat for the next round.

This makes CROPI particularly attractive when rollout generation is expensive, RL budgets are limited, or you want a more principled curriculum than heuristic filtering. It is a simple idea operationally, but it unlocks a very strong efficiency story in practice. ⚡

Results 📈

1.5B Main Training Curve

1.5B main curve

On the 1.5B setting, CROPI achieves 2.66x step-level acceleration while training on only 10% of the data per stage. The key takeaway is not just faster training, but better use of RL compute: CROPI focuses updates on prompts that matter most for validation-time improvement. This is where the method really stands out. 🏎️

Overall Comparison

Main result table

The paper shows that CROPI consistently improves the efficiency-quality tradeoff compared with stronger data-hungry baselines. The method is designed to be practical: it works with realistic RL pipelines and avoids introducing another expensive online loop solely for data scoring. Better data selection, less redundant RL, stronger results. ✅

Sparse Projection Efficiency

1.5B sparse projection

3B sparse projection

These results highlight a second advantage of CROPI: the influence-estimation pipeline remains usable even for larger models because it relies on sparse random projection rather than storing or comparing raw full-dimensional gradients. That scalability is a big part of why CROPI is practical beyond toy settings. 📦

Repository Overview 🗂️

This repository contains the open-source implementation of the CROPI selection and curriculum loop. The runnable code lives under cropi/. The goal is to make the paper pipeline easy to inspect, easy to reproduce, and straightforward to extend. 🛠️

Repository structure

  • cropi/core Core logic for gradient extraction, influence-score computation, and prompt selection.
  • cropi/utils Utilities for JSONL splitting and RL checkpoint merging/export.
  • cropi/scripts Shell entry points, including a single-script end-to-end controller.
  • cropi/inference Optional wrappers for external rollout-generation scripts.
  • figures Paper figures used in this README.

What is included

  • CROPI scoring and selection pipeline
  • multi-round orchestration through one script
  • support for iterative select -> RL -> select -> RL
  • support for exporting RL checkpoints into Hugging Face format for the next CROPI round

What is not included

  • datasets
  • model checkpoints
  • rollout logs
  • experiment logs and tracking artifacts

You are expected to prepare these assets locally.

How To Run ▶️

1. Create the environment with uv

cd CROPI

uv venv
source .venv/bin/activate

uv pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
uv pip install -e .
uv pip install numpy pandas pyarrow tqdm "transformers<5" math-verify tabulate
uv pip install --no-deps traker==0.3.2
uv pip install fast_jl --no-build-isolation

If you already have a suitable PyTorch installation on the machine, you can also use:

uv venv --system-site-packages
source .venv/bin/activate

2. Sanity-check the installation

uv run python -m compileall cropi
uv run cropi-select --help
uv run cropi-compute-inf-score --help
uv run cropi-get-grad --help
bash -n cropi/scripts/*.sh cropi/inference/*.sh

3. Prepare the expected data layout

The public pipeline expects files in the following structure:

data/
  <train_dataset>/
    train_qwen.parquet
    <model_name>/
      train_<infer_note>.jsonl
      train_<infer_note>_grad_<proj_note>.jsonl.<rank>
  <valid_dataset>/
    valid_qwen.parquet
    <model_name>/
      valid_<valid_infer_note>.jsonl
      valid_<valid_infer_note>_grad_<proj_note>.jsonl.<rank>

In other words:

  • parquet files provide the raw train/validation prompt pool
  • rollout JSONL files contain prompt, answer, responses, and rewards
  • gradient JSONL shards contain projected gradient features for CROPI scoring

4. Run one CROPI selection stage

To compute influence scores and select data once:

cd CROPI

bash cropi/scripts/run_cropi.sh select-only ./data Qwen2.5-1.5B-Instruct_curriculum

This runs:

  1. cropi-compute-inf-score
  2. cropi-select

and writes the selected parquet under the corresponding dataset directory. One command, one selection stage, clean and reproducible. 🎯

5. Run the full CROPI loop from one script

cropi/scripts/run_cropi.sh is the top-level entry point for the full pipeline. It supports:

  • select-only
  • rl-only
  • full

The full mode executes the iterative pipeline:

select -> RL -> recompute gradients for the new checkpoint -> select -> RL -> ...

Minimal full-pipeline example

cd CROPI

BASE_MODEL_PATH=/path/to/Qwen2.5-1.5B-Instruct \
RL_PYTHON=/path/to/verl-env/bin/python \
TRAIN_DATA_NAMES=gsm_math_dsr_test \
VALID_DATA_NAMES=gsm8k,math,gaokao2023en \
RL_VAL_DATA_NAMES=gsm8k,math,gaokao2023en \
NUM_RL_ROUNDS=2 \
RL_NUM_GPUS=8 \
RL_TP_SIZE=2 \
bash cropi/scripts/run_cropi.sh full ./data Qwen2.5-1.5B-Instruct_curriculum

Important runtime knobs

  • BASE_MODEL_PATH Initial Hugging Face checkpoint for the first RL round.
  • RL_PYTHON Python executable with verl installed.
  • NUM_RL_ROUNDS Number of CROPI+RL stages to run.
  • RL_NUM_GPUS Number of GPUs for RL. Default: 8.
  • RL_TP_SIZE vLLM tensor-parallel size during rollout. Default: 2.
  • NUM_PARALLEL Number of gradient shards / cropi-get-grad workers.
  • RL_TOTAL_TRAINING_STEPS RL steps per stage.
  • DRY_RUN=1 Print the full command chain without executing it.

6. Notes on RL support

  • run_cropi.sh is the supported public entry point for the full iterative pipeline.
  • The script now supports 8-GPU RL by default. 🖥️
  • After each RL stage, the script exports the actor checkpoint to huggingface/ format so the next CROPI stage can reuse it directly.
  • If your RL environment is separate from the CROPI environment, point RL_PYTHON to that interpreter.

Acknowledgements 🙏

We thank the following projects and teams for making this work possible:

  • VeRL for the RL training framework
  • TRAK / traker for efficient influence-inspired gradient projection tooling
  • Qwen2.5 for the base language models
  • Qwen2.5-Math for math-oriented model and evaluation resources
  • vLLM for efficient rollout generation and serving

We also acknowledge the external math-evaluation tooling used by the original project setup, including the Qwen2.5-Math evaluation pipeline. This repository builds on a very strong open ecosystem, and we are grateful for it. 🌍

Citation 📝

If you find this repository useful, please cite our paper. If CROPI helps your research or engineering workflow, a citation is greatly appreciated. 💙

@misc{zhu2025dataefficientrlvroffpolicyinfluence,
  title={Data-Efficient RLVR via Off-Policy Influence Guidance},
  author={Erle Zhu and Dazhi Jiang and Yuan Wang and Xujun Li and Jiale Cheng and Yuxian Gu and Yilin Niu and Aohan Zeng and Jie Tang and Minlie Huang and Hongning Wang},
  year={2025},
  eprint={2510.26491},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2510.26491}
}

About

Official Repository for for paper "Data-Efficient RLVR via Off-Policy Influence Guidance"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors