Authors: Ying Feng*, Hongjie Fang*, Yinong He*, Jingjing Chen, Chenxi Wang, Zihao He, Ruonan Liu, Cewu Lu
Please follow the installation guide to install the rise and vq-vae conda environments and the dependencies, as well as the real robot environments. Also, remember to adjust the constant parameters in dataset/constants.py and utils/constants.py according to your own environment.
Make sure that TRANS_MIN/MAX and WORKSPACE_MIN/MAX are correctly set in the camera coordinates, or you may obtain meaningless output. We recommend expanding TRANS_MIN/MAX by 0.15 - 0.3 meters on both sides of the actual translation range to accommodate spatial data augmentation. You could follow command_train.sh for data visualization and parameter check.
Please calibrate the camera(s) with the robot before data collection and evaluation to ensure correct spatial transformations between camera(s) and the robot. Please refer to calibration guide for more details.
Please follow the data collection guide to collect data. We provide the sample data for each tasks on Google Drive and Baidu Netdisk (code: 643b). You may need to adjust dataset/pretrain.py to accommodate different data formats. The sample data have the format of
collect_cups
|-- calib/
| |-- [calib timestamp 1]/
| | |-- extrinsics.npy # extrinsics (camera to marker)
| | |-- intrinsics.npy # intrinsics
| | `-- tcp.npy # tcp pose of calibration
| `-- [calib timestamp 2]/ # similar calib structure
`-- train/
|-- [episode identifier 1]
| |-- metadata.json # metadata
| |-- timestamp.txt # calib timestamp
| |-- cam_[serial_number 1]/
| | |-- color # RGB
| | | |-- [timestamp 1].png
| | | |-- [timestamp 2].png
| | | |-- ...
| | | `-- [timestamp T].png
| | |-- depth # depth
| | | |-- [timestamp 1].png
| | | |-- [timestamp 2].png
| | | |-- ...
| | | `-- [timestamp T].png
| | |-- lowdim
| | | |-- tcp.npz # tcp
| | | | |-- [timestamp 1]
| | | | |-- [timestamp 2]
| | | | |-- ...
| | | | |-- [timestamp T]
| | | |-- pos.npz # finger pose
| | | | |-- [timestamp 1]
| | | | |-- [timestamp 2]
| | | | |-- ...
| | | | `-- [timestamp T]
| `-- cam_[serial_number 2]/ # similar camera structure
`-- [episode identifier 2] # similar episode structure
Please follow the training guide when working with this codebase.
The guide provides a step-by-step example for Task 1: Pull Tissue, including:
- Preprocessing the dataset
- Training the VQ-VAE
- Training VQ-Rise
This covers the full workflow from raw data to model training.
Here we provide the sample real-world evaluation code based on the hardwares (Flexiv Rizon 4 robotic arms, OyMotion RoHand, Intel RealSense camera). For other hardware settings, please follow the deployment guide to modify the evaluation script.
Modify the arguments in scripts/command_eval_rise_vae.sh, then
conda activate rise
bash command_eval_rise_vae.shHere are the argument explanations in the training process:
--ckpt [ckpt_path]: the checkpoint to be evaluated.--calib [calib_dir]: the calibration directory.--num_inference_step [Ni]: how often to perform a policy inference, measured in steps.--max_steps [Nstep]: maximum steps for an evaluation.--num_action[Nstep]: number of steps predicted each time.--vae_codebook: the path of vqvae codebook.--robot_ip: the robot arm ip address.--robot_serial_number: the robot serial number of robot arm.--com_port: the communicate port of dexterous hand.--camera_ids: the camera ids.--vis: set to enable open3d visualization after every inference. This visualization is blocking, it will prevent the evaluation process from continuing.--ensemble_mode [mode]: the temporal ensemble mode.[mode] = "new": use the newest predicted action in this step.[mode] = "old": use the oldest predicted action in this step.[mode] = "avg": use the average of the predicted actions in this step.[mode] = "act": use the aggregated weighted average of predicted actions in this step. The weights are set following ACT.[mode] = "hato": use the aggregated weighted average of predicted actions in this step. The weights are set following HATO.
- The other arguments remain the same as in the training script.
- Our codebase is built upon RISE.
- Our diffusion module is adapted from Diffusion Policy. This part is under MIT License.
- Our transformer module is adapted from ACT, which used DETR in their implementations. The DETR part is under APACHE 2.0 License.
- Our Minkowski ResNet observation encoder is adapted from the examples of the MinkowskiEngine repository. This part is under MIT License.
- Our temporal ensemble implementation is inspired by the recent HATO project.
@article{feng2025learning,
title = {Learning Dexterous Manipulation with Quantized Hand State},
author = {Feng, Ying and Fang, Hongjie and He, Yinong and Chen, Jingjing and Wang, Chenxi and He, Zihao and Liu, Ruonan and Lu, Cewu},
journal = {arXiv preprint arXiv:2509.17450},
year = {2025}
}
@inproceedings{wang2024rise,
title = {RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective},
author = {Wang, Chenxi and Fang, Hongjie and Fang, Hao-Shu and Lu, Cewu},
booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year = {2024},
pages = {2870-2877},
doi = {10.1109/IROS58592.2024.10801678}}
}DQ-RISE is licensed under CC BY-NC-SA 4.0