CVPR 2026 Oral
Haiyang Mei
Qiming Huang
Hai Ci
Mike Zheng Shou✉️
Show Lab, National University of Singapore
We introduce RobotSeg, the first foundation model for robot segmentation that : 🌈
- supports both images and videos,
- enables fine-grained segmentation of the robot arm, gripper, and whole robot, and
- offers promptable capabilities for flexible editing and annotation.
Table of Contents
🚀 1. Introduction
⚡️ 2. Key Challenges
🎥 3. VRS Dataset
✨ 4. RobotSeg Model
🏆 5. State-of-the-Art Performance
🦾 6. Applications of RobotSeg
🛠️ 7. Getting Started
• 7.1 Installation
• 7.2 Download
• 7.3 Demo Use
• 7.4 Testing
• 7.5 Evaluation
• 7.6 Training
🙌 8. Acknowledgments
📚 9. Citation
Existing segmentation models such as SAM 1/2/3 are powerful, yet it is surprising ⚡️ that they still struggle to segment robots reliably.
We are thrilled to introduce RobotSeg ✨, the first foundation model and dataset designed specifically for segmenting robots in image and video.
RobotSeg delivers accurate and consistent robot masks that support:
🤖 visual servoing for VLA systems
🧩 robot-centric data augmentation
🏗️ real-to-sim transfer
🛡️ safety monitoring for collision warning
RobotSeg targets four challenges that make robot segmentation uniquely difficult:
- Embodiment Diversity – robots vary dramatically in shape, size, and articulation
- Appearance Ambiguity – their visual patterns often blend with cluttered backgrounds
- Structural Complexity – articulated arm links, joints, and grippers form intricate structures
- Rapid Shape Changes – fast manipulation causes large geometric and motion variations
To support comprehensive evaluation and training, we construct VRS, the first video robot segmentation benchmark:
📌 2,812 videos (138,707 frames)
📌 10 robot embodiments (Franka, Fanuc Mate, UR5, Kuka iiwa, Google Everyday Robot, MobileALOHA, xArm, WindowX, Sawyer, Hello Stretch)
📌 Fine-grained masks for arm, gripper, and whole robot
Built upon SAM 2, RobotSeg introduces three robot-centric innovations:
✨ Structure-Enhanced Memory Associator (SEMA): injects robot structural cues into memory matching to maintain stable, structure-preserving masks across video frames
✨ Robot Prompt Generator (RPG): produces semantic robot prompts that guide segmentation without requiring manual click or box inputs
✨ Label-Efficient Training (LET): supervises the model using only the first-frame ground-truth mask through cycle, semantic, and patch consistency losses
🔥 Leading performance over robot-specific baselines (RoVi-Aug, RoboEngine)
🔥 Outperforms language-conditioned approaches including CLIPSeg, LISA, EVF-SAM, VideoLISA, and SAM 3
🔥 Surpasses SAM 2.1 across prompt settings (automatic, 1-click, 3-click, box, online-interactive)
🔥 Lightweight: only 41.3M parameters and runs >10 FPS in inference
🔥 Robust to 10 diverse robot embodiments
Table below summarizes the quantitative comparisons on the RoboEngine (image) and VRS (video) datasets across diverse settings (i.e., automatic AU, 1-click 1C, 3-click 3C, bounding-box BB, and online-interactive OI). "–" denotes that the method does not support this setting. RobotSeg delivers the best segmentation performance while maintaining competitive computational efficiency.
(a) Comparison against image-level robot segmentation method RoboEngine
(b) Comparison against general promptable segmentation method SAM 2.1
(c) Comparison against concept segmentation method SAM 3
(d) Comparison under point or box prompts
RobotSeg delivers accurate and consistent robot masks that support:
Precise robot masks allow compositing the robot into new environments, generating diverse visual conditions for robust policy learning and sim-to-real adaptation.
RobotSeg provides accurate robot masks that can be used by modern 3D reconstruction pipelines (e.g., SAM-3D Objects) to generate high-quality robot geometry for digital-twin modeling.
Our implementation uses python==3.11, torch==2.5.1 and torchvision==0.20.1. You can install RobotSeg on a GPU machine using:
conda create -n robotseg python=3.11
conda activate robotseg
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e ".[dev]"
python setup.py build_ext --inplace
-
Checkpoint
-
Dataset
- VRS [ OneDrive ] [ BaiduDisk ]
- RoboEngine [ OneDrive ] [ BaiduDisk ] (Reorganized from the original RoboEngine dataset with a unified folder structure for easier use. If you use it, remember to cite the RoboEngine paper.)
cd test
python demo.py
(a) Prepare mask_gt_info
If you test on the VRS or RoboEngine dataset, you can skip this step, since the required mask_gt_info is already included in the released dataset and can be used directly. This step is mainly needed when testing on your own custom dataset.
python tools/save_gt_mask_multiprocess.py
(b) Auto / Semi inference
cd test
sh infer_auto_semi.sh
(c) Interactive inference
sh infer_interactive.sh
sh eval_auto_semi.sh
sh eval_interactive.sh
(a) Prepare pseudo mask for video training
If you train on the VRS dataset, you can skip this step, since the required pseudo mask is already included in the released dataset and can be used directly. This step is mainly needed when training on your own custom video dataset.
curl -Ls https://micro.mamba.pm/install.sh | bash
source ~/.bashrc
cd dinov3
micromamba env create -f conda.yaml
micromamba activate dinov3
python generate_pseudo_masks.py
(b) Start training
sh train.sh
RobotSeg is built upon SAM 2 and DINOv3.
If you find our work useful, please consider citing our paper:
@article{mei2025robotseg,
title={RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video},
author={Mei, Haiyang and Huang, Qiming and Ci, Hai and Shou, Mike Zheng},
journal={arXiv:2511.22950},
year={2025}
}