RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

CVPR 2026 Oral

Haiyang Mei Qiming Huang Hai Ci Mike Zheng Shou^✉️
Show Lab, National University of Singapore

We introduce RobotSeg, the first foundation model for robot segmentation that : 🌈

supports both images and videos,
enables fine-grained segmentation of the robot arm, gripper, and whole robot, and
offers promptable capabilities for flexible editing and annotation.

Table of Contents
🚀 1. Introduction
⚡️ 2. Key Challenges
🎥 3. VRS Dataset
✨ 4. RobotSeg Model
🏆 5. State-of-the-Art Performance
🦾 6. Applications of RobotSeg
🛠️ 7. Getting Started
      • 7.1 Installation
      • 7.2 Download
      • 7.3 Demo Use
      • 7.4 Testing
      • 7.5 Evaluation
      • 7.6 Training
🙌 8. Acknowledgments
📚 9. Citation

🚀 1. Introduction

Existing segmentation models such as SAM 1/2/3 are powerful, yet it is surprising ⚡️ that they still struggle to segment robots reliably.

We are thrilled to introduce RobotSeg ✨, the first foundation model and dataset designed specifically for segmenting robots in image and video.

RobotSeg delivers accurate and consistent robot masks that support:
🤖 visual servoing for VLA systems
🧩 robot-centric data augmentation
🏗️ real-to-sim transfer
🛡️ safety monitoring for collision warning

⚡️ 2. Key Challenges

RobotSeg targets four challenges that make robot segmentation uniquely difficult:

Embodiment Diversity – robots vary dramatically in shape, size, and articulation
Appearance Ambiguity – their visual patterns often blend with cluttered backgrounds
Structural Complexity – articulated arm links, joints, and grippers form intricate structures
Rapid Shape Changes – fast manipulation causes large geometric and motion variations

🎥 3. VRS Dataset

To support comprehensive evaluation and training, we construct VRS, the first video robot segmentation benchmark:
📌 2,812 videos (138,707 frames)
📌 10 robot embodiments (Franka, Fanuc Mate, UR5, Kuka iiwa, Google Everyday Robot, MobileALOHA, xArm, WindowX, Sawyer, Hello Stretch)
📌 Fine-grained masks for arm, gripper, and whole robot

✨ 4. RobotSeg Model

Built upon SAM 2, RobotSeg introduces three robot-centric innovations:

✨ Structure-Enhanced Memory Associator (SEMA): injects robot structural cues into memory matching to maintain stable, structure-preserving masks across video frames
✨ Robot Prompt Generator (RPG): produces semantic robot prompts that guide segmentation without requiring manual click or box inputs
✨ Label-Efficient Training (LET): supervises the model using only the first-frame ground-truth mask through cycle, semantic, and patch consistency losses

🏆 5. State-of-the-Art Performance

🔥 Leading performance over robot-specific baselines (RoVi-Aug, RoboEngine)
🔥 Outperforms language-conditioned approaches including CLIPSeg, LISA, EVF-SAM, VideoLISA, and SAM 3
🔥 Surpasses SAM 2.1 across prompt settings (automatic, 1-click, 3-click, box, online-interactive)
🔥 Lightweight: only 41.3M parameters and runs >10 FPS in inference
🔥 Robust to 10 diverse robot embodiments

5.1 Quantitative Comparison

Table below summarizes the quantitative comparisons on the RoboEngine (image) and VRS (video) datasets across diverse settings (i.e., automatic AU, 1-click 1C, 3-click 3C, bounding-box BB, and online-interactive OI). "–" denotes that the method does not support this setting. RobotSeg delivers the best segmentation performance while maintaining competitive computational efficiency.

5.2 Qualitative Comparison

(a) Comparison against image-level robot segmentation method RoboEngine

(b) Comparison against general promptable segmentation method SAM 2.1

(c) Comparison against concept segmentation method SAM 3

(d) Comparison under point or box prompts

🦾 6. Applications of RobotSeg

RobotSeg delivers accurate and consistent robot masks that support:

6.1 Robot-Centric Data Augmentation

Precise robot masks allow compositing the robot into new environments, generating diverse visual conditions for robust policy learning and sim-to-real adaptation.

6.2 Robot 3D Reconstruction

RobotSeg provides accurate robot masks that can be used by modern 3D reconstruction pipelines (e.g., SAM-3D Objects) to generate high-quality robot geometry for digital-twin modeling.

🛠 7. Getting Started

7.1 Installation

Our implementation uses python==3.11, torch==2.5.1 and torchvision==0.20.1. You can install RobotSeg on a GPU machine using:

conda create -n robotseg python=3.11
conda activate robotseg
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e ".[dev]"
python setup.py build_ext --inplace

7.2 Download

Checkpoint
- robotseg.pt [ OneDrive ] [ BaiduDisk ]
Dataset
- VRS [ OneDrive ] [ BaiduDisk ]
- RoboEngine [ OneDrive ] [ BaiduDisk ] (Reorganized from the original RoboEngine dataset with a unified folder structure for easier use. If you use it, remember to cite the RoboEngine paper.)

7.3 Demo Use

cd test
python demo.py

7.4 Testing

(a) Prepare mask_gt_info

If you test on the VRS or RoboEngine dataset, you can skip this step, since the required mask_gt_info is already included in the released dataset and can be used directly. This step is mainly needed when testing on your own custom dataset.

python tools/save_gt_mask_multiprocess.py

(b) Auto / Semi inference

cd test
sh infer_auto_semi.sh

(c) Interactive inference

sh infer_interactive.sh

7.5 Evaluation

sh eval_auto_semi.sh
sh eval_interactive.sh

7.6 Training

(a) Prepare pseudo mask for video training

If you train on the VRS dataset, you can skip this step, since the required pseudo mask is already included in the released dataset and can be used directly. This step is mainly needed when training on your own custom video dataset.

curl -Ls https://micro.mamba.pm/install.sh | bash
source ~/.bashrc
cd dinov3
micromamba env create -f conda.yaml
micromamba activate dinov3
python generate_pseudo_masks.py

(b) Start training

sh train.sh

🙌 8. Acknowledgments

RobotSeg is built upon SAM 2 and DINOv3.

📚 9. Citation

If you find our work useful, please consider citing our paper:

@article{mei2025robotseg,
      title={RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video}, 
      author={Mei, Haiyang and Huang, Qiming and Ci, Hai and Shou, Mike Zheng},
      journal={arXiv:2511.22950},
      year={2025}
}

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
assets		assets
robotseg		robotseg
test		test
tools		tools
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

🚀 1. Introduction

⚡️ 2. Key Challenges

🎥 3. VRS Dataset

✨ 4. RobotSeg Model

🏆 5. State-of-the-Art Performance

5.1 Quantitative Comparison

5.2 Qualitative Comparison

🦾 6. Applications of RobotSeg

6.1 Robot-Centric Data Augmentation

6.2 Robot 3D Reconstruction

🛠 7. Getting Started

7.1 Installation

7.2 Download

7.3 Demo Use

7.4 Testing

7.5 Evaluation

7.6 Training

🙌 8. Acknowledgments

📚 9. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

🚀 1. Introduction

⚡️ 2. Key Challenges

🎥 3. VRS Dataset

✨ 4. RobotSeg Model

🏆 5. State-of-the-Art Performance

5.1 Quantitative Comparison

5.2 Qualitative Comparison

🦾 6. Applications of RobotSeg

6.1 Robot-Centric Data Augmentation

6.2 Robot 3D Reconstruction

🛠 7. Getting Started

7.1 Installation

7.2 Download

7.3 Demo Use

7.4 Testing

7.5 Evaluation

7.6 Training

🙌 8. Acknowledgments

📚 9. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

Packages