Skip to content

showlab/RobotSeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

CVPR 2026 Oral

Haiyang Mei    Qiming Huang    Hai Ci    Mike Zheng Shou✉️
Show Lab, National University of Singapore

Watch the video

We introduce RobotSeg, the first foundation model for robot segmentation that : 🌈

  1. supports both images and videos,
  2. enables fine-grained segmentation of the robot arm, gripper, and whole robot, and
  3. offers promptable capabilities for flexible editing and annotation.

Table of Contents
🚀 1. Introduction
⚡️ 2. Key Challenges
🎥 3. VRS Dataset
✨ 4. RobotSeg Model
🏆 5. State-of-the-Art Performance
🦾 6. Applications of RobotSeg
🛠️ 7. Getting Started
      • 7.1 Installation
      • 7.2 Download
      • 7.3 Demo Use
      • 7.4 Testing
      • 7.5 Evaluation
      • 7.6 Training
🙌 8. Acknowledgments
📚 9. Citation

🚀 1. Introduction

Existing segmentation models such as SAM 1/2/3 are powerful, yet it is surprising ⚡️ that they still struggle to segment robots reliably.

We are thrilled to introduce RobotSeg ✨, the first foundation model and dataset designed specifically for segmenting robots in image and video.

RobotSeg delivers accurate and consistent robot masks that support:
🤖 visual servoing for VLA systems
🧩 robot-centric data augmentation
🏗️ real-to-sim transfer
🛡️ safety monitoring for collision warning

⚡️ 2. Key Challenges

RobotSeg targets four challenges that make robot segmentation uniquely difficult:

  • Embodiment Diversity – robots vary dramatically in shape, size, and articulation
  • Appearance Ambiguity – their visual patterns often blend with cluttered backgrounds
  • Structural Complexity – articulated arm links, joints, and grippers form intricate structures
  • Rapid Shape Changes – fast manipulation causes large geometric and motion variations

🎥 3. VRS Dataset

To support comprehensive evaluation and training, we construct VRS, the first video robot segmentation benchmark:
📌 2,812 videos (138,707 frames)
📌 10 robot embodiments (Franka, Fanuc Mate, UR5, Kuka iiwa, Google Everyday Robot, MobileALOHA, xArm, WindowX, Sawyer, Hello Stretch)
📌 Fine-grained masks for arm, gripper, and whole robot

✨ 4. RobotSeg Model

Built upon SAM 2, RobotSeg introduces three robot-centric innovations:

Structure-Enhanced Memory Associator (SEMA): injects robot structural cues into memory matching to maintain stable, structure-preserving masks across video frames
Robot Prompt Generator (RPG): produces semantic robot prompts that guide segmentation without requiring manual click or box inputs
Label-Efficient Training (LET): supervises the model using only the first-frame ground-truth mask through cycle, semantic, and patch consistency losses

🏆 5. State-of-the-Art Performance

🔥 Leading performance over robot-specific baselines (RoVi-Aug, RoboEngine)
🔥 Outperforms language-conditioned approaches including CLIPSeg, LISA, EVF-SAM, VideoLISA, and SAM 3
🔥 Surpasses SAM 2.1 across prompt settings (automatic, 1-click, 3-click, box, online-interactive)
🔥 Lightweight: only 41.3M parameters and runs >10 FPS in inference
🔥 Robust to 10 diverse robot embodiments

5.1 Quantitative Comparison

Table below summarizes the quantitative comparisons on the RoboEngine (image) and VRS (video) datasets across diverse settings (i.e., automatic AU, 1-click 1C, 3-click 3C, bounding-box BB, and online-interactive OI). "–" denotes that the method does not support this setting. RobotSeg delivers the best segmentation performance while maintaining competitive computational efficiency.

5.2 Qualitative Comparison

(a) Comparison against image-level robot segmentation method RoboEngine

(b) Comparison against general promptable segmentation method SAM 2.1

(c) Comparison against concept segmentation method SAM 3

(d) Comparison under point or box prompts

🦾 6. Applications of RobotSeg

RobotSeg delivers accurate and consistent robot masks that support:

6.1 Robot-Centric Data Augmentation

Precise robot masks allow compositing the robot into new environments, generating diverse visual conditions for robust policy learning and sim-to-real adaptation.

6.2 Robot 3D Reconstruction

RobotSeg provides accurate robot masks that can be used by modern 3D reconstruction pipelines (e.g., SAM-3D Objects) to generate high-quality robot geometry for digital-twin modeling.

🛠 7. Getting Started

7.1 Installation

Our implementation uses python==3.11, torch==2.5.1 and torchvision==0.20.1. You can install RobotSeg on a GPU machine using:

conda create -n robotseg python=3.11
conda activate robotseg
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e ".[dev]"
python setup.py build_ext --inplace

7.2 Download

7.3 Demo Use

cd test
python demo.py

7.4 Testing

(a) Prepare mask_gt_info

If you test on the VRS or RoboEngine dataset, you can skip this step, since the required mask_gt_info is already included in the released dataset and can be used directly. This step is mainly needed when testing on your own custom dataset.

python tools/save_gt_mask_multiprocess.py

(b) Auto / Semi inference

cd test
sh infer_auto_semi.sh

(c) Interactive inference

sh infer_interactive.sh

7.5 Evaluation

sh eval_auto_semi.sh
sh eval_interactive.sh

7.6 Training

(a) Prepare pseudo mask for video training

If you train on the VRS dataset, you can skip this step, since the required pseudo mask is already included in the released dataset and can be used directly. This step is mainly needed when training on your own custom video dataset.

curl -Ls https://micro.mamba.pm/install.sh | bash
source ~/.bashrc
cd dinov3
micromamba env create -f conda.yaml
micromamba activate dinov3
python generate_pseudo_masks.py

(b) Start training

sh train.sh

🙌 8. Acknowledgments

RobotSeg is built upon SAM 2 and DINOv3.

📚 9. Citation

If you find our work useful, please consider citing our paper:

@article{mei2025robotseg,
      title={RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video}, 
      author={Mei, Haiyang and Huang, Qiming and Ci, Hai and Shou, Mike Zheng},
      journal={arXiv:2511.22950},
      year={2025}
}

⬆ back to top

About

[CVPR 2026 Oral] RobotSeg

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages