GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

1PRIOR @ Allen Institute for AI, 2Boston University, 3University of Washington, 4Vercept, 5UT Austin

Abstract

We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model, and PRISM, a large-scale synthetic dataset used to train it. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body or lid.

Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from PRISM, a novel large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects.

In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative. GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot.

We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.


Overview of the PRISM dataset and GraspMolmo model

GraspMolmo is a generalizable open-vocabulary task-oriented grasping model that predicts semantically appropriate grasps given a natural language instruction.


Data Generation

PRISM Dataset

Purpose-driven Robotic Interaction in Scene Manipulation (PRISM) is a large-scale synthetic dataset for Task-Oriented Grasping featuring cluttered environments and diverse, realistic task descriptions. We use 2365 object instances from ShapeNet-Sem along with stable grasps from ACRONYM to compose 10,000 unique and diverse scenes. Within each scene we capture 10 views, within which there are multiple tasks to be performed. This results in 379k task-grasp samples in total.

Overview of the PRISM data generation pipeline

TaskGrasp-Image

TaskGrasp is a standard benchmark for Task-Oriented Grasping, with manually annotated grasp annotations. However, TaskGrasp only contains segmented partial object point clouds, which (a) fails to capture the complexity of real-world scenes, and (b) includes fusion and segmentation artifacts. The use of point clouds also makes it difficult to leverage models that use RGB input, such as VLMs.

To address these issues, we derive TaskGrasp-Image from TaskGrasp. TaskGrasp-Image consists of the same ground-truth annotations as TaskGrasp, but the grasp annotations have been transformed into the image frame, allowing models to use RGB input. We perform this transformation by using point-cloud registration techniques to align the point clouds with the segmented object points from each RGB-D frame. Since the image frames are, by construction, ground truth real data, TaskGrasp-Image does not suffer from fusion or segmentation artifacts. As a result, TaskGrasp-Image is a realistic and representative benchmark for Task-Oriented Grasping.

A visualization of the segmented pointcloud (red) of a pot registered to one of the input RGB-D views.

A visualization of the segmented pointcloud (red) of a pot registered to one of the input RGB-D views. By performing this registration, we can transform the annotated grasps (yellow) into the camera frame.


Evaluation

Experimental Results

We evaluate GraspMolmo and the most appropriate baseline on three distinct benchmark settings, progressively increasing in complexity and real-world applicability: a benchmark from literature with simple objects and minimal visual diversity, a synthetic held-out dataset of fully composed scenes with unseen objects, and finally, real-world transfer scenarios. The performance gap between methods widens notably as we progress from simpler to more complex evaluation scenarios, revealing fundamental differences in approach capabilities that are not apparent in basic benchmarks.

We see that GraspMolmo outperforms all baselines in all evaluations, especially in real-world transfer experiments. This demonstrates its powerful generalization capabilities, and its ability to handle complexity in scene composition, object diversity, and task specification.

We also note that performance on PRISM-Test correlates well with real-world performance, further demonstrating its importance and utility.

PRISM-Test is a better predictor of real-world performance than TaskGrasp-Image.

PRISM-Test is a better predictor of real-world performance than TaskGrasp-Image.

TaskGrasp-Image PRISM-Test PRISM-Real (Prediction) PRISM-Real (Overall)
Random 54.5% 29.3% - -
GraspGPT 72.3% 40.0% 35.1% 24.0%
Molmo 75.6% 49.8% 33.7% 31.0%
GraspMolmo 76.7% 62.5% 70.4% 61.1%

Real-World Evaluation

Scene 1

Scene 1

Scene 2

Scene 2

Scene 3

Scene 3

Scene 1 French Press "Pour coffee from the french press"
"Press down the knob of the plunger of the french press"
Kitchen Knife "Use the knife to cut fruit"
"Hand me the knife safely"
Mug "Pour the water out of the blue mug"
"Hang the blue mug onto a hook by the handle"
Scene 2 Water Bottle "Open the lid of the water bottle"
"Give me some water"
Sink "Adjust the faucet"
"Turn on the sink"
Spray Bottle "Spray cleaning solution with the spray bottle"
"Unscrew the spray bottle"
Scene 3 Books "Pass the book written by Greg Bear"
"Pass the book written by Orson Scott Card"
Telephone "Answer the phone"
"Put the phone back on the hook"
Flower + Vase "Take the flowers out of the vase"
"Dump the flowers out of the vase"

A full list of the scenes, objects, and tasks used for real-world evaluation.

Extension to Bimanual Grasping

We also see that by performing multiple predictions, GraspMolmo can predict multiple grasps, enabling complex task-oriented bimanual grasping. To do so, we decompose a bimanual task into two single-arm tasks, and use GraspMolmo to predict the grasps for each arm sequentially. For example, "Open the water bottle" becomes "Lift the water bottle" and "Unscrew the lid".

BibTeX

@misc{deshpande2025graspmolmo,
      title={GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation}, 
      author={Abhay Deshpande and Yuquan Deng and Arijit Ray and Jordi Salvador and Winson Han and Jiafei Duan and Kuo-Hao Zeng and Yuke Zhu and Ranjay Krishna and Rose Hendrix},
      year={2025},
      eprint={2505.13441},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.13441}, 
}