BiDexVLA: A Hybrid Framework for Fast and Robust Bimanual Dexterous Grasping
Abstract
Robot dexterous grasping is a fundamental challenge in embodied AI. While conventional methods struggle with generalization and efficiency, recent Vision-Language-Action (VLA) models have shown impressive zero-shot capabilities. However, a significant gap remains in applying these models to complex bimanual tasks with fine-grained control. To address this, we propose BiDexVLA, a hybrid two-stage framework for fast and robust bimanual dexterous grasping. The first stage uses a novel model-based strategy to efficiently generate grasp poses and types, culminating in the robot moving to a pre-grasp position. The second stage then employs a VLA model as a low-level, closed-loop controller to execute the final action with real-time feedback from vision and proprioception, ensuring high success rates. Experimental results show that our framework achieves a grasping success rate of 76.7% and outperforms a state-of-the-art baseline by 55.0% on various object types. Most importantly, our method enables fast and robust bimanual dexterous grasping, whereas the baseline is limited to a single hand within a constrained workspace and slower execution. Our contributions include a novel two-stage architecture, a specialized grasp generation strategy for bimanual tasks, and an innovative closed-loop VLA controller.
Method Overview
BiDexVLA adopts a two-stage hybrid design: a model-based dexterous grasp generator proposes a coordinated bimanual pre-grasp, and a Vision-Language-Action (VLA) controller executes the motion with closed-loop feedback. The pipeline in Figure 1 of the paper highlights how sensing, grasp synthesis, and policy control interact within this workflow.
Stage 1 — Model-based dexterous grasp generation. The robot reconstructs an accurate point cloud using stereo depth, Qwen-2.5-VL semantics, and SAM segmentation. It samples two-finger grasp candidates, maps them through dexterous grasp templates for each hand, and scores them for reachability, collision avoidance, and grasp quality before moving to the selected pre-grasp pose.
Stage 2 — VLA-based low-level controller. Multimodal observations (RGB views, point clouds, joint states, and force feedback) pass through a hierarchical encoder to produce observation tokens. A CVAE predictor generates fast action chunks, and a flow-matching refiner corrects them, enabling responsive, robust execution during teleoperation hand-off and autonomous runs.
Experimental Highlights
Hardware trials. On a full-sized humanoid platform, BiDexVLA reaches a 76.7% success rate when grasping challenging household objects, validating the synergy between planning and VLA control.
Baseline comparison. The hybrid pipeline yields a +55% absolute improvement over DexGraspVLA by reducing hand-over-hand interference, shortening teleoperation time, and stabilising execution.
Generalisation. The controller maintains performance across normal, soft, and thin objects, and effortlessly switches between right-hand, left-hand, and coordinated bimanual execution without additional tuning.
Teleoperation Workflow Comparison
BiDexVLA autonomously completes the pre-grasp stage before handing control to the operator, greatly reducing manual workload.
Right-Hand Performance on Diverse Objects
We compare DexGraspVLA and BiDexVLA on normal, soft, and thin objects to evaluate robustness and generalization.
Normal Objects · Seen
Normal Objects · Unseen
Soft Objects
Thin Objects
Bimanual Coordination and Left-Handed Skills
BiDexVLA flexibly switches between hands and maintains stable coordination during bimanual tasks.
Left-Hand Only
Bimanual Coordination
BibTeX
@article{BiDexVLA2026,
title={BiDexVLA: A Hybrid Framework for Fast and Robust Bimanual Dexterous Grasping},
author={Anonymous Authors},
journal={},
year={2026},
url={}
}