BiDexVLA: A Hybrid Framework for Fast and Robust Bimanual Dexterous Grasping

Anonymous Authors

BiDexVLA: A Hybrid Framework for Fast and Robust Bimanual Dexterous Grasping

Anonymous Authors

Anonymous Institution

Paper

BiDexVLA enables fast and robust bimanual dexterous grasping on humanoid robots, outperforming baselines by 55% in success rate while being more efficient.

Abstract

Robot dexterous grasping is a fundamental challenge in embodied AI. While conventional methods struggle with generalization and efficiency, recent Vision-Language-Action (VLA) models have shown impressive zero-shot capabilities. However, a significant gap remains in applying these models to complex bimanual tasks with fine-grained control. To address this, we propose BiDexVLA, a hybrid two-stage framework for fast and robust bimanual dexterous grasping. The first stage uses a novel model-based strategy to efficiently generate grasp poses and types, culminating in the robot moving to a pre-grasp position. The second stage then employs a VLA model as a low-level, closed-loop controller to execute the final action with real-time feedback from vision and proprioception, ensuring high success rates. Experimental results show that our framework achieves a grasping success rate of 76.7% and outperforms a state-of-the-art baseline by 55.0% on various object types. Most importantly, our method enables fast and robust bimanual dexterous grasping, whereas the baseline is limited to a single hand within a constrained workspace and slower execution. Our contributions include a novel two-stage architecture, a specialized grasp generation strategy for bimanual tasks, and an innovative closed-loop VLA controller.

BiDexVLA teaser illustration — Teaser illustration highlighting BiDexVLA’s autonomous pre-grasp and dexterous execution.

Method Overview

BiDexVLA adopts a two-stage hybrid design: a model-based dexterous grasp generator proposes a coordinated bimanual pre-grasp, and a Vision-Language-Action (VLA) controller executes the motion with closed-loop feedback. The pipeline in Figure 1 of the paper highlights how sensing, grasp synthesis, and policy control interact within this workflow.

Two-stage BiDexVLA pipeline — Overall BiDexVLA workflow combining model-based grasp generation with VLA-driven execution.

Stage 1 — Model-based dexterous grasp generation. The robot reconstructs an accurate point cloud using stereo depth, Qwen-2.5-VL semantics, and SAM segmentation. It samples two-finger grasp candidates, maps them through dexterous grasp templates for each hand, and scores them for reachability, collision avoidance, and grasp quality before moving to the selected pre-grasp pose.

Stage 2 — VLA-based low-level controller. Multimodal observations (RGB views, point clouds, joint states, and force feedback) pass through a hierarchical encoder to produce observation tokens. A CVAE predictor generates fast action chunks, and a flow-matching refiner corrects them, enabling responsive, robust execution during teleoperation hand-off and autonomous runs.

Experimental Highlights

Hardware trials. On a full-sized humanoid platform, BiDexVLA reaches a 76.7% success rate when grasping challenging household objects, validating the synergy between planning and VLA control.

Baseline comparison. The hybrid pipeline yields a +55% absolute improvement over DexGraspVLA by reducing hand-over-hand interference, shortening teleoperation time, and stabilising execution.

Generalisation. The controller maintains performance across normal, soft, and thin objects, and effortlessly switches between right-hand, left-hand, and coordinated bimanual execution without additional tuning.

Teleoperation Workflow Comparison

BiDexVLA autonomously completes the pre-grasp stage before handing control to the operator, greatly reducing manual workload.

Baseline

Direct teleoperation requires continuous human control and is relatively slow.

BiDexVLA

Two-stage teleoperation: autonomous pre-grasp followed by quick human fine-tuning to complete the grasp.

Right-Hand Performance on Diverse Objects

We compare DexGraspVLA and BiDexVLA on normal, soft, and thin objects to evaluate robustness and generalization.

Normal Objects · Seen

Baseline

Baseline · Normal object (seen)

BiDexVLA

BiDexVLA · Normal object (seen)

Normal Objects · Unseen

Baseline

Baseline · Normal object (unseen #1)

Baseline

Baseline · Normal object (unseen #2)

BiDexVLA

BiDexVLA · Normal object (unseen #1)

BiDexVLA

BiDexVLA · Normal object (unseen #2)

Soft Objects

Baseline

Baseline · Soft object (seen)

BiDexVLA

BiDexVLA · Soft object (seen)

Baseline

Baseline · Soft object (unseen)

BiDexVLA

BiDexVLA · Soft object (unseen)

Thin Objects

Baseline

Baseline · Thin object (seen)

BiDexVLA

BiDexVLA · Thin object (seen)

Baseline

Baseline · Thin object (unseen)

BiDexVLA

BiDexVLA · Thin object (unseen)

Bimanual Coordination and Left-Handed Skills

BiDexVLA flexibly switches between hands and maintains stable coordination during bimanual tasks.

Left-Hand Only

BiDexVLA

Left-hand grasp example #1

BiDexVLA

Left-hand grasp example #2

Bimanual Coordination

BiDexVLA

Bimanual grasp example #1

BiDexVLA

Bimanual grasp example #2

BibTeX

@article{BiDexVLA2026,
  title={BiDexVLA: A Hybrid Framework for Fast and Robust Bimanual Dexterous Grasping},
  author={Anonymous Authors},
  journal={},
  year={2026},
  url={}
}