VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning

Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye* · Zhejiang University · ICLR 2025 · OpenReview · PDF

One-liner. A human-collected visual-tactile dataset (565k frames, 10 daily tasks, 182 objects, gathered with a cheap piezoresistive sensor glove) plus an Isaac-Gym RL benchmark of six dexterous Shadow-Hand tasks, used to show that even binary, sparse tactile signals jointly pretrained with vision (MAE-style masked reconstruction) boost manipulation success by ~40% over vision-only and stay robust to viewpoint, tactile noise, and the binarization threshold — i.e., touch matters most exactly when the hand occludes its own view.

Problem & motivation

Vision and touch are the two senses humans lean on for manipulation, yet large-scale pretraining for robotics has stayed visual-and-language-only (MVP, R3M, Voltron, RT-2-style data). Existing visual-tactile datasets (Touch and Go, SSVTP, TaRF, PHYSICLEAR) are built for texture/material classification, tactile localization, or physical-property reasoning — not for learning complex dexterous manipulation skills. High-resolution optical tactile sensors (GelSight, DIGIT) are bulky and hard to wear over a long human data collection, and online visual-tactile manipulation work (RotateIt, See-to-Touch) trains representation jointly with policy for a single task, so it does not transfer across tasks. The paper's bet: the combination and tempo of contact across fingers carries planning-level information about how to manipulate, and that prior can be harvested cheaply from humans wearing a low-cost pressure-sensor glove.

Method

1. Dataset — the wearable glove. A cloth glove with 20 commercial piezoresistive pressure sensors (18.3 mm pad on thumb tip and palm for sensitivity, 10 mm elsewhere), read out through an STM32 microcontroller at 200 Hz. Each sensor is calibrated against an F/T sensor over 0.5–7.5 N; the fitted force→voltage law is U = 0.7216·F0.5025 + 0.0398 (Eq 1). Vision is captured ego-centrically with a HoloLens 2, synchronized with the glove. 5 subjects collected 2,032 sequences → 565k visual-tactile frame pairs over 10 tasks (PickUp, BottleCap Turning, In-hand Reorientation, Bowl Unstacking, Articulated Manipulation, Peg-in-hole, Water Pouring, Table-top Manipulation, Scissors, Pressing) and 182 objects (Table 8). Crucially, the raw pressure values are thresholded into binary contact / no-contact signals to shrink the real-to-sim and sim-to-real gap.

2. Benchmark — the simulation platform. Six Shadow-Hand tasks in Isaac Gym (Fig 2a): BottleCap Turning, Faucet Screwing, Lever Sliding, Table Reorientation, In-hand Reorientation, Bimanual Hand-over. Objects come from ShapeNet/DexGraspNet, SAPIEN, YCB, and custom CAD; each task has seen + unseen object splits. 20 force sensors are arranged on the simulated hand mirroring the glove, and their outputs are binarized with a tactile threshold of 0.01 N (Csim).

3. VT-JointPretrain (the fusion model). An MAE-style masked autoencoder over visual-tactile pairs (V, C) (Fig 9). The RGB image is patchified and linearly projected to tokens ; the binary tactile vector C ∈ {0,1}Nc is sliced into Nc patches, each MLP-projected to (Eq 2). Both modalities are randomly masked at ratios γv, γc (Eq 3). A transformer encoder fuses a learnable CLS token with the visible tokens to produce h = {hCLS, hv, hc} (Eq 4); a decoder reconstructs the masked image (MLP reconstructor) and each masked tactile patch (ensemble of per-patch MLPs) (Eqs 5–6). After convergence the frozen encoder feeds hCLS to a policy. Single-modality variants V-Pretrain and T-Pretrain drop the other branch; V-Pretrain+T-Pretrain concatenates the two separately-trained features (a weaker non-fusion baseline).

4. Policy learning. Each task is an MDP solved with PPO (Fig 2c). State S = {h ← Mθ(·), P} concatenates the frozen representation h with proprioception P (joint angles + velocities); Mθ is the frozen pretrained (or non-pretrained) encoder taking ego-centric Vsim and/or binarized Csim. A run that achieves the goal in one episode counts as success; all numbers are mean±std over 4 random seeds, 100 test episodes.

5. Sim-to-real. A teacher policy is trained in sim with domain randomization (Gaussian noise σ=0.1 on joint angles, velocities, actions, and tactile forces), then distilled with DAgger into a student that takes augmented visual input. Real hardware: a Shadow Hand, an Azure Kinect, and a 20-sensor tactile collection board, binarization threshold 0.2 V.

Setup

Results

Headline: across all six tasks, VT-JointPretrain reaches 72.2% (seen) / 66.8% (unseen) task-mean success, vs 54.0 / 46.1 for V-Pretrain (vision-only) and 34.8 / 27.8 for the Base proprioception-only model (Table 2). The abstract's framing: adding binary tactile to the policy buys ~+20%, and joint pretraining with vision buys a further ~+20%.

Method (task mean)SplitBaseT-PretrainV-PretrainVT-JointPretrain
Success rate (%)Seen34.8±5.855.7±8.854.0±8.572.2±2.4
Unseen27.8±5.049.2±9.446.1±8.166.8±2.7

Benchmarking the broader method zoo (Table 3, all six tasks averaged):

MethodModalitySeenUnseen
T (non-pretrain)t50.8±2.547.0±2.5
V (non-pretrain)v24.0±3.022.2±2.9
V+T (non-pretrain)v+t23.6±2.619.3±2.9
V_CLIPv61.3±1.549.4±1.8
V_CLIP+Tv+t65.4±1.755.9±1.7
V-Pretrain+T-Pretrainv+t62.6±6.353.3±7.3
VT-JointPretrainv+t74.3±0.665.7±0.7

Where it wins and where it is surprising:

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This sits squarely on the thesis behind the 2026-06-24 batch: many manipulation predicates — is_grasped, is_inserted, is_screwed_tight, cap_is_turning — are not visually evaluable; they live in the tempo and combination of finger contact. VTDexManip is the clearest dataset-scale demonstration that this signal is real and learnable: a 20-bit binary contact vector, jointly pretrained with vision, carries enough of the how to move dexterous RL from ~35% to ~72% success, with the biggest gains exactly where vision is occluded by the hand.

Relative to BLADE: this is squarely a representation/RL paper, not a planning-abstraction or predicate-learning paper, so I shouldn't overclaim a direct method connection. But it is load-bearing evidence for a BLADE-flavored research direction. BLADE's own "what I noticed" flagged that continuous/force parameters sit entirely inside the diffusion policy and the abstraction layer is purely categorical. VTDexManip suggests the missing ingredient for force-modulated predicates may be cheap and even binary: if a contact-pattern encoder this crude already separates "turning" from "slipping," then a BLADE-style classifier fθ(turned-on(faucet)) grounded in touch rather than pixels is plausible. The dataset's per-finger contact tempo (the t-SNE clusters of Fig 1e) is exactly the kind of signal one would want to segment contact-primitive bodies from — a tactile analogue of BLADE's gripper-state segmentation. Worth a topic page on tactile-grounded predicates once 2–3 of this batch's touch-language papers are ingested.

Quotable

Despite the tactile modality used in our experiments being binary and sparse, including it directly in the policy training boosts the success rate by about 20% and joint pretraining it with vision gains a further 20%. — Abstract
the combination and tempo of touch status of different hand parts may provide abundant information on how to manipulate in a higher planning level. — §1, Introduction / p.3
in contrast with the existing visual-tactile datasets, our dataset is the first visual-tactile dataset for complex robotic manipulation skill learning. — §3 / p.3

Related

Papers cited here that are worth ingesting next (forward references):

Newly ingested in the 2026-06-24 batch — directly relevant to this work: