VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning

Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye* · Zhejiang University · ICLR 2025 · OpenReview · PDF

One-liner. A human-collected visual-tactile dataset (565k frames, 10 daily tasks, 182 objects, gathered with a cheap piezoresistive sensor glove) plus an Isaac-Gym RL benchmark of six dexterous Shadow-Hand tasks, used to show that even binary, sparse tactile signals jointly pretrained with vision (MAE-style masked reconstruction) boost manipulation success by ~40% over vision-only and stay robust to viewpoint, tactile noise, and the binarization threshold — i.e., touch matters most exactly when the hand occludes its own view.

Problem & motivation

Vision and touch are the two senses humans lean on for manipulation, yet large-scale pretraining for robotics has stayed visual-and-language-only (MVP, R3M, Voltron, RT-2-style data). Existing visual-tactile datasets (Touch and Go, SSVTP, TaRF, PHYSICLEAR) are built for texture/material classification, tactile localization, or physical-property reasoning — not for learning complex dexterous manipulation skills. High-resolution optical tactile sensors (GelSight, DIGIT) are bulky and hard to wear over a long human data collection, and online visual-tactile manipulation work (RotateIt, See-to-Touch) trains representation jointly with policy for a single task, so it does not transfer across tasks. The paper's bet: the combination and tempo of contact across fingers carries planning-level information about how to manipulate, and that prior can be harvested cheaply from humans wearing a low-cost pressure-sensor glove.

Method

1. Dataset — the wearable glove. A cloth glove with 20 commercial piezoresistive pressure sensors (18.3 mm pad on thumb tip and palm for sensitivity, 10 mm elsewhere), read out through an STM32 microcontroller at 200 Hz. Each sensor is calibrated against an F/T sensor over 0.5–7.5 N; the fitted force→voltage law is U = 0.7216·F^0.5025 + 0.0398 (Eq 1). Vision is captured ego-centrically with a HoloLens 2, synchronized with the glove. 5 subjects collected 2,032 sequences → 565k visual-tactile frame pairs over 10 tasks (PickUp, BottleCap Turning, In-hand Reorientation, Bowl Unstacking, Articulated Manipulation, Peg-in-hole, Water Pouring, Table-top Manipulation, Scissors, Pressing) and 182 objects (Table 8). Crucially, the raw pressure values are thresholded into binary contact / no-contact signals to shrink the real-to-sim and sim-to-real gap.

2. Benchmark — the simulation platform. Six Shadow-Hand tasks in Isaac Gym (Fig 2a): BottleCap Turning, Faucet Screwing, Lever Sliding, Table Reorientation, In-hand Reorientation, Bimanual Hand-over. Objects come from ShapeNet/DexGraspNet, SAPIEN, YCB, and custom CAD; each task has seen + unseen object splits. 20 force sensors are arranged on the simulated hand mirroring the glove, and their outputs are binarized with a tactile threshold of 0.01 N (C_sim).

3. VT-JointPretrain (the fusion model). An MAE-style masked autoencoder over visual-tactile pairs (V, C) (Fig 9). The RGB image is patchified and linearly projected to tokens v̄; the binary tactile vector C ∈ {0,1}^Nc is sliced into Nc patches, each MLP-projected to c̄ (Eq 2). Both modalities are randomly masked at ratios γ_v, γ_c (Eq 3). A transformer encoder fuses a learnable CLS token with the visible tokens to produce h = {h_CLS, h_v, h_c} (Eq 4); a decoder reconstructs the masked image (MLP reconstructor) and each masked tactile patch (ensemble of per-patch MLPs) (Eqs 5–6). After convergence the frozen encoder feeds h_CLS to a policy. Single-modality variants V-Pretrain and T-Pretrain drop the other branch; V-Pretrain+T-Pretrain concatenates the two separately-trained features (a weaker non-fusion baseline).

4. Policy learning. Each task is an MDP solved with PPO (Fig 2c). State S = {h ← M_θ(·), P} concatenates the frozen representation h with proprioception P (joint angles + velocities); M_θ is the frozen pretrained (or non-pretrained) encoder taking ego-centric V_sim and/or binarized C_sim. A run that achieves the goal in one episode counts as success; all numbers are mean±std over 4 random seeds, 100 test episodes.

5. Sim-to-real. A teacher policy is trained in sim with domain randomization (Gaussian noise σ=0.1 on joint angles, velocities, actions, and tactile forces), then distilled with DAgger into a student that takes augmented visual input. Real hardware: a Shadow Hand, an Azure Kinect, and a 20-sensor tactile collection board, binarization threshold 0.2 V.

Setup

Datasets / benchmarks: the new VTDexManip dataset (565k visual-tactile frames, 10 daily tasks, 182 objects, 2,032 human sequences, RGB + binary pressure) for pretraining; the new six-task VTDexManip simulation benchmark (BottleCap Turning, Faucet Screwing, Lever Sliding, Table Reorientation, In-hand Reorientation, Bimanual Hand-over) with seen/unseen object splits for RL evaluation. Source objects from ShapeNet, DexGraspNet (Wang et al.), SAPIEN, YCB.
Hardware / simulator: Isaac Gym + Shadow Hand (24 joints) with 20 simulated force sensors; ego-centric RGB cameras. Data collection: cloth glove with 20 piezoresistive sensors (STM32, 200 Hz) + HoloLens 2. Real-world: Shadow Hand, Azure Kinect, 20-sensor tactile board, ROS.
Baselines: non-pretrained Base (proprioception only), T, V, V+T (ResNet18 + MLP encoders); pretrained visual encoders V_MVP, V_Voltron, V_R3M, V_CLIP, V_ResNet and their +T concatenation variants; the trained-on-this-dataset V-Pretrain, T-Pretrain, V-Pretrain+T-Pretrain, and the proposed VT-JointPretrain. 18+ methods total.
Compute: not reported (PPO, 2000–3000 iterations per task per Fig 3; no GPU-hours given).

Results

Headline: across all six tasks, VT-JointPretrain reaches 72.2% (seen) / 66.8% (unseen) task-mean success, vs 54.0 / 46.1 for V-Pretrain (vision-only) and 34.8 / 27.8 for the Base proprioception-only model (Table 2). The abstract's framing: adding binary tactile to the policy buys ~+20%, and joint pretraining with vision buys a further ~+20%.

Method (task mean)	Split	Base	T-Pretrain	V-Pretrain	VT-JointPretrain
Success rate (%)	Seen	34.8±5.8	55.7±8.8	54.0±8.5	72.2±2.4
Success rate (%)	Unseen	27.8±5.0	49.2±9.4	46.1±8.1	66.8±2.7

Benchmarking the broader method zoo (Table 3, all six tasks averaged):

Method	Modality	Seen	Unseen
T (non-pretrain)	t	50.8±2.5	47.0±2.5
V (non-pretrain)	v	24.0±3.0	22.2±2.9
V+T (non-pretrain)	v+t	23.6±2.6	19.3±2.9
V_CLIP	v	61.3±1.5	49.4±1.8
V_CLIP+T	v+t	65.4±1.7	55.9±1.7
V-Pretrain+T-Pretrain	v+t	62.6±6.3	53.3±7.3
VT-JointPretrain	v+t	74.3±0.6	65.7±0.7

Where it wins and where it is surprising:

Touch helps most under occlusion. Tactile gains are largest on the first four tasks where the hand blocks the object from the camera; on Table/In-hand Reorientation (cleaner vision) the visual advantage grows — complementary, not redundant.
Non-pretrained V (24.0) is barely above noise, and naive V+T concatenation (23.6) is worse than V alone — raw multimodal concatenation hurts. Joint masked pretraining is what makes fusion pay off.
T-only pretraining (55.7) already beats every vision-only pretrained encoder except V_CLIP+T, which is a striking result for a 20-bit binary signal.
Robustness (Tables 4–6): VT-JointPretrain stays strong across ego/arm/3rd-person viewpoints, across binarization thresholds 0.01/0.5/1.0 N, and across Gaussian tactile noise σ=0.01–1 N when a hysteresis threshold is applied; binary tactile tolerates a 20-fold threshold mismatch (0.05 vs 1.0 N) between pretraining and RL.
Distillation asymmetry (Table 7): after joint pretraining, masking tactile at deployment (VT-JointPretrain-MaskT, 73.3/65.7) still beats pure V-Pretrain (70.8/58.5) — tactile regularizes the visual features; the reverse (mask vision) does not transfer, i.e. vision does not improve the tactile branch.

Limitations & open questions

From the authors:

Tactile signals are reduced to binary contact; pressure magnitude, shear, and sensor-sensitivity variation are discarded. They flag the discrete, inherently-sparse nature of touch as an unsolved modeling challenge.
Fully exploiting touch needs high-speed, high-fidelity simulation of object-sensor contact, which the field lacks; the sim-to-real gap remains.
Vision-to-tactile transfer fails after joint learning (only tactile→vision works), an open asymmetry.
VT-JointPretrain is trained from scratch; integrating large VLMs with tactile data is left as future work.

What I noticed reading it:

Human-glove to Shadow-Hand morphology gap is hand-waved. 20 human-glove pressure pads are mapped to 20 simulated force sensors on a 24-joint Shadow Hand, but the kinematic mismatch between a human hand and the Shadow Hand is never quantified — how the pretrained contact pattern prior survives the embodiment change is asserted, not measured.
No language modality at all, despite the dataset being organized by 10 named "tasks." The tasks are labels for collection, not conditioning signals — the policy is per-task PPO, so there is no multi-task or language-conditioned generalization claim.
Success is binary "goal reached in one episode," but the quality of the manipulation (smoothness, force economy) is unmeasured — ironic for a paper whose thesis is that touch carries the how.
The headline "~20% + ~20%" decomposition is an abstract-level rounding; per-task gains in Table 2 vary widely (e.g. Lever Sliding jumps far more than Bimanual Hand-over), so the additive story is a convenient average.
Real-world section reports only qualitative deployment on 4–5 tasks (Fig 4); no real-world success-rate table, so the sim-to-real claim is the weakest-evidenced part of the paper.

Why I care

This sits squarely on the thesis behind the 2026-06-24 batch: many manipulation predicates — is_grasped, is_inserted, is_screwed_tight, cap_is_turning — are not visually evaluable; they live in the tempo and combination of finger contact. VTDexManip is the clearest dataset-scale demonstration that this signal is real and learnable: a 20-bit binary contact vector, jointly pretrained with vision, carries enough of the how to move dexterous RL from ~35% to ~72% success, with the biggest gains exactly where vision is occluded by the hand.

Relative to BLADE: this is squarely a representation/RL paper, not a planning-abstraction or predicate-learning paper, so I shouldn't overclaim a direct method connection. But it is load-bearing evidence for a BLADE-flavored research direction. BLADE's own "what I noticed" flagged that continuous/force parameters sit entirely inside the diffusion policy and the abstraction layer is purely categorical. VTDexManip suggests the missing ingredient for force-modulated predicates may be cheap and even binary: if a contact-pattern encoder this crude already separates "turning" from "slipping," then a BLADE-style classifier f_θ(turned-on(faucet)) grounded in touch rather than pixels is plausible. The dataset's per-finger contact tempo (the t-SNE clusters of Fig 1e) is exactly the kind of signal one would want to segment contact-primitive bodies from — a tactile analogue of BLADE's gripper-state segmentation. Worth a topic page on tactile-grounded predicates once 2–3 of this batch's touch-language papers are ingested.

Quotable

Despite the tactile modality used in our experiments being binary and sparse, including it directly in the policy training boosts the success rate by about 20% and joint pretraining it with vision gains a further 20%. — Abstract

the combination and tempo of touch status of different hand parts may provide abundant information on how to manipulate in a higher planning level. — §1, Introduction / p.3

in contrast with the existing visual-tactile datasets, our dataset is the first visual-tactile dataset for complex robotic manipulation skill learning. — §3 / p.3

Papers cited here that are worth ingesting next (forward references):

Making Sense of Vision and Touch (Lee et al. 2020) — canonical learned multimodal representation for contact-rich tasks; the conceptual ancestor of joint visual-tactile fusion.
Liu et al. 2024 — Masked visual-tactile pre-training for robot manipulation (ICRA) — the authors' own prior MAE-style visual-tactile pretraining work that VTDexManip extends to dexterous hands.
Radosavovic et al. 2023 — MVP, Nair et al. 2022 — R3M, Karamcheti et al. 2023 — Voltron — the visual-pretraining baselines; foundational for the structure-of-the-benchmark comparison.
Yang et al. 2022 — Touch and Go, Yu et al. 2024 — PHYSICLEAR/Octopi — the prior visual-tactile datasets it positions against (Table 1).

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ObjectFolder 2.0 and ObjectFolder Benchmark — sibling multisensory dataset/benchmark efforts (vision + touch + audio); the closest dataset-paper neighbors in Cluster I.
Touch and Go — the human-collected vision+touch dataset VTDexManip explicitly contrasts itself with (texture/stylization vs manipulation-skill learning).
Kaiwu — another multimodal manipulation dataset; complementary scale/modality comparison.
Dexterity from Touch (T-DEX) and See to Touch — the dexterous tactile-policy line VTDexManip critiques as per-task / non-transferable.
Sparsh, AnyTouch, MViTac — tactile/visuo-tactile representation-pretraining peers; methodological cousins of VT-JointPretrain (masked vs contrastive fusion).
TACTO, Taxim, TacEx — tactile simulators; relevant to the authors' stated need for high-fidelity contact simulation.
Towards Forceful Robotic Foundation Models (survey) — situates VTDexManip's "touch carries the how" thesis in the broader force/contact foundation-model agenda.