One-liner. A human-collected visual-tactile dataset (565k frames, 10 daily tasks, 182 objects, gathered with a cheap piezoresistive sensor glove) plus an Isaac-Gym RL benchmark of six dexterous Shadow-Hand tasks, used to show that even binary, sparse tactile signals jointly pretrained with vision (MAE-style masked reconstruction) boost manipulation success by ~40% over vision-only and stay robust to viewpoint, tactile noise, and the binarization threshold — i.e., touch matters most exactly when the hand occludes its own view.
Vision and touch are the two senses humans lean on for manipulation, yet large-scale pretraining for robotics has stayed visual-and-language-only (MVP, R3M, Voltron, RT-2-style data). Existing visual-tactile datasets (Touch and Go, SSVTP, TaRF, PHYSICLEAR) are built for texture/material classification, tactile localization, or physical-property reasoning — not for learning complex dexterous manipulation skills. High-resolution optical tactile sensors (GelSight, DIGIT) are bulky and hard to wear over a long human data collection, and online visual-tactile manipulation work (RotateIt, See-to-Touch) trains representation jointly with policy for a single task, so it does not transfer across tasks. The paper's bet: the combination and tempo of contact across fingers carries planning-level information about how to manipulate, and that prior can be harvested cheaply from humans wearing a low-cost pressure-sensor glove.
1. Dataset — the wearable glove. A cloth glove with 20
commercial piezoresistive pressure sensors (18.3 mm pad on thumb tip and
palm for sensitivity, 10 mm elsewhere), read out through an STM32
microcontroller at 200 Hz. Each sensor is calibrated against an F/T sensor
over 0.5–7.5 N; the fitted force→voltage law is
U = 0.7216·F0.5025 + 0.0398 (Eq 1). Vision is
captured ego-centrically with a HoloLens 2, synchronized with the glove.
5 subjects collected 2,032 sequences → 565k visual-tactile frame
pairs over 10 tasks (PickUp, BottleCap Turning, In-hand Reorientation,
Bowl Unstacking, Articulated Manipulation, Peg-in-hole, Water Pouring, Table-top
Manipulation, Scissors, Pressing) and 182 objects (Table 8). Crucially, the raw
pressure values are thresholded into binary contact / no-contact
signals to shrink the real-to-sim and sim-to-real gap.
2. Benchmark — the simulation platform. Six
Shadow-Hand tasks in Isaac Gym (Fig 2a): BottleCap Turning, Faucet Screwing,
Lever Sliding, Table Reorientation, In-hand Reorientation, Bimanual Hand-over.
Objects come from ShapeNet/DexGraspNet, SAPIEN, YCB, and custom CAD; each task
has seen + unseen object splits. 20 force sensors are arranged on the simulated
hand mirroring the glove, and their outputs are binarized with a tactile
threshold of 0.01 N (Csim).
3. VT-JointPretrain (the fusion model). An MAE-style masked
autoencoder over visual-tactile pairs (V, C) (Fig 9). The RGB image
is patchified and linearly projected to tokens v̄; the binary
tactile vector C ∈ {0,1}Nc is sliced into
Nc patches, each MLP-projected to c̄ (Eq 2). Both
modalities are randomly masked at ratios γv,
γc (Eq 3). A transformer encoder fuses a learnable
CLS token with the visible tokens to produce
h = {hCLS, hv, hc} (Eq 4); a
decoder reconstructs the masked image (MLP reconstructor) and each masked
tactile patch (ensemble of per-patch MLPs) (Eqs 5–6). After convergence the
frozen encoder feeds hCLS to a policy. Single-modality
variants V-Pretrain and T-Pretrain drop the
other branch; V-Pretrain+T-Pretrain concatenates the two
separately-trained features (a weaker non-fusion baseline).
4. Policy learning. Each task is an MDP solved with PPO
(Fig 2c). State S = {h ← Mθ(·), P}
concatenates the frozen representation h with proprioception
P (joint angles + velocities); Mθ is
the frozen pretrained (or non-pretrained) encoder taking ego-centric
Vsim and/or binarized Csim.
A run that achieves the goal in one episode counts as success; all numbers are
mean±std over 4 random seeds, 100 test episodes.
5. Sim-to-real. A teacher policy is trained in sim with domain randomization (Gaussian noise σ=0.1 on joint angles, velocities, actions, and tactile forces), then distilled with DAgger into a student that takes augmented visual input. Real hardware: a Shadow Hand, an Azure Kinect, and a 20-sensor tactile collection board, binarization threshold 0.2 V.
Headline: across all six tasks, VT-JointPretrain reaches 72.2% (seen) / 66.8% (unseen) task-mean success, vs 54.0 / 46.1 for V-Pretrain (vision-only) and 34.8 / 27.8 for the Base proprioception-only model (Table 2). The abstract's framing: adding binary tactile to the policy buys ~+20%, and joint pretraining with vision buys a further ~+20%.
| Method (task mean) | Split | Base | T-Pretrain | V-Pretrain | VT-JointPretrain |
|---|---|---|---|---|---|
| Success rate (%) | Seen | 34.8±5.8 | 55.7±8.8 | 54.0±8.5 | 72.2±2.4 |
| Unseen | 27.8±5.0 | 49.2±9.4 | 46.1±8.1 | 66.8±2.7 |
Benchmarking the broader method zoo (Table 3, all six tasks averaged):
| Method | Modality | Seen | Unseen |
|---|---|---|---|
| T (non-pretrain) | t | 50.8±2.5 | 47.0±2.5 |
| V (non-pretrain) | v | 24.0±3.0 | 22.2±2.9 |
| V+T (non-pretrain) | v+t | 23.6±2.6 | 19.3±2.9 |
| V_CLIP | v | 61.3±1.5 | 49.4±1.8 |
| V_CLIP+T | v+t | 65.4±1.7 | 55.9±1.7 |
| V-Pretrain+T-Pretrain | v+t | 62.6±6.3 | 53.3±7.3 |
| VT-JointPretrain | v+t | 74.3±0.6 | 65.7±0.7 |
Where it wins and where it is surprising:
From the authors:
What I noticed reading it:
This sits squarely on the thesis behind the 2026-06-24 batch: many
manipulation predicates — is_grasped, is_inserted,
is_screwed_tight, cap_is_turning — are not
visually evaluable; they live in the tempo and combination of finger
contact. VTDexManip is the clearest dataset-scale demonstration that this signal
is real and learnable: a 20-bit binary contact vector, jointly pretrained with
vision, carries enough of the how to move dexterous RL from ~35% to ~72%
success, with the biggest gains exactly where vision is occluded by the hand.
Relative to BLADE:
this is squarely a representation/RL paper, not a planning-abstraction
or predicate-learning paper, so I shouldn't overclaim a direct method connection.
But it is load-bearing evidence for a BLADE-flavored research direction. BLADE's
own "what I noticed" flagged that continuous/force parameters sit entirely inside
the diffusion policy and the abstraction layer is purely categorical. VTDexManip
suggests the missing ingredient for force-modulated predicates may be cheap and
even binary: if a contact-pattern encoder this crude already separates
"turning" from "slipping," then a BLADE-style classifier
fθ(turned-on(faucet)) grounded in touch rather than
pixels is plausible. The dataset's per-finger contact tempo (the t-SNE
clusters of Fig 1e) is exactly the kind of signal one would want to segment
contact-primitive bodies from — a tactile analogue of BLADE's gripper-state
segmentation. Worth a topic page on tactile-grounded predicates once 2–3
of this batch's touch-language papers are ingested.
Despite the tactile modality used in our experiments being binary and sparse, including it directly in the policy training boosts the success rate by about 20% and joint pretraining it with vision gains a further 20%. — Abstract
the combination and tempo of touch status of different hand parts may provide abundant information on how to manipulate in a higher planning level. — §1, Introduction / p.3
in contrast with the existing visual-tactile datasets, our dataset is the first visual-tactile dataset for complex robotic manipulation skill learning. — §3 / p.3
Papers cited here that are worth ingesting next (forward references):
Newly ingested in the 2026-06-24 batch — directly relevant to this work: