Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks

Jialiang Zhao, Yuxiang Ma, Lirui Wang, Edward H. Adelson · MIT CSAIL · CoRL 2024 · arXiv:2406.13640 · PDF

One-liner. T3 treats the zoo of incompatible camera-based tactile sensors as a multi-modal alignment problem — sensor-specific encoders feed a shared transformer trunk that funnels into task-specific decoders — and shows that pre-training this trunk on FoTa (the largest unified tactile dataset, 3M+ images, 13 sensors, 11 tasks) yields a reusable tactile encoder that transfers across both new sensors and new tasks with little or no fine-tuning, and lifts a real sub-millimeter insertion policy.

Problem & motivation

Camera-based tactile sensors (GelSight-family) are the dominant high-res touch modality, but they are extremely heterogeneous: different shapes, form factors, illumination colors, marker patterns, camera counts. Every new sensor-task pairing today means re-collecting data and training an encoder from scratch — expensive, and especially wasteful in long-horizon contact-rich tasks where the policy reward is sparse and many components must be trained jointly. The authors argue there should be shareable latent structure across sensors and tasks, but the central obstacle is that, unlike vision-touch or audio-visual learning where you have temporally-aligned pairs, heterogeneous tactile data is unaligned: it is ill-defined to declare two tactile images from different hardware "the same contact." So standard contrastive / distance-based cross-modal recipes don't apply. The goal: a scalable representation learnable from unaligned tactile data, shareable across sensors and tasks.

Method

Two ingredients: a dataset (FoTa) and an architecture (T3).

FoTa (Foundation Tactile) dataset. Aggregates many of the largest open-sourced tactile datasets (VisGel, TVL, Touch-and-Go, Calandra'17, Yuan'18, YCB-Sight, ObjectFolder-real, Tippur'23) plus new in-house data, into a single unified, I/O-efficient WebDataset format: 3,083,452 tactile images, 13 sensors, 11 tasks (Table 2). New in-house data was collected on two rigs to add recently-emerged sensors (GelSight Finray, Svelte, Wedge, DenseTact 2.0, GelSight 360): a 7-DoF Franka platform (2 sensors on a parallel jaw gripper exploring a clamped object, recording SE(3) pose) and a 3-DoF CNC gantry (3D-printed textured probes on a force/torque sensor probing a fixed sensor's surface, recording translational pose + probing force). VisGel's "flat" no-contact images were sub-sampled out using a Laplacian-variance threshold (Eqn. 2, σi = 4.24).

T3 architecture (Fig 1). For N_s sensors and N_t tasks: one ViT encoder Enc_i per sensor, one shared ViT Trunk, and one decoder Dec_j per task. Each batch always comes from a single sensor-task pairing (X_i, Y_j), so the matching Enc_i and Dec_j are attached to the trunk; the trunk is the only component shared across everything, which is what carries the cross-sensor / cross-task knowledge. Decoder architecture is chosen per task type: ViT for generative (MAE reconstruction), MLP for classification, ResNet+MLP for pose regression. Loss for a single-image task is L_j(Y_j, Dec_j(Trunk(Enc_i(X_i)))); pose tasks consume two tactile images, encode/trunk each separately, then concatenate before the decoder (Eqn. 1). Similar sensors share an encoder (e.g. GelSight17 variants 1–4 share one encoder).

Three-stage training.

Pre-train I — self-supervised MAE. Masked auto-encoding (L2 patch-normalized pixel loss) over all FoTa data, including unlabeled images. A single reconstruction decoder Dec_0 (8 ViT blocks) is shared by all sensors. Targets local/pixel-level understanding. Best masking ratio found to be 80%.

Pre-train II — supervised, labels distilled from public datasets. Trains the trunk under supervision on 10 tasks for which the public datasets carry labels: classification tasks (object, material, fabric smoothness / fuzziness / texture-type) via MLP decoders + cross-entropy; regression tasks (SE(3) relative pose between two overlapping tactile images via ResNet+MLP, and Laplacian-variance / information-content estimation) via MSE. Targets global / semantic understanding.

Fine-tuning (optional). Further train on the specific target sensor-task with a small dataset; skippable if the pairing already existed in pre-training. Network sizes range tiny-12M / small-45M / medium-174M / large-308M (Table 1).

Setup

Results

Pre-training helps and scales (Fig 3a). Across all 24 configurations, pre-training improved evaluation accuracy with a median improvement of 24%. Larger networks did better — large's classification accuracy was 19% higher than tiny's — though the medium-vs-large gap was insignificant. With only half the fine-tuning data, pre-trained networks beat from-scratch networks, and for medium/large the half-data and full-data fine-tuned performances were close, suggesting pre-trained models generalize with less novel data. Masking ratio mattered little (85% at mr=30% up to 89% at mr=80%; Fig 3b).

Zero-shot transfer to novel sensors (Fig 3c,d). Mixed by task. On object classification, zero-shot transfer gave only minor gains over random guessing — but a small fine-tune (2,000 points) bounced classification accuracy up to 17% with fine-tuning. On pose estimation, zero-shot already showed significant improvement over the dataset average, and fine-tuning reduced RMSE by 5.5mm for DenseTact 2.0; on GelSight Svelte the pose errors before and after fine-tuning were nearly identical and close to optimal.

Long-horizon insertion (Fig 4c). Headline claim from the abstract: T3-encoder policies achieved a task success rate 25% higher than policies trained with from-scratch tactile encoders, and 53% higher than without tactile sensing. The vision-only policy failed every insertion on the two harder parts (USB, VGA). The T3 encoder also reduced the average number of tactile exploration steps to insert a part.

Insertion policy (15 episodes)Tactile encoderRelative outcome
No tactile (vision-only)Failed all USB & VGA insertions
Tactile, from scratchTrained from scratchBaseline tactile
Tactile, T3Pre-trained T3+25% success vs. scratch; +53% vs. no-tactile; fewer steps

Per-part success rates and step counts are reported only as bar charts (Fig 4c); exact numbers not reported in text.

Limitations & open questions

From the authors (§6):

What I noticed reading it:

Why I care

This is squarely on-theme for the touch-grounding thesis behind this batch: many manipulation predicates I care about for BLADE-style planning — is_inserted, is_grasped, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in contact. The sub-millimeter insertion result is exactly the kind of is_inserted predicate / controller that vision alone can't ground (the paper explicitly notes vision fails under heavy occlusion). T3 is the representation-layer answer: a reusable tactile encoder you could drop in front of a BLADE predicate classifier or diffusion-policy body so that the touch features don't have to be re-learned per sensor.

The deeper relevance is methodological. BLADE's bottleneck is that diffusion policies and predicate classifiers are trained per-skill from scratch; T3 is the tactile analogue of "pre-train a heterogeneous encoder once, reuse everywhere," the same move PARL/BLADE make at the symbolic layer but here at the sensory layer. The "shared trunk across unaligned heterogeneous inputs" trick is a clean alternative to contrastive alignment and worth tracking if I ever want force/touch predicates that survive a sensor swap. Adelson + Lirui Wang (of HPT, the proprioceptive-visual heterogeneous-pretraining line) make this a credible foundation-model thread, not a one-off.

Quotable

T3 captures the shared latent information across different sensor-task pairings by constructing a shared trunk transformer with sensor-specific encoders and task-specific decoders. — Abstract
one significant difference is that heterogeneous tactile representation learning lacks aligned data collected with different sensors for different tasks. It is often ill-defined to infer the resemblance of two tactile images collected from different hardware… — §2, Representation learning with heterogeneous data
T3 achieved a task success rate 25% higher than that of policies trained with tactile encoders trained from scratch, or 53% higher than without tactile sensing. — Abstract / §5.3

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: