Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks

Jialiang Zhao, Yuxiang Ma, Lirui Wang, Edward H. Adelson · MIT CSAIL · CoRL 2024 · arXiv:2406.13640 · PDF

One-liner. T3 treats the zoo of incompatible camera-based tactile sensors as a multi-modal alignment problem — sensor-specific encoders feed a shared transformer trunk that funnels into task-specific decoders — and shows that pre-training this trunk on FoTa (the largest unified tactile dataset, 3M+ images, 13 sensors, 11 tasks) yields a reusable tactile encoder that transfers across both new sensors and new tasks with little or no fine-tuning, and lifts a real sub-millimeter insertion policy.

Problem & motivation

Camera-based tactile sensors (GelSight-family) are the dominant high-res touch modality, but they are extremely heterogeneous: different shapes, form factors, illumination colors, marker patterns, camera counts. Every new sensor-task pairing today means re-collecting data and training an encoder from scratch — expensive, and especially wasteful in long-horizon contact-rich tasks where the policy reward is sparse and many components must be trained jointly. The authors argue there should be shareable latent structure across sensors and tasks, but the central obstacle is that, unlike vision-touch or audio-visual learning where you have temporally-aligned pairs, heterogeneous tactile data is unaligned: it is ill-defined to declare two tactile images from different hardware "the same contact." So standard contrastive / distance-based cross-modal recipes don't apply. The goal: a scalable representation learnable from unaligned tactile data, shareable across sensors and tasks.

Method

Two ingredients: a dataset (FoTa) and an architecture (T3).

FoTa (Foundation Tactile) dataset. Aggregates many of the largest open-sourced tactile datasets (VisGel, TVL, Touch-and-Go, Calandra'17, Yuan'18, YCB-Sight, ObjectFolder-real, Tippur'23) plus new in-house data, into a single unified, I/O-efficient WebDataset format: 3,083,452 tactile images, 13 sensors, 11 tasks (Table 2). New in-house data was collected on two rigs to add recently-emerged sensors (GelSight Finray, Svelte, Wedge, DenseTact 2.0, GelSight 360): a 7-DoF Franka platform (2 sensors on a parallel jaw gripper exploring a clamped object, recording SE(3) pose) and a 3-DoF CNC gantry (3D-printed textured probes on a force/torque sensor probing a fixed sensor's surface, recording translational pose + probing force). VisGel's "flat" no-contact images were sub-sampled out using a Laplacian-variance threshold (Eqn. 2, σ_i = 4.24).

T3 architecture (Fig 1). For N_s sensors and N_t tasks: one ViT encoder Enc_i per sensor, one shared ViT Trunk, and one decoder Dec_j per task. Each batch always comes from a single sensor-task pairing (X_i, Y_j), so the matching Enc_i and Dec_j are attached to the trunk; the trunk is the only component shared across everything, which is what carries the cross-sensor / cross-task knowledge. Decoder architecture is chosen per task type: ViT for generative (MAE reconstruction), MLP for classification, ResNet+MLP for pose regression. Loss for a single-image task is L_j(Y_j, Dec_j(Trunk(Enc_i(X_i)))); pose tasks consume two tactile images, encode/trunk each separately, then concatenate before the decoder (Eqn. 1). Similar sensors share an encoder (e.g. GelSight17 variants 1–4 share one encoder).

Three-stage training.

Pre-train I — self-supervised MAE. Masked auto-encoding (L2 patch-normalized pixel loss) over all FoTa data, including unlabeled images. A single reconstruction decoder Dec_0 (8 ViT blocks) is shared by all sensors. Targets local/pixel-level understanding. Best masking ratio found to be 80%.

Pre-train II — supervised, labels distilled from public datasets. Trains the trunk under supervision on 10 tasks for which the public datasets carry labels: classification tasks (object, material, fabric smoothness / fuzziness / texture-type) via MLP decoders + cross-entropy; regression tasks (SE(3) relative pose between two overlapping tactile images via ResNet+MLP, and Laplacian-variance / information-content estimation) via MSE. Targets global / semantic understanding.

Fine-tuning (optional). Further train on the specific target sensor-task with a small dataset; skippable if the pairing already existed in pre-training. Network sizes range tiny-12M / small-45M / medium-174M / large-308M (Table 1).

Setup

Datasets / benchmarks: FoTa (3,083,452 images, 13 sensors, 11 tasks; constituents in Table 2). Pre-training uses all FoTa except the gantry-collected object-classification and 3D pose-estimation data, which are held out for evaluation / fine-tuning. The §5.1 fine-tuning/eval task is a 6-category object classification on 2 sensors (GelSight Wedge, GelSight Mini), ~3,300 points each. §5.2 zero-shot transfer uses two novel sensors (GelSight Svelte, DenseTact 2.0) on two seen tasks (object classification, pose estimation).
Hardware / simulator: Real robots. Data-collection rigs: 7-DoF Franka Emika Panda with 2 tactile sensors on a parallel jaw gripper; 3-DoF desktop CNC gantry with a 6-axis force/torque sensor (MMS101). Downstream insertion task: 2 GelSight Wedge sensors on a parallel jaw gripper on a 7-DoF Franka, PCB fixed to workbench, RGB camera on the side; 0.4mm hole-pin clearance; 3 parts (3-pin toggle switch, 12-pin double-stack USB, 17-pin VGA).
Baselines: For pre-training value: train-from-scratch vs. fine-tune-from-pre-train-I vs. fine-tune-from-pre-train-I&II, across 4 network sizes and 2 fine-tuning-data amounts (Fig 3a). For zero-shot transfer: random guess and dataset average (Fig 3c,d). For insertion: a no-tactile (vision-only) policy, a tactile policy with an encoder trained from scratch, and a tactile policy with the T3 encoder (Fig 4c). All insertion policies share a robot-state MLP and a pre-trained ResNet18 vision encoder.
Compute: not reported (network parameter counts given: 12M / 45M / 174M / 308M; encoder ViT MLP-ratio 4.0; reconstruction decoder 512-dim, 16 heads, 8 layers).

Results

Pre-training helps and scales (Fig 3a). Across all 24 configurations, pre-training improved evaluation accuracy with a median improvement of 24%. Larger networks did better — large's classification accuracy was 19% higher than tiny's — though the medium-vs-large gap was insignificant. With only half the fine-tuning data, pre-trained networks beat from-scratch networks, and for medium/large the half-data and full-data fine-tuned performances were close, suggesting pre-trained models generalize with less novel data. Masking ratio mattered little (85% at mr=30% up to 89% at mr=80%; Fig 3b).

Zero-shot transfer to novel sensors (Fig 3c,d). Mixed by task. On object classification, zero-shot transfer gave only minor gains over random guessing — but a small fine-tune (2,000 points) bounced classification accuracy up to 17% with fine-tuning. On pose estimation, zero-shot already showed significant improvement over the dataset average, and fine-tuning reduced RMSE by 5.5mm for DenseTact 2.0; on GelSight Svelte the pose errors before and after fine-tuning were nearly identical and close to optimal.

Long-horizon insertion (Fig 4c). Headline claim from the abstract: T3-encoder policies achieved a task success rate 25% higher than policies trained with from-scratch tactile encoders, and 53% higher than without tactile sensing. The vision-only policy failed every insertion on the two harder parts (USB, VGA). The T3 encoder also reduced the average number of tactile exploration steps to insert a part.

Insertion policy (15 episodes)	Tactile encoder	Relative outcome
No tactile (vision-only)	—	Failed all USB & VGA insertions
Tactile, from scratch	Trained from scratch	Baseline tactile
Tactile, T3	Pre-trained T3	+25% success vs. scratch; +53% vs. no-tactile; fewer steps

Per-part success rates and step counts are reported only as bar charts (Fig 4c); exact numbers not reported in text.

Limitations & open questions

From the authors (§6):

FoTa is unbalanced — the 2 most popular sensors constitute over 50% of the dataset, so the trained model (and policy) may be biased toward them.
T3 currently does per-image encoding and is trained with explicit labels in pre-train II / fine-tuning; representation learning on tactile image sequences with sparse or implicit labels is left as future work.
T3 / FoTa are limited to camera-based tactile sensors; extending to non-camera-based sensors is future work.

What I noticed reading it:

The headline real-world numbers (25% / 53%) come from 15 episodes per part on 3 parts and are presented as bar charts without seeds or confidence intervals — a small-N claim relative to the heavily-evaluated pre-training ablations. The exact per-part success counts aren't in the text.
The "shared trunk on unaligned data" mechanism is appealing but never isolated: there's no ablation showing the trunk actually transfers structure vs. each encoder doing the work. Trunk-attention-map visualizations (Fig 5) are suggestive, not causal.
Zero-shot transfer for classification is essentially random-guess level — the "zero-shot transferability" headline really only holds for pose estimation; classification needs fine-tuning. The abstract phrasing ("achieved zero-shot transferability in certain sensor-task pairings") is accurate but easy to over-read.
medium (174M) ≈ large (308M) suggests the FoTa data may already be saturating model capacity — the "scales with bigger networks" claim has a near-term ceiling on this dataset.
Pre-train II depends on label availability in the constituent public datasets, so the supervised semantic stage inherits whatever task/label biases those datasets carry.

Why I care

This is squarely on-theme for the touch-grounding thesis behind this batch: many manipulation predicates I care about for BLADE-style planning — is_inserted, is_grasped, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in contact. The sub-millimeter insertion result is exactly the kind of is_inserted predicate / controller that vision alone can't ground (the paper explicitly notes vision fails under heavy occlusion). T3 is the representation-layer answer: a reusable tactile encoder you could drop in front of a BLADE predicate classifier or diffusion-policy body so that the touch features don't have to be re-learned per sensor.

The deeper relevance is methodological. BLADE's bottleneck is that diffusion policies and predicate classifiers are trained per-skill from scratch; T3 is the tactile analogue of "pre-train a heterogeneous encoder once, reuse everywhere," the same move PARL/BLADE make at the symbolic layer but here at the sensory layer. The "shared trunk across unaligned heterogeneous inputs" trick is a clean alternative to contrastive alignment and worth tracking if I ever want force/touch predicates that survive a sensor swap. Adelson + Lirui Wang (of HPT, the proprioceptive-visual heterogeneous-pretraining line) make this a credible foundation-model thread, not a one-off.

Quotable

T3 captures the shared latent information across different sensor-task pairings by constructing a shared trunk transformer with sensor-specific encoders and task-specific decoders. — Abstract

one significant difference is that heterogeneous tactile representation learning lacks aligned data collected with different sensors for different tasks. It is often ill-defined to infer the resemblance of two tactile images collected from different hardware… — §2, Representation learning with heterogeneous data

T3 achieved a task success rate 25% higher than that of policies trained with tactile encoders trained from scratch, or 53% higher than without tactile sensing. — Abstract / §5.3

Papers cited that should likely be ingested next:

[36] Wang, Chen, Zhao, He 2024 — HPT (Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers) (PDF) — same shared-trunk heterogeneous-pretraining idea by an overlapping author (Lirui Wang), for proprioceptive-visual policies; the direct architectural sibling.
[27] Fu et al. 2024 — A Touch, Vision, and Language Dataset (TVL) (PDF) — a FoTa constituent and the touch–vision–language alignment counterpart; in this batch.
[6] Girdhar et al. 2023 — ImageBind (PDF) — the cited cross-modal binding anchor T3 contrasts against (aligned vs. unaligned); in this batch.
[39] Chen, Van der Merwe, Sipos, Fazeli 2022 — Visuo-tactile transformers for manipulation — prior shared-self-attention cross-modal tactile work; the closest architectural precedent.
[9] Wang, Zhao, Du, Adelson, Tedrake 2024 — PoCo (Policy composition for heterogeneous robot learning) — heterogeneous policy composition from the same lab; relevant to reusing T3 features in policies.
[30] Gao et al. 2022 — ObjectFolder 2.0 (PDF) — a FoTa constituent; multisensory object dataset; in this batch.
[29] Yang et al. 2022 — Touch and Go (PDF) — a FoTa constituent; in this batch.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

Sparsh — the other major self-supervised touch-representation foundation model (cluster B); the direct competing recipe for a reusable tactile encoder.
AnyTouch — unified static+dynamic visuo-tactile representation across sensors; the closest peer to T3's cross-sensor goal, adding the temporal axis T3 lacks.
UniT — data-efficient tactile representation; complementary low-data angle to T3's large-pretraining angle.
MViTac — visual-tactile contrastive pretraining; the aligned-pairs approach T3 deliberately avoids (good contrast point).
TVL and UniTouch — touch–language/everything binding lines; the alignment-based counterpart to T3's unaligned-trunk strategy, and TVL is a FoTa constituent.
ObjectFolder 2.0, Touch and Go — FoTa constituent datasets ingested in this batch.
GelSight and DIGIT — the sensor hardware whose heterogeneity T3 is built to absorb (FoTa includes both families).