Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong · Yale University / University of Michigan · CVPR 2024 · arXiv:2401.18084 · PDF · project page

One-liner. UniTouch binds vision-based tactile sensing into a pretrained image embedding (ImageBind's) by contrastively aligning touch to its paired RGB image — and because that image space is already aligned with language and audio, touch inherits zero-shot links to text and sound "for free," enabling material classification, grasp-stability prediction, touch↔text/audio retrieval, touch-conditioned image synthesis, and a Touch-LLM, all without task-specific tactile labels.

Problem & motivation

Touch is expensive to collect (it requires actively probing objects) and vision-based tactile sensors are not standardized: GelSight, DIGIT, GelSlim, and simulators (Taxim, TACTO) produce visually divergent outputs because of different elastomer designs, illumination, and calibration (Fig 2). The result is that tactile representations are usually trained per-sensor, per-task, and never reach the scale or generality of vision-language models. The paper's bet: rather than collect paired touch–text or touch–audio data, exploit the fact that touch is naturally paired with the visual image of the contact patch, and that pretrained image embeddings (ImageBind [35]) are already bound to many other modalities. Align touch to that image space once, and every modality ImageBind already speaks becomes reachable from touch zero-shot.

Method

Three ideas: bind touch to a frozen image embedding, handle multiple sensors via learnable per-sensor tokens, and balance the multi-sensor batch.

1. Binding touch to images (Fig 3). Given a batch of B visuo-tactile pairs {(v_i, t_i)}, a trainable touch encoder F_T produces a tactile embedding aligned to the frozen ImageBind image embedding F_V(v_i) via a symmetric InfoNCE loss in both directions: L = L_{T→V} + L_{V→T}, with temperature τ and feature dim C (Eqs 1–2). Because the image encoder is frozen and already aligned to text/audio/depth/etc., pulling touch toward its paired image transitively binds touch to all of ImageBind's modalities — no paired touch–text or touch–audio data is ever used in training.

2. Learning from multiple sensors at once (sensor-specific tokens). To bridge the cross-sensor domain gap, UniTouch introduces a set of learnable sensor-specific tokens {s_k}, k = 1..K, each s_k ∈ R^{L×D}, capturing per-sensor calibration / background color / intensity profiles. For touch image t_i from sensor k, the L tokens are prepended as a prefix to the touch patch tokens before encoding, yielding F_T(t_i, s_{k_i}) (Eq 3). The rest of the encoder capacity is freed to learn shared texture/geometry. At inference on unseen sensors, they retrieve the nearest known sensor token: compute a prototype per sensor (mean over its raw tactile pixels), then pick the stored token minimizing L1 distance to the input.

3. In-batch data sampling. Naive uniform sampling across datasets yields a surplus of easy negatives (cross-sensor pairs are trivially separable by domain artifacts). They instead sample so that a fraction σ of each batch comes from a single dataset and the remaining (1-σ) from others; the per-dataset selection probability is proportional to dataset cardinality, p_n = |D_n| / Σ_m |D_m| (Eq 4). This keeps training focused on intra-sensor hard negatives while still exposing inter-sensor discrimination. Ablation uses σ = 0.75.

4. Downstream applications (all zero-shot, §3.3). (i) Zero-shot touch understanding: encode text prompts with the frozen CLIP/ImageBind text encoder, rank cosine similarity to the touch embedding (material classification, grasp-stability prediction). (ii) Cross-modal retrieval touch↔vision/audio/text. (iii) Touch-LLM: feed the UniTouch embedding into an off-the-shelf VLM (in place of its image embedding) to do tactile question answering. (iv) Image synthesis with touch: condition a pretrained text-to-image diffusion model [89] on touch features for touch→image generation and tactile-driven stylization. (v) X-to-touch generation: an image-to-touch diffusion model lets vision/text/audio generate touch images.

Setup

Datasets / benchmarks: Train on four visuo-tactile datasets (Table 1): Touch and Go (GelSight, 120k), The Feeling of Success (GelSight, 9.3k), YCB-Slide (DIGIT, 183k), ObjectFolder 2.0 (Taxim, 180k). Evaluate in-domain plus three out-of-domain sets with unseen sensors: ObjectFolder Real (GelSlim, 20k), ObjectFolder 1.0 (TACTO, 20k), SSVTP (DIGIT, 4.6k). Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go.
Hardware / simulator: Vision-based tactile sensors only — GelSight, DIGIT, GelSlim (real) and Taxim, TACTO (simulators). No physical robot run by this paper; grasp-stability is a prediction task over existing tactile datasets (The Feeling of Success). K = 3 sensor types, L = 5 learnable tokens each.
Baselines: Supervised ImageNet features; self-supervised visuo-tactile methods VT CMC [111] and SSVTP [57] (re-trained on the multi-dataset setup with the same ViT); for retrieval: CCA, PLSCA, DSCMR, DAR (supervised cross-modal methods); for touch-to-image: Pix2Pix, VisGel, Vision-from-touch [112]; for Touch-LLM captioning: BLIP-2, InstructBLIP, LLaVA-1.5.
Compute: 4× NVIDIA A40 GPUs, 150 epochs, batch size 48; AdamW, base LR 1e-5, cosine schedule, τ = 0.07, ViT backbone (24 blocks, 16 heads), feature dim C = 1024.

Results

UniTouch's linear-probe features beat all baselines on both downstream tasks, and its zero-shot numbers are competitive with supervised methods. Material classification (Table 2, accuracy %, "All" = trained on all datasets):

Method	Touch&Go	OF 2.0	YCB-Slide	OF 1.0 (OOD)	OF Real (OOD)	SSVTP (OOD)
Chance	5.0	14.2	10.0	14.2	14.2	16.6
Supervised (ImageNet)	47.1	70.3	72.3	37.5	54.8	73.4
VT CMC [111] (All)	49.2	70.3	69.5	33.8	48.1	68.5
SSVTP [57] (All)	43.8	68.9	67.4	35.1	49.7	66.8
UniTouch (linear probe)	61.3	85.4	78.1	41.3	61.2	77.4
UniTouch (zero-shot)	52.7	43.5	66.4	32.7	33.2	60.9

Key results:

Linear-probe UniTouch leads every column, in- and out-of-domain — e.g. +15.1 over the best baseline on ObjectFolder 2.0 (85.4 vs 70.3), and consistent OOD gains validate the sensor tokens + sampling.
Zero-shot material classification (no labels at all) is competitive with supervised baselines on several sets (52.7 on Touch and Go beats Supervised's 47.1; 60.9 on SSVTP), but visibly weaker on ObjectFolder 2.0 (43.5) and ObjectFolder Real (33.2), where simulated/unseen-sensor touch is harder to ground in text.
Grasp-stability prediction (Table 3): linear probe 82.3 / 78.1 / 64.7 (Feeling / OF2.0 / OF1.0), again topping baselines; zero-shot 65.5 / 64.3 / 64.7 — aligning touch to text transfers to a robotics task.
Cross-modal retrieval (Table 4, mAP on ObjectFolder 2.0): touch→vision 41.9, touch→audio 37.9, touch→text 38.0 — zero-shot, yet beating fully-supervised CCA/PLSCA/DSCMR/DAR (best supervised touch→vision 32.3) by a large margin.
Touch-to-image (Table 5): best CVTP (0.56) and material consistency (0.31), though a slightly worse FID (103.11) than Vision-from-touch's 81.2.
Touch-LLM captioning (Table 6, GPT-4 rating 1–5): 3.30 vs LLaVA-1.5 2.33, InstructBLIP 1.93, BLIP-2 1.01.
Prompt analysis (Table 7): haptic-phrased prompts ("This feels like [CLS]", "Touch of [CLS]") beat visual-phrased prompts, direct evidence that language carries genuine tactile semantics.
Ablation (Table 8, zero-shot material on Touch and Go): vanilla baseline drops 43.1→21.4 when moving from one dataset to all (the sensor-gap penalty); +sensor token recovers to 38.1, +sampling to 40.3, both together reach 52.7.

Limitations & open questions

From the authors:

Scope limited to vision-based tactile sensors; barometric / force / temperature sensors (different output formats) are excluded. They flag scaling the training strategy to those as future work.
The representation is a "black box" — not easily interpretable, no explainability in the embedding space.
The field of multimodal foundational tactile models is "admittedly still young"; this is positioned as a concrete first step, not a finished system.

What I noticed reading it:

The whole approach inherits ImageBind's binding through vision: touch only ever sees text/audio via the image bottleneck. Any concept that isn't visually distinguishable at the contact patch (e.g. hardness vs. softness that looks identical in a GelSight image) may be unreachable in principle — the zero-shot material drop on simulated ObjectFolder is consistent with this.
"Grasp stability prediction" is framed as a robotics task but is really a classification over a static tactile dataset (The Feeling of Success); no closed-loop robot trial, no real grasp executed. The robotics claim is weaker than it reads.
The nearest-sensor-token retrieval for unseen sensors is a hard assignment to one of only K = 3 learned tokens — untested how it degrades when a genuinely novel sensor lands between prototypes, or as K scales to many sensors.
Touch-LLM is evaluated by GPT-4 rating on 400 captions — a single automatic judge, no human eval, no inter-rater check; the 3.30 vs 2.33 gap should be read cautiously.
Continuous tactile quantities (force magnitude, slip velocity) never appear — everything is categorical/semantic, so force-modulated distinctions stay outside the representation.

Why I care

This is a clean anchor for the thesis behind the 2026-06-24 batch: many manipulation predicates (surface_is_rough, is_grasped, material_is_metal) are not visually evaluable — they live in touch. UniTouch is the cleanest demonstration that a touch embedding can be made to answer language queries about those properties zero-shot, which is exactly the missing-modality predicate-grounding problem BLADE side-steps. In BLADE, predicate classifiers f_θ(p): O → {T,F} are trained from RGB observations; UniTouch suggests a route to instantiate tactile predicate classifiers without per-predicate labels, by phrasing the predicate as a text prompt over a touch-language space. The catch is the vision-bottleneck above: UniTouch can only ground predicates that are visible in the contact image, whereas the predicates BLADE most needs touch for (is_screwed_tight, force-state) are precisely the non-visual ones. So UniTouch is a strong method anchor for "touch↔language grounding" and a partial answer — it grounds touch semantics but not touch forces. That gap is exactly where a BLADE-style symbolic abstraction over genuinely non-visual tactile/force signals would contribute. Relevant batch neighbors that push on the same axis are listed below.

Quotable

We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. — Abstract / p.1

As the visual embedding comes from a joint space that has already aligned with different modalities, touch that is bound with images will bridge a connection to other modalities, yielding a multi-modal unified tactile representation. — §3.1, Binding touch with images / p.3

We empirically found that our prompts can significantly improve the performance, indicating that language can indeed understand touch. — §4.6, Language prompting for touch / p.8

Papers cited that should likely be ingested next:

[35] Girdhar et al. 2023 — ImageBind (CVPR) — the frozen joint embedding UniTouch binds to; the foundational dependency. PDF.
[111] Yang et al. — Touch and Go (VT CMC) — primary training dataset + a direct visuo-tactile-contrastive baseline. PDF.
[57] Kerr et al. — SSVTP (RSS) — DIGIT visuo-tactile pretraining baseline + OOD eval set.
[30/32/33] Gao et al. — ObjectFolder 1.0 / 2.0 / Benchmark — training + cross-modal-retrieval benchmark. OF1.0 · OF2.0 · Benchmark.
[59] Lambeta et al. — DIGIT and [54/117] Johnson & Adelson / Yuan et al. — GelSight — the sensor-hardware references behind the domain gap. DIGIT · GelSight.
[90] Si & Yuan — Taxim and [101] Wang et al. — TACTO — the simulators generating two of the training/OOD sets. Taxim · TACTO.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

TVL (Touch-Vision-Language Dataset) — same touch↔language-via-vision binding idea, but adds explicit touch-language pairs and a tactile-language model; the closest sibling in Cluster A.
Octopi and Octopi-1.5 — tactile-language models for object-property reasoning; the LLM-side successor to UniTouch's Touch-LLM.
Touch100k and TLV — large touch-language-vision datasets that supply the paired supervision UniTouch deliberately avoids; complementary data foundation.
AnyTouch, T3, and Sparsh — the multi-sensor unified-tactile-representation line; same cross-sensor-generalization goal as UniTouch's sensor-specific tokens.
LanguageBind and Meta-Transformer — the multimodal-binding anchors (alongside ImageBind) UniTouch's method generalizes.
Towards Forceful Robotic Foundation Models (survey) — frames the non-visual force/contact axis that UniTouch's vision bottleneck leaves untouched.