Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang*, Chao Feng*, Ziyang Chen*, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong · Yale University / University of Michigan · CVPR 2024 · arXiv:2401.18084 · PDF · project page

One-liner. UniTouch binds vision-based tactile sensing into a pretrained image embedding (ImageBind's) by contrastively aligning touch to its paired RGB image — and because that image space is already aligned with language and audio, touch inherits zero-shot links to text and sound "for free," enabling material classification, grasp-stability prediction, touch↔text/audio retrieval, touch-conditioned image synthesis, and a Touch-LLM, all without task-specific tactile labels.

Problem & motivation

Touch is expensive to collect (it requires actively probing objects) and vision-based tactile sensors are not standardized: GelSight, DIGIT, GelSlim, and simulators (Taxim, TACTO) produce visually divergent outputs because of different elastomer designs, illumination, and calibration (Fig 2). The result is that tactile representations are usually trained per-sensor, per-task, and never reach the scale or generality of vision-language models. The paper's bet: rather than collect paired touch–text or touch–audio data, exploit the fact that touch is naturally paired with the visual image of the contact patch, and that pretrained image embeddings (ImageBind [35]) are already bound to many other modalities. Align touch to that image space once, and every modality ImageBind already speaks becomes reachable from touch zero-shot.

Method

Three ideas: bind touch to a frozen image embedding, handle multiple sensors via learnable per-sensor tokens, and balance the multi-sensor batch.

1. Binding touch to images (Fig 3). Given a batch of B visuo-tactile pairs {(v_i, t_i)}, a trainable touch encoder F_T produces a tactile embedding aligned to the frozen ImageBind image embedding F_V(v_i) via a symmetric InfoNCE loss in both directions: L = L_{T→V} + L_{V→T}, with temperature τ and feature dim C (Eqs 1–2). Because the image encoder is frozen and already aligned to text/audio/depth/etc., pulling touch toward its paired image transitively binds touch to all of ImageBind's modalities — no paired touch–text or touch–audio data is ever used in training.

2. Learning from multiple sensors at once (sensor-specific tokens). To bridge the cross-sensor domain gap, UniTouch introduces a set of learnable sensor-specific tokens {s_k}, k = 1..K, each s_k ∈ R^{L×D}, capturing per-sensor calibration / background color / intensity profiles. For touch image t_i from sensor k, the L tokens are prepended as a prefix to the touch patch tokens before encoding, yielding F_T(t_i, s_{k_i}) (Eq 3). The rest of the encoder capacity is freed to learn shared texture/geometry. At inference on unseen sensors, they retrieve the nearest known sensor token: compute a prototype per sensor (mean over its raw tactile pixels), then pick the stored token minimizing L1 distance to the input.

3. In-batch data sampling. Naive uniform sampling across datasets yields a surplus of easy negatives (cross-sensor pairs are trivially separable by domain artifacts). They instead sample so that a fraction σ of each batch comes from a single dataset and the remaining (1-σ) from others; the per-dataset selection probability is proportional to dataset cardinality, p_n = |D_n| / Σ_m |D_m| (Eq 4). This keeps training focused on intra-sensor hard negatives while still exposing inter-sensor discrimination. Ablation uses σ = 0.75.

4. Downstream applications (all zero-shot, §3.3). (i) Zero-shot touch understanding: encode text prompts with the frozen CLIP/ImageBind text encoder, rank cosine similarity to the touch embedding (material classification, grasp-stability prediction). (ii) Cross-modal retrieval touch↔vision/audio/text. (iii) Touch-LLM: feed the UniTouch embedding into an off-the-shelf VLM (in place of its image embedding) to do tactile question answering. (iv) Image synthesis with touch: condition a pretrained text-to-image diffusion model [89] on touch features for touch→image generation and tactile-driven stylization. (v) X-to-touch generation: an image-to-touch diffusion model lets vision/text/audio generate touch images.

Setup

Results

UniTouch's linear-probe features beat all baselines on both downstream tasks, and its zero-shot numbers are competitive with supervised methods. Material classification (Table 2, accuracy %, "All" = trained on all datasets):

MethodTouch&GoOF 2.0YCB-SlideOF 1.0 (OOD)OF Real (OOD)SSVTP (OOD)
Chance5.014.210.014.214.216.6
Supervised (ImageNet)47.170.372.337.554.873.4
VT CMC [111] (All)49.270.369.533.848.168.5
SSVTP [57] (All)43.868.967.435.149.766.8
UniTouch (linear probe)61.385.478.141.361.277.4
UniTouch (zero-shot)52.743.566.432.733.260.9

Key results:

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is a clean anchor for the thesis behind the 2026-06-24 batch: many manipulation predicates (surface_is_rough, is_grasped, material_is_metal) are not visually evaluable — they live in touch. UniTouch is the cleanest demonstration that a touch embedding can be made to answer language queries about those properties zero-shot, which is exactly the missing-modality predicate-grounding problem BLADE side-steps. In BLADE, predicate classifiers f_θ(p): O → {T,F} are trained from RGB observations; UniTouch suggests a route to instantiate tactile predicate classifiers without per-predicate labels, by phrasing the predicate as a text prompt over a touch-language space. The catch is the vision-bottleneck above: UniTouch can only ground predicates that are visible in the contact image, whereas the predicates BLADE most needs touch for (is_screwed_tight, force-state) are precisely the non-visual ones. So UniTouch is a strong method anchor for "touch↔language grounding" and a partial answer — it grounds touch semantics but not touch forces. That gap is exactly where a BLADE-style symbolic abstraction over genuinely non-visual tactile/force signals would contribute. Relevant batch neighbors that push on the same axis are listed below.

Quotable

We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. — Abstract / p.1
As the visual embedding comes from a joint space that has already aligned with different modalities, touch that is bound with images will bridge a connection to other modalities, yielding a multi-modal unified tactile representation. — §3.1, Binding touch with images / p.3
We empirically found that our prompts can significantly improve the performance, indicating that language can indeed understand touch. — §4.6, Language prompting for touch / p.8

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: