AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors

Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang†, Di Hu† · Renmin University of China / Wuhan UST / Beijing UPT · ICLR 2025 · arXiv:2502.12191 · PDF

One-liner. AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities — pixel-level masked modeling for fine detail and semantic-level touch–vision–language alignment plus a novel cross-sensor matching task — so one model perceives static properties (material/hardness) and dynamic changes (sliding/pressure) and transfers across GelSight, DIGIT, DuraGel, and unseen sensors.

Problem & motivation

Visuo-tactile sensors (GelSight, DIGIT, etc.) are poorly standardized: each sensor renders the same physical contact differently, so models trained on one sensor's data don't transfer, and per-sensor datasets stay small. Two gaps motivate the paper. First, data: prior cross-sensor efforts lack aligned multi-sensor data (Rodriguez et al. 2024 paired only two sensors on narrow manipulation tasks, ignoring material/hardness and multi-modal text). Second, representation: human touch is both static (a brief touch recognizes texture/material) and dynamic (continuous perception while unlocking a lock or pouring), but existing tactile representation work (e.g. UniTouch, TLV-Link) learns from static tactile images only and clusters by sensor rather than by the object property being touched. AnyTouch attacks both: a new aligned dataset (TacQuad) and a unified static-dynamic, multi-level framework.

Method

Two contributions: the TacQuad dataset and the AnyTouch multi-level framework (Fig 2). The framework integrates tactile images and videos and trains in two sequential stages.

Unified input format. A static tactile image I ∈ R^(1×H×W×3) is treated as a single-frame video by replicating it F times into a 4-D tensor X_T ∈ R^(F×H×W×3); videos pass through directly. A shared patch projection turns both into spatio-temporal tokens z ∈ R^(N×d), so images and videos are processed by the same encoder.

Stage 1 — Masked modeling (pixel level, §4.2). Randomly mask tokens of images/videos at ratio ρ; a decoder reconstructs the static image and dynamic video under MSE losses L_rec^S and L_rec^D (Eq 1). An additional next-frame prediction task (Eq 2) predicts frame V_(F+1) to capture continuous deformation dynamics. This learns fine-grained, sensor-specific deformation detail.

Stage 2a — Multi-modal aligning (semantic level, §4.3). Bind touch to vision and text via CLIP-style contrastive learning, using text (tactile attribute descriptions) as the anchor since text consistently describes attributes across datasets. Because tri-modal tactile data is rare, they use GPT-4o to expand text pairings (1.4M new text pairs across four datasets) and adopt a modality-missing-aware contrastive loss (Eq 3–4) that aligns over the largest available subset per modality pair (touch↔vision, touch↔text, vision↔text) with weights α_TV, α_TL, α_VL.

Stage 2b — Cross-sensor matching (sensor-agnostic features, §4.4). The novel task: given a tactile sample X_T, decide whether another sample X_T^+ (same object + same position, different sensor) versus a negative X_T^- (different object/position) come from the same physical contact. Touch representations are element-wise multiplied (x_T · x_T^+) and scored by an MLP under a binary-cross-entropy loss (Eq 5–6). This explicitly clusters representations of the same object across sensors (Fig 3), going beyond UniTouch's image-only alignment.

Universal sensor tokens (§4.5). Instead of only sensor-specific tokens s_k (which can't transfer to unseen sensors), they add a shared universal token s_u. During training each input randomly uses s_u with probability p_u that increases linearly (Eq 7); at inference, unseen sensors always use s_u. Stage losses: Stage 1 = L_rec^S + L_rec^D + L_pred^D (Eq 8); Stage 2 = L_align + λ L_match (Eq 9).

Setup

Results

Static perception on seen sensor (GelSight), Table 2 (accuracy):

MethodTAG MaterialTAG RoughnessTAG HardnessFeel Grasp
CLIP52.9684.0988.3472.37
ViT-LENS-263.085.192.0
TLV-Link67.284.791.394.5‡
UniTouch61.382.3
TLV-Link†74.1285.9476.9776.97
AnyTouch (TAG, Feel, YCB, OF2.0)82.7486.0194.2487.17
AnyTouch (all data)80.8286.7494.6880.53

Static perception on unseen sensors (TACTO & Taxim), Table 3 (linear probe, material acc): AnyTouch reaches 49.62 on ObjectFolder 1.0 (vs UniTouch† 47.25, CLIP 41.00) and 85.87 on ObjectFolder 2.0 (vs 75.29 / 73.16) — best on both unseen sensors, confirming cross-sensor transfer from the universal sensor token + cross-sensor matching.

Real-world dynamic pouring, Table 4 (mean error in grams, lower better):

MethodDynamic PerceptionFine-tune errFreeze err
CLIP5.2249.1
T3 (3M data)2.339.74
AnyTouch (static-only)2.459.60
AnyTouch1.568.22

The full dynamic AnyTouch beats T3 despite T3 using more data; the static-only ablation is roughly on par with T3, isolating the value of dynamic (video) perception.

Where it loses / is mixed. In TAG material (Table 1), the model trained on only GelSight data scores 83.55 — higher than versions trained on more multi-sensor data (down to 79.61). Adding more sensors dilutes the seen-dataset's share and slightly hurts that one in-distribution task (the authors note this echoes CLIP's data-overlap finding). Also, integrating large DIGIT data helps less than GelSight Mini data, attributed to hardware image differences.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is squarely on-theme for the touch-as-a-perception-channel thesis behind the 2026-06-24 batch: many manipulation predicates (is_grasped, is_inserted, surface_is_rough, is_full) are not visually evaluable and live in touch. Relative to BLADE, where predicate classifiers f_θ(p) are learned from RGB-D crops, AnyTouch is a candidate tactile backbone for predicates that vision can't ground — and crucially it claims a sensor-agnostic representation, which matters if a BLADE-style system should not be re-trained per gripper/sensor. Two specific hooks: (1) its static branch maps to instantaneous predicates (material, hardness) while its dynamic branch maps to process/effect predicates (the pouring task is literally estimating a continuous fill quantity — the touch analogue of BLADE's diffusion-policy continuous parameters). (2) Its cross-sensor matching is a clean way to make learned predicates portable, a generalization axis orthogonal to BLADE's novel-state generalization. The representation is foundation-model-style, not a planner, so it complements rather than competes with the abstraction layer — it would sit below the predicate classifiers, supplying the tactile features they read.

Quotable

We recognize that the human tactile perception is a combination of static and dynamic processes… Drawing on this insight, we propose learning unified representations from both static and dynamic perspectives to accommodate a range of tasks. — §1, Introduction / p.4
Our method not only uses multi-modal data to bridge the gap between sensors, but also explicitly clusters representations of the same position on the same object from different sensors together. — Fig 3 caption / p.6

Related

Papers cited that should likely be ingested next:

Newly ingested in 2026-06-24 batch — directly relevant: