AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors

Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang†, Di Hu† · Renmin University of China / Wuhan UST / Beijing UPT · ICLR 2025 · arXiv:2502.12191 · PDF

One-liner. AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities — pixel-level masked modeling for fine detail and semantic-level touch–vision–language alignment plus a novel cross-sensor matching task — so one model perceives static properties (material/hardness) and dynamic changes (sliding/pressure) and transfers across GelSight, DIGIT, DuraGel, and unseen sensors.

Problem & motivation

Visuo-tactile sensors (GelSight, DIGIT, etc.) are poorly standardized: each sensor renders the same physical contact differently, so models trained on one sensor's data don't transfer, and per-sensor datasets stay small. Two gaps motivate the paper. First, data: prior cross-sensor efforts lack aligned multi-sensor data (Rodriguez et al. 2024 paired only two sensors on narrow manipulation tasks, ignoring material/hardness and multi-modal text). Second, representation: human touch is both static (a brief touch recognizes texture/material) and dynamic (continuous perception while unlocking a lock or pouring), but existing tactile representation work (e.g. UniTouch, TLV-Link) learns from static tactile images only and clusters by sensor rather than by the object property being touched. AnyTouch attacks both: a new aligned dataset (TacQuad) and a unified static-dynamic, multi-level framework.

Method

Two contributions: the TacQuad dataset and the AnyTouch multi-level framework (Fig 2). The framework integrates tactile images and videos and trains in two sequential stages.

Unified input format. A static tactile image I ∈ R^(1×H×W×3) is treated as a single-frame video by replicating it F times into a 4-D tensor X_T ∈ R^(F×H×W×3); videos pass through directly. A shared patch projection turns both into spatio-temporal tokens z ∈ R^(N×d), so images and videos are processed by the same encoder.

Stage 1 — Masked modeling (pixel level, §4.2). Randomly mask tokens of images/videos at ratio ρ; a decoder reconstructs the static image Î and dynamic video V̂ under MSE losses L_rec^S and L_rec^D (Eq 1). An additional next-frame prediction task (Eq 2) predicts frame V_(F+1) to capture continuous deformation dynamics. This learns fine-grained, sensor-specific deformation detail.

Stage 2a — Multi-modal aligning (semantic level, §4.3). Bind touch to vision and text via CLIP-style contrastive learning, using text (tactile attribute descriptions) as the anchor since text consistently describes attributes across datasets. Because tri-modal tactile data is rare, they use GPT-4o to expand text pairings (1.4M new text pairs across four datasets) and adopt a modality-missing-aware contrastive loss (Eq 3–4) that aligns over the largest available subset per modality pair (touch↔vision, touch↔text, vision↔text) with weights α_TV, α_TL, α_VL.

Stage 2b — Cross-sensor matching (sensor-agnostic features, §4.4). The novel task: given a tactile sample X_T, decide whether another sample X_T^+ (same object + same position, different sensor) versus a negative X_T^- (different object/position) come from the same physical contact. Touch representations are element-wise multiplied (x_T · x_T^+) and scored by an MLP under a binary-cross-entropy loss (Eq 5–6). This explicitly clusters representations of the same object across sensors (Fig 3), going beyond UniTouch's image-only alignment.

Universal sensor tokens (§4.5). Instead of only sensor-specific tokens s_k (which can't transfer to unseen sensors), they add a shared universal token s_u. During training each input randomly uses s_u with probability p_u that increases linearly (Eq 7); at inference, unseen sensors always use s_u. Stage losses: Stage 1 = L_rec^S + L_rec^D + L_pred^D (Eq 8); Stage 2 = L_align + λ L_match (Eq 9).

Setup

Datasets / benchmarks: 9 tactile training datasets — Touch and Go, VisGel, Cloth, ObjectFolder Real, TVL, YCB-Slide, SSVTP, Octopi, plus the coarse-grained subset of their new TacQuad (aligned, 4 sensors, 72,606 contact frames: 17,524 fine-grained spatio-temporal across 25 objects + 55,082 coarse-grained spatial across 99 objects; each frame has GPT-4o-generated, manually-corrected tactile text). Downstream eval: Touch and Go, Feel, ObjectFolder 1.0 and 2.0; real-world fine-grained pouring.
Hardware / simulator: four visuo-tactile sensors — GelSight Mini (public), DIGIT (public), DuraGel (self-made), Tac3D (force-field). Real-world pouring uses a robot arm with GelSight Mini pouring beads into a cylinder (Fig 5). Unseen-sensor eval includes simulated sensors TACTO and Taxim.
Baselines: single-sensor models ViT-LENS-2, TLV-Link, OmniBind; multi-sensor models UniTouch and UniTouch† (retrained on matched data); CLIP (no tactile pre-training); and T3 (trained on 3M data) for the real-world pouring task.
Compute: not reported (data volumes up to 2,427k samples reported).

Results

Static perception on seen sensor (GelSight), Table 2 (accuracy):

Method	TAG Material	TAG Roughness	TAG Hardness	Feel Grasp
CLIP	52.96	84.09	88.34	72.37
ViT-LENS-2	63.0	85.1	92.0	–
TLV-Link	67.2	84.7	91.3	94.5‡
UniTouch	61.3	–	–	82.3
TLV-Link†	74.12	85.94	76.97	76.97
AnyTouch (TAG, Feel, YCB, OF2.0)	82.74	86.01	94.24	87.17
AnyTouch (all data)	80.82	86.74	94.68	80.53

Static perception on unseen sensors (TACTO & Taxim), Table 3 (linear probe, material acc): AnyTouch reaches 49.62 on ObjectFolder 1.0 (vs UniTouch† 47.25, CLIP 41.00) and 85.87 on ObjectFolder 2.0 (vs 75.29 / 73.16) — best on both unseen sensors, confirming cross-sensor transfer from the universal sensor token + cross-sensor matching.

Real-world dynamic pouring, Table 4 (mean error in grams, lower better):

Method	Dynamic Perception	Fine-tune err	Freeze err
CLIP	✗	5.22	49.1
T3 (3M data)	✗	2.33	9.74
AnyTouch (static-only)	✗	2.45	9.60
AnyTouch	✓	1.56	8.22

The full dynamic AnyTouch beats T3 despite T3 using more data; the static-only ablation is roughly on par with T3, isolating the value of dynamic (video) perception.

Where it loses / is mixed. In TAG material (Table 1), the model trained on only GelSight data scores 83.55 — higher than versions trained on more multi-sensor data (down to 79.61). Adding more sensors dilutes the seen-dataset's share and slightly hurts that one in-distribution task (the authors note this echoes CLIP's data-overlap finding). Also, integrating large DIGIT data helps less than GelSight Mini data, attributed to hardware image differences.

Limitations & open questions

From the authors:

Fine-grained spatio-temporal aligned collection is costly and time-consuming, capping the precisely-aligned subset at 30 sets / 25 objects (17,524 frames); the larger subset is only coarse-grained.
Adding more multi-sensor data can reduce performance on an in-distribution task when it shrinks that dataset's proportion (the GelSight-only-best-on-TAG-material effect).
Benefit of integrating a given sensor depends on hardware similarity — DIGIT's images differ more, so its data transfers less than GelSight Mini's.

What I noticed reading it:

The cross-sensor matching task needs same-object same-position pairs across sensors, which is exactly the expensive aligned data TacQuad provides — the method's headline novelty is bottlenecked on the dataset that is admittedly hard to scale. How does matching degrade as alignment quality drops from fine- to coarse-grained?
Real-world dynamic eval is a single task (pouring) averaged over only 10 runs, reported as mean error without variance — a thin statistical basis for the dynamic-perception claim, the paper's most novel capability.
"Static-dynamic" is operationalized as image-vs-video plus next-frame prediction; there is no explicit force/pressure signal — all dynamics are inferred from the optical gel deformation, so force-magnitude predicates (e.g. is_screwed_tight) aren't directly addressed.
Most evaluation tasks are binary/few-class classification (material, roughness, hardness, grasp success); harder open-ended property reasoning (cf. Octopi) isn't tested.

Why I care

This is squarely on-theme for the touch-as-a-perception-channel thesis behind the 2026-06-24 batch: many manipulation predicates (is_grasped, is_inserted, surface_is_rough, is_full) are not visually evaluable and live in touch. Relative to BLADE, where predicate classifiers f_θ(p) are learned from RGB-D crops, AnyTouch is a candidate tactile backbone for predicates that vision can't ground — and crucially it claims a sensor-agnostic representation, which matters if a BLADE-style system should not be re-trained per gripper/sensor. Two specific hooks: (1) its static branch maps to instantaneous predicates (material, hardness) while its dynamic branch maps to process/effect predicates (the pouring task is literally estimating a continuous fill quantity — the touch analogue of BLADE's diffusion-policy continuous parameters). (2) Its cross-sensor matching is a clean way to make learned predicates portable, a generalization axis orthogonal to BLADE's novel-state generalization. The representation is foundation-model-style, not a planner, so it complements rather than competes with the abstraction layer — it would sit below the predicate classifiers, supplying the tactile features they read.

Quotable

We recognize that the human tactile perception is a combination of static and dynamic processes… Drawing on this insight, we propose learning unified representations from both static and dynamic perspectives to accommodate a range of tasks. — §1, Introduction / p.4

Our method not only uses multi-modal data to bridge the gap between sensors, but also explicitly clusters representations of the same position on the same object from different sensors together. — Fig 3 caption / p.6

Papers cited that should likely be ingested next:

Feng et al. 2024 — Play to the Score (stage-guided dynamic multi-sensory fusion) — same first author; dynamic visuo-tactile + audio fusion for pouring and peg insertion. Direct conceptual precursor to AnyTouch's dynamic branch.
Rodriguez et al. 2024 — the dual-sensor cross-sensor-generation dataset AnyTouch positions TacQuad against.
He et al. 2022 — Masked Autoencoders (MAE) — the pixel-level masked-modeling backbone of Stage 1.
Radford et al. 2021 — CLIP — the contrastive aligning recipe underlying Stage 2a.
Yang et al. 2024 (sensor-specific tokens) — the token scheme AnyTouch extends with universal sensor tokens.

Newly ingested in 2026-06-24 batch — directly relevant:

UniTouch — the closest prior multi-sensor tactile representation and a key baseline; AnyTouch differs by adding dynamic (video) perception and explicit cross-sensor matching.
TVL and Touch100k / TLV — touch–vision–language datasets AnyTouch trains on and whose TLV-Link is a baseline; same touch-language-alignment family.
Sparsh and T3 — sibling tactile representation/foundation models (T3 is the real-world pouring baseline); same Cluster B problem, different recipe (self-supervised touch vs. transferable transformers vs. static-dynamic multi-sensor).
UniT and MViTac — other tactile pretraining approaches (data-efficient / visual-tactile contrastive) in the same cluster.
Octopi — tactile-language property reasoning; the harder open-ended-reasoning eval AnyTouch's classification benchmarks don't cover.