One-liner. AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities — pixel-level masked modeling for fine detail and semantic-level touch–vision–language alignment plus a novel cross-sensor matching task — so one model perceives static properties (material/hardness) and dynamic changes (sliding/pressure) and transfers across GelSight, DIGIT, DuraGel, and unseen sensors.
Visuo-tactile sensors (GelSight, DIGIT, etc.) are poorly standardized: each sensor renders the same physical contact differently, so models trained on one sensor's data don't transfer, and per-sensor datasets stay small. Two gaps motivate the paper. First, data: prior cross-sensor efforts lack aligned multi-sensor data (Rodriguez et al. 2024 paired only two sensors on narrow manipulation tasks, ignoring material/hardness and multi-modal text). Second, representation: human touch is both static (a brief touch recognizes texture/material) and dynamic (continuous perception while unlocking a lock or pouring), but existing tactile representation work (e.g. UniTouch, TLV-Link) learns from static tactile images only and clusters by sensor rather than by the object property being touched. AnyTouch attacks both: a new aligned dataset (TacQuad) and a unified static-dynamic, multi-level framework.
Two contributions: the TacQuad dataset and the AnyTouch multi-level framework (Fig 2). The framework integrates tactile images and videos and trains in two sequential stages.
Unified input format. A static tactile image I ∈ R^(1×H×W×3)
is treated as a single-frame video by replicating it F times into a 4-D tensor
X_T ∈ R^(F×H×W×3); videos pass through directly. A shared
patch projection turns both into spatio-temporal tokens z ∈ R^(N×d),
so images and videos are processed by the same encoder.
Stage 1 — Masked modeling (pixel level, §4.2). Randomly mask
tokens of images/videos at ratio ρ; a decoder reconstructs the static image
Î and dynamic video V̂ under MSE losses
L_rec^S and L_rec^D (Eq 1). An additional next-frame
prediction task (Eq 2) predicts frame V_(F+1) to capture continuous
deformation dynamics. This learns fine-grained, sensor-specific deformation detail.
Stage 2a — Multi-modal aligning (semantic level, §4.3). Bind
touch to vision and text via CLIP-style contrastive learning, using text
(tactile attribute descriptions) as the anchor since text consistently describes attributes
across datasets. Because tri-modal tactile data is rare, they use GPT-4o to expand text
pairings (1.4M new text pairs across four datasets) and adopt a
modality-missing-aware contrastive loss (Eq 3–4) that aligns over the
largest available subset per modality pair (touch↔vision, touch↔text,
vision↔text) with weights α_TV, α_TL, α_VL.
Stage 2b — Cross-sensor matching (sensor-agnostic features, §4.4).
The novel task: given a tactile sample X_T, decide whether another sample
X_T^+ (same object + same position, different sensor) versus a negative
X_T^- (different object/position) come from the same physical contact.
Touch representations are element-wise multiplied (x_T · x_T^+) and scored by an
MLP under a binary-cross-entropy loss (Eq 5–6). This explicitly clusters
representations of the same object across sensors (Fig 3), going beyond UniTouch's
image-only alignment.
Universal sensor tokens (§4.5). Instead of only sensor-specific
tokens s_k (which can't transfer to unseen sensors), they add a shared
universal token s_u. During training each input randomly uses
s_u with probability p_u that increases linearly (Eq 7); at
inference, unseen sensors always use s_u. Stage losses: Stage 1
= L_rec^S + L_rec^D + L_pred^D (Eq 8); Stage 2 = L_align + λ L_match (Eq 9).
Static perception on seen sensor (GelSight), Table 2 (accuracy):
| Method | TAG Material | TAG Roughness | TAG Hardness | Feel Grasp |
|---|---|---|---|---|
| CLIP | 52.96 | 84.09 | 88.34 | 72.37 |
| ViT-LENS-2 | 63.0 | 85.1 | 92.0 | – |
| TLV-Link | 67.2 | 84.7 | 91.3 | 94.5‡ |
| UniTouch | 61.3 | – | – | 82.3 |
| TLV-Link† | 74.12 | 85.94 | 76.97 | 76.97 |
| AnyTouch (TAG, Feel, YCB, OF2.0) | 82.74 | 86.01 | 94.24 | 87.17 |
| AnyTouch (all data) | 80.82 | 86.74 | 94.68 | 80.53 |
Static perception on unseen sensors (TACTO & Taxim), Table 3 (linear probe, material acc): AnyTouch reaches 49.62 on ObjectFolder 1.0 (vs UniTouch† 47.25, CLIP 41.00) and 85.87 on ObjectFolder 2.0 (vs 75.29 / 73.16) — best on both unseen sensors, confirming cross-sensor transfer from the universal sensor token + cross-sensor matching.
Real-world dynamic pouring, Table 4 (mean error in grams, lower better):
| Method | Dynamic Perception | Fine-tune err | Freeze err |
|---|---|---|---|
| CLIP | ✗ | 5.22 | 49.1 |
| T3 (3M data) | ✗ | 2.33 | 9.74 |
| AnyTouch (static-only) | ✗ | 2.45 | 9.60 |
| AnyTouch | ✓ | 1.56 | 8.22 |
The full dynamic AnyTouch beats T3 despite T3 using more data; the static-only ablation is roughly on par with T3, isolating the value of dynamic (video) perception.
Where it loses / is mixed. In TAG material (Table 1), the model trained on only GelSight data scores 83.55 — higher than versions trained on more multi-sensor data (down to 79.61). Adding more sensors dilutes the seen-dataset's share and slightly hurts that one in-distribution task (the authors note this echoes CLIP's data-overlap finding). Also, integrating large DIGIT data helps less than GelSight Mini data, attributed to hardware image differences.
From the authors:
What I noticed reading it:
is_screwed_tight) aren't directly
addressed.This is squarely on-theme for the touch-as-a-perception-channel thesis behind the 2026-06-24
batch: many manipulation predicates (is_grasped, is_inserted,
surface_is_rough, is_full) are not visually evaluable and live
in touch. Relative to BLADE,
where predicate classifiers f_θ(p) are learned from RGB-D crops, AnyTouch is a
candidate tactile backbone for predicates that vision can't ground — and crucially
it claims a sensor-agnostic representation, which matters if a BLADE-style system
should not be re-trained per gripper/sensor. Two specific hooks: (1) its static branch
maps to instantaneous predicates (material, hardness) while its dynamic branch maps to
process/effect predicates (the pouring task is literally estimating a continuous fill
quantity — the touch analogue of BLADE's diffusion-policy continuous parameters). (2) Its
cross-sensor matching is a clean way to make learned predicates portable, a generalization axis
orthogonal to BLADE's novel-state generalization. The representation is foundation-model-style, not
a planner, so it complements rather than competes with the abstraction layer — it would sit
below the predicate classifiers, supplying the tactile features they read.
We recognize that the human tactile perception is a combination of static and dynamic processes… Drawing on this insight, we propose learning unified representations from both static and dynamic perspectives to accommodate a range of tasks. — §1, Introduction / p.4
Our method not only uses multi-modal data to bridge the gap between sensors, but also explicitly clusters representations of the same position on the same object from different sensors together. — Fig 3 caption / p.6
Papers cited that should likely be ingested next:
Newly ingested in 2026-06-24 batch — directly relevant: