Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation

Carolina Higuera*, Akash Sharma*, Taosha Fan*, Chaithanya Krishna Bodduluri, Byron Boots, Michael Kaess, Mike Lambeta, Tingfan Wu, Zixi Liu, Francois Robert Hogan†, Mustafa Mukadam† · FAIR at Meta / UW / CMU · 2025 · arXiv:2506.14754 · PDF

One-liner. Sparsh-X is the first self-supervised touch backbone that fuses four tactile modalities — image, audio, motion (IMU), and pressure — from the Digit 360 fingertip into one shared embedding, and the paper shows that touch "beyond the camera image" is what lets a representation pick up contact properties (force, material, slip) that a tactile-image-only encoder simply cannot see.

Problem & motivation

Human dexterity leans on a wide spectrum of touch signals — skin deformation, vibration, motion, pressure — yet robot tactile sensing has been overwhelmingly unimodal, built around GelSight-style tactile images because that hardware was standardized and available. New compact sensors like the Digit 360 now expose images, contact microphones, IMU, and static pressure in one fingertip, but there is no unified, scalable method to actually exploit all of them together. Prior multimodal work either treats every signal as an RGB image and concatenates tokens (quadratic attention cost; MULSA [34]) or learns each modality independently (MimicTouch [5]). Sparsh-X argues that representation learning over fused heterogeneous touch is the missing ingredient: complementary contact cues (high-frequency audio + tactile image both flag make/break contact; IMU + pressure carry shear and slip) collapse into one latent that is more data-efficient and robust for downstream manipulation. This directly extends the unimodal vision-based Sparsh line to the multisensory regime.

Method

Sparsh-X is a transformer backbone (ViT-style [36]) over four Digit 360 streams, fused through attention bottlenecks ([35], Nagrani et al.) rather than full cross-modal concatenation. Architecture and pipeline are in Fig 2.

Per-modality tokenization. Each stream is preprocessed to its own spatio-temporal scale before a linear projection to 768-dim tokens:

Bottleneck fusion. Each modality runs independently through L_f = 8 unimodal self-attention layers, then cross-modal information is exchanged only through B = 4 shared bottleneck tokens over L_b = 4 fusion layers (total depth L = L_f + L_b = 12). After each fusion block the bottleneck tokens are averaged across modalities, forcing information to pass through a low-bandwidth summarizer — this is what keeps cost sub-quadratic versus naive token concatenation.

SSL objective. Training uses a teacher–student self-distillation scheme (DINO/iBOT-style [40, 11]): both branches are encoder + predictor. After tokenizing and adding sinusoidal positional embeddings plus a register token, sinusoidal masking is applied to the student input tokens per modality — retaining 10–50% of the signal for local masks and 50–100% for global masks. Register tokens from global and local masks are concatenated and passed through prediction heads. As in DINO, the prediction task is online clustering: teacher tokens are pseudo-labels (centroids adapt over time), and the cross-entropy between teacher and student softmax outputs is the loss.

Downstream usage. Sparsh-X is then frozen. For each downstream task a small task-specific attentive-pooling decoder ([41], context autoencoder) is trained on top, both for physical-property probes and for policy learning.

Setup

Results

Headline claims (abstract + §4–5): multisensory pretraining boosts plug-insertion policy success by 63% over an E2E tactile-image model, improves in-hand-rotation robustness by 90% (vertical-drift reduction vs. Hora) in recovering object state from touch, and improves physical-property characterization by 48% on average over E2E approaches.

Physical-property probes (frozen Sparsh-X, Fig 4):

TaskFinding
Object-Action-Surface (36-class)Audio+IMU pairing gives +32% over tactile image alone; all modalities give +13% over image alone; pretraining adds up to a 10% margin in the lowest-data regime. At 50% data: E2E-image 68.8% vs. Sparsh-X-all 87.5% (Fig 12).
Material-Quantity (18-class)Sparsh-X (all modalities) is highest across every data budget; +20.5% over E2E tactile-image. At 33% data: E2E-image 68.8% vs. Sparsh-X 87.5% (Fig 13).
Normal-Force regression (≤3.5N)Combining all modalities → average error 35mN, a 17% improvement over tactile-image only.

Plug insertion via imitation learning (Fig 5, 20 trials): multisensory touch reaches a 90% success rate on the tight-tolerance task. Pretraining gives a 90% performance boost over training all modalities from scratch jointly with the policy. Sparsh-X (all modalities) is +500% over external-vision-only and +63% over E2E tactile-vision-only. Where it loses / nuance: for the tactile-image modality alone, E2E training actually outperforms frozen pretrained representations — the authors attribute this to the tactile-image signal varying little across trials, so a task-specific encoder can specialize; multimodality is what closes the gap.

In-hand rotation, sim-to-real tactile adaptation (Figs 6–7, 10 trials nominal / 5 trials dynamical): a ControlNet [52] tactile adaptation module with a zero-init convolution wraps frozen Sparsh-X onto the frozen Hora policy. This reduces vertical translation (drift) by 90% vs. Hora and outperforms fine-tuned Hora, the proprio-only IL baseline (which fails often from OOD states), and Hora+ControlNet(E2E). Under reduced friction, Hora+ControlNet (Sparsh-X) keeps the object stable without losing grasp, while Hora+ControlNet(image) struggles — the authors say contact-patch changes are too subtle for tactile images alone. Under +20g mass, both Sparsh-X and image variants let the base policy adapt finger gaiting.

Limitations & open questions

Author-stated:

What I noticed reading it:

Why I care

This paper is close to the heart of a thesis I want to push from BLADE: many manipulation predicates are not visually evaluable. is_inserted(plug, socket), is_screwed_tight, surface_is_rough, is_slipping, is_full(bottle) — these live in touch, force, vibration, and pressure, not pixels. BLADE learns visual classifiers fθ(p): O → {T,F} for predicates; Sparsh-X is precisely an argument that for a large family of contact predicates, the observation O must be multisensory tactile, not RGB. Its three property probes are essentially predicate-classifier benchmarks in disguise: material-quantity = is_full/material_is(x), object-action-surface = surface_is_rough/contact_made, normal-force = a graded force_applied predicate. A frozen Sparsh-X embedding is a candidate perceptual backbone for grounding those non-visual predicates — the missing input channel that would let a BLADE-style abstraction layer cover contact-rich behaviors its gripper-state segmentation and visual classifiers currently can't (a limitation BLADE itself flagged for contact-rich tasks).

The bottleneck-fusion + frozen-backbone + small-decoder recipe is also a clean template: pretrain a multisensory contact representation once, then attach a thin predicate head per symbol. That is a concrete path to multisensory predicate invention. The ControlNet tactile-adaptation trick (inject frozen touch features into a frozen privileged-information policy) is separately interesting as a way to bolt a touch representation onto an existing planner/policy without retraining.

Quotable

We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. — Abstract / p.1
Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images, and improves robustness by 90% in recovering object states from touch. — Abstract / p.1
Representations that encode these physical properties at the fingertip level are especially valuable for dexterous manipulation, as they enable feedback of object and contact state directly within latent space. — §1 / p.2

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: