Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation

Carolina Higuera*, Akash Sharma*, Taosha Fan*, Chaithanya Krishna Bodduluri, Byron Boots, Michael Kaess, Mike Lambeta, Tingfan Wu, Zixi Liu, Francois Robert Hogan†, Mustafa Mukadam† · FAIR at Meta / UW / CMU · 2025 · arXiv:2506.14754 · PDF

One-liner. Sparsh-X is the first self-supervised touch backbone that fuses four tactile modalities — image, audio, motion (IMU), and pressure — from the Digit 360 fingertip into one shared embedding, and the paper shows that touch "beyond the camera image" is what lets a representation pick up contact properties (force, material, slip) that a tactile-image-only encoder simply cannot see.

Problem & motivation

Human dexterity leans on a wide spectrum of touch signals — skin deformation, vibration, motion, pressure — yet robot tactile sensing has been overwhelmingly unimodal, built around GelSight-style tactile images because that hardware was standardized and available. New compact sensors like the Digit 360 now expose images, contact microphones, IMU, and static pressure in one fingertip, but there is no unified, scalable method to actually exploit all of them together. Prior multimodal work either treats every signal as an RGB image and concatenates tokens (quadratic attention cost; MULSA [34]) or learns each modality independently (MimicTouch [5]). Sparsh-X argues that representation learning over fused heterogeneous touch is the missing ingredient: complementary contact cues (high-frequency audio + tactile image both flag make/break contact; IMU + pressure carry shear and slip) collapse into one latent that is more data-efficient and robust for downstream manipulation. This directly extends the unimodal vision-based Sparsh line to the multisensory regime.

Method

Sparsh-X is a transformer backbone (ViT-style [36]) over four Digit 360 streams, fused through attention bottlenecks ([35], Nagrani et al.) rather than full cross-modal concatenation. Architecture and pipeline are in Fig 2.

Per-modality tokenization. Each stream is preprocessed to its own spatio-temporal scale before a linear projection to 768-dim tokens:

Image — tactile images at 30fps, temporal stride 5, two frames concatenated along channels, cropped/resized to 224×224×3, patch size 16×16.
Audio — two contact mics at 48kHz; a 0.55s window converted to a log-mel spectrogram (128 channels, 5ms Hamming window, 2.5ms hop), both mics concatenated into a 224×256 image, patch size 16.
IMU — 3-axis accelerometer at 400Hz, 0.55s window → 224×3 temporal signal, chunked + linearly projected.
Pressure — static pressure at 200Hz, 1.1s window → 224×1.

Bottleneck fusion. Each modality runs independently through L_f = 8 unimodal self-attention layers, then cross-modal information is exchanged only through B = 4 shared bottleneck tokens over L_b = 4 fusion layers (total depth L = L_f + L_b = 12). After each fusion block the bottleneck tokens are averaged across modalities, forcing information to pass through a low-bandwidth summarizer — this is what keeps cost sub-quadratic versus naive token concatenation.

SSL objective. Training uses a teacher–student self-distillation scheme (DINO/iBOT-style [40, 11]): both branches are encoder + predictor. After tokenizing and adding sinusoidal positional embeddings plus a register token, sinusoidal masking is applied to the student input tokens per modality — retaining 10–50% of the signal for local masks and 50–100% for global masks. Register tokens from global and local masks are concatenated and passed through prediction heads. As in DINO, the prediction task is online clustering: teacher tokens are pseudo-labels (centroids adapt over time), and the cross-entropy between teacher and student softmax outputs is the loss.

Downstream usage. Sparsh-X is then frozen. For each downstream task a small task-specific attentive-pooling decoder ([41], context autoencoder) is trained on top, both for physical-property probes and for policy learning.

Setup

Datasets / benchmarks: ~1M unlabeled contact-interaction samples (18.6 hours, Fig 8) collected with Digit 360 from two platforms — an Allegro hand rummaging in a tray (4.5h, 8 sequences ~8.5min) and a two-fingered manual picker doing tap/slide/place/drop on varied surfaces (14.1h, 104 sequences ~3min, with object/action/surface annotations). Six distinct Digit 360 sensors. Downstream probes: object-action-surface classification (36 classes), material-quantity estimation (18 classes: corn/lentils/pills/rice/water/oil × full/half/quarter), normal-force regression (up to 3.5N).
Hardware / simulator: Digit 360 multimodal fingertips on an Allegro hand; Franka arm with Digit-360-equipped parallel gripper for material/force tasks; Meca arm for force probe. Policy tasks: plug insertion (Allegro + Digit 360) and in-hand rotation (Hora [51] on Allegro). Real-world only — ROS2 deployment, ~20Hz inference for 4 fingers on an RTX 4090.
Baselines: End-to-end (E2E) encoder-decoder models trained from scratch on tactile images (and on modality subsets); for policies, a vision-only (no-touch) baseline, E2E tactile-image policies, and for in-hand rotation: Hora baseline (proprio only), Hora fine-tuned, a proprio-only imitation-learning baseline, and Hora+ControlNet(E2E). Sparsh-X is also ablated against its own modality subsets.
Compute: 200 epochs on 16 A-100 GPUs, batch size 128, AdamW with linear rampup + cosine schedule.

Results

Headline claims (abstract + §4–5): multisensory pretraining boosts plug-insertion policy success by 63% over an E2E tactile-image model, improves in-hand-rotation robustness by 90% (vertical-drift reduction vs. Hora) in recovering object state from touch, and improves physical-property characterization by 48% on average over E2E approaches.

Physical-property probes (frozen Sparsh-X, Fig 4):

Task	Finding
Object-Action-Surface (36-class)	Audio+IMU pairing gives +32% over tactile image alone; all modalities give +13% over image alone; pretraining adds up to a 10% margin in the lowest-data regime. At 50% data: E2E-image 68.8% vs. Sparsh-X-all 87.5% (Fig 12).
Material-Quantity (18-class)	Sparsh-X (all modalities) is highest across every data budget; +20.5% over E2E tactile-image. At 33% data: E2E-image 68.8% vs. Sparsh-X 87.5% (Fig 13).
Normal-Force regression (≤3.5N)	Combining all modalities → average error 35mN, a 17% improvement over tactile-image only.

Plug insertion via imitation learning (Fig 5, 20 trials): multisensory touch reaches a 90% success rate on the tight-tolerance task. Pretraining gives a 90% performance boost over training all modalities from scratch jointly with the policy. Sparsh-X (all modalities) is +500% over external-vision-only and +63% over E2E tactile-vision-only. Where it loses / nuance: for the tactile-image modality alone, E2E training actually outperforms frozen pretrained representations — the authors attribute this to the tactile-image signal varying little across trials, so a task-specific encoder can specialize; multimodality is what closes the gap.

In-hand rotation, sim-to-real tactile adaptation (Figs 6–7, 10 trials nominal / 5 trials dynamical): a ControlNet [52] tactile adaptation module with a zero-init convolution wraps frozen Sparsh-X onto the frozen Hora policy. This reduces vertical translation (drift) by 90% vs. Hora and outperforms fine-tuned Hora, the proprio-only IL baseline (which fails often from OOD states), and Hora+ControlNet(E2E). Under reduced friction, Hora+ControlNet (Sparsh-X) keeps the object stable without losing grasp, while Hora+ControlNet(image) struggles — the authors say contact-patch changes are too subtle for tactile images alone. Under +20g mass, both Sparsh-X and image variants let the base policy adapt finger gaiting.

Limitations & open questions

Author-stated:

The tactile-image pretraining modality has the lowest diversity (few distinct Digit 360 devices, each with its own optical artifacts), potentially limiting generalization; broader collaborative datasets needed.
Experiments use frozen Sparsh-X to isolate pretraining effects; fine-tuning could further help and compensate for modality-specific data limits.
Force evaluation is limited to normal-force estimation under controlled perpendicular contact; varying contact geometries and multiple simultaneous contacts are open.
Shear-force estimation is not attempted — separating extrinsic forces from the elastomer's internal deformation is a hard modeling problem (the soft hemispherical dome deforms and moves on contact).

What I noticed reading it:

Nearly all evaluation is real-world with small trial counts (20 trials for insertion; 10 and 5 trials for rotation). Several flagship numbers ("90%", "63%", "500%") are derived from these low-N runs — success rates rather than rates with confidence intervals — so the statistical strength is modest relative to the bold percentages.
The "63% boost" and "48% improvement" are relative percentage gains over particular E2E baselines; absolute success/accuracy deltas (e.g., insertion 90% vs ~55% E2E from Fig 5) are smaller and clearer than the framing.
The interesting negative result — E2E beats frozen pretrained encoders on the image-only modality — is under-explored; it suggests the SSL objective may be over-smoothing precisely the modality with the richest spatial contact detail, and is only rescued by the other three modalities.
No sim evaluation and no public benchmark numbers means cross-paper comparison (e.g., vs. AnyTouch or T3) is impossible from this paper alone — everything is Digit-360-specific.
It locks to one sensor (Digit 360). The whole pitch rests on hardware that happens to expose 4 modalities; cross-sensor transfer (a strength of Sparsh and T3) is explicitly out of scope.

Why I care

This paper is close to the heart of a thesis I want to push from BLADE: many manipulation predicates are not visually evaluable. is_inserted(plug, socket), is_screwed_tight, surface_is_rough, is_slipping, is_full(bottle) — these live in touch, force, vibration, and pressure, not pixels. BLADE learns visual classifiers f_θ(p): O → {T,F} for predicates; Sparsh-X is precisely an argument that for a large family of contact predicates, the observation O must be multisensory tactile, not RGB. Its three property probes are essentially predicate-classifier benchmarks in disguise: material-quantity = is_full/material_is(x), object-action-surface = surface_is_rough/contact_made, normal-force = a graded force_applied predicate. A frozen Sparsh-X embedding is a candidate perceptual backbone for grounding those non-visual predicates — the missing input channel that would let a BLADE-style abstraction layer cover contact-rich behaviors its gripper-state segmentation and visual classifiers currently can't (a limitation BLADE itself flagged for contact-rich tasks).

The bottleneck-fusion + frozen-backbone + small-decoder recipe is also a clean template: pretrain a multisensory contact representation once, then attach a thin predicate head per symbol. That is a concrete path to multisensory predicate invention. The ControlNet tactile-adaptation trick (inject frozen touch features into a frozen privileged-information policy) is separately interesting as a way to bolt a touch representation onto an existing planner/policy without retraining.

Quotable

We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. — Abstract / p.1

Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images, and improves robustness by 90% in recovering object states from touch. — Abstract / p.1

Representations that encode these physical properties at the fingertip level are especially valuable for dexterous manipulation, as they enable feedback of object and contact state directly within latent space. — §1 / p.2

Papers cited that should likely be ingested next:

[4] Lambeta et al. 2024 — Digit 360 / Digitizing touch with an artificial multimodal fingertip — the sensor this entire paper is built on; the multimodal hardware enabling all four streams.
[34] Li et al. 2023 — See, Hear, and Feel (MULSA) — the concatenate-all-tokens multimodal-transformer baseline Sparsh-X improves on. Forward ref: see_hear_feel_sensory_fusion.
[35] Nagrani et al. 2021 — Attention Bottlenecks for Multimodal Fusion — the bottleneck-fusion mechanism at the core of the architecture.
[5] Yu et al. 2024 — MimicTouch — unimodal SSL on tactile image + audio separately; the no-explicit-fusion contrast. Forward ref: mimictouch_human_tactile_demos.
[6] Mejia et al. 2024 — Hearing Touch — audio-visual pretraining for contact-rich manipulation. Forward ref: hearing_touch_audio_visual_pretraining.
[28] Thankaraj & Pinto 2023 — That Sounds Right — auditory self-supervision for dynamic manipulation. Forward ref: that_sounds_right_auditory_self_supervision.
[29] Liu et al. 2024 — ManiWAV — in-the-wild audio-visual manipulation. Forward ref: maniwav_in_the_wild_audio_visual.
[51] Qi et al. 2022 — Hora (In-Hand Object Rotation via RMA) — the base privileged policy adapted in the in-hand-rotation experiment.
[52] Zhang et al. 2023 — ControlNet — the adaptation mechanism reused for tactile sim-to-real injection.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

Sparsh — the direct unimodal (vision-based) predecessor; Sparsh-X is its multisensory extension. The single most important sibling.
AnyTouch — unified static-dynamic visuo-tactile representation across sensors; the cross-sensor counterpoint to Sparsh-X's single-sensor multimodal stance.
Transferable Tactile Transformers (T3) — cross-sensor/task tactile representation; same foundation-model-for-touch goal, different (transfer-first) axis.
UniT and MViTac — other tactile representation-learning backbones in cluster B; SSL/contrastive alternatives to Sparsh-X's self-distillation.
See, Hear, and Feel and Hearing Touch — the multisensory-fusion baselines/ancestors Sparsh-X positions against (treat-as-image vs. bottleneck fusion).
Reactive Diffusion Policy — consumes visuo-tactile signals in a contact-rich policy; a downstream policy that a Sparsh-X-style backbone could feed.
Towards Forceful Robotic Foundation Models (survey) — situates Sparsh-X within the "touch/force foundation model" research program.