Sparsh: Self-supervised touch representations for vision-based tactile sensing

Carolina Higuera*, Akash Sharma*, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, Mustafa Mukadam · FAIR at Meta / U. Washington / CMU · CoRL 2024 · arXiv:2410.24090 · PDF · project page

One-liner. Sparsh is a family of self-supervised touch encoders (MAE / DINO / DINOv2 / I-JEPA / V-JEPA) pre-trained on 460k+ unlabeled vision-based tactile images that generalize across DIGIT, GelSight 2017, and GelSight Mini sensors and across six touch-centric tasks — the "foundation-model-for-touch" move, replacing per-sensor, per-task handcrafted encoders, and shipped with TacBench to standardize the benchmark.

Problem & motivation

Vision-based tactile sensors (GelSight, DIGIT) capture contact geometry, texture, and force at the sensor-object interface, but the field's prevailing practice is to train a custom model with task-specific labeled data for each sensor. That fragments effort: a feature extractor tuned for GelSight with markers may not transfer to markerless DIGIT, and an encoder optimized for texture recognition may be useless for slip reasoning. Worse, the labels that would let you build a general model — ground-truth contact forces, slip, deformation tracking, extrinsic contact — are exactly the quantities that are expensive or infeasible to instrument at scale. Sparsh borrows the self-supervised-learning (SSL) recipe that reshaped NLP and vision: learn data-agnostic objectives from cheap unlabeled tactile images, then probe the frozen representation on downstream tasks. It also fills a missing-benchmark gap with TacBench, so progress is measurable across sensors and models.

Method

Two deliverables: the Sparsh encoder family and the TacBench benchmark (Fig 1, Fig 2).

SSL pre-training (the encoders). All encoders are ViT-B/14 (Table 1), pre-trained without labels on ~460k tactile images (70% / 462.7k of a ~661k-image pool; the rest held out for online probes). Five SSL paradigms are adapted from their official vision codebases:

Tactile-specific design moves. (i) Background subtraction for markerless DIGIT and GelSight Mini, giving the model a no-contact reference so static shear from a perpendicular force is legible; empirically improves same-sensor generalization. (ii) Temporal tokenization — for image SSL methods two frames at stride 5 are concatenated channel-wise, It ⊕ It−5 → x ∈ ℝh×w×6, an ~80 ms window matching the human partial-slip reaction time; for V-JEPA, 4-frame clips at [t, t−2, t−4, t−6] (~100 ms). All inputs reshaped to 224×224.

TacBench (the benchmark) & evaluation protocol. Six tasks under three questions: comprehend properties — T1 force estimation, T1A force-field visualization, T2 slip detection; enable perception — T3 SE(2) pose estimation, T4 grasp stability, T5 textile recognition; enable planning — T6 bead-maze tactile policy. Standard evaluation freezes the Sparsh encoder and trains an attentive decoder (cross-attention + 2-layer MLP) on the labeled set; dense tasks (force fields) use a DPT decoder (Fig 7). The headline comparison is against an E2E baseline of identical capacity trained from scratch, swept across labeled-data budgets (1% / 10% / 33% / 50% / 100%).

Setup

Results

Headline: SSL pre-training + frozen probe beats from-scratch E2E by an average of 95.1% across TacBench under limited (33–50%) labeled budgets. Sparsh (DINO) and Sparsh (I-JEPA) are the most competitive overall, with DINO outperforming I-JEPA by ~5.6% on average — evidence that latent-space SSL beats pixel-space (MAE) for tactile images. Task-level character (Fig 4):

TaskBest Sparsh variantHeadline finding
T1 Force estimationSparsh (DINO)Low force error even with sparse labels; markerless GelSight Mini where E2E fails
T2 Slip detectionSparsh (V-JEPA)Highest F1; 4-frame temporal window helps; strong even at 1% data
T3 Pose estimationpre-trained (DINO)Holds accuracy in low-data; E2E degrades sharply
T4 Grasp stabilitySparsh (I-JEPA)/(V-JEPA)~80% acc from a single finger’s touch, surpassing Calandra et al. (tactile+vision)
T5 Textile recognitionSparsh (MAE)MAE’s pixel features shine; big gain at 10% data over hard-to-train E2E
T6 Bead maze (policy)Sparsh (DINO)/(I-JEPA)~20–53% lower trajectory error vs E2E; but no model finishes the real maze

Where it loses / is honest: on T6 the lower BC trajectory error did not translate to real-robot success — none of the models completes the full maze on hardware (lack of force control, no error recovery after losing grip, drift from local decision-making). Specialist from-scratch models can show better real rollout performance because the narrow task domain overfits favorably — a known pre-trained-vision phenomenon the authors cite. MAE generally lags the latent-space methods except on the texture-heavy T5.

Limitations & open questions

From the authors (§8):

What I noticed reading it:

Why I care

This is squarely on the thesis behind this batch: many manipulation predicates are not visually evaluableis_grasped, is_inserted, surface_is_rough, is_slipping, is_screwed_tight live in touch and force, not pixels. BLADE learns visual classifiers for its predicates and explicitly flags contact-rich tasks as a segmentation/observation weakness. Sparsh is the infrastructure that could supply the missing modality: a frozen, sensor-agnostic touch encoder whose features TacBench shows already linearly encode force, slip state, SE(2) pose, and grasp success. Concretely, a BLADE-style predicate classifier fθ(p): O → {T,F} could take a Sparsh embedding instead of (or alongside) a cropped RGB region, making touch-dependent preconditions/effects learnable from the same auto-labeled-demo pipeline. The bead-maze result is also a cautionary note for the BLADE thesis: good representation-level read-out does not guarantee closed-loop control without force-aware action — the symbolic layer can name is_slipping, but the controller still needs force modulation, which BLADE's purely categorical abstraction punts to the diffusion policy. Cluster B (representation/foundation models) is the natural feeder; the policy clusters (C/F) are where these reps would actually plug into manipulation.

Directly related newly-ingested batch papers: see Related below.

Quotable

We find that SSL pre-training for touch representation and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. — Abstract / p.1
However, current approaches primarily focus on texture and visual properties and overlook physical contact properties, such as forces, slippage, and poses, which are essential for dexterous manipulation. — §2 Related work / p.3
Touch comes before sight, before speech. — §1 Introduction / p.1 (Margaret Atwood, via the authors)

Related

Papers cited that should likely be ingested next:

Newly ingested in 2026-06-24 batch — directly relevant: