Sparsh: Self-supervised touch representations for vision-based tactile sensing

Carolina Higuera*, Akash Sharma*, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, Mustafa Mukadam · FAIR at Meta / U. Washington / CMU · CoRL 2024 · arXiv:2410.24090 · PDF · project page

One-liner. Sparsh is a family of self-supervised touch encoders (MAE / DINO / DINOv2 / I-JEPA / V-JEPA) pre-trained on 460k+ unlabeled vision-based tactile images that generalize across DIGIT, GelSight 2017, and GelSight Mini sensors and across six touch-centric tasks — the "foundation-model-for-touch" move, replacing per-sensor, per-task handcrafted encoders, and shipped with TacBench to standardize the benchmark.

Problem & motivation

Vision-based tactile sensors (GelSight, DIGIT) capture contact geometry, texture, and force at the sensor-object interface, but the field's prevailing practice is to train a custom model with task-specific labeled data for each sensor. That fragments effort: a feature extractor tuned for GelSight with markers may not transfer to markerless DIGIT, and an encoder optimized for texture recognition may be useless for slip reasoning. Worse, the labels that would let you build a general model — ground-truth contact forces, slip, deformation tracking, extrinsic contact — are exactly the quantities that are expensive or infeasible to instrument at scale. Sparsh borrows the self-supervised-learning (SSL) recipe that reshaped NLP and vision: learn data-agnostic objectives from cheap unlabeled tactile images, then probe the frozen representation on downstream tasks. It also fills a missing-benchmark gap with TacBench, so progress is measurable across sensors and models.

Method

Two deliverables: the Sparsh encoder family and the TacBench benchmark (Fig 1, Fig 2).

SSL pre-training (the encoders). All encoders are ViT-B/14 (Table 1), pre-trained without labels on ~460k tactile images (70% / 462.7k of a ~661k-image pool; the rest held out for online probes). Five SSL paradigms are adapted from their official vision codebases:

Sparsh (MAE) — masked image modeling; ViT encoder + lightweight ViT decoder, L2 pixel reconstruction ‖I_target − I_recon‖². Pixel-space objective.
Sparsh (DINO / DINOv2) — self-distillation; a student ViT predicts an EMA teacher's softmax-normalized outputs over different crops (cross-entropy). Latent-space objective.
Sparsh (I-JEPA / V-JEPA) — joint-embedding predictive; a context encoder predicts an EMA target encoder's features for masked local regions through a small predictor (L2 in feature space). I-JEPA uses spatial block masking on images; V-JEPA uses tube masking on 4-frame clips.

Tactile-specific design moves. (i) Background subtraction for markerless DIGIT and GelSight Mini, giving the model a no-contact reference so static shear from a perpendicular force is legible; empirically improves same-sensor generalization. (ii) Temporal tokenization — for image SSL methods two frames at stride 5 are concatenated channel-wise, I_t ⊕ I_t−5 → x ∈ ℝ^h×w×6, an ~80 ms window matching the human partial-slip reaction time; for V-JEPA, 4-frame clips at [t, t−2, t−4, t−6] (~100 ms). All inputs reshaped to 224×224.

TacBench (the benchmark) & evaluation protocol. Six tasks under three questions: comprehend properties — T1 force estimation, T1A force-field visualization, T2 slip detection; enable perception — T3 SE(2) pose estimation, T4 grasp stability, T5 textile recognition; enable planning — T6 bead-maze tactile policy. Standard evaluation freezes the Sparsh encoder and trains an attentive decoder (cross-attention + 2-layer MLP) on the labeled set; dense tasks (force fields) use a DPT decoder (Fig 7). The headline comparison is against an E2E baseline of identical capacity trained from scratch, swept across labeled-data budgets (1% / 10% / 33% / 50% / 100%).

Setup

Datasets / benchmarks: SSL pre-training pools YCB-Slide (DIGIT, 180k), the authors' new Touch-Slide (DIGIT, 180k), Touch-and-Go (GelSight, 220k), and ObjectFolder (GelSight Mini, 81k) — ~661k total, 462.7k used for pre-training. TacBench labeled sets (Table 3): T1 75k (DIGIT) + 75k (GelSight Mini); T2 125k DIGIT, 13% slip; T3 49k DIGIT; T4 Feeling-of-Success 9.3k GelSight 2017; T5 Clothing/Clothing-Dataset 120k GelSight 2017, 20 classes; T6 bead-maze ~34k DIGIT.
Hardware / simulator: Three real vision-based tactile sensors — DIGIT, GelSight 2017 (markers), GelSight Mini (markerless). Downstream tasks use real robot arms sliding/grasping (Franka-class setups, Fig 3) and a hexapod-mounted force sensor for force/slip data collection. Bead-maze uses a real arm with DIGIT and VR/kinesthetic teleop demos.
Baselines: task- and sensor-specific E2E models of identical capacity trained from scratch (encoder + probe both trained); the five Sparsh SSL variants compared against each other; on T4, comparison to Calandra et al. 2017 (which combined tactile + vision).
Compute: 8× Nvidia A-100 (80G) GPUs; 150 epochs, AdamW, cosine weight-decay 0.04→0.4, 30-epoch LR warmup (Table 1). ~86.3M parameters per backbone; inference up to 112 FPS on RTX 3080 (V-JEPA 60 FPS) (Table 2).

Results

Headline: SSL pre-training + frozen probe beats from-scratch E2E by an average of 95.1% across TacBench under limited (33–50%) labeled budgets. Sparsh (DINO) and Sparsh (I-JEPA) are the most competitive overall, with DINO outperforming I-JEPA by ~5.6% on average — evidence that latent-space SSL beats pixel-space (MAE) for tactile images. Task-level character (Fig 4):

Task	Best Sparsh variant	Headline finding
T1 Force estimation	Sparsh (DINO)	Low force error even with sparse labels; markerless GelSight Mini where E2E fails
T2 Slip detection	Sparsh (V-JEPA)	Highest F1; 4-frame temporal window helps; strong even at 1% data
T3 Pose estimation	pre-trained (DINO)	Holds accuracy in low-data; E2E degrades sharply
T4 Grasp stability	Sparsh (I-JEPA)/(V-JEPA)	~80% acc from a single finger’s touch, surpassing Calandra et al. (tactile+vision)
T5 Textile recognition	Sparsh (MAE)	MAE’s pixel features shine; big gain at 10% data over hard-to-train E2E
T6 Bead maze (policy)	Sparsh (DINO)/(I-JEPA)	~20–53% lower trajectory error vs E2E; but no model finishes the real maze

Where it loses / is honest: on T6 the lower BC trajectory error did not translate to real-robot success — none of the models completes the full maze on hardware (lack of force control, no error recovery after losing grip, drift from local decision-making). Specialist from-scratch models can show better real rollout performance because the narrow task domain overfits favorably — a known pre-trained-vision phenomenon the authors cite. MAE generally lags the latent-space methods except on the texture-heavy T5.

Limitations & open questions

From the authors (§8):

Open-source tactile datasets used here are predominantly discrete contact interactions; data rich in shear would likely improve representations.
They do not ablate the length of tactile-image history used for learning (the stride-5 / 2-frame choice); doing so could guide downstream gains.
Bead-maze policies on the real robot only partially complete the maze before compounding error drops the bead; how to effectively leverage pre-trained touch reps in behavioral cloning is open.
Currently limited by data streaming rates, not inference, at deployment.

What I noticed reading it:

"95.1% improvement" is an average relative gain over a from-scratch E2E baseline at restricted budgets — an impressive but soft framing; the absolute numbers per task (Fig 4) tell a more mixed story, and at 100% labels the gap narrows on several tasks.
No external SOTA baselines on most tasks — the comparison is Sparsh variants vs. a same-capacity scratch model. The closest concurrent works (T3, UniT) are discussed but not benchmarked head-to-head, so "best general touch rep" is asserted relative to its own family.
The "generalize across sensors" claim rests on pre-training on three sensor families and testing on real instances unseen during SSL, but per-task decoders are still trained per-sensor — cross-sensor zero-shot transfer is only an appendix few-shot (10-shot) result, not the headline.
Grasp-stability and slip datasets are class-imbalanced (36% fail / 13% slip); F1 is reported, but the small eval splits (1.3k grasps) make the ~80% numbers statistically thin.
The whole benchmark is about reading out properties from a frozen representation — it shows touch reps encode force/slip/pose, not that a downstream planner can act on those predicates reliably (T6 is the one action test, and it fails on hardware).

Why I care

This is squarely on the thesis behind this batch: many manipulation predicates are not visually evaluable — is_grasped, is_inserted, surface_is_rough, is_slipping, is_screwed_tight live in touch and force, not pixels. BLADE learns visual classifiers for its predicates and explicitly flags contact-rich tasks as a segmentation/observation weakness. Sparsh is the infrastructure that could supply the missing modality: a frozen, sensor-agnostic touch encoder whose features TacBench shows already linearly encode force, slip state, SE(2) pose, and grasp success. Concretely, a BLADE-style predicate classifier f_θ(p): O → {T,F} could take a Sparsh embedding instead of (or alongside) a cropped RGB region, making touch-dependent preconditions/effects learnable from the same auto-labeled-demo pipeline. The bead-maze result is also a cautionary note for the BLADE thesis: good representation-level read-out does not guarantee closed-loop control without force-aware action — the symbolic layer can name is_slipping, but the controller still needs force modulation, which BLADE's purely categorical abstraction punts to the diffusion policy. Cluster B (representation/foundation models) is the natural feeder; the policy clusters (C/F) are where these reps would actually plug into manipulation.

Directly related newly-ingested batch papers: see Related below.

Quotable

We find that SSL pre-training for touch representation and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. — Abstract / p.1

However, current approaches primarily focus on texture and visual properties and overlook physical contact properties, such as forces, slippage, and poses, which are essential for dexterous manipulation. — §2 Related work / p.3

Touch comes before sight, before speech. — §1 Introduction / p.1 (Margaret Atwood, via the authors)

Papers cited that should likely be ingested next:

[45] Zhao et al. 2024 — Transferable Tactile Transformers (T3) — the closest concurrent work; shared-trunk sensor-specific encoders with MAE + labeled supervision. PDF (forward-ref)
[46] Xu et al. 2024 — UniT — VQGAN-based tactile representation for GelSight Mini; the other concurrent baseline. PDF (forward-ref)
[39] Fu et al. 2024 — A Touch, Vision, and Language Dataset (TVL) — same Meta/Goldberg orbit; touch-vision-language alignment. PDF (forward-ref)
[40] Yang et al. 2024 — Binding touch to everything (UniTouch) — unified multimodal tactile reps via cross-sensory binding. PDF (forward-ref)
[20] Yang et al. 2022 — Touch and Go — a core SSL pre-training dataset here. PDF (forward-ref)
[37/38] Gao et al. — ObjectFolder / ObjectFolder 2.0 — multisensory datasets feeding SSL. PDF (forward-ref)
[1] Yuan et al. 2017 — GelSight and [3] Lambeta et al. 2020 — DIGIT — the foundational sensor papers Sparsh trains on. GelSight PDF · DIGIT PDF

Newly ingested in 2026-06-24 batch — directly relevant:

Tactile Beyond Pixels (Sparsh-X) — the direct successor from the same line; extends Sparsh beyond a single image stream to multi-modal tactile signals.
Transferable Tactile Transformers (T3) and UniT — the two concurrent tactile-representation works Sparsh names as closest; cross-sensor / data-efficient counterpoints.
AnyTouch and MViTac — unified static/dynamic and contrastive visuo-tactile pre-training; same Cluster-B representation-learning family.
Dexterity from Touch (T-DEX) and See to Touch — SSL tactile pre-training aimed at dexterity/policy, the action-side use of such representations.
UniTouch and TVL — touch–language/vision binding; where Sparsh-style encoders meet the language-grounding thesis BLADE cares about.
Beyond Sight (FuSe) and Tactile-VLA — multisensory VLA policies; the downstream consumers a frozen Sparsh encoder could feed.
ObjectFolder, Touch and Go — pre-training datasets Sparsh actually uses.
Towards Forceful Robotic Foundation Models (survey) — situates Sparsh in the force/touch foundation-model landscape.