One-liner. Sparsh is a family of self-supervised touch encoders (MAE / DINO / DINOv2 / I-JEPA / V-JEPA) pre-trained on 460k+ unlabeled vision-based tactile images that generalize across DIGIT, GelSight 2017, and GelSight Mini sensors and across six touch-centric tasks — the "foundation-model-for-touch" move, replacing per-sensor, per-task handcrafted encoders, and shipped with TacBench to standardize the benchmark.
Vision-based tactile sensors (GelSight, DIGIT) capture contact geometry, texture, and force at the sensor-object interface, but the field's prevailing practice is to train a custom model with task-specific labeled data for each sensor. That fragments effort: a feature extractor tuned for GelSight with markers may not transfer to markerless DIGIT, and an encoder optimized for texture recognition may be useless for slip reasoning. Worse, the labels that would let you build a general model — ground-truth contact forces, slip, deformation tracking, extrinsic contact — are exactly the quantities that are expensive or infeasible to instrument at scale. Sparsh borrows the self-supervised-learning (SSL) recipe that reshaped NLP and vision: learn data-agnostic objectives from cheap unlabeled tactile images, then probe the frozen representation on downstream tasks. It also fills a missing-benchmark gap with TacBench, so progress is measurable across sensors and models.
Two deliverables: the Sparsh encoder family and the TacBench benchmark (Fig 1, Fig 2).
SSL pre-training (the encoders). All encoders are ViT-B/14 (Table 1), pre-trained without labels on ~460k tactile images (70% / 462.7k of a ~661k-image pool; the rest held out for online probes). Five SSL paradigms are adapted from their official vision codebases:
‖Itarget − Irecon‖2.
Pixel-space objective.Tactile-specific design moves. (i) Background
subtraction for markerless DIGIT and GelSight Mini, giving the model a
no-contact reference so static shear from a perpendicular force is legible;
empirically improves same-sensor generalization. (ii) Temporal
tokenization — for image SSL methods two frames at stride 5 are
concatenated channel-wise,
It ⊕ It−5 → x ∈ ℝh×w×6,
an ~80 ms window matching the human partial-slip reaction time; for V-JEPA, 4-frame
clips at [t, t−2, t−4, t−6] (~100 ms). All inputs
reshaped to 224×224.
TacBench (the benchmark) & evaluation protocol. Six tasks under three questions: comprehend properties — T1 force estimation, T1A force-field visualization, T2 slip detection; enable perception — T3 SE(2) pose estimation, T4 grasp stability, T5 textile recognition; enable planning — T6 bead-maze tactile policy. Standard evaluation freezes the Sparsh encoder and trains an attentive decoder (cross-attention + 2-layer MLP) on the labeled set; dense tasks (force fields) use a DPT decoder (Fig 7). The headline comparison is against an E2E baseline of identical capacity trained from scratch, swept across labeled-data budgets (1% / 10% / 33% / 50% / 100%).
Headline: SSL pre-training + frozen probe beats from-scratch E2E by an average of 95.1% across TacBench under limited (33–50%) labeled budgets. Sparsh (DINO) and Sparsh (I-JEPA) are the most competitive overall, with DINO outperforming I-JEPA by ~5.6% on average — evidence that latent-space SSL beats pixel-space (MAE) for tactile images. Task-level character (Fig 4):
| Task | Best Sparsh variant | Headline finding |
|---|---|---|
| T1 Force estimation | Sparsh (DINO) | Low force error even with sparse labels; markerless GelSight Mini where E2E fails |
| T2 Slip detection | Sparsh (V-JEPA) | Highest F1; 4-frame temporal window helps; strong even at 1% data |
| T3 Pose estimation | pre-trained (DINO) | Holds accuracy in low-data; E2E degrades sharply |
| T4 Grasp stability | Sparsh (I-JEPA)/(V-JEPA) | ~80% acc from a single finger’s touch, surpassing Calandra et al. (tactile+vision) |
| T5 Textile recognition | Sparsh (MAE) | MAE’s pixel features shine; big gain at 10% data over hard-to-train E2E |
| T6 Bead maze (policy) | Sparsh (DINO)/(I-JEPA) | ~20–53% lower trajectory error vs E2E; but no model finishes the real maze |
Where it loses / is honest: on T6 the lower BC trajectory error did not translate to real-robot success — none of the models completes the full maze on hardware (lack of force control, no error recovery after losing grip, drift from local decision-making). Specialist from-scratch models can show better real rollout performance because the narrow task domain overfits favorably — a known pre-trained-vision phenomenon the authors cite. MAE generally lags the latent-space methods except on the texture-heavy T5.
From the authors (§8):
What I noticed reading it:
This is squarely on the thesis behind this batch:
many manipulation predicates are not visually evaluable —
is_grasped, is_inserted, surface_is_rough,
is_slipping, is_screwed_tight live in touch and force,
not pixels. BLADE
learns visual classifiers for its predicates and explicitly flags
contact-rich tasks as a segmentation/observation weakness. Sparsh is the
infrastructure that could supply the missing modality: a frozen, sensor-agnostic
touch encoder whose features TacBench shows already linearly encode force, slip
state, SE(2) pose, and grasp success. Concretely, a BLADE-style predicate
classifier fθ(p): O → {T,F} could take a Sparsh
embedding instead of (or alongside) a cropped RGB region, making
touch-dependent preconditions/effects learnable from the same auto-labeled-demo
pipeline. The bead-maze result is also a cautionary note for the BLADE thesis:
good representation-level read-out does not guarantee closed-loop control without
force-aware action — the symbolic layer can name is_slipping,
but the controller still needs force modulation, which BLADE's purely categorical
abstraction punts to the diffusion policy. Cluster B (representation/foundation
models) is the natural feeder; the policy clusters (C/F) are where these reps
would actually plug into manipulation.
Directly related newly-ingested batch papers: see Related below.
We find that SSL pre-training for touch representation and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. — Abstract / p.1
However, current approaches primarily focus on texture and visual properties and overlook physical contact properties, such as forces, slippage, and poses, which are essential for dexterous manipulation. — §2 Related work / p.3
Touch comes before sight, before speech. — §1 Introduction / p.1 (Margaret Atwood, via the authors)
Papers cited that should likely be ingested next:
Newly ingested in 2026-06-24 batch — directly relevant: