Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

Vedant Dave*, Fotios Lygerakis*, Elmar Rueckert · Cyber-Physical-Systems Lab, Montanuniversität Leoben · 2024 · arXiv preprint (cs.RO) · arXiv:2401.12024 · PDF

One-liner. MViTac pre-trains paired vision and touch ResNet-18 encoders with a four-term InfoNCE objective — two intra-modal (vision–vision, touch–touch) plus two inter-modal (vision↔touch) losses — and shows that adding the within-modality terms on top of the usual cross-modal contrastive recipe yields representations that beat both prior SSL visuo-tactile methods and a supervised baseline on material-property classification, with linear probing on frozen features.

Problem & motivation

Vision captures global scene structure but misses the fine-grained contact attributes (hardness, roughness, texture, slip) that touch delivers; fusing the two is hard because they carry very different information densities and most prior fusion work leaned on human-labeled data, which is especially expensive for tactile signals (every sample requires physical interaction). Contrastive self-supervised learning is the obvious lever for unlabeled tactile data, but earlier visuo-tactile SSL methods (e.g., SSVTP, the CMC variant in Touch-and-Go) optimize only cross-modal agreement. MViTac's thesis is that the discriminative power needed for downstream tasks also benefits from within-modal contrast, so it learns both at once.

Method

The architecture (Fig 1) uses dual encoders — an image encoder f(·;θ) and a tactile encoder f(·;ψ), each a ResNet-18 backbone pre-trained on ImageNet — together with momentum counterparts (θ_k, ψ_k) updated by EMA (Eq. 1–3). Both modalities are treated as RGB images in R^H×W×3 (the GelSight-style tactile imprint is itself an image). Each encoder feeds two 2-layer MLP projection heads, one for the intra-modal task and one for the inter-modal task, mapping the 512-d backbone output to a 128-d embedding. Loss throughout is InfoNCE (Eq. 8) with temperature τ; unlike MoCo/MCT, negatives are simply the other in-batch keys (no memory queue).

Intra-modal (Eq. 4–8). Augment one image twice into query/key, encode through encoder + momentum encoder, contrast. Done separately for vision (L_vv) and touch (L_tt) to sharpen within-modality features.

Inter-modal (Eq. 9–13). For a paired (o_v, o_t) sample: image-to-tactile (L_vt) pulls the image-encoder query toward the momentum-tactile key, and tactile-to-image (L_tv) does the reverse. The paper notes it adapts Multimodal Contrastive Training (MCT, [42]) but drops MCT's margin/dot-product formulation in favor of plain InfoNCE since both modalities live in image space.

Combined loss (Eq. 14): L_mm = L_vv + L_tt + λ_inter(L_vt + L_tv), where λ_inter trades off the within- vs across-modality objectives. After pre-training, projection heads are discarded, the encoder is frozen, and a linear classifier is trained on top (linear probing) for each downstream task.

Setup

Datasets / benchmarks: Touch-and-Go (TAG) for material-property identification — three tasks: 20-way material category, hard/soft binary, rough/smooth binary, using the authors' prescribed splits. Calandra ("More Than a Feeling") dataset for grasp success prediction — paired GelSight (left/right) + RGB triplets (before/during/after); no standard split, so authors create their own using 40 of 106 objects (~18,000 samples), tactile pair stacked across channels.
Hardware / simulator: No robot rollouts — purely offline learning from pre-collected real-world datasets. Tactile inputs are vision-based optical sensor images (GelSight-style). Grasping experiment baseline uses the TACTO implementation. Training on a single RTX 4090 GPU.
Baselines: ResNet-18 supervised (tactile, and tactile+visual concatenated); TAG [11] (Contrastive Multiview Coding); SSVTP [10] (Kerr et al., InfoNCE-based visuo-tactile pretraining); Calandra et al. [4] supervised (grasping only); chance.
Compute: single 4090 GPU, batch size 256, 240 epochs pre-training + 60 epochs downstream; ADAM, pretrain LR 0.03, downstream LR 1e-4, best τ=0.07.

Results

On material-property identification (Table I, Top-1 accuracy %), the tactile+visual MViTac is best across all three tasks, and even the tactile-only MViTac beats the SSL baselines on category and rough/smooth:

Method	Modality	Category	Hard/Soft	Rough/Smooth
ResNet18 (supervised)	Tactile	57.4	89.1	79.3
ResNet18 (supervised)	Tac+Vis	48.0	85.9	80.0
TAG (CMC)	Tac+Vis	68.6	87.1	82.4
SSVTP (InfoNCE)	Tac+Vis	70.7	88.6	83.6
MViTac (ours)	Tactile	57.6	86.2	82.1
MViTac (ours)	Tac+Vis	74.9	91.8	84.1

Where it wins: tactile+visual MViTac tops every material task, including beating the supervised model on hard/soft (91.8 vs 89.1) — notable since supervised usually wins low-data regimes. Adding visual to tactile uniformly helps all methods.

Where it loses: on grasp success prediction (Table II), MViTac reaches 60.3%, beating TAG/CMC (56.3%) by ~4% on unseen objects but losing badly to the Calandra supervised baseline (73.1%). The authors attribute the gap to the small (~18k), imbalanced training set — contrastive methods need bigger, more diverse data. In the tactile-only material setting the supervised ResNet-18 still edges MViTac on hard/soft (89.1 vs 86.2).

Limitations & open questions

From the authors:

SSL does not yet beat supervised methods when data is limited (grasp prediction); closing this gap is flagged future work.
Tactile alone is insufficient for complex/uncontrolled surfaces; needs visual augmentation.
Evaluated only on offline datasets — no real-robot or real-time deployment; broader/more sophisticated manipulation tasks untested.

What I noticed reading it:

The Calandra grasp split is self-created (no standard split), so the 60.3 vs 56.3 comparison rests on a split the authors chose — not directly comparable to published Calandra numbers, and the supervised baseline they cite (73.1) used a different setup.
No ablation isolating the contribution of the intra-modal terms — the paper's whole pitch is "intra + inter beats inter-only," yet there is no λ_inter sweep or drop-the-intra-loss ablation reported. The claim is argued from out-performing inter-focused baselines (TAG, SSVTP), which also differ in backbone/training details, confounding the attribution.
Single seed / no variance reported on any table — small accuracy gaps (e.g., 57.4 vs 57.6 on category, tactile-only) are within plausible noise.
"Tactile" here means an optical GelSight-style image; the method is really image–image contrastive learning and would not transfer directly to non-image tactile streams (force/torque, taxel arrays, audio) without re-engineering the encoder.

Why I care

Off the central BLADE thesis (no language, no planning, no long-horizon manipulation), but squarely on the batch's grounding sub-thesis: many manipulation predicates — is_grasped, surface_is_rough, is_hard — are not visually evaluable and live in touch. MViTac is a clean, minimal demonstration that a frozen self-supervised touch encoder can linearly read out exactly those properties (hard/soft, rough/smooth, grasp success). That is precisely the kind of cheap, frozen tactile feature extractor a BLADE-style predicate classifier could sit on top of when a predicate is tactile rather than visual. Its specific contribution — that intra-modal contrast helps on top of cross-modal — is a representation-learning design note rather than a manipulation result; treat this as a method/baseline anchor in the visuo-tactile SSL lineage, not as a control-policy paper. The grasp-prediction loss to supervised learning is also a useful data-point on where contrastive pretraining still under-delivers in the low-data tactile regime.

Quotable

MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. — Abstract / p.1

Our model not only distinguishes similarities and differences between visual and tactile data but also places substantial emphasis on learning within the same sensory modality. — §IV.B / p.5

CL techniques necessitate bigger and more diverse datasets to perform comparably with supervised methods. — §IV.B / p.6

Papers cited that should likely be ingested next:

[11] Yang et al. — Touch and Go (NeurIPS 2022 D&B) — the TAG dataset + CMC baseline; primary training/eval data. expected slug.
[10] Kerr et al. — SSVTP (Self-Supervised Visuo-Tactile Pretraining, 2023) — the closest SSL baseline (InfoNCE visuo-tactile); not in the batch list.
[51] Wang et al. — TACTO simulator — used for the grasping baseline. expected slug.
[4]/[37] Calandra et al. — More Than a Feeling / The Feeling of Success — the grasp dataset + supervised baseline; canonical visuo-tactile grasping reference.
[34] GelSight, [35] DIGIT — the vision-based tactile sensors underlying the image-style tactile inputs.

Newly ingested in 2026-06-24 batch — directly relevant:

Dexterity from Touch (T-DEX) — sibling self-supervised tactile-representation paper (Guzey et al. [12], cited here); same "pretrain a touch encoder, reuse downstream" recipe but for dexterous manipulation rather than linear-probe classification.
See to Touch — same lab line; visual incentives for tactile dexterity, complementary visuo-tactile representation angle.
Sparsh and T3 (Transferable Tactile Transformers) — the scaled-up successors to MViTac's idea: large self-supervised touch foundation models; MViTac is an early, small-scale point on that trajectory.
AnyTouch and UniT — unified / data-efficient visuo-tactile representation learning; direct descendants in the same cluster B.
Touch and Go — the dataset + CMC baseline MViTac trains on and compares against (if ingested in this batch).
TACTO — the GelSight simulator MViTac uses for its grasping baseline (if ingested in this batch).