Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

Vedant Dave*, Fotios Lygerakis*, Elmar Rueckert · Cyber-Physical-Systems Lab, Montanuniversität Leoben · 2024 · arXiv preprint (cs.RO) · arXiv:2401.12024 · PDF

One-liner. MViTac pre-trains paired vision and touch ResNet-18 encoders with a four-term InfoNCE objective — two intra-modal (vision–vision, touch–touch) plus two inter-modal (vision↔touch) losses — and shows that adding the within-modality terms on top of the usual cross-modal contrastive recipe yields representations that beat both prior SSL visuo-tactile methods and a supervised baseline on material-property classification, with linear probing on frozen features.

Problem & motivation

Vision captures global scene structure but misses the fine-grained contact attributes (hardness, roughness, texture, slip) that touch delivers; fusing the two is hard because they carry very different information densities and most prior fusion work leaned on human-labeled data, which is especially expensive for tactile signals (every sample requires physical interaction). Contrastive self-supervised learning is the obvious lever for unlabeled tactile data, but earlier visuo-tactile SSL methods (e.g., SSVTP, the CMC variant in Touch-and-Go) optimize only cross-modal agreement. MViTac's thesis is that the discriminative power needed for downstream tasks also benefits from within-modal contrast, so it learns both at once.

Method

The architecture (Fig 1) uses dual encoders — an image encoder f(·;θ) and a tactile encoder f(·;ψ), each a ResNet-18 backbone pre-trained on ImageNet — together with momentum counterparts (θk, ψk) updated by EMA (Eq. 1–3). Both modalities are treated as RGB images in RH×W×3 (the GelSight-style tactile imprint is itself an image). Each encoder feeds two 2-layer MLP projection heads, one for the intra-modal task and one for the inter-modal task, mapping the 512-d backbone output to a 128-d embedding. Loss throughout is InfoNCE (Eq. 8) with temperature τ; unlike MoCo/MCT, negatives are simply the other in-batch keys (no memory queue).

Intra-modal (Eq. 4–8). Augment one image twice into query/key, encode through encoder + momentum encoder, contrast. Done separately for vision (Lvv) and touch (Ltt) to sharpen within-modality features.

Inter-modal (Eq. 9–13). For a paired (ov, ot) sample: image-to-tactile (Lvt) pulls the image-encoder query toward the momentum-tactile key, and tactile-to-image (Ltv) does the reverse. The paper notes it adapts Multimodal Contrastive Training (MCT, [42]) but drops MCT's margin/dot-product formulation in favor of plain InfoNCE since both modalities live in image space.

Combined loss (Eq. 14): Lmm = Lvv + Ltt + λinter(Lvt + Ltv), where λinter trades off the within- vs across-modality objectives. After pre-training, projection heads are discarded, the encoder is frozen, and a linear classifier is trained on top (linear probing) for each downstream task.

Setup

Results

On material-property identification (Table I, Top-1 accuracy %), the tactile+visual MViTac is best across all three tasks, and even the tactile-only MViTac beats the SSL baselines on category and rough/smooth:

MethodModalityCategoryHard/SoftRough/Smooth
ResNet18 (supervised)Tactile57.489.179.3
ResNet18 (supervised)Tac+Vis48.085.980.0
TAG (CMC)Tac+Vis68.687.182.4
SSVTP (InfoNCE)Tac+Vis70.788.683.6
MViTac (ours)Tactile57.686.282.1
MViTac (ours)Tac+Vis74.991.884.1

Where it wins: tactile+visual MViTac tops every material task, including beating the supervised model on hard/soft (91.8 vs 89.1) — notable since supervised usually wins low-data regimes. Adding visual to tactile uniformly helps all methods.

Where it loses: on grasp success prediction (Table II), MViTac reaches 60.3%, beating TAG/CMC (56.3%) by ~4% on unseen objects but losing badly to the Calandra supervised baseline (73.1%). The authors attribute the gap to the small (~18k), imbalanced training set — contrastive methods need bigger, more diverse data. In the tactile-only material setting the supervised ResNet-18 still edges MViTac on hard/soft (89.1 vs 86.2).

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

Off the central BLADE thesis (no language, no planning, no long-horizon manipulation), but squarely on the batch's grounding sub-thesis: many manipulation predicates — is_grasped, surface_is_rough, is_hard — are not visually evaluable and live in touch. MViTac is a clean, minimal demonstration that a frozen self-supervised touch encoder can linearly read out exactly those properties (hard/soft, rough/smooth, grasp success). That is precisely the kind of cheap, frozen tactile feature extractor a BLADE-style predicate classifier could sit on top of when a predicate is tactile rather than visual. Its specific contribution — that intra-modal contrast helps on top of cross-modal — is a representation-learning design note rather than a manipulation result; treat this as a method/baseline anchor in the visuo-tactile SSL lineage, not as a control-policy paper. The grasp-prediction loss to supervised learning is also a useful data-point on where contrastive pretraining still under-delivers in the low-data tactile regime.

Quotable

MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. — Abstract / p.1
Our model not only distinguishes similarities and differences between visual and tactile data but also places substantial emphasis on learning within the same sensory modality. — §IV.B / p.5
CL techniques necessitate bigger and more diverse datasets to perform comparably with supervised methods. — §IV.B / p.6

Related

Papers cited that should likely be ingested next:

Newly ingested in 2026-06-24 batch — directly relevant: