One-liner. MViTac pre-trains paired vision and touch ResNet-18 encoders with a four-term InfoNCE objective — two intra-modal (vision–vision, touch–touch) plus two inter-modal (vision↔touch) losses — and shows that adding the within-modality terms on top of the usual cross-modal contrastive recipe yields representations that beat both prior SSL visuo-tactile methods and a supervised baseline on material-property classification, with linear probing on frozen features.
Vision captures global scene structure but misses the fine-grained contact attributes (hardness, roughness, texture, slip) that touch delivers; fusing the two is hard because they carry very different information densities and most prior fusion work leaned on human-labeled data, which is especially expensive for tactile signals (every sample requires physical interaction). Contrastive self-supervised learning is the obvious lever for unlabeled tactile data, but earlier visuo-tactile SSL methods (e.g., SSVTP, the CMC variant in Touch-and-Go) optimize only cross-modal agreement. MViTac's thesis is that the discriminative power needed for downstream tasks also benefits from within-modal contrast, so it learns both at once.
The architecture (Fig 1) uses dual encoders — an image encoder
f(·;θ) and a tactile encoder
f(·;ψ), each a ResNet-18 backbone pre-trained on
ImageNet — together with momentum counterparts
(θk, ψk) updated
by EMA (Eq. 1–3). Both modalities are treated as RGB images in
RH×W×3 (the GelSight-style tactile
imprint is itself an image). Each encoder feeds two 2-layer MLP
projection heads, one for the intra-modal task and one for the
inter-modal task, mapping the 512-d backbone output to a 128-d embedding.
Loss throughout is InfoNCE (Eq. 8) with temperature
τ; unlike MoCo/MCT, negatives are simply the other
in-batch keys (no memory queue).
Intra-modal (Eq. 4–8). Augment one image twice
into query/key, encode through encoder + momentum encoder, contrast.
Done separately for vision (Lvv) and touch
(Ltt) to sharpen within-modality features.
Inter-modal (Eq. 9–13). For a paired
(ov, ot) sample: image-to-tactile
(Lvt) pulls the image-encoder query toward the
momentum-tactile key, and tactile-to-image (Ltv)
does the reverse. The paper notes it adapts Multimodal Contrastive
Training (MCT, [42]) but drops MCT's margin/dot-product formulation in
favor of plain InfoNCE since both modalities live in image space.
Combined loss (Eq. 14):
Lmm = Lvv + Ltt +
λinter(Lvt + Ltv),
where λinter trades off the within- vs
across-modality objectives. After pre-training, projection heads are
discarded, the encoder is frozen, and a linear classifier is trained on
top (linear probing) for each downstream task.
τ=0.07.On material-property identification (Table I, Top-1 accuracy %), the tactile+visual MViTac is best across all three tasks, and even the tactile-only MViTac beats the SSL baselines on category and rough/smooth:
| Method | Modality | Category | Hard/Soft | Rough/Smooth |
|---|---|---|---|---|
| ResNet18 (supervised) | Tactile | 57.4 | 89.1 | 79.3 |
| ResNet18 (supervised) | Tac+Vis | 48.0 | 85.9 | 80.0 |
| TAG (CMC) | Tac+Vis | 68.6 | 87.1 | 82.4 |
| SSVTP (InfoNCE) | Tac+Vis | 70.7 | 88.6 | 83.6 |
| MViTac (ours) | Tactile | 57.6 | 86.2 | 82.1 |
| MViTac (ours) | Tac+Vis | 74.9 | 91.8 | 84.1 |
Where it wins: tactile+visual MViTac tops every material task, including beating the supervised model on hard/soft (91.8 vs 89.1) — notable since supervised usually wins low-data regimes. Adding visual to tactile uniformly helps all methods.
Where it loses: on grasp success prediction (Table II), MViTac reaches 60.3%, beating TAG/CMC (56.3%) by ~4% on unseen objects but losing badly to the Calandra supervised baseline (73.1%). The authors attribute the gap to the small (~18k), imbalanced training set — contrastive methods need bigger, more diverse data. In the tactile-only material setting the supervised ResNet-18 still edges MViTac on hard/soft (89.1 vs 86.2).
From the authors:
What I noticed reading it:
λinter sweep or
drop-the-intra-loss ablation reported. The claim is argued from
out-performing inter-focused baselines (TAG, SSVTP), which also differ
in backbone/training details, confounding the attribution.Off the central BLADE
thesis (no language, no planning, no long-horizon manipulation), but
squarely on the batch's grounding sub-thesis: many manipulation
predicates — is_grasped, surface_is_rough,
is_hard — are not visually evaluable and live
in touch. MViTac is a clean, minimal demonstration that a frozen
self-supervised touch encoder can linearly read out exactly those
properties (hard/soft, rough/smooth, grasp success). That is precisely the
kind of cheap, frozen tactile feature extractor a BLADE-style predicate
classifier could sit on top of when a predicate is tactile rather than
visual. Its specific contribution — that intra-modal
contrast helps on top of cross-modal — is a representation-learning
design note rather than a manipulation result; treat this as a
method/baseline anchor in the visuo-tactile SSL
lineage, not as a control-policy paper. The grasp-prediction loss to
supervised learning is also a useful data-point on where contrastive
pretraining still under-delivers in the low-data tactile regime.
MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. — Abstract / p.1
Our model not only distinguishes similarities and differences between visual and tactile data but also places substantial emphasis on learning within the same sensory modality. — §IV.B / p.5
CL techniques necessitate bigger and more diverse datasets to perform comparably with supervised methods. — §IV.B / p.6
Papers cited that should likely be ingested next:
Newly ingested in 2026-06-24 batch — directly relevant: