Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han · Beijing Jiaotong University / BUPT · 2024 · arXiv preprint · arXiv:2403.09813 · PDF · project page

One-liner. TLV is the first touch–language–vision dataset with sentence-level (not just lexical-label) tactile descriptions — ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-machine cascade — paired with STLV-Align, a LoRA recipe that aligns touch, language, and vision in one embedding space by tuning only 1% of parameters.

Problem & motivation

Tactile-multimodal research has fixated on the visual–tactile pair and, where language enters at all, kept it at the lexical level — words or class labels used purely for classification (e.g., Touch and Go, "Connecting look and feel"). Lexical labels carry thin semantics; richer sentence-level captions could convey object identity, contact location, material, and texture/softness jointly, but annotating lengthy tactile text by hand is expensive and slow. The authors argue that modern image-to-text models (GPT-4V, Gemini) now make long-form tactile annotation feasible at scale, and that binding touch to language at the sentence level (in the spirit of ImageBind / LanguageBind extending the joint space to new modalities) is the missing piece for “comprehensive” tactile perception.

Method

The contribution is two-part: a dataset (TLV) built by a three-stage human-machine cascade, and a training recipe (STLV-Align). The dataset construction pipeline is Fig 1.

Stage I — Touch and vision collection. Paired tactile/visual frames are drawn from VisGel [26], a large vision–touch dataset where a robot arm touches objects while a GelSight [22] sensor and an RGB camera record synchronized, timestamped video. From 10,000 synchronized videos, for each video they extract two paired frame sets: a touched set and an untouched set. The first frame (arm away from objects) is the background; frame differencing [40] against it gives the frame of maximum difference as the touch frame; the 40th frame is uniformly taken as the no-contact frame.

Stage II — Touch localization. Human annotators draw a red bounding box around the touched object in the visual image and give it an open-ended name (no fixed vocabulary). Untouched frames are skipped. Data that can't be cleanly annotated (occluded object, no real interaction) is filtered out. The box plus name is the "data-specific prompt" handed to the next stage.

Stage III — Tactile labeling. GPT-4V is prompted with the boxed visual image and a carefully designed data-specific prompt instructing it to describe the touched object's name, contact location, material at the point of contact, and texture/softness-hardness. For untouched frames, a fixed string — "No object is being touched." — is used instead of an LLM call. Treating touch and vision as "two views of the same semantics" is what licenses captioning the tactile signal from the visual image.

STLV-Align (the model, Fig 2). Three OpenCLIP encoders — touch, text, vision. The touch modality is processed as an RGB image (GelSight frames look like images), so both the touch and vision encoders are instantiated as OpenCLIP ViT [9] vision encoders; the text encoder is the OpenCLIP text encoder. Rather than the large-scale pretraining of the prior approach (ViT-LENS-2 [25]), they use LoRA [18]: the base weight W₀ stays frozen and only a low-rank update BA is learned, so f(x) = W₀x + BAx with B∈ℝd×r, A∈ℝr×k (Eq 1). During joint training the text encoder is frozen (OpenCLIP text already generalizes well); only the touch and vision encoders update, and the vision update exists mainly to assist the touch update. Three contrastive losses (Eq 2) bind the pairs — touch↔language LT,L, vision↔language LV,L, touch↔vision LT,V — combined as a symmetric joint loss (LT,L+LL,T) + α(LV,L+LL,V) + β(LT,V+LV,T), with both α and β set to 0.1 (vision is auxiliary).

Setup

Results

Headline: with only 19,843 unlabeled training pairs and 1% tuned parameters, STLV-Align (Large) lifts its OpenCLIP foundation by +8.3% on material and by more than 30% on both hard/soft (+32.9%) and rough/smooth (+31.9%) zero-shot accuracy on Touch and Go (Table 1). It does not beat the absolute numbers of the heavily-pretrained ViT-LENS-2 (I+T), which hits 65.8% material; the claim is parameter-/data-efficiency, not SOTA.

ModelSizeMaterialHard/SoftRough/Smooth
ImageBindBase24.265.769.8
ViT-LENS-2 (I)Large31.2 (+7.0)74.3 (+8.6)78.2 (+8.4)
ViT-LENS-2 (I+T)Large65.8 (+41.6)74.7 (+9.0)63.8 (−6.0)
OpenCLIPLarge17.732.242.7
STLV-AlignLarge26.0 (+8.3)65.1 (+32.9)74.6 (+31.9)

Where it loses: STLV-Align's absolute material accuracy (26.0) trails ViT-LENS-2 (I+T)'s 65.8 by a wide margin, and even its hard/soft and rough/smooth sit below ViT-LENS-2 (I). The efficiency comparison (Table 2) is the real selling point: STLV-Align is unsupervised, uses 19,843 vs ViT-LENS-2's 91,982 training items, tunes 1% vs 100% of parameters, and supports cross-domain evaluation.

Ablation (Table 3). Aligning both touch and text with vision helps hard/soft (65.1) and rough/smooth (74.6); aligning only one of them with vision hurts. Curiously, removing vision entirely (−TV&VL) gives the best material score (32.5 vs 26.0) — so vision is a net positive on texture/softness but a net negative on material classification.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This sits squarely on the batch thesis that many manipulation predicates — surface_is_rough, is_soft, material_is_metal — are not visually evaluable and live in touch. TLV is an attempt to attach sentence-level language to those tactile properties, which is exactly the substrate BLADE would need if it ever wanted to learn touch-grounded predicate classifiers (BLADE today learns only visual predicate classifiers). The appeal: a touch↔language embedding space could let an LLM name a tactile predicate and have it grounded in GelSight readings rather than pixels.

But the caveat is sharp and worth flagging for my own future work: TLV's "tactile" captions are produced from the visual image, so the language is not actually grounded in touch. If I cite TLV in a related-work section, it should be framed as an early, vision-captioned touch–language dataset — a motivating proof-of-concept, not a clean tactile-grounding benchmark. The genuinely touch-grounded successors in this batch (TVL, Octopi, UniTouch, Touch100k) are the ones to lean on for grounded predicate learning. TLV is adjacent-but-flawed relative to BLADE: it shares the language-grounds-perception ambition but does not deliver true touch grounding, and it has no planning/manipulation component at all.

Quotable

Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. — Abstract / p.1
The two modalities of touch and vision can be regarded as different views containing the same semantics. — §3.2 Touch Localization / p.5
To our knowledge, this is the first touch-language-vision dataset with sentence-level descriptions. — §3.4 Dataset Statistics / p.5

Related

Papers cited here that are worth ingesting next:

Newly ingested in the 2026-06-24 batch — directly relevant: