Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han · Beijing Jiaotong University / BUPT · 2024 · arXiv preprint · arXiv:2403.09813 · PDF · project page

One-liner. TLV is the first touch–language–vision dataset with sentence-level (not just lexical-label) tactile descriptions — ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-machine cascade — paired with STLV-Align, a LoRA recipe that aligns touch, language, and vision in one embedding space by tuning only 1% of parameters.

Problem & motivation

Tactile-multimodal research has fixated on the visual–tactile pair and, where language enters at all, kept it at the lexical level — words or class labels used purely for classification (e.g., Touch and Go, "Connecting look and feel"). Lexical labels carry thin semantics; richer sentence-level captions could convey object identity, contact location, material, and texture/softness jointly, but annotating lengthy tactile text by hand is expensive and slow. The authors argue that modern image-to-text models (GPT-4V, Gemini) now make long-form tactile annotation feasible at scale, and that binding touch to language at the sentence level (in the spirit of ImageBind / LanguageBind extending the joint space to new modalities) is the missing piece for “comprehensive” tactile perception.

Method

The contribution is two-part: a dataset (TLV) built by a three-stage human-machine cascade, and a training recipe (STLV-Align). The dataset construction pipeline is Fig 1.

Stage I — Touch and vision collection. Paired tactile/visual frames are drawn from VisGel [26], a large vision–touch dataset where a robot arm touches objects while a GelSight [22] sensor and an RGB camera record synchronized, timestamped video. From 10,000 synchronized videos, for each video they extract two paired frame sets: a touched set and an untouched set. The first frame (arm away from objects) is the background; frame differencing [40] against it gives the frame of maximum difference as the touch frame; the 40th frame is uniformly taken as the no-contact frame.

Stage II — Touch localization. Human annotators draw a red bounding box around the touched object in the visual image and give it an open-ended name (no fixed vocabulary). Untouched frames are skipped. Data that can't be cleanly annotated (occluded object, no real interaction) is filtered out. The box plus name is the "data-specific prompt" handed to the next stage.

Stage III — Tactile labeling. GPT-4V is prompted with the boxed visual image and a carefully designed data-specific prompt instructing it to describe the touched object's name, contact location, material at the point of contact, and texture/softness-hardness. For untouched frames, a fixed string — "No object is being touched." — is used instead of an LLM call. Treating touch and vision as "two views of the same semantics" is what licenses captioning the tactile signal from the visual image.

STLV-Align (the model, Fig 2). Three OpenCLIP encoders — touch, text, vision. The touch modality is processed as an RGB image (GelSight frames look like images), so both the touch and vision encoders are instantiated as OpenCLIP ViT [9] vision encoders; the text encoder is the OpenCLIP text encoder. Rather than the large-scale pretraining of the prior approach (ViT-LENS-2 [25]), they use LoRA [18]: the base weight W₀ stays frozen and only a low-rank update BA is learned, so f(x) = W₀x + BAx with B∈ℝ^d×r, A∈ℝ^r×k (Eq 1). During joint training the text encoder is frozen (OpenCLIP text already generalizes well); only the touch and vision encoders update, and the vision update exists mainly to assist the touch update. Three contrastive losses (Eq 2) bind the pairs — touch↔language L_T,L, vision↔language L_V,L, touch↔vision L_T,V — combined as a symmetric joint loss (L_T,L+L_L,T) + α(L_V,L+L_L,V) + β(L_T,V+L_V,T), with both α and β set to 0.1 (vision is auxiliary).

Setup

Datasets / benchmarks: TLV itself — 19,834 annotated entries (9,834 touched pairs after filtering + 10,000 untouched pairs), all sourced from VisGel [26]. Downstream zero-shot evaluation on the Touch and Go [36] dataset across three tactile-classification tasks: material, hard/soft, rough/smooth (cross-domain — TLV is train, Touch and Go is test).
Hardware / simulator: No new robot. Tactile data is GelSight [22] camera-based touch imagery inherited from VisGel; visual data is RGB camera frames. The robot arm and capture rig are VisGel's.
Baselines: ViT-LENS-2 [25] (SOTA omni-modal model, anchored by images "(I)" and by images+text "(I+T)"), with its foundation ImageBind [14]; OpenCLIP-large [19] as STLV-Align's own foundation.
Compute: not reported (only the 1% parameter-tuning ratio and the LoRA design are given).

Results

Headline: with only 19,843 unlabeled training pairs and 1% tuned parameters, STLV-Align (Large) lifts its OpenCLIP foundation by +8.3% on material and by more than 30% on both hard/soft (+32.9%) and rough/smooth (+31.9%) zero-shot accuracy on Touch and Go (Table 1). It does not beat the absolute numbers of the heavily-pretrained ViT-LENS-2 (I+T), which hits 65.8% material; the claim is parameter-/data-efficiency, not SOTA.

Model	Size	Material	Hard/Soft	Rough/Smooth
ImageBind	Base	24.2	65.7	69.8
ViT-LENS-2 (I)	Large	31.2 (+7.0)	74.3 (+8.6)	78.2 (+8.4)
ViT-LENS-2 (I+T)	Large	65.8 (+41.6)	74.7 (+9.0)	63.8 (−6.0)
OpenCLIP	Large	17.7	32.2	42.7
STLV-Align	Large	26.0 (+8.3)	65.1 (+32.9)	74.6 (+31.9)

Where it loses: STLV-Align's absolute material accuracy (26.0) trails ViT-LENS-2 (I+T)'s 65.8 by a wide margin, and even its hard/soft and rough/smooth sit below ViT-LENS-2 (I). The efficiency comparison (Table 2) is the real selling point: STLV-Align is unsupervised, uses 19,843 vs ViT-LENS-2's 91,982 training items, tunes 1% vs 100% of parameters, and supports cross-domain evaluation.

Ablation (Table 3). Aligning both touch and text with vision helps hard/soft (65.1) and rough/smooth (74.6); aligning only one of them with vision hurts. Curiously, removing vision entirely (−TV&VL) gives the best material score (32.5 vs 26.0) — so vision is a net positive on texture/softness but a net negative on material classification.

Limitations & open questions

From the authors:

The method "may apply to specific scenarios, but there is room for improvement in terms of performance" — they concede STLV-Align is not competitive on absolute accuracy.
They intend to extend TLV to more tasks to fully exploit its potential (i.e., current task coverage is narrow: three binary/material classifications).

What I noticed reading it:

The "tactile" captions are generated from vision, not from touch. GPT-4V never sees the GelSight image — it captions the boxed RGB image, and material/softness claims are inferred from visual appearance. This risks systematically wrong tactile labels (a matte plastic that looks like metal) and undercuts the premise that the text encodes genuine tactile semantics. The whole dataset's tactile grounding hinges on the "two views of the same semantics" assumption, which is exactly where it's weakest.
No human verification of GPT-4V captions is reported. No annotation-quality audit, inter-annotator agreement, or caption-accuracy study. For a dataset paper, that's the missing experiment.
The frame-selection heuristics are crude. "40th frame = no-contact" and "max frame-difference = contact" are blunt; mislabeled contact frames would propagate silently into training.
The ablation showing vision hurts material classification is left unexplained, yet vision-as-auxiliary is the paper's design thesis — a real tension that isn't reconciled.
No per-class or per-seed variance is reported on the classification tables; single accuracy numbers only.

Why I care

This sits squarely on the batch thesis that many manipulation predicates — surface_is_rough, is_soft, material_is_metal — are not visually evaluable and live in touch. TLV is an attempt to attach sentence-level language to those tactile properties, which is exactly the substrate BLADE would need if it ever wanted to learn touch-grounded predicate classifiers (BLADE today learns only visual predicate classifiers). The appeal: a touch↔language embedding space could let an LLM name a tactile predicate and have it grounded in GelSight readings rather than pixels.

But the caveat is sharp and worth flagging for my own future work: TLV's "tactile" captions are produced from the visual image, so the language is not actually grounded in touch. If I cite TLV in a related-work section, it should be framed as an early, vision-captioned touch–language dataset — a motivating proof-of-concept, not a clean tactile-grounding benchmark. The genuinely touch-grounded successors in this batch (TVL, Octopi, UniTouch, Touch100k) are the ones to lean on for grounded predicate learning. TLV is adjacent-but-flawed relative to BLADE: it shares the language-grounds-perception ambition but does not deliver true touch grounding, and it has no planning/manipulation component at all.

Quotable

Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. — Abstract / p.1

The two modalities of touch and vision can be regarded as different views containing the same semantics. — §3.2 Touch Localization / p.5

To our knowledge, this is the first touch-language-vision dataset with sentence-level descriptions. — §3.4 Dataset Statistics / p.5

Papers cited here that are worth ingesting next:

[26] Li et al. 2019 — Connecting touch and vision via cross-modal prediction (VisGel) (CVPR) — the source dataset TLV is entirely built on; the direct predecessor.
[36] Yang et al. 2022 — Touch and Go (NeurIPS) — the human-collected vision-touch dataset used as the cross-domain test set. Already in the batch cross-ref list.
[25] Lei et al. 2023 — ViT-LENS-2 — the SOTA omni-modal baseline STLV-Align positions itself against.
[14] Girdhar et al. 2023 — ImageBind — foundation of the binding paradigm; in this batch.
[41] Zhu et al. 2024 — LanguageBind — the language-centered alignment strategy TLV's framing builds on.
[12] Gao et al. 2021 — ObjectFolder 2.0 and [13] ObjectFolder Benchmark — multisensory datasets cited as prior tactile datasets; in this batch.

Newly ingested in the 2026-06-24 batch — directly relevant:

TVL (A Touch, Vision, and Language Dataset) — the close sibling and the stronger version of this idea: human-labeled tactile captions (grounded in the GelSight signal, not inferred from vision) plus a touch-vision-language model. Read TLV and TVL together; TVL fixes TLV's core grounding flaw.
Touch100k — same touch-language-vision framing at much larger scale; the natural follow-on dataset.
Octopi and Octopi-1.5 — tactile-language models that reason over object properties from touch; the downstream-capability counterpart to TLV's alignment.
UniTouch — binds touch into a shared multimodal space (ImageBind-style); the representation-learning peer to STLV-Align, with broader binding.
ImageBind and LanguageBind — the binding/anchoring anchors STLV-Align inherits its contrastive multi-encoder recipe from.
AnyTouch and T3 — tactile representation/foundation models; the encoder-side direction STLV-Align's frozen-foundation-plus-LoRA approach is a lightweight alternative to.