One-liner. TLV is the first touch–language–vision dataset with sentence-level (not just lexical-label) tactile descriptions — ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-machine cascade — paired with STLV-Align, a LoRA recipe that aligns touch, language, and vision in one embedding space by tuning only 1% of parameters.
Tactile-multimodal research has fixated on the visual–tactile pair and, where language enters at all, kept it at the lexical level — words or class labels used purely for classification (e.g., Touch and Go, "Connecting look and feel"). Lexical labels carry thin semantics; richer sentence-level captions could convey object identity, contact location, material, and texture/softness jointly, but annotating lengthy tactile text by hand is expensive and slow. The authors argue that modern image-to-text models (GPT-4V, Gemini) now make long-form tactile annotation feasible at scale, and that binding touch to language at the sentence level (in the spirit of ImageBind / LanguageBind extending the joint space to new modalities) is the missing piece for “comprehensive” tactile perception.
The contribution is two-part: a dataset (TLV) built by a three-stage human-machine cascade, and a training recipe (STLV-Align). The dataset construction pipeline is Fig 1.
Stage I — Touch and vision collection. Paired
tactile/visual frames are drawn from VisGel [26], a large
vision–touch dataset where a robot arm touches objects while a
GelSight [22] sensor and an RGB camera record synchronized,
timestamped video. From 10,000 synchronized videos, for each video they extract
two paired frame sets: a touched set and an untouched set. The
first frame (arm away from objects) is the background; frame differencing
[40] against it gives the frame of maximum difference as the touch frame; the
40th frame is uniformly taken as the no-contact frame.
Stage II — Touch localization. Human annotators draw a red bounding box around the touched object in the visual image and give it an open-ended name (no fixed vocabulary). Untouched frames are skipped. Data that can't be cleanly annotated (occluded object, no real interaction) is filtered out. The box plus name is the "data-specific prompt" handed to the next stage.
Stage III — Tactile labeling. GPT-4V is prompted with the boxed visual image and a carefully designed data-specific prompt instructing it to describe the touched object's name, contact location, material at the point of contact, and texture/softness-hardness. For untouched frames, a fixed string — "No object is being touched." — is used instead of an LLM call. Treating touch and vision as "two views of the same semantics" is what licenses captioning the tactile signal from the visual image.
STLV-Align (the model, Fig 2). Three OpenCLIP encoders —
touch, text, vision. The touch modality is processed as an RGB image (GelSight
frames look like images), so both the touch and vision encoders are instantiated
as OpenCLIP ViT [9] vision encoders; the text encoder is the OpenCLIP text
encoder. Rather than the large-scale pretraining of the prior approach
(ViT-LENS-2 [25]), they use LoRA [18]: the base weight
W₀ stays frozen and only a low-rank update
BA is learned, so f(x) = W₀x + BAx with
B∈ℝd×r,
A∈ℝr×k (Eq 1). During joint
training the text encoder is frozen (OpenCLIP text already generalizes
well); only the touch and vision encoders update, and the vision update exists
mainly to assist the touch update. Three contrastive losses
(Eq 2) bind the pairs — touch↔language
LT,L, vision↔language LV,L,
touch↔vision LT,V — combined as a symmetric
joint loss (LT,L+LL,T) + α(LV,L+LL,V) + β(LT,V+LV,T),
with both α and β set to 0.1 (vision is
auxiliary).
Headline: with only 19,843 unlabeled training pairs and 1% tuned parameters, STLV-Align (Large) lifts its OpenCLIP foundation by +8.3% on material and by more than 30% on both hard/soft (+32.9%) and rough/smooth (+31.9%) zero-shot accuracy on Touch and Go (Table 1). It does not beat the absolute numbers of the heavily-pretrained ViT-LENS-2 (I+T), which hits 65.8% material; the claim is parameter-/data-efficiency, not SOTA.
| Model | Size | Material | Hard/Soft | Rough/Smooth |
|---|---|---|---|---|
| ImageBind | Base | 24.2 | 65.7 | 69.8 |
| ViT-LENS-2 (I) | Large | 31.2 (+7.0) | 74.3 (+8.6) | 78.2 (+8.4) |
| ViT-LENS-2 (I+T) | Large | 65.8 (+41.6) | 74.7 (+9.0) | 63.8 (−6.0) |
| OpenCLIP | Large | 17.7 | 32.2 | 42.7 |
| STLV-Align | Large | 26.0 (+8.3) | 65.1 (+32.9) | 74.6 (+31.9) |
Where it loses: STLV-Align's absolute material accuracy (26.0) trails ViT-LENS-2 (I+T)'s 65.8 by a wide margin, and even its hard/soft and rough/smooth sit below ViT-LENS-2 (I). The efficiency comparison (Table 2) is the real selling point: STLV-Align is unsupervised, uses 19,843 vs ViT-LENS-2's 91,982 training items, tunes 1% vs 100% of parameters, and supports cross-domain evaluation.
Ablation (Table 3). Aligning both touch and text with vision helps hard/soft (65.1) and rough/smooth (74.6); aligning only one of them with vision hurts. Curiously, removing vision entirely (−TV&VL) gives the best material score (32.5 vs 26.0) — so vision is a net positive on texture/softness but a net negative on material classification.
From the authors:
What I noticed reading it:
This sits squarely on the batch thesis that many manipulation predicates
— surface_is_rough, is_soft,
material_is_metal — are not visually evaluable and
live in touch. TLV is an attempt to attach sentence-level language to
those tactile properties, which is exactly the substrate
BLADE
would need if it ever wanted to learn touch-grounded predicate classifiers
(BLADE today learns only visual predicate classifiers). The
appeal: a touch↔language embedding space could let an LLM name a tactile
predicate and have it grounded in GelSight readings rather than pixels.
But the caveat is sharp and worth flagging for my own future work: TLV's "tactile" captions are produced from the visual image, so the language is not actually grounded in touch. If I cite TLV in a related-work section, it should be framed as an early, vision-captioned touch–language dataset — a motivating proof-of-concept, not a clean tactile-grounding benchmark. The genuinely touch-grounded successors in this batch (TVL, Octopi, UniTouch, Touch100k) are the ones to lean on for grounded predicate learning. TLV is adjacent-but-flawed relative to BLADE: it shares the language-grounds-perception ambition but does not deliver true touch grounding, and it has no planning/manipulation component at all.
Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. — Abstract / p.1
The two modalities of touch and vision can be regarded as different views containing the same semantics. — §3.2 Touch Localization / p.5
To our knowledge, this is the first touch-language-vision dataset with sentence-level descriptions. — §3.4 Dataset Statistics / p.5
Papers cited here that are worth ingesting next:
Newly ingested in the 2026-06-24 batch — directly relevant: