Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation

Ning Cheng, Changhao Guan, Jing Gao, Weihao Wang, You Li, Fandong Meng, Jie Zhou, Bin Fang, Jinan Xu, Wenjuan Han · Beijing Jiaotong Univ. / WeChat AI, Tencent / BUPT · 2024 · arXiv preprint (under review) · arXiv:2406.03813 · PDF

One-liner. Touch100k is the first ~100k-scale paired touch–language–vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus key-feature phrases), and the companion method TLV-Link uses curriculum learning to "grow" a tactile encoder out of a frozen vision encoder — the practical payoff being that touch becomes a language-queryable modality for material and grasp reasoning instead of a bag of classification labels.

Problem & motivation

Tactile perception is underexplored relative to vision and audio, and the language that does accompany existing tactile datasets is almost always a flat classification label (e.g., "rough", "metal"). That impoverished language signal caps how richly touch can be cross-modally associated — you cannot ground "fine clay, hard, rough, abrasive to the hand" if your supervision is a single class token. The authors argue tactile research needs (i) a dataset at real scale with (ii) descriptive, multi-granularity natural language tied to (iii) the paired vision, so that a tactile encoder can be aligned to language the way CLIP aligned images to text. Touch100k is positioned as the first dataset combining tactile + multi-granularity language + vision at 100k scale (Table 1 contrasts it against TLV, TVL, ObjectFolder Real/2.0/1.0, SSVTP, Touch-and-Go (TAG), YCB-Slide, VisGel, ViTac Cloth, Data_ICRA18, GelFabric, Feel — Touch100k is the only one checking every column including multi-granularity language and ≥100k size).

Method

The paper has two halves: building the dataset, and the TLV-Link pretraining method.

Dataset construction (Fig 1, three stages).

TLV-Link pretraining (Fig 2). The goal is a tactile encoder for the GelSight sensor aligned to language. Two stages:

1. Curriculum representation for tactile encoding (§4.1). A teacher–student curriculum: the pretrained vision encoder is the teacher, the touch encoder is the student (tactile images are treated as RGB). The "curriculum representation" is a weighted blend x = βi x(v) + (1 − βi) x(t) (Eq. 1). Early in training the weak student leans on the strong visual teacher (large β1); βi linearly decays toward βmin = 0 over N steps (βi = β1 − (i/N)(β1 − βmin), Eq. 2), handing control to the maturing touch encoder.

2. Modality alignment (§4.2). Multi-granularity descriptions are encoded by a frozen OpenCLIP-large text encoder; sentence- and phrase-level features are fused into a final text feature y. The blended tactile representation x is aligned to y via an InfoNCE contrastive loss (Eq. 3). The touch encoder is fully trained; the vision encoder is LoRA fine-tuned; the text encoder stays frozen. Both encoders are 24-layer 1024-dim ViTs (patch 14) initialized from OpenCLIP-large. The authors call the focus on adapting a visual encoder into the touch domain "touch-centric multimodal representation learning".

Setup

Results

Headline: TLV-Link pretrained on Touch100k sets new SOTA on most subtasks under both linear probing and zero-shot (Table 2). Accuracy (%):

Setting / ModelTrain dataMaterialHard/SoftRough/SmoothRobot Grasping
Chance5.050.050.050.0
Linear probe: VT CMCTAG54.777.379.478.1
UniTouchTAG, Feel, YCB-Slide, OF2.061.382.3
VIT-LENS-2TAG63.092.085.1
TLV-Link (ours)Touch100k67.293.184.794.5
Zero-shot: UniTouchTAG, Feel, YCB-Slide, OF2.052.765.5
VIT-LENS-2TAG65.874.763.8
TLV-Link (ours)Touch100k70.079.377.865.4

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is squarely on the thesis behind the 2026-06-24 batch: many manipulation predicates — surface_is_rough, is_hard, material_is_metal, grasp_is_stable — are not reliably visually evaluable; they live in touch. BLADE learns visual predicate classifiers fθ(p): O → {T,F}, but several of the most useful manipulation predicates would need a tactile O. Touch100k + TLV-Link is exactly the missing ingredient: a tactile encoder aligned to language, so a touch observation could in principle feed a language-named predicate classifier the way BLADE feeds an RGB crop. The robot-grasping result (94.5 linear-probe) is direct evidence that touch carries a clean grasp_is_stable-style signal.

Two honest caveats keep this adjacent rather than core to BLADE. First, the language labels are GPT-4V-on-vision, so this dataset can't yet ground the genuinely non-visual predicates (compliance, temperature) that are the strongest argument for touch — it's a first rung, not the full ladder. Second, it's a representation/dataset paper with no planning or closed-loop control. The value to my line of work is: (i) it's a concrete recipe for a language-queryable tactile encoder that a BLADE-style predicate layer could call, and (ii) its curriculum trick (grow touch out of a frozen vision encoder via decaying blend weight) is a cheap way to bootstrap a new sensory modality's predicate classifiers from a vision-pretrained backbone — useful if I want to add touch predicates without collecting touch-scale data from scratch.

Quotable

Despite many works involving language, the language presentation of these works is often in the form of textual classification labels. — §1 Introduction / p.1
To the best of our knowledge, Touch100k is the first dataset to encompass tactile, multi-granularity language, and visual modalities at a scale of 100k. — §1 Introduction / p.2
Initially, the curriculum representation relies heavily on the teacher model due to the limited capacity of the student model. As pretraining progresses, the student model improves, allowing for a gradual decrease in the teacher's influence. — §1 / p.2 (TLV-Link overview)

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: