Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation

Ning Cheng, Changhao Guan, Jing Gao, Weihao Wang, You Li, Fandong Meng, Jie Zhou, Bin Fang, Jinan Xu, Wenjuan Han · Beijing Jiaotong Univ. / WeChat AI, Tencent / BUPT · 2024 · arXiv preprint (under review) · arXiv:2406.03813 · PDF

One-liner. Touch100k is the first ~100k-scale paired touch–language–vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus key-feature phrases), and the companion method TLV-Link uses curriculum learning to "grow" a tactile encoder out of a frozen vision encoder — the practical payoff being that touch becomes a language-queryable modality for material and grasp reasoning instead of a bag of classification labels.

Problem & motivation

Tactile perception is underexplored relative to vision and audio, and the language that does accompany existing tactile datasets is almost always a flat classification label (e.g., "rough", "metal"). That impoverished language signal caps how richly touch can be cross-modally associated — you cannot ground "fine clay, hard, rough, abrasive to the hand" if your supervision is a single class token. The authors argue tactile research needs (i) a dataset at real scale with (ii) descriptive, multi-granularity natural language tied to (iii) the paired vision, so that a tactile encoder can be aligned to language the way CLIP aligned images to text. Touch100k is positioned as the first dataset combining tactile + multi-granularity language + vision at 100k scale (Table 1 contrasts it against TLV, TVL, ObjectFolder Real/2.0/1.0, SSVTP, Touch-and-Go (TAG), YCB-Slide, VisGel, ViTac Cloth, Data_ICRA18, GelFabric, Feel — Touch100k is the only one checking every column including multi-granularity language and ≥100k size).

Method

The paper has two halves: building the dataset, and the TLV-Link pretraining method.

Dataset construction (Fig 1, three stages).

Stage I — Data collection. Foundational vision-touch data is curated from two existing GelSight datasets: Touch-and-Go (TAG) (91,982 observations, already has classification labels) and VisGel (10,000 observations; lacks textual labels, so the authors manually label the touched-object names). Total: 101,982 paired visual-tactile observations.
Stage II — Multi-granularity description generation. GPT-4V (gpt-4-turbo, April/May 2024) is prompted with the visual image to generate two description levels: sentence-level (rich, high-level expressions including contextual and dynamic relationships) and phrase-level (key tactile features). A prompt pool is sampled randomly per item to diversify outputs (prompt pools in Appendix B).
Stage III — Multi-step quality enhancement. (1) Pattern filtering + machine correction: regex-based filtering of bad patterns (multilanguage confusion, special markers, semantic redundancy), re-fed to GPT-4V for correction. (2) Consistency assessment: Gemini 1.0 Pro acts as a referee comparing image vs. description; hired workers correct inconsistencies. (3) Completeness assessment: human evaluation that descriptions cover touch-point locations, texture features, etc. After filtering, 100,147 valid touch-language-vision entries remain.

TLV-Link pretraining (Fig 2). The goal is a tactile encoder for the GelSight sensor aligned to language. Two stages:

1. Curriculum representation for tactile encoding (§4.1). A teacher–student curriculum: the pretrained vision encoder is the teacher, the touch encoder is the student (tactile images are treated as RGB). The "curriculum representation" is a weighted blend x = β_i x^(v) + (1 − β_i) x^(t) (Eq. 1). Early in training the weak student leans on the strong visual teacher (large β₁); β_i linearly decays toward β_min = 0 over N steps (β_i = β₁ − (i/N)(β₁ − β_min), Eq. 2), handing control to the maturing touch encoder.

2. Modality alignment (§4.2). Multi-granularity descriptions are encoded by a frozen OpenCLIP-large text encoder; sentence- and phrase-level features are fused into a final text feature y. The blended tactile representation x is aligned to y via an InfoNCE contrastive loss (Eq. 3). The touch encoder is fully trained; the vision encoder is LoRA fine-tuned; the text encoder stays frozen. Both encoders are 24-layer 1024-dim ViTs (patch 14) initialized from OpenCLIP-large. The authors call the focus on adapting a visual encoder into the touch domain "touch-centric multimodal representation learning".

Setup

Datasets / benchmarks: Pretraining on Touch100k (100,147 entries, sourced from TAG + VisGel). Downstream: material property identification on the three TAG test sets — material classification (20 labels, multi-class), hard/soft (binary), rough/smooth (binary); robot grasping prediction on the Feel ("The Feeling of Success") dataset, predicting grasp success from before/after tactile observations of left/right sensors, split by objects 8:1:1.
Hardware / simulator: No robot run by the authors — this is a dataset/representation paper. All tactile data is GelSight (vision-based tactile sensor); grasping is evaluated offline on the Feel dataset's parallel-gripper tactile records.
Baselines: Chance; supervised ImageNet features; and SOTA tactile representation models VT CMC (TAG's method), MViTac, UniTouch, VIT-LENS-2. Evaluated under linear probing (for vision/touch-only encoders) and zero-shot (for language-capable encoders: UniTouch, VIT-LENS-2, TLV-Link).
Compute: not reported.

Results

Headline: TLV-Link pretrained on Touch100k sets new SOTA on most subtasks under both linear probing and zero-shot (Table 2). Accuracy (%):

Setting / Model	Train data	Material	Hard/Soft	Rough/Smooth	Robot Grasping
Chance	—	5.0	50.0	50.0	50.0
Linear probe: VT CMC	TAG	54.7	77.3	79.4	78.1
UniTouch	TAG, Feel, YCB-Slide, OF2.0	61.3	—	—	82.3
VIT-LENS-2	TAG	63.0	92.0	85.1	—
TLV-Link (ours)	Touch100k	67.2	93.1	84.7	94.5
Zero-shot: UniTouch	TAG, Feel, YCB-Slide, OF2.0	52.7	—	—	65.5
VIT-LENS-2	TAG	65.8	74.7	63.8	—
TLV-Link (ours)	Touch100k	70.0	79.3	77.8	65.4

Where it wins: robot grasping linear-probe 94.5 (+12.2 over suboptimal UniTouch 82.3); material 67.2 (+4.2); hard/soft 93.1. Zero-shot material 70.0 (+4.2), hard/soft 79.3 (+4.6), rough/smooth 77.8 (+14.0).
Where it loses / ties: linear-probe rough/smooth, where VIT-LENS-2 edges it (85.1 vs 84.7); and zero-shot robot grasping (65.4 vs UniTouch 65.5).
Dataset-vs-dataset (Table 3): holding the method fixed (VT CMC + Touch-Text Contrastive Learning), Touch100k beats TAG on every subtask (e.g., material 64.1 vs 60.8; grasping 61.3 vs 43.4) — isolating the dataset's contribution from the method's.
Scale (Table 4): 25%→100% of Touch100k improves linear-probe and especially zero-shot (e.g., grasping zero-shot 44.7→65.4) — more data helps, with the larger gains in zero-shot.
Ablation (Table 5): removing the multi-stage curriculum representation drops every subtask; the cliff is zero-shot grasping −21.2 (65.4→44.2), arguing the curriculum is what gives the touch encoder generalization, not just the data.

Limitations & open questions

From the authors:

The dataset and method are specifically designed for GelSight; transfer to other tactile sensors (DIGIT, GelSlim) is an open question and explicit future work.
t-SNE analysis (Fig 3) shows TLV-Link separates the binary tasks (hard/soft, rough/smooth) cleanly but degrades on multi-class material classification and on grasping — they flag tactile multi-classification and manipulation as not yet robust.

What I noticed reading it:

The language supervision is image-derived, not touch-derived. Descriptions are generated by GPT-4V from the visual image, not from the tactile signal. So the "touch-language" alignment is really "vision-implied-touch-language" — any tactile property not visible in the RGB image (true compliance, temperature, sub-surface texture) cannot be in the label, capping the ceiling of what touch can learn.
Quality control leans on other black-box models. GPT-4V generates, Gemini referees, humans correct — reproducibility and bias of this pipeline aren't quantified (e.g., what fraction of entries each correction stage touched, inter-rater agreement). The 100,147/101,982 retention implies ~1.8% hard-filtered, but consistency/ completeness correction rates aren't reported.
Downstream is all classification. "Robot grasping prediction" is offline success classification on the Feel dataset, not a closed-loop policy — the paper demonstrates representation quality, not control. No manipulation rollouts.
Inherited-data ceiling. 90% of the data is TAG; the dataset's diversity is bounded by TAG's object/scene coverage. The "100k scale" is multi-granularity language over largely pre-existing tactile observations.
Curriculum weight schedule is linear and hand-set (β₁, β_min=0); no sensitivity analysis over the schedule shape.

Why I care

This is squarely on the thesis behind the 2026-06-24 batch: many manipulation predicates — surface_is_rough, is_hard, material_is_metal, grasp_is_stable — are not reliably visually evaluable; they live in touch. BLADE learns visual predicate classifiers f_θ(p): O → {T,F}, but several of the most useful manipulation predicates would need a tactile O. Touch100k + TLV-Link is exactly the missing ingredient: a tactile encoder aligned to language, so a touch observation could in principle feed a language-named predicate classifier the way BLADE feeds an RGB crop. The robot-grasping result (94.5 linear-probe) is direct evidence that touch carries a clean grasp_is_stable-style signal.

Two honest caveats keep this adjacent rather than core to BLADE. First, the language labels are GPT-4V-on-vision, so this dataset can't yet ground the genuinely non-visual predicates (compliance, temperature) that are the strongest argument for touch — it's a first rung, not the full ladder. Second, it's a representation/dataset paper with no planning or closed-loop control. The value to my line of work is: (i) it's a concrete recipe for a language-queryable tactile encoder that a BLADE-style predicate layer could call, and (ii) its curriculum trick (grow touch out of a frozen vision encoder via decaying blend weight) is a cheap way to bootstrap a new sensory modality's predicate classifiers from a vision-pretrained backbone — useful if I want to add touch predicates without collecting touch-scale data from scratch.

Quotable

Despite many works involving language, the language presentation of these works is often in the form of textual classification labels. — §1 Introduction / p.1

To the best of our knowledge, Touch100k is the first dataset to encompass tactile, multi-granularity language, and visual modalities at a scale of 100k. — §1 Introduction / p.2

Initially, the curriculum representation relies heavily on the teacher model due to the limited capacity of the student model. As pretraining progresses, the student model improves, allowing for a gradual decrease in the teacher's influence. — §1 / p.2 (TLV-Link overview)

Papers cited that should likely be ingested next:

[33] Radford et al. — CLIP — the vision-language contrastive backbone this whole alignment paradigm inherits.
[16] Girdhar et al. — ImageBind — six-modality binding to an image-centric space; the cross-modal-binding precedent (expected slug: imagebind_one_embedding_space).
[51] Zhu et al. — LanguageBind — language-centric binding framework; direct philosophical sibling (expected slug: languagebind_language_anchored).
[43] Yang et al. — Touch-and-Go (TAG) — the 90%-of-data source + a downstream benchmark (expected slug: touch_and_go_human_collected_vision_touch).
[28] / VisGel — the second data source (visual-tactile, GelSight).
[5] Cheng et al. — TLV (Touch-Language-Vision Dataset) — the authors' own predecessor dataset (expected slug: tlv_touch_language_vision_dataset).
[12] Fu et al. — TVL (A Touch, Vision, and Language Dataset) — the closest competing touch-language-vision dataset (expected slug: touch_vision_language_dataset_multimodal_alignment).
[13,14,15] Gao et al. — ObjectFolder 1.0 / Benchmark / 2.0 — multisensory object datasets compared in Table 1 (objectfolder_dataset_implicit_representations, objectfolder_benchmark_neural_real, objectfolder_2_multisensory_sim2real).
[4] Calandra et al. — The Feeling of Success (Feel) — the grasp-prediction downstream dataset.
[42] UniTouch — Binding Touch to Everything — the main language-capable baseline (expected slug: binding_touch_to_everything_unitouch).
[3] Bengio et al. — Curriculum Learning — origin of the CL idea TLV-Link adapts to modality alignment.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

TLV (Touch-Language-Vision Dataset) — same authors' earlier, smaller touch-language-vision dataset; Touch100k is its 100k-scale, multi-granularity successor.
TVL — the closest peer touch-vision-language dataset + alignment method; the head-to-head comparison in Table 1.
UniTouch — the language-capable tactile-binding baseline TLV-Link beats on material and edges on grasping; same "touch into a CLIP-aligned space" family.
Octopi — tactile-language model for object-property reasoning; the LLM-reasoning counterpart to Touch100k's contrastive-representation framing of touch-language grounding.
AnyTouch — unified visuo-tactile representation across sensors; directly addresses Touch100k's main limitation (GelSight-only transfer).
Touch-and-Go — the vision-touch dataset supplying ~90% of Touch100k's observations.
ImageBind and LanguageBind — the multimodal-binding anchors whose contrastive-alignment paradigm Touch100k instantiates for the touch modality.