Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation
Ning Cheng, Changhao Guan, Jing Gao, Weihao Wang, You Li, Fandong Meng,
Jie Zhou, Bin Fang, Jinan Xu, Wenjuan Han
· Beijing Jiaotong Univ. / WeChat AI, Tencent / BUPT
· 2024 · arXiv preprint (under review)
· arXiv:2406.03813
· PDF
One-liner. Touch100k is the first ~100k-scale paired
touch–language–vision dataset where GelSight tactile observations are
annotated with GPT-4V-generated multi-granularity language (full sentences
plus key-feature phrases), and the companion method TLV-Link uses curriculum
learning to "grow" a tactile encoder out of a frozen vision encoder — the
practical payoff being that touch becomes a language-queryable modality for material
and grasp reasoning instead of a bag of classification labels.
Problem & motivation
Tactile perception is underexplored relative to vision and audio, and the language
that does accompany existing tactile datasets is almost always a flat
classification label (e.g., "rough", "metal"). That impoverished language signal caps
how richly touch can be cross-modally associated — you cannot ground "fine clay,
hard, rough, abrasive to the hand" if your supervision is a single class token. The
authors argue tactile research needs (i) a dataset at real scale with (ii) descriptive,
multi-granularity natural language tied to (iii) the paired vision, so that a tactile
encoder can be aligned to language the way CLIP aligned images to text. Touch100k is
positioned as the first dataset combining tactile + multi-granularity language + vision
at 100k scale (Table 1 contrasts it against TLV, TVL, ObjectFolder Real/2.0/1.0,
SSVTP, Touch-and-Go (TAG), YCB-Slide, VisGel, ViTac Cloth, Data_ICRA18, GelFabric,
Feel — Touch100k is the only one checking every column including multi-granularity
language and ≥100k size).
Method
The paper has two halves: building the dataset, and the TLV-Link pretraining method.
Dataset construction (Fig 1, three stages).
- Stage I — Data collection. Foundational vision-touch data is
curated from two existing GelSight datasets: Touch-and-Go (TAG) (91,982
observations, already has classification labels) and VisGel (10,000
observations; lacks textual labels, so the authors manually label the touched-object
names). Total: 101,982 paired visual-tactile observations.
- Stage II — Multi-granularity description generation. GPT-4V
(gpt-4-turbo, April/May 2024) is prompted with the visual image to generate
two description levels: sentence-level (rich, high-level expressions
including contextual and dynamic relationships) and phrase-level (key tactile
features). A prompt pool is sampled randomly per item to diversify outputs (prompt
pools in Appendix B).
- Stage III — Multi-step quality enhancement. (1) Pattern
filtering + machine correction: regex-based filtering of bad patterns
(multilanguage confusion, special markers, semantic redundancy), re-fed to GPT-4V for
correction. (2) Consistency assessment: Gemini 1.0 Pro acts as a referee comparing
image vs. description; hired workers correct inconsistencies. (3) Completeness
assessment: human evaluation that descriptions cover touch-point locations, texture
features, etc. After filtering, 100,147 valid touch-language-vision
entries remain.
TLV-Link pretraining (Fig 2). The goal is a tactile encoder for the
GelSight sensor aligned to language. Two stages:
1. Curriculum representation for tactile encoding (§4.1). A
teacher–student curriculum: the pretrained vision encoder is the teacher,
the touch encoder is the student (tactile images are treated as RGB). The
"curriculum representation" is a weighted blend
x = βi x(v) + (1 − βi) x(t)
(Eq. 1). Early in training the weak student leans on the strong visual teacher (large
β1); βi linearly decays toward
βmin = 0 over N steps
(βi = β1 − (i/N)(β1 − βmin),
Eq. 2), handing control to the maturing touch encoder.
2. Modality alignment (§4.2). Multi-granularity descriptions are
encoded by a frozen OpenCLIP-large text encoder; sentence- and phrase-level
features are fused into a final text feature y. The blended tactile
representation x is aligned to y via an InfoNCE contrastive loss
(Eq. 3). The touch encoder is fully trained; the vision encoder is LoRA fine-tuned; the
text encoder stays frozen. Both encoders are 24-layer 1024-dim ViTs (patch 14)
initialized from OpenCLIP-large. The authors call the focus on adapting a visual encoder
into the touch domain "touch-centric multimodal representation learning".
Setup
- Datasets / benchmarks: Pretraining on Touch100k (100,147 entries,
sourced from TAG + VisGel). Downstream: material property identification on
the three TAG test sets — material classification (20 labels, multi-class),
hard/soft (binary), rough/smooth (binary); robot grasping prediction on the
Feel ("The Feeling of Success") dataset, predicting grasp success from before/after
tactile observations of left/right sensors, split by objects 8:1:1.
- Hardware / simulator: No robot run by the authors — this is a
dataset/representation paper. All tactile data is GelSight (vision-based tactile sensor);
grasping is evaluated offline on the Feel dataset's parallel-gripper tactile records.
- Baselines: Chance; supervised ImageNet features; and SOTA tactile
representation models VT CMC (TAG's method), MViTac, UniTouch, VIT-LENS-2. Evaluated
under linear probing (for vision/touch-only encoders) and zero-shot (for language-capable
encoders: UniTouch, VIT-LENS-2, TLV-Link).
- Compute: not reported.
Results
Headline: TLV-Link pretrained on Touch100k sets new SOTA on most subtasks under both
linear probing and zero-shot (Table 2). Accuracy (%):
| Setting / Model | Train data | Material | Hard/Soft | Rough/Smooth | Robot Grasping |
| Chance | — | 5.0 | 50.0 | 50.0 | 50.0 |
| Linear probe: VT CMC | TAG | 54.7 | 77.3 | 79.4 | 78.1 |
| UniTouch | TAG, Feel, YCB-Slide, OF2.0 | 61.3 | — | — | 82.3 |
| VIT-LENS-2 | TAG | 63.0 | 92.0 | 85.1 | — |
| TLV-Link (ours) | Touch100k | 67.2 | 93.1 | 84.7 | 94.5 |
| Zero-shot: UniTouch | TAG, Feel, YCB-Slide, OF2.0 | 52.7 | — | — | 65.5 |
| VIT-LENS-2 | TAG | 65.8 | 74.7 | 63.8 | — |
| TLV-Link (ours) | Touch100k | 70.0 | 79.3 | 77.8 | 65.4 |
- Where it wins: robot grasping linear-probe 94.5 (+12.2 over
suboptimal UniTouch 82.3); material 67.2 (+4.2); hard/soft 93.1. Zero-shot material 70.0
(+4.2), hard/soft 79.3 (+4.6), rough/smooth 77.8 (+14.0).
- Where it loses / ties: linear-probe rough/smooth, where VIT-LENS-2
edges it (85.1 vs 84.7); and zero-shot robot grasping (65.4 vs UniTouch 65.5).
- Dataset-vs-dataset (Table 3): holding the method fixed (VT CMC +
Touch-Text Contrastive Learning), Touch100k beats TAG on every subtask (e.g., material
64.1 vs 60.8; grasping 61.3 vs 43.4) — isolating the dataset's contribution from
the method's.
- Scale (Table 4): 25%→100% of Touch100k improves linear-probe and
especially zero-shot (e.g., grasping zero-shot 44.7→65.4) — more data helps,
with the larger gains in zero-shot.
- Ablation (Table 5): removing the multi-stage curriculum representation
drops every subtask; the cliff is zero-shot grasping −21.2 (65.4→44.2),
arguing the curriculum is what gives the touch encoder generalization, not just the data.
Limitations & open questions
From the authors:
- The dataset and method are specifically designed for GelSight; transfer to other
tactile sensors (DIGIT, GelSlim) is an open question and explicit future work.
- t-SNE analysis (Fig 3) shows TLV-Link separates the binary tasks (hard/soft,
rough/smooth) cleanly but degrades on multi-class material classification and on
grasping — they flag tactile multi-classification and manipulation as not yet robust.
What I noticed reading it:
- The language supervision is image-derived, not touch-derived.
Descriptions are generated by GPT-4V from the visual image, not from the tactile
signal. So the "touch-language" alignment is really "vision-implied-touch-language" —
any tactile property not visible in the RGB image (true compliance, temperature,
sub-surface texture) cannot be in the label, capping the ceiling of what touch can learn.
- Quality control leans on other black-box models. GPT-4V generates,
Gemini referees, humans correct — reproducibility and bias of this pipeline aren't
quantified (e.g., what fraction of entries each correction stage touched, inter-rater
agreement). The 100,147/101,982 retention implies ~1.8% hard-filtered, but consistency/
completeness correction rates aren't reported.
- Downstream is all classification. "Robot grasping prediction" is
offline success classification on the Feel dataset, not a closed-loop policy — the
paper demonstrates representation quality, not control. No manipulation rollouts.
- Inherited-data ceiling. 90% of the data is TAG; the dataset's diversity
is bounded by TAG's object/scene coverage. The "100k scale" is multi-granularity
language over largely pre-existing tactile observations.
- Curriculum weight schedule is linear and hand-set (
β1,
βmin=0); no sensitivity analysis over the schedule shape.
Why I care
This is squarely on the thesis behind the 2026-06-24 batch: many manipulation predicates
— surface_is_rough, is_hard, material_is_metal,
grasp_is_stable — are not reliably visually evaluable; they live in
touch. BLADE
learns visual predicate classifiers fθ(p): O → {T,F}, but
several of the most useful manipulation predicates would need a tactile O.
Touch100k + TLV-Link is exactly the missing ingredient: a tactile encoder aligned to
language, so a touch observation could in principle feed a language-named predicate
classifier the way BLADE feeds an RGB crop. The robot-grasping result (94.5 linear-probe) is
direct evidence that touch carries a clean grasp_is_stable-style signal.
Two honest caveats keep this adjacent rather than core to BLADE. First, the language
labels are GPT-4V-on-vision, so this dataset can't yet ground the genuinely
non-visual predicates (compliance, temperature) that are the strongest argument for touch —
it's a first rung, not the full ladder. Second, it's a representation/dataset paper with no
planning or closed-loop control. The value to my line of work is: (i) it's a concrete recipe for
a language-queryable tactile encoder that a BLADE-style predicate layer could call, and (ii) its
curriculum trick (grow touch out of a frozen vision encoder via decaying blend weight) is a cheap
way to bootstrap a new sensory modality's predicate classifiers from a vision-pretrained
backbone — useful if I want to add touch predicates without collecting touch-scale data
from scratch.
Quotable
Despite many works involving language, the language presentation of these works is often in the
form of textual classification labels.
— §1 Introduction / p.1
To the best of our knowledge, Touch100k is the first dataset to encompass tactile,
multi-granularity language, and visual modalities at a scale of 100k.
— §1 Introduction / p.2
Initially, the curriculum representation relies heavily on the teacher model due to the limited
capacity of the student model. As pretraining progresses, the student model improves, allowing
for a gradual decrease in the teacher's influence.
— §1 / p.2 (TLV-Link overview)
Related
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work:
- TLV (Touch-Language-Vision Dataset)
— same authors' earlier, smaller touch-language-vision dataset; Touch100k is its
100k-scale, multi-granularity successor.
- TVL
— the closest peer touch-vision-language dataset + alignment method; the head-to-head
comparison in Table 1.
- UniTouch
— the language-capable tactile-binding baseline TLV-Link beats on material and edges on
grasping; same "touch into a CLIP-aligned space" family.
- Octopi
— tactile-language model for object-property reasoning; the LLM-reasoning counterpart to
Touch100k's contrastive-representation framing of touch-language grounding.
- AnyTouch
— unified visuo-tactile representation across sensors; directly addresses Touch100k's
main limitation (GelSight-only transfer).
- Touch-and-Go
— the vision-touch dataset supplying ~90% of Touch100k's observations.
- ImageBind and
LanguageBind
— the multimodal-binding anchors whose contrastive-alignment paradigm Touch100k
instantiates for the touch modality.