CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding

Wenxuan Ma*, Xiaoge Cao*, Yixiang Zhang, Chaofan Zhang, Shaobo Yang, Peng Hao, Bin Fang, Yinghao Cai, Shaowei Cui†, Shuo Wang · Institute of Automation CAS / Beihang / BUPT / Samsung Research China · 2025 · arXiv preprint (cs.RO, 13 May 2025) · arXiv:2505.08194 · PDF

One-liner. CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pre-aligned CLIP vision-language space, yielding a tactile encoder that supports zero-shot contact-state classification and feeds an LLM that can reason over — and emit corrective adjustments about — what the sensor is touching.

Problem & motivation

Existing tactile–language alignment work (TVL, Touch100K, UniTouch, TacQuad) operates on 2D tactile images and describes mostly superficial material attributes ("hard", "smooth", texture), which the authors argue is insufficient for contact-rich manipulation: what a controller needs is contact-state information — where on the gel the contact is, how deep the press, the geometry of the contact patch, the force. Two stated gaps: (1) robust tactile–language alignment needs sensor-agnostic representations that transfer across hardware (optical vs resistive) and across sim2real; (2) there is no language-annotated dataset for contact states (location, area, depth) as opposed to texture labels. CLTP targets the contact-geometry layer of tactile semantics that prior touch-language models skip.

Method

Two pieces: a dataset (TCL3D) and a CLIP-style alignment recipe (CLTP).

Representation choice. Rather than a tactile image, CLTP uses the 3D contact-deformed point cloud read off the sensor gel (1024 points sampled). This is the load-bearing modality decision: a 3D deformation field captures contact location/depth/geometry directly and is posited to transfer across sensor types better than raw images.

Frozen-CLIP bridge. CLTP trains only a 3D tactile encoder E_T; the image encoder E_I and text encoder E_L are the frozen, pre-aligned vision-language encoders from CLIP (the model is built on ULIP-2). For each contact the tactile point cloud is pulled toward both (i) the text feature of its generated contact-state description and (ii) the image feature of a rendered RGB image of that contact. Training minimizes the sum of two CLIP-style contrastive losses, a tactile-to-language term L_T2L and a tactile-to-image term L_T2I (Eqs. 1–3). The text term injects high-level contact semantics; the image term recovers fine shape/texture detail that text words discretize away.

TCL3D dataset construction. Each sample is a triplet (tactile 3D point cloud, language description, rendered contact image). Tactile point clouds come mostly from the TACTO simulator over PyBullet (cheap, scalable), with real samples from GelStereo and GelSight Mini sensors to verify sim2real. Contact state is annotated along five discretized dimensions: contact shape (19 categories, e.g. spherical/cylindrical), texture (5, e.g. smooth/ridged), depth (4, e.g. slight/moderate/deep), position (9, e.g. top-left/center), and area (5, e.g. tiny/small/medium). Shape and texture are characterized by GPT-4o from the rendered image; position, depth, and area are computed directly from the point cloud. A template prompt composes these into sentences such as "a [Texture] [Shape] object, pressed [Depth] in [Position] with [Area] contact area." Images for each contact are synthesized by meshing the tactile point cloud and rendering from the sensor's view.

Downstream uses. The frozen pretrained tactile encoder is applied three ways: (a) zero-shot 3D contact-state classification via cosine similarity to text prompts per dimension; (b) supervised contact-state classification (freeze encoder, finetune per-attribute MLP heads); (c) Tac3D-LLM — an MLP aligns tactile features to LLM tokens (Qwen2.5-VL-3B backbone, LLaVA-style captioning finetune) so the LLM can do tactile question-answering and multimodal reasoning over contact state.

Setup

Datasets / benchmarks: TCL3D, the paper's own dataset — 52,425 samples (50,860 simulated; 1,450 GelStereo real; 115 GelSight real), 117 objects (62 large, mostly YCB; 55 small pegs / McMaster parts), ~400–500 contact samples per object. Evaluated on zero-shot 3D classification, standard contact-state classification (TCL3D + real-world), and Tac3D-LLM QA.
Hardware / simulator: TACTO tactile simulator + PyBullet for synthesis; real data from GelStereo and GelSight Mini (visuotactile, optical) sensors pressed via a 3D CNC with shaped probes. 1024 points sampled per tactile point cloud.
Baselines: Point-BERT and Point-MAE (3D point-cloud encoders) for classification; an ablated CLTP (w/o image loss); for Tac3D-LLM, PointMAE-LLM and PointBERT-LLM.
Compute: not reported.

Results

Zero-shot contact-state classification (Table 1, accuracy %): the full CLTP beats the no-image ablation on every dimension — Shape 70.1 vs 52.6, Texture 68.8 vs 65.7, Depth 98.3 vs 96.5, Position 94.4 vs 91.9, Area 81.8 vs 67.1 — confirming the tactile-to-image loss is what supplies fine-grained shape/texture detail.

Standard (supervised-head) contact-state classification (Table 2, accuracy %):

Method	Shape (TCL3D / Real)		Texture		Depth
Point-BERT	28.7	21.9	89.5	91.1	64.4	52.1
Point-MAE	31.6	23.9	90.2	89.2	68.1	52.1
CLTP (w/o image)	61.2	55.3	94.5	90.3	94.7	77.3
CLTP	84.8	71.2	96.1	92.7	99.2	83.6

Headlines: CLTP reaches 84.8% shape accuracy on TCL3D vs 28.7/31.6 for Point-BERT/Point-MAE, and >95% on texture, position, and area. On real-world sensor data it gets 71.2% shape, surpassing Point-MAE by +47.3 points, and the full model beats its own no-image variant by +15.9 on real shape — the language + visual supervision both help, and the encoder trained mostly on sim transfers to real GelSight/GelStereo data. The case study (Fig. 4) shows CLTP matching GT on shape/depth/position where Point-BERT/Point-MAE miss. Tac3D-LLM (Sec. 5.3, Fig. 7) does contact-description and comparative-reasoning QA ("which object is most likely to be this touch point cloud and why?") and generalizes to unseen question types where the unaligned baselines only answer their training-question type.

Limitations & open questions

From the authors:

Shape ambiguity: the 19 shape classes overlap (triangle vs peak, sphere vs ellipsoid), causing confusable labels (Fig. 6, Case 0).
The "contact position" is defined as the deepest contact point, which is noisy for large flat contacts (Fig. 6, Case 1).
Real-world data is far smaller than simulated (1,450 + 115 vs 50,860), so sim2real claims rest on a thin real set.

What I noticed reading it:

The contact state is entirely static: a single press snapshot annotated on five dimensions. There is no temporal axis — no notion of a contact state change over time (became firmer, started slipping, contact growing as object deforms). The dataset is a set of independent snapshots, not trajectories.
Language is restricted to a fixed template over discretized bins; it is descriptive, not predictive or corrective at the representation level — the "corrective adjustment" capability lives only in the downstream Tac3D-LLM, and only as free-text generation, not as a grounded action signal or a predicate over state.
No closed-loop manipulation experiment in the main paper: classification and QA only. The grasping demo (strawberry positioning / delicate grasping) is deferred to an appendix, so the manipulation payoff is asserted more than evaluated.
"Sensor-agnostic" is argued via the 3D-point-cloud choice but tested only across two optical/visuotactile sensors (GelStereo, GelSight); the harder optical-vs-resistive transfer it motivates is not actually evaluated.

Why I care

This is the tactile-language analogue of Octopi for BLADE's missing non-visual predicate layer: BLADE learns visual predicate classifiers, and CLTP is exactly the recipe for a tactile classifier — a learned encoder that maps raw 3D contact geometry to language-aligned contact-state labels. Predicates like is_grasped, is_inserted, or a contact-firmness precondition could in principle be read off a CLTP encoder where vision can't see them. The downstream Tac3D-LLM even closes my "tactile state → reasoned correction" loop on paper: the seed brief frames this cluster as "an LLM can emit corrective adjustments ('slightly reduce gripping force')", and the strawberry-grasping appendix gestures at exactly that.

But it sits on the wrong side of the axis I actually care about. CLTP grounds static contact geometry — one snapshot, five descriptive dimensions — not a dynamic state-change concept. There is no time axis, no "became slippery after wetting", no monitored transition. Compared to BLADE's symbolic predicates with preconditions/effects, CLTP's "state" is a neural/textual descriptor with no transition structure; compared to closed-loop-correction work like REFLECT or Inner Monologue, its correction story is unevaluated. So for the dynamic-state-change idea it is the grasp-stabilization, static-contact-geometry reference point: the strongest existing tactile→language grounding, and a clean illustration of exactly the temporal / transition gap the idea wants to fill. It also pairs with Tactile-VLA (force-word grounding + CoT correction) as the "tactile + language + correction" neighbors that stop short of dynamic state.

Quotable

Existing tactile descriptions remain limited to superficial attributes like texture, neglecting critical contact states essential for robotic manipulation. — Abstract

To the best of our knowledge, this is the first study to align tactile and language representations from the contact state perspective for manipulation tasks, providing great potential for tactile-language-action model learning. — Abstract

Their textual descriptions predominantly focus on material texture evaluation rather than grasping state information critical for contact-rich manipulation tasks—such as contact location, depth, and interaction dynamics. — §2.2, Tactile-Language Pre-training

Papers cited that could be ingested next:

Fu et al. — TVL (Touch-Vision-Language dataset / model) [23] — the 2D tactile-image–language alignment baseline CLTP positions itself against (already in corpus as TVL).
Cheng et al. — Touch100K / TVL-Link [22] — open tactile-visual-text dataset; a direct tactile-language predecessor.
Yang et al. — UniTouch [50] — tactile aligned to visual/textual/auditory modalities.
Feng et al. — TacQuad / AnyTouch [51] — cross-sensor alignment of dynamic contact sequences with vision and text; the closest cited work that does add a temporal axis (worth ingesting for the dynamic-state angle).
Xue et al. — ULIP / ULIP-2 [36, 44] — the point-cloud–image–text alignment framework CLTP is built on.
Radford et al. — CLIP [13] — the frozen vision-language space CLTP distills into.
Zhu et al. — PointCLIP [45, 46] — CLIP-for-3D open-world classification precedent.

Related ingested papers:

Octopi — the static tactile property-reasoning sibling; CLTP is the contact-geometry counterpart, both pre-dynamic.
Tactile-VLA — force-word grounding + CoT correction; the closest "tactile + language + closed-loop correction" neighbor.
BLADE — anchor; CLTP is a candidate tactile-predicate-classifier layer for BLADE's visual-only predicate learning.
REFLECT and Inner Monologue — closed-loop semantic correction; the capability Tac3D-LLM gestures at but does not evaluate.