CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding

Wenxuan Ma*, Xiaoge Cao*, Yixiang Zhang, Chaofan Zhang, Shaobo Yang, Peng Hao, Bin Fang, Yinghao Cai, Shaowei Cui†, Shuo Wang · Institute of Automation CAS / Beihang / BUPT / Samsung Research China · 2025 · arXiv preprint (cs.RO, 13 May 2025) · arXiv:2505.08194 · PDF

One-liner. CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pre-aligned CLIP vision-language space, yielding a tactile encoder that supports zero-shot contact-state classification and feeds an LLM that can reason over — and emit corrective adjustments about — what the sensor is touching.

Problem & motivation

Existing tactile–language alignment work (TVL, Touch100K, UniTouch, TacQuad) operates on 2D tactile images and describes mostly superficial material attributes ("hard", "smooth", texture), which the authors argue is insufficient for contact-rich manipulation: what a controller needs is contact-state information — where on the gel the contact is, how deep the press, the geometry of the contact patch, the force. Two stated gaps: (1) robust tactile–language alignment needs sensor-agnostic representations that transfer across hardware (optical vs resistive) and across sim2real; (2) there is no language-annotated dataset for contact states (location, area, depth) as opposed to texture labels. CLTP targets the contact-geometry layer of tactile semantics that prior touch-language models skip.

Method

Two pieces: a dataset (TCL3D) and a CLIP-style alignment recipe (CLTP).

Representation choice. Rather than a tactile image, CLTP uses the 3D contact-deformed point cloud read off the sensor gel (1024 points sampled). This is the load-bearing modality decision: a 3D deformation field captures contact location/depth/geometry directly and is posited to transfer across sensor types better than raw images.

Frozen-CLIP bridge. CLTP trains only a 3D tactile encoder ET; the image encoder EI and text encoder EL are the frozen, pre-aligned vision-language encoders from CLIP (the model is built on ULIP-2). For each contact the tactile point cloud is pulled toward both (i) the text feature of its generated contact-state description and (ii) the image feature of a rendered RGB image of that contact. Training minimizes the sum of two CLIP-style contrastive losses, a tactile-to-language term LT2L and a tactile-to-image term LT2I (Eqs. 1–3). The text term injects high-level contact semantics; the image term recovers fine shape/texture detail that text words discretize away.

TCL3D dataset construction. Each sample is a triplet (tactile 3D point cloud, language description, rendered contact image). Tactile point clouds come mostly from the TACTO simulator over PyBullet (cheap, scalable), with real samples from GelStereo and GelSight Mini sensors to verify sim2real. Contact state is annotated along five discretized dimensions: contact shape (19 categories, e.g. spherical/cylindrical), texture (5, e.g. smooth/ridged), depth (4, e.g. slight/moderate/deep), position (9, e.g. top-left/center), and area (5, e.g. tiny/small/medium). Shape and texture are characterized by GPT-4o from the rendered image; position, depth, and area are computed directly from the point cloud. A template prompt composes these into sentences such as "a [Texture] [Shape] object, pressed [Depth] in [Position] with [Area] contact area." Images for each contact are synthesized by meshing the tactile point cloud and rendering from the sensor's view.

Downstream uses. The frozen pretrained tactile encoder is applied three ways: (a) zero-shot 3D contact-state classification via cosine similarity to text prompts per dimension; (b) supervised contact-state classification (freeze encoder, finetune per-attribute MLP heads); (c) Tac3D-LLM — an MLP aligns tactile features to LLM tokens (Qwen2.5-VL-3B backbone, LLaVA-style captioning finetune) so the LLM can do tactile question-answering and multimodal reasoning over contact state.

Setup

Results

Zero-shot contact-state classification (Table 1, accuracy %): the full CLTP beats the no-image ablation on every dimension — Shape 70.1 vs 52.6, Texture 68.8 vs 65.7, Depth 98.3 vs 96.5, Position 94.4 vs 91.9, Area 81.8 vs 67.1 — confirming the tactile-to-image loss is what supplies fine-grained shape/texture detail.

Standard (supervised-head) contact-state classification (Table 2, accuracy %):

MethodShape (TCL3D / Real)TextureDepth
Point-BERT28.721.989.591.164.452.1
Point-MAE31.623.990.289.268.152.1
CLTP (w/o image)61.255.394.590.394.777.3
CLTP84.871.296.192.799.283.6

Headlines: CLTP reaches 84.8% shape accuracy on TCL3D vs 28.7/31.6 for Point-BERT/Point-MAE, and >95% on texture, position, and area. On real-world sensor data it gets 71.2% shape, surpassing Point-MAE by +47.3 points, and the full model beats its own no-image variant by +15.9 on real shape — the language + visual supervision both help, and the encoder trained mostly on sim transfers to real GelSight/GelStereo data. The case study (Fig. 4) shows CLTP matching GT on shape/depth/position where Point-BERT/Point-MAE miss. Tac3D-LLM (Sec. 5.3, Fig. 7) does contact-description and comparative-reasoning QA ("which object is most likely to be this touch point cloud and why?") and generalizes to unseen question types where the unaligned baselines only answer their training-question type.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is the tactile-language analogue of Octopi for BLADE's missing non-visual predicate layer: BLADE learns visual predicate classifiers, and CLTP is exactly the recipe for a tactile classifier — a learned encoder that maps raw 3D contact geometry to language-aligned contact-state labels. Predicates like is_grasped, is_inserted, or a contact-firmness precondition could in principle be read off a CLTP encoder where vision can't see them. The downstream Tac3D-LLM even closes my "tactile state → reasoned correction" loop on paper: the seed brief frames this cluster as "an LLM can emit corrective adjustments ('slightly reduce gripping force')", and the strawberry-grasping appendix gestures at exactly that.

But it sits on the wrong side of the axis I actually care about. CLTP grounds static contact geometry — one snapshot, five descriptive dimensions — not a dynamic state-change concept. There is no time axis, no "became slippery after wetting", no monitored transition. Compared to BLADE's symbolic predicates with preconditions/effects, CLTP's "state" is a neural/textual descriptor with no transition structure; compared to closed-loop-correction work like REFLECT or Inner Monologue, its correction story is unevaluated. So for the dynamic-state-change idea it is the grasp-stabilization, static-contact-geometry reference point: the strongest existing tactile→language grounding, and a clean illustration of exactly the temporal / transition gap the idea wants to fill. It also pairs with Tactile-VLA (force-word grounding + CoT correction) as the "tactile + language + correction" neighbors that stop short of dynamic state.

Quotable

Existing tactile descriptions remain limited to superficial attributes like texture, neglecting critical contact states essential for robotic manipulation. — Abstract
To the best of our knowledge, this is the first study to align tactile and language representations from the contact state perspective for manipulation tasks, providing great potential for tactile-language-action model learning. — Abstract
Their textual descriptions predominantly focus on material texture evaluation rather than grasping state information critical for contact-rich manipulation tasks—such as contact location, depth, and interaction dynamics. — §2.2, Tactile-Language Pre-training

Related

Papers cited that could be ingested next:

Related ingested papers: