UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, Yu She · Purdue University / University of Arkansas · 2025 · arXiv:2408.06481 · PDF

One-liner. Train a VQGAN autoencoder on the tactile images from one simple object (an Allen key or a small ball, <20 min of GelSight-mini data) and its encoder becomes a plug-and-play tactile representation that zero-shot generalizes to unseen objects and beats T3, MAE, BYOL, and from-scratch ResNet on pose estimation, classification, and visuo-tactile diffusion-policy learning.

Problem & motivation

Vision-based tactile sensors (GelSight) carry rich contact geometry, in-hand pose, and force-distribution information that pure RGB / point-cloud inputs miss in contact-rich manipulation. But existing tactile representation learners either are task-specific or — like the scaling-oriented T3 and masked-autoencoder approaches — demand large, multi-sensor / multi-task tactile datasets. The authors ask whether a single, simple object can yield a tactile representation that (1) generalizes, (2) preserves as much of the rich tactile signal as possible, and (3) transfers zero-shot across many downstream tasks. The key insight motivating the method: tactile images have a much more compact, low-variance color distribution than natural images (the imaging process discards object/background color, keeping only contact geometry and marker shear), so a VQ-regularized latent should be learnable from very little data.

Method

Backbone — VQGAN as a tactile autoencoder. UniT repurposes VQGAN (a VQVAE generator + patch-based discriminator, originally for high-res image synthesis, Fig 1) purely as an autoencoder for tactile image compression. A CNN encoder E maps a tactile image x ∈ R^{H×W×3} to a latent z = E(x) ∈ R^{h×w×c} through a quantization layer; a CNN decoder G reconstructs x̂; a patch-based discriminator D supplies an adversarial real/fake loss. The whole thing is trained self-supervised (reconstruction + VQ + adversarial). The authors argue VQ regularization yields a structured, low-variance latent that amplifies salient contact features and suppresses background color variance (Fig 6 visualizes CNN-vs-VQGAN latents: without VQ the latent still mirrors the input color distribution; with VQ it becomes a distinct, dichotomous representation).

Train with a single simple object. The autoencoder is trained on tactile images from one object — an Allen key (10,854 images) or a small ball (4,831 images), each collected from one GelSight mini in <20 min at 10 Hz by continuously varying contact configuration and force. Both objects lack surface texture; the small ball is the extreme case (no texture, edges, sharp shape; omnidirectionally symmetric), so generalization from it is the hardest test of the approach.

Decoder head for downstream tasks. After training, the decoder and discriminator are discarded; the frozen (or fine-tuned) encoder is connected to lightweight decoder blocks built from Conv2D, GroupNorm, and SEBlock modules (Fig 2) feeding perception or policy heads. The paper shows that freezing the UniT encoder and training only the decoder head matches full fine-tuning. For policy learning, UniT plugs in as the tactile encoder of a diffusion policy-style visuo-tactile imitation pipeline, identical to baselines except for the tactile encoder.

Setup

Datasets / benchmarks: Self-collected GelSight-mini tactile datasets (Allen key 10,854 imgs; small ball 4,831 imgs) for representation training. Downstream: USB-plug 3D pose (quaternion) and 6D pose (position + quaternion) estimation, self-collected with OptiTrack ground truth (10,111 train / 1,229 test images, 9:1 episode split); YCB-Sight tactile classification (6 objects, 240 contact images, 8:2 split); three real Aloha manipulation tasks (Chicken Legs Hanging, Chips Grasping, Allen Key Insertion); a simulated peg-insertion task built on TacSL (30 demos via SpaceMouse).
Hardware / simulator: GelSight mini sensors with markers (3 used for reconstruction generalization tests). Aloha low-cost dual-arm platform, each gripper carrying a GelSight Mini; bimanual 14-DoF / single-arm 7-DoF action spaces, 10 Hz rollout, 24 diffusion denoising steps. Simulated peg insertion uses TacSL with two camera views + one GelSight image.
Baselines: T3 (state-of-the-art tactile transformer; Tiny/Small/Medium/Large, pretrained on large tactile data), MAE (ViT-Tiny/Base, mask ratios 0.25/0.5/0.75), BYOL, ResNet from scratch and ResNet with ImageNet pretraining; for policy: vision-only diffusion policy and visual-tactile diffusion policy with from-scratch tactile encoder.
Compute: not reported (model sizes given: UniT rep-dim 16×20 is 79.81 M params vs MAE ViT-Base 143.20 M; training schedule 500 epochs for policies, rollouts every 25 epochs).

Results

USB-plug pose estimation (mean absolute error; lower is better). On 6D pose (Table II), UniT achieves rotation error 0.155 rad and position error 4.8 mm, beating all baselines:

Method (6D pose)	Rotation (rad)↓	Position (mm)↓
BYOL	1.202	11.2
MAE (best, 0.75)	0.284	6.0
T3 (best, Medium)	0.306	5.8
ResNet (pretrain)	0.336	5.8
UniT	0.155	4.8

On 3D pose (Table I), UniT (Allen-key-trained, rep-dim 16×20) reaches 0.128 rad on Allen key and 0.166 on small ball, best among methods including T3-Large (0.332). Notably UniT trained only on a small ball (0.166) still beats T3 and ResNet, underscoring the single-simple-object claim.

Classification (YCB-Sight, Table III). UniT hits 92.1% in the freeze (zero-shot transfer) setting — far above MAE (max 81.0%), T3 (max 82.3%), and ResNet (85.4%). After fine-tuning, UniT 97.3% is comparable to ResNet-pretrain (97.9%) and BYOL (97.5%); the headline advantage is zero-shot.

Policy learning (Table V, real tasks). Visual-tactile policy with UniT gives the best success on all three real tasks: Chicken Legs Hanging 13/15 one-leg & 9/15 two-legs (vs vision-only 9/15, 6/15), Chips Grasping 14/15, Allen Key Insertion 23/30 total (vs 15/30 vision-only, 17/30 scratch). On simulated peg insertion (Table VI), UniT-Freeze 0.571 / UniT-Fine-Tune 0.575 beat vision-only (0.501), scratch visual-tactile (0.502), and T3 (0.509/0.516).

Where it loses / is only a tie. In the fine-tuned classification and the fine-tuned-pose regimes, UniT is roughly comparable to, not clearly ahead of, ResNet-with-pretraining and BYOL — the authors attribute their strength to small dataset size. Ablations (Table I) show removing VQ hurts across all rep-dims, and the discriminator generally helps (with a minor exception on the small ball at 8×10).

Limitations & open questions

From the authors:

UniT does not transfer across different sensor types with drastically different image characteristics — it relies on minimal data from a single sensor type, so a new sensor needs new (small) data.
The single-object training caps texture diversity; future work proposes training on a broader set of physical objects (e.g., intricate fabric-like patterns) to enrich fine-texture capture.
Approach is scoped to GelSight-family marker-gel sensors; other sensor families are assumed compatible but untested.

What I noticed reading it:

Real-task results are reported as count-of-success out of 10–15 (e.g., 13/15, 8/10), not rates over multiple seeds — a smaller statistical claim than the pose/classification tables, and with N this small the per-task gaps over baselines are noisy.
The central claim ("a single simple object suffices") is shown for two objects (Allen key, small ball) on essentially one downstream-sensor type. Whether the choice of training object matters is only lightly probed — Fig 3/text note the Allen key generalizes better than the ball, but there's no systematic study of which object properties drive the representation.
VQGAN is borrowed wholesale; the paper offers an intuition (compact tactile color distribution) for why VQ helps but no quantitative analysis of the codebook (utilization, size sensitivity) — the mechanism is argued visually (Fig 6) rather than measured.
Everything stays at the level of a learned dense feature; UniT produces a good tactile embedding but no symbolic / interpretable structure (no force or contact-state read-out), so downstream heads must relearn any task-relevant predicate from scratch.

Why I care

Direct relevance to the thesis behind BLADE: many manipulation predicates BLADE would want — is_grasped, is_inserted, is_screwed_tight, surface_is_rough — are not visually evaluable; they live in contact geometry and shear, exactly what a GelSight image encodes. BLADE currently learns predicate classifiers from RGB crops; UniT is a concrete answer to "what feature backbone feeds a tactile predicate classifier," and its data-efficiency story (one object, <20 min) is attractive because collecting per-predicate tactile labels is expensive. UniT's USB-plug 6D-pose result is essentially a learned continuous read-out of in-hand insertion state — the kind of signal BLADE's purely-categorical abstraction layer cannot currently express (a limitation I flagged in BLADE's own "what I noticed"). This makes UniT a candidate perception module for a force/contact-aware extension of BLADE's predicate set.

That said, UniT is a representation-learning paper, not a planning or language paper — it has no symbolic abstraction, no language, and no long-horizon composition. Its relevance is as an upstream sensory encoder, not a method comparison for BLADE's bi-level planning. Within this batch it sits in Cluster B (tactile representation backbones) alongside Sparsh and T3, which it directly outperforms on the shared benchmarks.

Quotable

Can we use a single simple object to learn a data efficient tactile representation of one type of GelSight that 1) possesses generalizability, 2) incorporates as much of the rich information present in tactile images as possible, and 3) can be applied in a zero-shot manner across a variety of downstream tasks involving different objects? — §I Introduction / p.1

In this paper, we demonstrate that VQGAN can serve as a highly effective tactile representation learner that can be trained with minimal data. — §III Background / p.2

Recording images at a frequency of 10 Hz enables the acquisition of such a dataset for a single object for no more than 20 minutes. — §IV-B / p.4

Papers cited that should likely be ingested next:

[10] Zhao et al. — T3: Transferable Tactile Transformers — the SOTA tactile-representation baseline UniT beats; the scale-driven counterpoint to UniT's data-efficiency thesis. (expected PDF)
[21] Higuera et al. — Sparsh — self-supervised multi-sensor touch representations; the other major tactile-foundation backbone in this batch. (expected PDF)
[22] Yang et al. — Binding Touch to Everything (UniTouch) — aligns tactile embeddings to pretrained image/everything embeddings; the multimodal-binding alternative to UniT's pure reconstruction. (expected PDF)
[1] Chi et al. — Diffusion Policy (RSS 2023) — the policy backbone UniT plugs into; shared low-level controller with BLADE.
[4] Yuan, Dong & Adelson — GelSight — the sensor and marker-tracking method underpinning UniT's tactile images. (expected PDF)
[30] Akinola et al. — TacSL — the visuo-tactile sim library used for the simulated peg-insertion benchmark.
[23] Esser et al. — Taming Transformers (VQGAN) / [24] van den Oord et al. — VQVAE — the autoencoder formalism UniT repurposes.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

BLADE — the anchor: UniT is a candidate tactile backbone for the non-visually-evaluable predicates BLADE's abstraction layer can't currently express.
Sparsh, T3, AnyTouch, and MViTac — same Cluster B tactile-representation family; UniT's data-efficiency angle is the contrast to their scaling/multimodal-alignment approaches.
T-DEX and See to Touch — self-supervised tactile pretraining for dexterity; downstream-policy siblings to UniT's policy-learning experiments.
UniTouch and TVL — the touch–language / multimodal-binding line; a route to make a UniT-style encoder language-queryable for predicate grounding.