OmniVTLA: Vision-Tactile-Language-Action Models with Semantic-Aligned Tactile Sensing

Zhengxue Cheng†, Yiqian Zhang, Anni Tang, Keyu Wang, Wenkang Zhang, Haoyu Li, Hengdi Zhang, Li Song · Shanghai Jiao Tong University / Paxini Tech · 2026 (arXiv v3, June 2026) · arXiv:2508.08706 · PDF

One-liner. OmniVTLA bolts a semantically-aligned tactile encoder onto a π0-style VLA: instead of treating touch as a raw low-level signal, it pretrains a tactile ViT via cross-modal contrastive learning (on a new 135K-sample tri-modal dataset, ObjTac) so that tactile tokens land in the same latent space as vision and language — and that grounding, not just more parameters, is what lifts contact-rich pick-place and peg-insertion success.

Problem & motivation

VLA models inherit CLIP/SigLIP vision backbones that are already contrastively aligned to language, but they rely on vision and language alone and stumble on contact-rich tasks where the decisive feedback lives in touch (material, roughness, hardness, contact state). Prior attempts to bolt touch onto VLAs (Tactile-VLA, VTLA) treat tactile data as a low-level signal and do not align it semantically with vision and language. The paper's wager (Fig 1): the tactile encoder's architecture and alignment objective matter more than raw parameter count, and a vision-language-tactile semantic alignment can be engineered the way CLIP aligned vision and language.

Method

OmniVTLA is built on π0 (Black et al. 2024) with a Gemma-2.6B backbone, a PaliGemma text tokenizer, a SigLIP-400M image encoder, and a flow-matching action expert. The action model targets p(A_t | o_t) over an action chunk of length H=50; the VTLA variant extends the π0 observation with tactile tokens: A_t ~ M(fφ(I_t), fθ(T_t), l_t) where T_t is tactile data remapped to a ViT-like 3-channel image (via max-min force normalization and tensor reshaping) and stitched/multi-sensor inputs are resized to 224×224 yielding 256 tokens.

Dual-encoder tactile path (Fig 2). The core design is a dual-ViT tactile encoder: one path is the SigLIP image encoder reused on the tactile-as-image input (knowledge transfer from large-scale vision pretraining), the other is a Semantic-Aligned tactile ViT (SA-ViT) trained explicitly via cross-modal contrastive learning to push tactile embeddings toward the matching vision and text concepts. This dual path addresses the heterogeneity between vision and touch and across different tactile sensors (Touch-and-Go, SSVTP, ObjectFolder, ObjTac all look different).

SA-ViT training. Built on the second-stage alignment pipeline of AnyTouch. Because ObjTac is tri-modal, the paper uses a symmetric all-pairs alignment loss (Eq. 3): L_align = α_VL(L_{V→L}+L_{L→V})/2 + α_VT(L_{V→T}+L_{T→V})/2 + α_TL(L_{T→L}+L_{L→T})/2, plus a cross-sensor matching loss (binary cross-entropy) so embeddings align across sensors. The SA-ViT has an encoder/decoder structure; alignment grounds tactile signals (material, roughness, hardness) in visual and linguistic context.

ObjTac dataset. 56 objects across 10 material categories (plastic, glass, wood, brick, metal, fabric, leather, ceramic, paper, others), labeled by surface roughness (rough/smooth) and hardness (rigid/soft). Sensor is a Paxini Gen2 magnetic Hall-effect tactile sensor that recovers 6D pose of embedded magnets and inverts to a multi-dimensional force distribution. Per object: 2–5 interaction trials, 10–60 s each at 60 Hz, plus 720p first-person video at 30 FPS, temporally synchronized by timestamp. Totals: 135K tri-modal samples, 270K force recordings, 252 video sequences. Tri-modal text annotations give object name, material, roughness, hardness, and descriptions.

Action representation. For two-finger grippers: 10 tokens (3 relative positions, 6 relative angles, 1 gripper state). For the four-finger hand: 25 tokens (3 relative positions, 6 relative angles, 16 absolute joint positions).

Setup

Datasets / benchmarks: ObjTac (their tri-modal tactile dataset, 135K samples, 56 objects); tactile-perception evaluation also on Touch-and-Go. Real-world manipulation: pick-and-place (gripper + dexterous hand) and peg-insertion (square / round / triangular, clearances 0.9 / 1.2 / 1.8 mm). Soft-object generalization on Sponge and Towel.
Hardware / simulator: UR5 arm; a two-finger jaw gripper with two Paxini tactile sensors + wrist camera; an 11-tactile-sensor four-finger dexterous hand + wrist camera; a base camera (Fig 4). All real-robot, no simulator.
Baselines: π0 (the vanilla VLA, no tactile), Diffusion Policy (DP) as a from-scratch VA baseline, and internal tactile-encoder ablations: VTLA-FS (tactile from scratch), VTLA-Pre (tactile = pretrained vision encoder), VTLA-SA (semantic-aligned, single path), VTLA-Pre-Pre (dual SigLIP+CLIP, parameter-matched anchor), VTLA-Pre-Tac (dual SigLIP+Tac-ViT w/o align), and OmniVTLA (dual SigLIP+SA-ViT). Tactile perception baselines: CLIP, SigLIP, Tac-ViT (w/o align), AnyTouch.
Compute: NVIDIA A100 80GB; VTLA/π0 fine-tuned 30K steps, batch 32, 2.5e-5 peak LR (1K warmup, cosine decay to 2.5e-6); DP/VTA trained from scratch 200K steps, LR 1e-4 (Table 10).

Results

Headline real-world numbers: on pick-and-place with the gripper, OmniVTLA reaches 96.9% average SR vs 75.0% for π0 (+21.9%); with the dexterous hand 100% vs 93.8% (+6.2%); on peg-insertion 83.3% vs 50.0% (+33.3%). It also cuts completion time (657→498 steps gripper pick-place) and smooths trajectories.

Pick-and-place, two-finger gripper (Table 3):

Model	Tactile enc.	SR % (avg)	CT steps (avg)
VLA (π0)	none	75.0	657
VTLA-FS	from scratch	81.2	537
VTLA-Pre	pretrained vision	84.4	586
VTLA-SA	semantic-aligned	87.5	484
VTLA-Pre-Pre	SigLIP+CLIP	81.3	501
VTLA-Pre-Tac	SigLIP+Tac-ViT	84.4	550
OmniVTLA	SigLIP+SA-ViT	96.9	498

Other findings:

Tactile perception (Table 2): SA-ViT beats CLIP/SigLIP on material/roughness/hardness on ObjTac (e.g. material 70.44 vs CLIP 54.64, SigLIP 49.15) while staying near baseline on Touch-and-Go. It does not uniformly beat AnyTouch on Touch-and-Go (AnyTouch leads on material 79.39 and hardness 95.16 there), but wins on the in-domain ObjTac set.
Architecture > parameters: VTLA-Pre-Pre and VTLA-Pre-Tac are parameter-matched to OmniVTLA but underperform it on SR, CT, and smoothness — the gain is attributed to the SA-ViT encoder, not parameter count.
DP baseline (Table 5): adding tactile to DP lifts avg SR 18.7% (59.4→78.1) and cuts CT 19.9% — "tactile universally helps."
Peg-insertion (Table 6): OmniVTLA 83.3% avg SR (including 90% on the unseen round shape) vs π0 50.0%.
Where it loses: offline-validation MSE (Fig 5) is led by VTLA-Pre-Tac, with OmniVTLA only second-best. On soft-object SR (Table 8), VTLA-FS/VTLA-Pre hit 93.8% vs OmniVTLA 81.3% — but the paper argues VTLA-FS's high SR comes from crushing tight gripper closure, while VTLA-SA/OmniVTLA win on minimal max gripper width (less deformation). On smoothness (Table 7) VTLA-SA, not OmniVTLA, is best.

Limitations & open questions

From the authors:

Training data used repeated-attempt demos; they found single-attempt data sufficed for pick-place and only contact-rich tasks needed multiple attempts — suggests the tactile benefit is task-specific.
No improvement in messy clutter, where success depends on visual navigation rather than tactile feedback.
Future work: leverage tactile without introducing visual ambiguity, more efficient tactile representations, and temporally-dynamic fusion architectures.

What I noticed reading it:

Small-N real-robot statistics. Gripper pick-place is 32 rollouts and the dexterous hand only 16; peg-insertion is 10 trials/shape. SR deltas of a few percent ride on a handful of trials, and no variance / seeds are reported for the real-world tables.
The best model isn't best on the diagnostics. OmniVTLA is second on offline MSE (behind VTLA-Pre-Tac) and second on smoothness (behind VTLA-SA). So the dual-encoder + alignment story is supported on end-task SR but not cleanly on the intermediate metrics — the single-path VTLA-SA is a surprisingly strong, simpler competitor.
Tactile-as-image is an assumption. Force fields are reshaped into a 3-channel ViT image; whether magnetic-Hall force distributions truly behave like the optical GelSight-style images SigLIP saw during pretraining is asserted, not measured.
Alignment vs domain. SA-ViT wins on ObjTac (in-domain) but not on Touch-and-Go; the contrastive alignment may be partly fitting the Paxini sensor distribution rather than a sensor-agnostic semantic space, despite the cross-sensor loss.
Sponge/Towel inversion. Reporting SR and max-gripper-width and then arguing the high-SR baseline "cheats" by crushing is a fair point, but it means SR alone is the wrong metric for soft objects — an honest wrinkle the headline numbers don't carry.

Why I care

OmniVTLA is a clean data point for the thesis that many manipulation predicates are not visually evaluable — is_grasped, is_inserted, surface_is_rough, is_soft live in force/contact, exactly the signals OmniVTLA shows decide pick-place and peg-insertion. Relative to BLADE: BLADE's contact predicates (turned-on(faucet), is-grasped) are learned as visual classifiers, and BLADE itself flags that gripper-state segmentation can't handle caging grasps or contact-rich tasks, and that continuous force sits opaquely inside the diffusion policy. OmniVTLA is on the opposite end of the structure-vs-scale axis — a monolithic flow-matching VLA with no symbolic abstraction layer — but its tactile encoder is precisely the kind of grounded perceptual front-end a force-aware predicate would need. The provocative connection: could SA-ViT-style semantically-aligned tactile embeddings serve as the classifier input for tactile predicates (is_inserted, is_screwed_tight) in a BLADE-style bilevel planner, giving the abstraction layer access to non-visual state? The "architecture > parameters" result also rhymes with BLADE's "structure as prior" stance against pure end-to-end scaling.

Caveat on relevance: OmniVTLA contributes no planning, no compositionality, and no predicate learning — it is a perception/representation + end-to-end-policy paper. Its value to my line is as a tactile grounding module, not as a planning method.

Quotable

Existing attempts to incorporate touch into VLA frameworks often treat tactile data as low-level signals, failing to align them semantically with visual and linguistic contexts. — §1 Introduction / p.1

This indicates the performance gains are attributed to the choice of tactile encoder architecture rather than the parameter count. — §4.2 Real-World Pick and Place Results / p.8

… smoother trajectories that adhere to the intuitive principle of “move quickly when clear, only slow down during contact approach.” — §1 / p.2

Papers cited that should likely be ingested next:

Black et al. 2024 — π0 (arXiv:2410.24164) — the flow-matching VLA backbone OmniVTLA is built on and the primary baseline. Foundational dependency; not in the batch cross-ref list.
Zhang et al. 2025b — VTLA (insertion VLA, arXiv:2505.09577) and Hao et al. 2025 — TLA (Tactile-Language-Action, arXiv:2503.08548) — the direct VTLA/TLA predecessors in Table 1.
Yu et al. 2024 — Octopi — cited as the pioneer tactile-language object-property work; in this batch as octopi_object_property_reasoning_tactile_language.

Newly ingested in the 2026-06-24 batch — directly relevant:

AnyTouch — OmniVTLA's SA-ViT directly reuses AnyTouch's second-stage alignment pipeline and is benchmarked against it; the closest methodological parent.
Tactile-VLA and ForceVLA — sibling Cluster-C tactile/force VLA policies; OmniVTLA frames itself against exactly these "tactile-as-low-level-signal" approaches.
TaF-VLA and VLA-Touch — other tactile-aligned / tactile-feedback VLAs in the same cluster; direct comparison candidates on contact-rich SR.
Octopi and UniTouch — touch-language grounding / tactile-to-multimodal-binding precursors that motivate SA-ViT's contrastive alignment.
ObjectFolder 2.0 and Touch and Go — the heterogeneous tactile datasets shown in Fig 2/Table 2 that ObjTac is positioned alongside; Touch-and-Go is a direct eval set here.