Vision-Based Tactile Sensing · Data Augmentation · Robot Manipulation
University of Maryland, College Park
ControlTac performs force- and pose-conditioned generation to synthesize millions of realistic tactile images from a single reference, enhancing a wide range of downstream robotic applications.
Vision-based tactile sensing is widely used in perception, reconstruction, and robotic manipulation, yet collecting large-scale tactile data remains costly due to diverse sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data — simulation and free-form tactile generation — often yield unrealistically rendered signals with poor transfer to highly dynamic real-world tasks. We propose ControlTac, a two-stage controllable tactile image generation framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact pose. By grounding generation in these important physical priors, ControlTac synthesizes realistic samples across different sensors while effectively capturing task-relevant variations. Across a series of downstream tasks and real-world experiments, the augmented datasets using our approach consistently improve performance and demonstrate practical utility in dynamic real-world settings.
Free-form generative methods lack physical grounding; simulators suffer from the sim-to-real gap. Our insight is that tactile synthesis should be explicitly conditioned on contact forces and poses, anchored by a single real reference image.
We propose conditioning tactile image synthesis on physically meaningful parameters — 3D contact force vectors and 2D contact pose masks — as the key to achieving both realism and controllable diversity.
We introduce ControlTac, a two-stage framework that decouples force and pose control to avoid representation entanglement, enabling cross-sensor transfer and diverse data synthesis from a single reference image.
We validate ControlTac across object classification, force estimation, pose estimation, real-world weighting, tracking, insertion, and imitation learning — consistently outperforming all baselines.
ControlTac decouples physical priors into contact force and spatial pose, operating sequentially to avoid entangling force-induced deformations with contact geometry changes.
The raw image is background-subtracted and encoded into a compressed latent space via a pre-trained autoencoder. A conditional Diffusion Transformer (with DDIM sampling) takes this latent representation and a target relative force vector ΔF to synthesize force-adjusted tactile images with accurate elastomer deformations.
Background subtraction isolates contact features and mitigates cross-sensor domain gaps arising from different lighting conditions.
A compact binary contact mask P serves as a global pose-control signal. Masks are aligned with ground-truth images at ±3 px and ±1° precision during training, and transformed to arbitrary target poses via 2D rigid transforms at inference.
Inspired by ControlNet, a frozen force-conditional backbone is augmented with a pose-guided adapter — injecting spatial constraints without disturbing the learned force-deformation mapping.
ControlTac is the only approach achieving all three desiderata: high realism, high variation, and physical controllability.
ControlTac is evaluated on twelve seen and unseen objects across two datasets using pixel-wise MSE↓ and structural similarity SSIM↑. ControlTac outperforms all baselines — including the Simulator and Sim2Real methods — across every evaluation setting.
| Method | Seen Objects | Unseen: FeelAnyForce | Unseen: AnyTouch2 | |||
|---|---|---|---|---|---|---|
| MSE↓ | SSIM↑ | MSE↓ | SSIM↑ | MSE↓ | SSIM↑ | |
| Simulator | 1054±19 | 0.68±0.03 | 1065±23 | 0.69±0.03 | 2157±14 | 0.61±0.04 |
| Sim2Real | 239±17 | 0.74±0.03 | 253±25 | 0.73±0.03 | 545±31 | 0.70±0.05 |
| Hybrid | 31±5 | 0.81±0.04 | 37±6 | 0.75±0.04 | — | — |
| Separate | 157±8 | 0.79±0.04 | 199±11 | 0.72±0.05 | — | — |
| ControlTac (Ours) | 23±2 | 0.83±0.03 | 26±3 | 0.79±0.04 | 29±2 | 0.81±0.02 |
Table 1. Quantitative results (mean ± SD). ControlTac achieves best performance across all seen and unseen evaluation sets.
ControlTac achieves 45× lower MSE than the Simulator on seen objects (23 vs. 1054) and 10× lower MSE than Sim2Real, while maintaining the highest SSIM across all evaluation sets including completely unseen AnyTouch2 objects.
ControlTac achieves strong zero-shot generalization to unseen sensor instances and reaches comparable performance to in-distribution sensors with minimal fine-tuning, evaluated on the 9DTact dataset comprising 3D-printed objects of various geometric shapes.
ControlTac's generated data is validated sequentially on three representative tasks spanning discrete categorization, sensitive force regression, and spatial contact reasoning.
Using one reference image for six unseen objects, ControlTac generates tactile data under varying forces and poses. Classification accuracy is benchmarked across CNN, ViT (scratch), and ImageNet-pretrained ViT architectures.
ViT (ImageNet) accuracy on 6 unseen objects. ControlTac achieves 0.99 vs. Simulator's 0.92. Text2Tac performs poorly as free-form generation lacks physical grounding.
3D contact force estimation (1–10 N, 0.1 N precision) is the most sensitive regression benchmark given GelSight's high sensitivity to subtle force variations. ControlTac synthesizes 15k–30k images to co-train a force estimator with varying subsets of real data.
Key result: Supplementing only 1/3 of the real dataset with ControlTac data matches full-dataset performance, whereas using that real subset alone yields poor results. ControlTac effectively covers the pose and force variation space that real data struggles to provide at scale.
2D pose estimation (X, Y coordinates and rotation angle θ) generalizes to unknown objects. ControlTac automatically extracts pose labels from 2D contact masks during synthesis, making it effortless to generate large labeled datasets.
Estimators trained solely on ControlTac-generated data achieve strong performance across all objects, outperforming models trained on real data alone — because ControlTac's diverse synthesis effortlessly covers gaps in contact poses and force dynamics.
ControlTac generalizes to unseen shapes (T-shape, Type-C connector) and new sensor instances. Using varying force generation (unfixed) consistently outperforms fixed-force training, covering real-world dynamics more comprehensively.
| Method | Cylinder (3 Types) | Cross | T-shape (Unseen) | USB (Unseen) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X | Y | Ang | X | Y | Ang | X | Y | Ang | X | Y | Ang | |
| PCA | 15 | 13 | 22 | 56 | 19 | 18 | — | — | — | — | — | — |
| Real | 8 | 8 | 4 | 6 | 6 | 2 | — | — | — | — | — | — |
| Simulator | 17 | 15 | 6 | 19 | 18 | 5 | — | — | — | — | — | — |
| Sim + Real | 12 | 13 | 6 | 17 | 16 | 4 | — | — | — | — | — | — |
| Ours (fixed force) | 9 | 8 | 5 | 7 | 9 | 4 | — | — | — | — | — | — |
| Ours (unfixed force) | 4 | 5 | 3 | 3 | 4 | 1 | 4 | 5 | 2 | 5 | 4 | 3 |
Table 2. Pose estimation errors (X/Y in mm, Angle in degrees). ControlTac (unfixed force) outperforms all baselines including models trained on real data, and generalizes to unseen objects on a new sensor instance.
Estimators and policies trained solely on ControlTac-augmented data are deployed across four challenging real-world tasks, demonstrating robust generalization, millimeter-level precision, and practical utility in dynamic settings.
We train ACT-based visuo-tactile policies on a multi-stage Pick and Peg-in-Hole task using third-view RGB, wrist RGB, and tactile inputs. Force-controlled trajectory-level augmentation (1–9× copies per trajectory with random force offsets) significantly improves policy robustness against contact inconsistencies in teleoperated demonstrations. With 70–100 real demonstrations, tactile augmentation consistently outperforms real-only training; with 100 demos and 9× augmentation, success rate reaches 72% — a +12 percentage point gain over no augmentation (60%), evaluated across 25 real-robot trials per condition.
A UR5 robot pushes four objects (1 kg metal weight, two water-filled cylinders of 0.50 kg and 0.56 kg, and a 0.63 kg glass bottle) at constant speed. An ATI Axia80 force sensor provides ground truth readings. The estimator trained on ControlTac-generated images — without any real training data from these objects — predicts the pushing force in real time from tactile input alone, achieving accuracy within 0.1 N of the real-data-trained model. This demonstrates robust generalization across complex and diverse material properties.
We track the 2D contact pose (position + rotation angle) of printed cylinder, cross, and T-shape objects as they undergo continuous, in-hand rotation and translation against the sensor. The pose estimator — trained entirely on ControlTac-generated data — runs at 10 Hz in real time, smoothly tracking across the full range of poses and dynamic motion. This task highlights ControlTac's practical utility: accurate 10 Hz tactile-based tracking was previously only achievable with large annotated real datasets.
An XArm7 robot equipped with dual GelSight Mini tactile sensors grasps objects at random angles and performs high-precision peg-in-hole insertion with a tight 3 mm tolerance. The pose estimator — trained on ControlTac data — predicts in-hand object pose from tactile feedback, and the robot compensates for grasping uncertainty by rotating and translating the end-effector to align the object vertically above the hole. ControlTac-trained models achieve up to 90% success rate across four object types, including a daily-use Type-C USB connector, demonstrating millimeter-level real-world manipulation from single-image augmentation.
@article{luo2025controltac,
title={ControlTac: Force-and Position-Controlled Tactile Data Augmentation with a Single Reference Image},
author={Luo, Dongyu and Yu, Kelin og Shahidzadeh, Amir-Hossein og Ferm{\"u}ller, Cornelia og Aloimonos, Yiannis og Gao, Ruohan},
journal={arXiv preprint arXiv:2505.20498},
year={2025}
}