CONTROLTAC: Scaling Tactile Data with
Physically Controlled Tactile Image Generation

Vision-Based Tactile Sensing  ·  Data Augmentation  ·  Robot Manipulation

Dongyu Luo*, Kelin Yu*, Amir-Hossein Shahidzadeh, Cornelia Fermuler, Yiannis Aloimonos, Ruohan Gao.

∗ Co-first authors

University of Maryland, College Park

Paper Incoming Soon
Overview Video

Given One Image — Generate Millions

ControlTac performs force- and pose-conditioned generation to synthesize millions of realistic tactile images from a single reference, enhancing a wide range of downstream robotic applications.

ControlTac Overview Teaser
Figure 1. Overview of ControlTac. Given a single reference image, ControlTac performs force- and pose-conditioned generation to synthesize millions of realistic tactile images (center). This augmented dataset enhances various downstream applications, including object classification, weight estimation, real-time pose tracking, object insertion, and training imitation learning policies.

Synthesizing Realistic Tactile Images
Conditioned on Physical Priors

Vision-based tactile sensing is widely used in perception, reconstruction, and robotic manipulation, yet collecting large-scale tactile data remains costly due to diverse sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data — simulation and free-form tactile generation — often yield unrealistically rendered signals with poor transfer to highly dynamic real-world tasks. We propose ControlTac, a two-stage controllable tactile image generation framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact pose. By grounding generation in these important physical priors, ControlTac synthesizes realistic samples across different sensors while effectively capturing task-relevant variations. Across a series of downstream tasks and real-world experiments, the augmented datasets using our approach consistently improve performance and demonstrate practical utility in dynamic real-world settings.
Vision-Based Tactile Sensing Tactile Data Augmentation Controllable Generation Diffusion Models Robot Manipulation GelSight

Why Physical Priors Matter

Free-form generative methods lack physical grounding; simulators suffer from the sim-to-real gap. Our insight is that tactile synthesis should be explicitly conditioned on contact forces and poses, anchored by a single real reference image.

01
Contribution

Physically Controlled Generation

We propose conditioning tactile image synthesis on physically meaningful parameters — 3D contact force vectors and 2D contact pose masks — as the key to achieving both realism and controllable diversity.

02
Contribution

Two-Stage ControlTac Framework

We introduce ControlTac, a two-stage framework that decouples force and pose control to avoid representation entanglement, enabling cross-sensor transfer and diverse data synthesis from a single reference image.

03
Contribution

Broad Real-World Validation

We validate ControlTac across object classification, force estimation, pose estimation, real-world weighting, tracking, insertion, and imitation learning — consistently outperforming all baselines.

Two-Stage Conditional Generation Framework

ControlTac decouples physical priors into contact force and spatial pose, operating sequentially to avoid entangling force-induced deformations with contact geometry changes.

Two-Stage ControlTac Framework Diagram
Figure 2. Two-stage ControlTac framework. (a) Force-Control: The raw image x′ is background-subtracted and encoded into latent space. Conditioned on the target force ΔF, the generator synthesizes a force-adjusted intermediate image yInt. (b) Pose-Control: An object-specific contact mask P is transformed via rigid 2D operations to serve as an explicit spatial constraint, guiding the model to generate the final tactile image y satisfying both ΔF and the target pose.
1
Stage One

Force-Control Generation

The raw image is background-subtracted and encoded into a compressed latent space via a pre-trained autoencoder. A conditional Diffusion Transformer (with DDIM sampling) takes this latent representation and a target relative force vector ΔF to synthesize force-adjusted tactile images with accurate elastomer deformations.

Background subtraction isolates contact features and mitigates cross-sensor domain gaps arising from different lighting conditions.

147M
Force-Control Stage Params
20,000
Training Data
~21h
Training Time
Note: Trained on a single A5000 GPU
2
Stage Two

Pose-Control Generation

A compact binary contact mask P serves as a global pose-control signal. Masks are aligned with ground-truth images at ±3 px and ±1° precision during training, and transformed to arbitrary target poses via 2D rigid transforms at inference.

Inspired by ControlNet, a frozen force-conditional backbone is augmented with a pose-guided adapter — injecting spatial constraints without disturbing the learned force-deformation mapping.

223M
Total Params (Force + Pose)
7,000
Training Data
~5h
Training Time
Note: Trained on a single A5000 GPU

ControlTac vs. Prior Paradigms

ControlTac is the only approach achieving all three desiderata: high realism, high variation, and physical controllability.

Method
Realism
Variation
Controllable
Text2Tac
Low
Low
Vis2Tac
Low
Medium
Simulation
Medium
Medium
ControlTac (Ours)
High
High
Visual Comparison Paradigm
Figure 3. Comparison of tactile data augmentation approaches along three criteria: visual realism, output diversity, and physical controllability.

Tactile Image Generation Quality

ControlTac is evaluated on twelve seen and unseen objects across two datasets using pixel-wise MSE↓ and structural similarity SSIM↑. ControlTac outperforms all baselines — including the Simulator and Sim2Real methods — across every evaluation setting.

Method Seen Objects Unseen: FeelAnyForce Unseen: AnyTouch2
MSE↓SSIM↑ MSE↓SSIM↑ MSE↓SSIM↑
Simulator1054±190.68±0.031065±230.69±0.032157±140.61±0.04
Sim2Real239±170.74±0.03253±250.73±0.03545±310.70±0.05
Hybrid31±50.81±0.0437±60.75±0.04
Separate157±80.79±0.04199±110.72±0.05
ControlTac (Ours) 23±20.83±0.03 26±30.79±0.04 29±20.81±0.02

Table 1. Quantitative results (mean ± SD). ControlTac achieves best performance across all seen and unseen evaluation sets.

Key Takeaways

ControlTac achieves 45× lower MSE than the Simulator on seen objects (23 vs. 1054) and 10× lower MSE than Sim2Real, while maintaining the highest SSIM across all evaluation sets including completely unseen AnyTouch2 objects.

Inference Throughput (Tested on a single A6000)

6.5
imgs/sec (Ours)
7.0
imgs/sec (Hybrid)
3.7
imgs/sec (Separate)
Visual Comparison Across Objects & Conditions
Qualitative Generation Results Across Methods
Figure 4. Qualitative generation results across diverse objects and contact conditions. Visual comparison of ControlTac against baselines on: (a) seen (rows 1–4) and unseen (rows 5–6) objects from FeelAnyForce, (b) unseen objects from AnyTouch2, and (c) a failure case. ControlTac captures complex force-induced deformations and fine textures even on unseen objects.

Generalize Across Different Sensor Instances

ControlTac achieves strong zero-shot generalization to unseen sensor instances and reaches comparable performance to in-distribution sensors with minimal fine-tuning, evaluated on the 9DTact dataset comprising 3D-printed objects of various geometric shapes.

Impact of Fine-tuning Data Size Chart
Figure 5(a). Impact of fine-tuning data size on MSE and SSIM metrics. Even with very limited fine-tuning data, ControlTac achieves competitive performance on unseen sensor instances.
Cross-Sensor Generation Visualization
Figure 5(b). Visualization of generated results across different sensors, confirming ControlTac's ability to capture unique lighting and texture details of unseen sensor instances.

Three Downstream Benchmarks

ControlTac's generated data is validated sequentially on three representative tasks spanning discrete categorization, sensitive force regression, and spatial contact reasoning.

Task 01

Object Classification

Using one reference image for six unseen objects, ControlTac generates tactile data under varying forces and poses. Classification accuracy is benchmarked across CNN, ViT (scratch), and ImageNet-pretrained ViT architectures.

Geo+Col Aug
0.79
Simulator
0.92
ControlTac (Ours)
0.99
Text2Tac
0.14

ViT (ImageNet) accuracy on 6 unseen objects. ControlTac achieves 0.99 vs. Simulator's 0.92. Text2Tac performs poorly as free-form generation lacks physical grounding.

Unseen Objects for Classification
Figure 6. The unseen objects used for validation in the downstream object classification benchmark.
Task 02

Force Estimation

3D contact force estimation (1–10 N, 0.1 N precision) is the most sensitive regression benchmark given GelSight's high sensitivity to subtle force variations. ControlTac synthesizes 15k–30k images to co-train a force estimator with varying subsets of real data.

Key result: Supplementing only 1/3 of the real dataset with ControlTac data matches full-dataset performance, whereas using that real subset alone yields poor results. ControlTac effectively covers the pose and force variation space that real data struggles to provide at scale.

Force Estimation Accuracy Histogram
Figure 7. Force estimation performance comparison across different data ratio configurations.
Task 03

Pose Estimation

2D pose estimation (X, Y coordinates and rotation angle θ) generalizes to unknown objects. ControlTac automatically extracts pose labels from 2D contact masks during synthesis, making it effortless to generate large labeled datasets.

Estimators trained solely on ControlTac-generated data achieve strong performance across all objects, outperforming models trained on real data alone — because ControlTac's diverse synthesis effortlessly covers gaps in contact poses and force dynamics.

Pose Estimation — Full Results Including Unseen Objects

ControlTac generalizes to unseen shapes (T-shape, Type-C connector) and new sensor instances. Using varying force generation (unfixed) consistently outperforms fixed-force training, covering real-world dynamics more comprehensively.

Method Cylinder (3 Types) Cross T-shape (Unseen) USB (Unseen)
XYAng XYAng XYAng XYAng
PCA151322561918
Real884662
Simulator1715619185
Sim + Real1213617164
Ours (fixed force)985794
Ours (unfixed force) 453 341 452 543

Table 2. Pose estimation errors (X/Y in mm, Angle in degrees). ControlTac (unfixed force) outperforms all baselines including models trained on real data, and generalizes to unseen objects on a new sensor instance.

Four Real-World Deployment Tasks

Estimators and policies trained solely on ControlTac-augmented data are deployed across four challenging real-world tasks, demonstrating robust generalization, millimeter-level precision, and practical utility in dynamic settings.

Imitation Learning Demo
Real World · Task 01
Imitation Learning: Pick & Peg-in-Hole

We train ACT-based visuo-tactile policies on a multi-stage Pick and Peg-in-Hole task using third-view RGB, wrist RGB, and tactile inputs. Force-controlled trajectory-level augmentation (1–9× copies per trajectory with random force offsets) significantly improves policy robustness against contact inconsistencies in teleoperated demonstrations. With 70–100 real demonstrations, tactile augmentation consistently outperforms real-only training; with 100 demos and 9× augmentation, success rate reaches 72% — a +12 percentage point gain over no augmentation (60%), evaluated across 25 real-robot trials per condition.

72%
Best SR (Aug×9)
60%
No Aug Baseline
+12pp
Improvement
Imitation Learning Success Rate Histogram
Figure 8. Success rate comparison for Pick & Peg-in-Hole task under different augmentation scales.
Object Weighting Demo
Real World · Task 02
Object Weighting

A UR5 robot pushes four objects (1 kg metal weight, two water-filled cylinders of 0.50 kg and 0.56 kg, and a 0.63 kg glass bottle) at constant speed. An ATI Axia80 force sensor provides ground truth readings. The estimator trained on ControlTac-generated images — without any real training data from these objects — predicts the pushing force in real time from tactile input alone, achieving accuracy within 0.1 N of the real-data-trained model. This demonstrates robust generalization across complex and diverse material properties.

<0.1N
Force Error
4
Object Types
ATI
Ground Truth
Pose Tracking Demo
Real World · Task 03
Real-time Pose Tracking

We track the 2D contact pose (position + rotation angle) of printed cylinder, cross, and T-shape objects as they undergo continuous, in-hand rotation and translation against the sensor. The pose estimator — trained entirely on ControlTac-generated data — runs at 10 Hz in real time, smoothly tracking across the full range of poses and dynamic motion. This task highlights ControlTac's practical utility: accurate 10 Hz tactile-based tracking was previously only achievable with large annotated real datasets.

10Hz
Tracking Rate
3
Object Shapes
Real-time
Deployment
Object Insertion Demo
Real World · Task 04
High-Precision Object Insertion

An XArm7 robot equipped with dual GelSight Mini tactile sensors grasps objects at random angles and performs high-precision peg-in-hole insertion with a tight 3 mm tolerance. The pose estimator — trained on ControlTac data — predicts in-hand object pose from tactile feedback, and the robot compensates for grasping uncertainty by rotating and translating the end-effector to align the object vertically above the hole. ControlTac-trained models achieve up to 90% success rate across four object types, including a daily-use Type-C USB connector, demonstrating millimeter-level real-world manipulation from single-image augmentation.

90%
Cylinder SR
85%
Cross/T-shape SR
75%
Type-C SR

BibTeX

@article{luo2025controltac, title={ControlTac: Force-and Position-Controlled Tactile Data Augmentation with a Single Reference Image}, author={Luo, Dongyu and Yu, Kelin og Shahidzadeh, Amir-Hossein og Ferm{\"u}ller, Cornelia og Aloimonos, Yiannis og Gao, Ruohan}, journal={arXiv preprint arXiv:2505.20498}, year={2025} }