ControlTac: Scaling Tactile Data with Physically Controlled Tactile Image Generation

Overview

Given One Image — Generate Millions

ControlTac performs force- and pose-conditioned generation to synthesize millions of realistic tactile images from a single reference, enhancing a wide range of downstream robotic applications.

Figure 1. Overview of ControlTac. Given a single reference image, ControlTac performs force- and pose-conditioned generation to synthesize millions of realistic tactile images (center). This augmented dataset enhances various downstream applications, including object classification, weight estimation, real-time pose tracking, object insertion, and training imitation learning policies.

Abstract

Synthesizing Realistic Tactile Images
Conditioned on Physical Priors

Vision-based tactile sensing is widely used in perception, reconstruction, and robotic manipulation, yet collecting large-scale tactile data remains costly due to diverse sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data — simulation and free-form tactile generation — often yield unrealistically rendered signals with poor transfer to highly dynamic real-world tasks. We propose ControlTac, a two-stage controllable tactile image generation framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact pose. By grounding generation in these important physical priors, ControlTac synthesizes realistic samples across different sensors while effectively capturing task-relevant variations. Across a series of downstream tasks and real-world experiments, the augmented datasets using our approach consistently improve performance and demonstrate practical utility in dynamic real-world settings.

Vision-Based Tactile Sensing Tactile Data Augmentation Controllable Generation Diffusion Models Robot Manipulation GelSight

Key Insight & Contributions

Why Physical Priors Matter

Free-form generative methods lack physical grounding; simulators suffer from the sim-to-real gap. Our insight is that tactile synthesis should be explicitly conditioned on contact forces and poses, anchored by a single real reference image.

Contribution

Physically Controlled Generation

We propose conditioning tactile image synthesis on physically meaningful parameters — 3D contact force vectors and 2D contact pose masks — as the key to achieving both realism and controllable diversity.

Contribution

Two-Stage ControlTac Framework

We introduce ControlTac, a two-stage framework that decouples force and pose control to avoid representation entanglement, enabling cross-sensor transfer and diverse data synthesis from a single reference image.

Contribution

Broad Real-World Validation

We validate ControlTac across object classification, force estimation, pose estimation, real-world weighting, tracking, insertion, and imitation learning — consistently outperforming all baselines.

Methodology

Two-Stage Conditional Generation Framework

ControlTac decouples physical priors into contact force and spatial pose, operating sequentially to avoid entangling force-induced deformations with contact geometry changes.

Figure 2. Two-stage ControlTac framework. (a) Force-Control: The raw image x′ is background-subtracted and encoded into latent space. Conditioned on the target force ΔF, the generator synthesizes a force-adjusted intermediate image y_Int. (b) Pose-Control: An object-specific contact mask P is transformed via rigid 2D operations to serve as an explicit spatial constraint, guiding the model to generate the final tactile image y satisfying both ΔF and the target pose.

Stage One

Force-Control Generation

The raw image is background-subtracted and encoded into a compressed latent space via a pre-trained autoencoder. A conditional Diffusion Transformer (with DDIM sampling) takes this latent representation and a target relative force vector ΔF to synthesize force-adjusted tactile images with accurate elastomer deformations.

Background subtraction isolates contact features and mitigates cross-sensor domain gaps arising from different lighting conditions.

147M

Force-Control Stage Params

20,000

Training Data

~21h

Training Time

Note: Trained on a single A5000 GPU

Stage Two

Pose-Control Generation

A compact binary contact mask P serves as a global pose-control signal. Masks are aligned with ground-truth images at ±3 px and ±1° precision during training, and transformed to arbitrary target poses via 2D rigid transforms at inference.

Inspired by ControlNet, a frozen force-conditional backbone is augmented with a pose-guided adapter — injecting spatial constraints without disturbing the learned force-deformation mapping.

223M

Total Params (Force + Pose)

7,000

Training Data

~5h

Training Time

Note: Trained on a single A5000 GPU

Approach Comparison

ControlTac vs. Prior Paradigms

ControlTac is the only approach achieving all three desiderata: high realism, high variation, and physical controllability.

Method

Realism

Variation

Controllable

Text2Tac

Low

✗

Vis2Tac

Low

Medium

✗

Simulation

Medium

✓

ControlTac (Ours)

High

✓

Figure 3. Comparison of tactile data augmentation approaches along three criteria: visual realism, output diversity, and physical controllability.

Experiments

Tactile Image Generation Quality

ControlTac is evaluated on twelve seen and unseen objects across two datasets using pixel-wise MSE↓ and structural similarity SSIM↑. ControlTac outperforms all baselines — including the Simulator and Sim2Real methods — across every evaluation setting.

Method	Seen Objects		Unseen: FeelAnyForce		Unseen: AnyTouch2
	MSE↓	SSIM↑	MSE↓	SSIM↑	MSE↓	SSIM↑
Simulator	1054±19	0.68±0.03	1065±23	0.69±0.03	2157±14	0.61±0.04
Sim2Real	239±17	0.74±0.03	253±25	0.73±0.03	545±31	0.70±0.05
Hybrid	31±5	0.81±0.04	37±6	0.75±0.04	—	—
Separate	157±8	0.79±0.04	199±11	0.72±0.05	—	—
ControlTac (Ours)	23±2	0.83±0.03	26±3	0.79±0.04	29±2	0.81±0.02

Table 1. Quantitative results (mean ± SD). ControlTac achieves best performance across all seen and unseen evaluation sets.

Key Takeaways

ControlTac achieves 45× lower MSE than the Simulator on seen objects (23 vs. 1054) and 10× lower MSE than Sim2Real, while maintaining the highest SSIM across all evaluation sets including completely unseen AnyTouch2 objects.

Inference Throughput (Tested on a single A6000)

6.5

imgs/sec (Ours)

7.0

imgs/sec (Hybrid)

3.7

imgs/sec (Separate)

Qualitative Results

Visual Comparison Across Objects & Conditions

Qualitative Generation Results Across Methods

Figure 4. Qualitative generation results across diverse objects and contact conditions. Visual comparison of ControlTac against baselines on: (a) seen (rows 1–4) and unseen (rows 5–6) objects from FeelAnyForce, (b) unseen objects from AnyTouch2, and (c) a failure case. ControlTac captures complex force-induced deformations and fine textures even on unseen objects.

Cross-Sensor Transfer

Generalize Across Different Sensor Instances

ControlTac achieves strong zero-shot generalization to unseen sensor instances and reaches comparable performance to in-distribution sensors with minimal fine-tuning, evaluated on the 9DTact dataset comprising 3D-printed objects of various geometric shapes.

Figure 5(a). Impact of fine-tuning data size on MSE and SSIM metrics. Even with very limited fine-tuning data, ControlTac achieves competitive performance on unseen sensor instances.

Figure 5(b). Visualization of generated results across different sensors, confirming ControlTac's ability to capture unique lighting and texture details of unseen sensor instances.

Downstream Applications

Three Downstream Benchmarks

ControlTac's generated data is validated sequentially on three representative tasks spanning discrete categorization, sensitive force regression, and spatial contact reasoning.

Task 01

Object Classification

Using one reference image for six unseen objects, ControlTac generates tactile data under varying forces and poses. Classification accuracy is benchmarked across CNN, ViT (scratch), and ImageNet-pretrained ViT architectures.

Geo+Col Aug

0.79

Simulator

0.92

ControlTac (Ours)

0.99

Text2Tac

0.14

ViT (ImageNet) accuracy on 6 unseen objects. ControlTac achieves 0.99 vs. Simulator's 0.92. Text2Tac performs poorly as free-form generation lacks physical grounding.

Figure 6. The unseen objects used for validation in the downstream object classification benchmark.

Task 02

Force Estimation

3D contact force estimation (1–10 N, 0.1 N precision) is the most sensitive regression benchmark given GelSight's high sensitivity to subtle force variations. ControlTac synthesizes 15k–30k images to co-train a force estimator with varying subsets of real data.

Key result: Supplementing only 1/3 of the real dataset with ControlTac data matches full-dataset performance, whereas using that real subset alone yields poor results. ControlTac effectively covers the pose and force variation space that real data struggles to provide at scale.

Figure 7. Force estimation performance comparison across different data ratio configurations.

Task 03

Pose Estimation

2D pose estimation (X, Y coordinates and rotation angle θ) generalizes to unknown objects. ControlTac automatically extracts pose labels from 2D contact masks during synthesis, making it effortless to generate large labeled datasets.

Estimators trained solely on ControlTac-generated data achieve strong performance across all objects, outperforming models trained on real data alone — because ControlTac's diverse synthesis effortlessly covers gaps in contact poses and force dynamics.

Pose Estimation — Full Results Including Unseen Objects

ControlTac generalizes to unseen shapes (T-shape, Type-C connector) and new sensor instances. Using varying force generation (unfixed) consistently outperforms fixed-force training, covering real-world dynamics more comprehensively.

Method	Cylinder (3 Types)			Cross			T-shape (Unseen)			USB (Unseen)
	X	Y	Ang	X	Y	Ang	X	Y	Ang	X	Y	Ang
PCA	15	13	22	56	19	18	—	—	—	—	—	—
Real	8	8	4	6	6	2	—	—	—	—	—	—
Simulator	17	15	6	19	18	5	—	—	—	—	—	—
Sim + Real	12	13	6	17	16	4	—	—	—	—	—	—
Ours (fixed force)	9	8	5	7	9	4	—	—	—	—	—	—
Ours (unfixed force)	4	5	3	3	4	1	4	5	2	5	4	3

Table 2. Pose estimation errors (X/Y in mm, Angle in degrees). ControlTac (unfixed force) outperforms all baselines including models trained on real data, and generalizes to unseen objects on a new sensor instance.

Real-World Experiments

Four Real-World Deployment Tasks

Estimators and policies trained solely on ControlTac-augmented data are deployed across four challenging real-world tasks, demonstrating robust generalization, millimeter-level precision, and practical utility in dynamic settings.

Real World · Task 01

Imitation Learning: Pick & Peg-in-Hole

We train ACT-based visuo-tactile policies on a multi-stage Pick and Peg-in-Hole task using third-view RGB, wrist RGB, and tactile inputs. Force-controlled trajectory-level augmentation (1–9× copies per trajectory with random force offsets) significantly improves policy robustness against contact inconsistencies in teleoperated demonstrations. With 70–100 real demonstrations, tactile augmentation consistently outperforms real-only training; with 100 demos and 9× augmentation, success rate reaches 72% — a +12 percentage point gain over no augmentation (60%), evaluated across 25 real-robot trials per condition.

72%

Best SR (Aug×9)

60%

No Aug Baseline

+12pp

Improvement

Imitation Learning Success Rate Histogram

Figure 8. Success rate comparison for Pick & Peg-in-Hole task under different augmentation scales.

Real World · Task 02

Object Weighting

A UR5 robot pushes four objects (1 kg metal weight, two water-filled cylinders of 0.50 kg and 0.56 kg, and a 0.63 kg glass bottle) at constant speed. An ATI Axia80 force sensor provides ground truth readings. The estimator trained on ControlTac-generated images — without any real training data from these objects — predicts the pushing force in real time from tactile input alone, achieving accuracy within 0.1 N of the real-data-trained model. This demonstrates robust generalization across complex and diverse material properties.

<0.1N

Force Error

Object Types

ATI

Ground Truth

Real World · Task 03

Real-time Pose Tracking

We track the 2D contact pose (position + rotation angle) of printed cylinder, cross, and T-shape objects as they undergo continuous, in-hand rotation and translation against the sensor. The pose estimator — trained entirely on ControlTac-generated data — runs at 10 Hz in real time, smoothly tracking across the full range of poses and dynamic motion. This task highlights ControlTac's practical utility: accurate 10 Hz tactile-based tracking was previously only achievable with large annotated real datasets.

10Hz

Tracking Rate

Object Shapes

Real-time

Deployment

Real World · Task 04

High-Precision Object Insertion

An XArm7 robot equipped with dual GelSight Mini tactile sensors grasps objects at random angles and performs high-precision peg-in-hole insertion with a tight 3 mm tolerance. The pose estimator — trained on ControlTac data — predicts in-hand object pose from tactile feedback, and the robot compensates for grasping uncertainty by rotating and translating the end-effector to align the object vertically above the hole. ControlTac-trained models achieve up to 90% success rate across four object types, including a daily-use Type-C USB connector, demonstrating millimeter-level real-world manipulation from single-image augmentation.

90%

Cylinder SR

85%

Cross/T-shape SR

75%

Type-C SR

CONTROLTAC: Scaling Tactile Data with
Physically Controlled Tactile Image Generation

Given One Image — Generate Millions

Synthesizing Realistic Tactile Images
Conditioned on Physical Priors