Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks.
To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach.
ControlTac consists of two key components: a. Force-Control: We input the background-removed tactile image x into the DiT model, conditioned on the 3D contact force ΔF, to generate force-specific tactile variations. b. Position-Control: We transfer the pretrained DiT from stage one and fine-tune it using ControlNet, conditioned on a contact mask c, to synthesize realistic tactile images yB under different contact positions and forces.
Here, we demonstrate how to annotate the contact mask to represent the contact position.
We conduct a qualitative comparison between ControlTac and other generators and simulators. ControlTac exhibits superior realism, variation, and controllability in the generated tactile images.
The first column displays 3D previews of six objects, followed by the input tactile image (Ref. Image) in the second column and the Contact Mask in the third column. The fourth column shows the initial force (top) and target force (bottom). Subsequent columns depict the Ground Truth (G.T.) and results from ControlTac, the hybrid force-position conditional diffusion model (Hybrid), and the separate-control pipeline (Separate). In part A), we visualize the generated images for comparison; in part B), we visualize the error maps highlighting the differences from the ground-truth tactile image.
The figure below showcases the generation results of force-controlled and position-controlled components in ControlTac.
The figure below clearly demonstrates that ControlTac can generate a diverse range of tactile images from a single reference tactile image.
The left figure illustrates how the force-controlled component in ControlTac augments 1,000 real samples with a larger set of generated tactile images, leading to a substantial reduction in MAE compared to using real data alone. Notably, by incorporating the generated data, the model achieves comparable performance to training on the full real dataset (20,000 images) using only 8,000 real samples. This demonstrates that the generated data effectively enrich the force distribution at each contact position, thereby enhancing the training of the force estimator. Moreover, combining a larger quantity of both real and generated data yields the best overall performance, underscoring the realism and utility of the generated samples.
Building on the force-controlled component, we further integrate the position-controlled component of ControlTac. To highlight the importance of diverse contact positions in training a robust force estimator, we divide the real dataset by contact angle, since tactile image appearance varies across different contact angles. The right figure presents the MAE of force estimation under different training conditions. The results show that incorporating position-controlled generation effectively compensates for limited angular diversity in real data, significantly improving performance even when only a small subset of real images is available—especially in scenarios where the real data covers a narrow range of angles.
We further validate the effectiveness of ControlTac in real-world pushing experiments. The force estimator trained only with generated tactile data achieves comparable performance to the one trained on real tactile data, demonstrating that the generated data is realistic and reliable enough to be used directly for training in practical scenarios.
As shown in the table below, pose estimators trained on tactile images generated by ControlTac achieve strong performance across all objects, including the unseen T Shape. In particular, using a larger amount of generated data leads to better results than using real data alone, as it is sufficiently realistic and covers a much wider range of contact positions and forces. We also compare the performance of the pose estimator using varying forces versus a fixed force (denoted as fixed in the table below, where the fixed force is set to the median value of 6.5 N). The results show that using varying force yields better performance, as contact force naturally changes during inference.
To further evaluate the performance of the pose estimator trained with ControlTac-generated data, we conducted a real-time pose tracking experiment. Our model successfully tracked poses at a frequency of 10 Hz, highlighting its practicality in dynamic real-world scenarios.
In the Precise Insertion task, the pose estimator trained with ControlTac-generated data achieved success rates of 90% on the cylinder and 85% on the cross. Notably, it also reached an 85% success rate on the unseen T-shape.
In the object classification task, we found that compared to traditional augmentation methods, using ControlTac for data augmentation yields significantly better performance—whether with a simple CNN classifier, a ViT trained from scratch, or a ViT pretrained on ImageNet.
Note: G = geometric augmentation; C = color augmentation; Gen = our ControlTac-based augmentation method.
@article{luo2025controltac,
title={ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image},
author={Luo, Dongyu and Yu, Kelin and Shahidzadeh, Amir-Hossein and Fermuller, Cornelia and Aloimonos, Yiannis and Gao, Ruohan},
journal={arXiv preprint arXiv:2505.20498},
year={2025}
}