Vision-based tactile sensing is widely used in perception, reconstruction, and robotic manipulation, yet collecting large-scale tactile data remains costly due to diverse sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data—simulation and free-form tactile generation—often yield unrealistically rendered signals with poor transfer to highly dynamic real-world tasks. We propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. By grounding generation in these important physical priors, ControlTac produces realistic samples that effectively capture task-relevant variations. Across three downstream tasks and three real-world experiments, the augmented datasets using our approach consistently improve performance and demonstrate practical utility in dynamic real-world settings.
ControlTac consists of two key components: a. Force-Control: We input the background-removed tactile image x into the DiT model, conditioned on the 3D contact force ΔF, to generate force-specific tactile variations. b. Position-Control: We transfer the pretrained DiT from stage one and fine-tune it using ControlNet, conditioned on a contact mask c, to synthesize realistic tactile images yB under different contact positions and forces.
Here, we demonstrate how to annotate the contact mask to represent the contact position.
We conduct a qualitative comparison between ControlTac and other generators and simulators. ControlTac exhibits superior realism, variation, and controllability in the generated tactile images.
The first column shows 3D previews of six objects, followed by the input tactile image (Ref. Image) in the second column and the Contact Mask in the third column. The fourth column displays the initial force (top) and target force (bottom). Subsequent columns present the Ground Truth (G.T.) and results from ControlTac, the hybrid force-position conditional diffusion model (Hybrid), the separate-control pipeline (Separate), and simulation results from Taxim (Si & Yuan, 2022) . In the upper part, we visualize the generated images for comparison; in the lower part, we show the error maps highlighting differences from the ground-truth tactile image.
The figure below showcases the generation results of force-controlled and position-controlled components in ControlTac.
The figure below clearly demonstrates that ControlTac can generate a diverse range of tactile images from a single reference tactile image.
The figure below demonstrates that ControlTac can cover the variation of positions and force, and remarkably improves MAE even with small real subsets. With only a third of the real data, the performance can reach a competitive performance to the full dataset, where the performance with only real data is much worse since it cannot cover all the variations of forces and positions. It is worthy to note that combining all real + generated data performs slightly worse than using only real data, and this is because FeelAnyForce already achieves near-oracle performance with full forces and positions coverage, although it's challenging to collect them in the real world.
We further validate the effectiveness of ControlTac in real-world pushing experiments. The force estimator trained only with generated tactile data achieves comparable performance to the one trained on real tactile data, demonstrating that the generated data is realistic and reliable enough to be used directly for training in practical scenarios.
As shown in the table below, pose estimators trained solely on tactile images generated by ControlTac achieve strong performance across all objects, including the unseen T Shape and USB with the new sensor sample. Remarkably, using the same amount of generated data outperforms training on real data alone, even when the real dataset is relatively large, as capturing tactile data that fully covers all contact variations in the dynamic real world is extremely challenging. In such case, generated data proves particularly valuable since all the covered positions can be generated.
Furthermore, ControlTac not only outperforms simulation-based data from Taxim (Si & Yuan, 2022) , where simulated images are not realistic, but also surpasses traditional PCA-based (She et al., 2021) pose estimation methods. We also evaluate the pose estimator under varying versus fixed forces (denoted as “fixed” in Table set to the median value of 6.5 N). Results show that unfixed force improves performance since it covers the force variations in the real-world scenarios.
To further evaluate the performance of the pose estimator trained with ControlTac-generated data, we conducted a real-time pose tracking experiment. Our model successfully tracked poses at a frequency of 10 Hz, highlighting its practicality in dynamic real-world scenarios.
In the Precise Insertion task, the pose estimator trained with ControlTac-generated data achieved success rates of 90% on the cylinder and 85% on the cross. Notably, it achieved success rates of 85% on the unseen T-shape and 75% on the Type-C connector.
In the object classification task, we found that compared to traditional augmentation methods, using ControlTac for data augmentation yields significantly better performance—whether with a simple CNN classifier, a ViT trained from scratch, or a ViT pretrained on ImageNet.
Note: G = geometric augmentation; C = color augmentation; Gen = our ControlTac-based augmentation method.
@article{luo2025controltac,
title={ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image},
author={Luo, Dongyu and Yu, Kelin and Shahidzadeh, Amir-Hossein and Fermuller, Cornelia and Aloimonos, Yiannis and Gao, Ruohan},
journal={arXiv preprint arXiv:2505.20498},
year={2025}
}