TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

Shaoxiong Wang, Mike Lambeta, Po-Wei Chou, Roberto Calandra · MIT / Meta AI · IEEE RA-L 2022 · arXiv:2012.08456 · PDF

One-liner. TACTO renders high-resolution color+depth tactile imprints for vision-based touch sensors (DIGIT, OmniTact) at up to 200 FPS by bridging a rigid-body physics engine (PyBullet) to a GPU OpenGL renderer (Pyrender) — turning million-scale tactile-data collection (infeasible on real hardware) into a one-day simulation job, and giving the touch-sensing community its OpenAI-Gym-style playground.

Problem & motivation

Simulators have driven progress in robot control and learning, but tactile sensing — especially high-resolution vision-based sensors like GelSight, DIGIT, and OmniTact — has lagged. Accurately simulating this sensor family requires modeling not just contact dynamics but also the optical properties of the elastomer gel and its illumination, which is hard. Prior low-dimensional tactile sims (BioTac, fabric/iCub skins in Gazebo) modeled arrays of independent tactels, which scales poorly — barely faster than real-time — and doesn't capture the millions of "tactels" a vision-based sensor effectively provides. The gap: no simulator that is simultaneously fast (hundreds of FPS), flexible (multiple sensor form factors/optics), realistic, and easy to use, integrated with a physics engine for active interaction during grasping/manipulation.

Method

Design desiderata. Four targets drive the architecture: high-throughput (hundreds of FPS), flexible (support varied optics/geometry: complex gel shapes, mirrors, transparent light-piping cases), realistic (illumination, shadows, deformation on the contact boundary), and easy to use.

Architecture (Fig 2). TACTO sits between a physics simulator and a back-end rendering engine, configured per sensor through config files. The default physics engine is PyBullet (other engines pluggable; the only required hooks are object/link poses and contact forces). The renderer is Pyrender (a lightweight OpenGL interface with GPU support). The workflow has three phases (Fig 3):

Initialize: load the sensor config (camera pose/FOV/clipping, lights pose/color/intensity, gel mesh, noise level, force→deformation mapping) and set up the renderer.
Create scene: PyBullet loads object URDFs; TACTO parses the URDF (via urdfpy), extracts object meshes, and adds them plus the gel surface, cameras, and lights into the OpenGL scene.
Step simulation: PyBullet computes physics; TACTO synchronizes object/sensor poses into the renderer and fetches the rendered imprint.

The key throughput trick: render from a synchronized scene, not from depth maps. The paper compares three RGB-from-depth options: (1) Phong's reflection model (simple but assumes single light bounce, hard to extend to mirrors/refraction); (2) OpenGL-from-depth (powerful but I/O-bound — reloading meshes from depth maps each step caps throughput at ~20 FPS even on GPU); (3) OpenGL on a synchronized scene (chosen). Option 3 preloads gel and object meshes into OpenGL once, then only updates their poses each step (fast) and overlaps the gel geometry with the contact object to extract depth+RGB. This reaches up to 200 FPS. It assumes rigid objects (gel is softer than everyday objects, so object deformation is negligible); very deformable objects fall back to the slower render-from-depth path.

Force-dependent deformation. PyBullet's rigid-body contact model yields contact forces; TACTO maps force→gel deformation depth via a user-supplied function. The default is a piece-wise linear map (approximating real elastomer linear elasticity within range) with a lower threshold (below which the sensor doesn't deform) and an upper saturation threshold. Light forces → small deformation; higher forces → more.

Realism extras. Optional shadow rendering (Fig 8, ~2 ms extra on GPU); calibration from real sensors (Fig 6): compute the pixel-wise difference of simulated images with vs. without touch, then add a reference real-sensor background image to fine-tune toward a specific sensor's non-uniform illumination (Fig 7). Contact-boundary deformation is admitted as hard to model exactly; mitigated by smoothing object meshes in preprocessing, data augmentation, or generative refinement.

Setup

Datasets / benchmarks: No external benchmark dataset. Two self-generated demonstrations: (a) a 1-million-grasp simulated grasp-stability dataset (orders of magnitude larger than the largest prior real tactile grasp dataset [6, Calandra et al. 2017], which it explicitly compares against); (b) Sim2Real pose-estimation: 10,000 simulated tactile imprints + 200 real imprints with manually annotated poses.
Hardware / simulator: TACTO + PyBullet. Renders DIGIT (Fig 4) and the more complex OmniTact (round surface, 5 cameras, 11 lights; Fig 5). Grasp task: two DIGIT sensors on a WSG-50 parallel-jaw gripper + external camera. Marble task: two DIGIT sensors, upper sensor position-controlled to roll a marble. Real-signal comparisons use a real DIGIT touching balls of 3.7–5.3 mm diameter.
Baselines: For the simulator itself: internal comparison of the three rendering strategies (Phong / OpenGL-from-depth / synchronized-scene) on speed. For control: Bayesian optimization vs. random search on marble manipulation. For Sim2Real pose estimation: Sim2Sim, Real2Real, Sim2Real (with/without augmentation), and Sim+Real at varying real-data counts.
Compute: GPU: Nvidia RTX 2080 Super + Intel i9-9900K. CPU: Intel i7-6820HQ. 1M-grasp collection took ~1 day with 5 threads.

Results

Speed (Table I). Rendering a single DIGIT (12K-face mesh, 1 object in contact) on GPU: 220 FPS at 160×120, 90 FPS at 640×480. Render time grows linearly with output resolution but stays flat as mesh size scales from 2K→12K faces; throughput holds even in cluttered scenes (only in-contact object count drives Pyrender cost). Four DIGIT sensors on an Allegro hand render at 50 FPS (160×120, GPU). Physics can run async to speed up further.

Config (GPU, 1 obj in contact)	Resolution	FPS
Single DIGIT	160×120	220
Single DIGIT	320×240	140
Single DIGIT	640×480	90
Single DIGIT (CPU)	160×120	60

Grasp stability (Fig 10). ResNet-18 models trained on the 1M dataset to predict lift success from vision and/or touch. Findings: touch learns fast with little data; vision needs 3–4 orders of magnitude more data to catch up; two tactile sensors beat one (an object can look stable from one side while slipping on the other); vision+touch is best in most regimes — qualitatively matching the real-robot trends of [6], but now explorable 2 orders of magnitude beyond the largest real dataset.

Marble manipulation (Fig 11). Bayesian optimization learns a controller gain K to roll a marble to target locations in ~50 iterations; converges faster than random search. Whole run is ~8 minutes (6 min optimization + 2 min simulation over 20,000 imprints).

Sim2Real pose estimation (Table II). Sim2Sim achieves 0.41±0.01 mm / 3.48±0.34° error. Sim2Real without augmentation has a clear gap (4.56 mm / 17.64°); color jittering augmentation closes much of it (1.66 mm / 11.60°), reaching parity with Real2Real trained on ~64 real datapoints. Mixing simulated + a little real data (Sim+Real) consistently beats Real2Real at equal real-data counts — e.g. Sim+Real (128) hits 0.52±0.03 mm / 4.14±0.57°, better than Real2Real (128) at 0.76 mm / 4.96° — evidence simulated data improves data efficiency.

Limitations & open questions

From the authors:

Contact-boundary gel deformation is hard to model because RGB and depth are computed at the same time; mostly affects sharp edges. Future: better (non-linear, sensor-characterized) force→deformation transfer functions.
The synchronized-scene fast path assumes rigid objects; very deformable objects need the slower render-from-depth path.
TACTO renders realistic perceptual outputs but leaves accurate contact dynamics to the underlying physics engine — the "ideal" sim providing both is explicitly out of scope.
Future work: more deformation-model variations, image filtering, comparison across geometries, and transferring grasp policies across objects.

What I noticed reading it:

Sim2Real and grasp results are demonstrated on a single box object / a pen / small balls — deliberately a "proof of concept." Generalization across object geometry and material (the harder Sim2Real regime) is asserted as "straightforward to extend" but not shown.
Realism is validated qualitatively (Figs 7–9, sim-vs-real image overlays) rather than with a quantitative image-similarity / perceptual metric against real sensors. The only hard Sim2Real number is downstream pose-estimation error, which conflates renderer fidelity with the calibration and augmentation pipeline.
The force→deformation map is borrowed from PyBullet's rigid-body contact forces — so normal-force fidelity is only as good as the rigid contact model, and shear/friction imprint detail (which real GelSight-class sensors capture richly) is not modeled, limiting usefulness for slip/shear predicates.
Table I excludes physics-simulation time ("excluding physics simulation") — the headline 200 FPS is the renderer alone; end-to-end throughput with a stiff contact solver in the loop could be lower.

Why I care

TACTO is infrastructure, not a manipulation method — but it is squarely load-bearing for the thesis that many manipulation predicates (is_grasped, is_inserted, is_slipping, surface_is_rough) are not visually evaluable and live in touch. BLADE learns visual predicate classifiers from RGB-D because that is what was cheaply labelable; the obvious next step is grounding tactile/force predicates, and that needs cheap, abundant tactile data with which to train classifiers. TACTO's whole pitch is exactly that: a 1M-grasp tactile dataset in a day, which is infeasible on real hardware. Concretely:

Predicate grounding in touch. A simulated tactile stream paired with auto-labeled symbolic state (BLADE's every-frame labeling trick) could train is_grasped/is_stable classifiers the way BLADE trained visual ones — TACTO is the data engine for that.
Sim-to-real for tactile policies. The Sim+Real data-efficiency result is the encouraging signal for training tactile diffusion-policy controllers (BLADE's low-level layer) in sim and finetuning on a handful of real touches.
Continuous/force parameters. A BLADE limitation I flagged is that force-modulated continuous parameters sit opaquely inside the diffusion policy; a tactile simulator is a prerequisite for ever making force/contact a first-class part of the abstraction layer.

Cluster I (datasets/benchmarks/simulators) sibling: this is the renderer-side tactile sim; Taxim (example-based GelSight) and TacEx (GelSight in Isaac Sim) are the direct comparison points to ingest next.

Quotable

TACTO – a fast, flexible, and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second. — Abstract

Our proposed system design builds a synchronized scene from the physical simulator, and directly renders both depth and RGB images in OpenGL. It can achieve high speed with powerful rendering functionalities. — §III-B, OpenGL for synchronized scenes / p.3

It took only one day to collect 1 million grasps with 5 threads … The resulting dataset collected with TACTO is several orders of magnitude larger than any publicly available dataset of grasps using tactile sensing. — §IV-A, Learning Grasp Stability / p.6

Papers cited that should likely be ingested next:

[2] Yuan, Dong, Adelson 2017 — GelSight — the high-resolution sensor TACTO simulates the optics of; in this batch.
[4] Lambeta et al. 2020 — DIGIT — the primary sensor rendered in all experiments; in this batch.
[6] Calandra et al. 2017 — The Feeling of Success (CoRL) — the real-robot grasp-stability work TACTO's 1M-grasp experiment scales up; the direct comparison.
[13] Gomes, Wilson, Luo 2019 — GelSight simulation for sim2real (ViTac workshop) — Phong-model RGB-from-depth baseline TACTO trades against.
[27] Tian et al. 2019 — Manipulation by Feel (ICRA) — the real-robot marble-rolling work the control task mirrors.

Newly ingested in the 2026-06-24 batch — directly relevant:

Taxim — sibling Cluster-I GelSight simulator (example-based / polynomial-lookup rendering); the closest methodological alternative to TACTO's synchronized-scene OpenGL approach.
TacEx — GelSight simulation inside Isaac Sim; the GPU-parallel, RL-throughput successor in the same tactile-sim niche.
GelSight and DIGIT — the two real sensors whose optics/form-factor TACTO is built to reproduce.
ObjectFolder 2.0 and ObjectFolder Benchmark — the multisensory (visual/acoustic/tactile) simulated-object datasets that pursue the same "simulate touch for cheap data + Sim2Real" goal at object-library scale.
Sparsh and T3 — tactile representation/foundation models whose pretraining hunger TACTO-style simulated data could help feed.