One-liner. TACTO renders high-resolution color+depth tactile imprints for vision-based touch sensors (DIGIT, OmniTact) at up to 200 FPS by bridging a rigid-body physics engine (PyBullet) to a GPU OpenGL renderer (Pyrender) — turning million-scale tactile-data collection (infeasible on real hardware) into a one-day simulation job, and giving the touch-sensing community its OpenAI-Gym-style playground.
Simulators have driven progress in robot control and learning, but tactile sensing — especially high-resolution vision-based sensors like GelSight, DIGIT, and OmniTact — has lagged. Accurately simulating this sensor family requires modeling not just contact dynamics but also the optical properties of the elastomer gel and its illumination, which is hard. Prior low-dimensional tactile sims (BioTac, fabric/iCub skins in Gazebo) modeled arrays of independent tactels, which scales poorly — barely faster than real-time — and doesn't capture the millions of "tactels" a vision-based sensor effectively provides. The gap: no simulator that is simultaneously fast (hundreds of FPS), flexible (multiple sensor form factors/optics), realistic, and easy to use, integrated with a physics engine for active interaction during grasping/manipulation.
Design desiderata. Four targets drive the architecture: high-throughput (hundreds of FPS), flexible (support varied optics/geometry: complex gel shapes, mirrors, transparent light-piping cases), realistic (illumination, shadows, deformation on the contact boundary), and easy to use.
Architecture (Fig 2). TACTO sits between a physics simulator
and a back-end rendering engine, configured per sensor through config files.
The default physics engine is PyBullet
(other engines pluggable; the only required hooks are object/link poses and
contact forces). The renderer is Pyrender (a lightweight OpenGL
interface with GPU support). The workflow has three phases (Fig 3):
urdfpy), extracts object meshes, and adds them plus the
gel surface, cameras, and lights into the OpenGL scene.The key throughput trick: render from a synchronized scene, not from depth maps. The paper compares three RGB-from-depth options: (1) Phong's reflection model (simple but assumes single light bounce, hard to extend to mirrors/refraction); (2) OpenGL-from-depth (powerful but I/O-bound — reloading meshes from depth maps each step caps throughput at ~20 FPS even on GPU); (3) OpenGL on a synchronized scene (chosen). Option 3 preloads gel and object meshes into OpenGL once, then only updates their poses each step (fast) and overlaps the gel geometry with the contact object to extract depth+RGB. This reaches up to 200 FPS. It assumes rigid objects (gel is softer than everyday objects, so object deformation is negligible); very deformable objects fall back to the slower render-from-depth path.
Force-dependent deformation. PyBullet's rigid-body contact model yields contact forces; TACTO maps force→gel deformation depth via a user-supplied function. The default is a piece-wise linear map (approximating real elastomer linear elasticity within range) with a lower threshold (below which the sensor doesn't deform) and an upper saturation threshold. Light forces → small deformation; higher forces → more.
Realism extras. Optional shadow rendering (Fig 8, ~2 ms extra on GPU); calibration from real sensors (Fig 6): compute the pixel-wise difference of simulated images with vs. without touch, then add a reference real-sensor background image to fine-tune toward a specific sensor's non-uniform illumination (Fig 7). Contact-boundary deformation is admitted as hard to model exactly; mitigated by smoothing object meshes in preprocessing, data augmentation, or generative refinement.
Speed (Table I). Rendering a single DIGIT (12K-face mesh, 1 object in contact) on GPU: 220 FPS at 160×120, 90 FPS at 640×480. Render time grows linearly with output resolution but stays flat as mesh size scales from 2K→12K faces; throughput holds even in cluttered scenes (only in-contact object count drives Pyrender cost). Four DIGIT sensors on an Allegro hand render at 50 FPS (160×120, GPU). Physics can run async to speed up further.
| Config (GPU, 1 obj in contact) | Resolution | FPS |
|---|---|---|
| Single DIGIT | 160×120 | 220 |
| Single DIGIT | 320×240 | 140 |
| Single DIGIT | 640×480 | 90 |
| Single DIGIT (CPU) | 160×120 | 60 |
Grasp stability (Fig 10). ResNet-18 models trained on the 1M dataset to predict lift success from vision and/or touch. Findings: touch learns fast with little data; vision needs 3–4 orders of magnitude more data to catch up; two tactile sensors beat one (an object can look stable from one side while slipping on the other); vision+touch is best in most regimes — qualitatively matching the real-robot trends of [6], but now explorable 2 orders of magnitude beyond the largest real dataset.
Marble manipulation (Fig 11). Bayesian optimization learns a
controller gain K to roll a marble to target locations in ~50
iterations; converges faster than random search. Whole run is ~8 minutes
(6 min optimization + 2 min simulation over 20,000 imprints).
Sim2Real pose estimation (Table II). Sim2Sim achieves 0.41±0.01 mm / 3.48±0.34° error. Sim2Real without augmentation has a clear gap (4.56 mm / 17.64°); color jittering augmentation closes much of it (1.66 mm / 11.60°), reaching parity with Real2Real trained on ~64 real datapoints. Mixing simulated + a little real data (Sim+Real) consistently beats Real2Real at equal real-data counts — e.g. Sim+Real (128) hits 0.52±0.03 mm / 4.14±0.57°, better than Real2Real (128) at 0.76 mm / 4.96° — evidence simulated data improves data efficiency.
From the authors:
What I noticed reading it:
TACTO is infrastructure, not a manipulation method — but it is squarely
load-bearing for the thesis that many manipulation predicates
(is_grasped, is_inserted, is_slipping,
surface_is_rough) are not visually evaluable and live in touch.
BLADE
learns visual predicate classifiers from RGB-D because that is what was cheaply
labelable; the obvious next step is grounding tactile/force predicates, and that
needs cheap, abundant tactile data with which to train classifiers. TACTO's whole
pitch is exactly that: a 1M-grasp tactile dataset in a day, which is infeasible on
real hardware. Concretely:
is_grasped/is_stable classifiers the way
BLADE trained visual ones — TACTO is the data engine for that.Cluster I (datasets/benchmarks/simulators) sibling: this is the renderer-side tactile sim; Taxim (example-based GelSight) and TacEx (GelSight in Isaac Sim) are the direct comparison points to ingest next.
TACTO – a fast, flexible, and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second. — Abstract
Our proposed system design builds a synchronized scene from the physical simulator, and directly renders both depth and RGB images in OpenGL. It can achieve high speed with powerful rendering functionalities. — §III-B, OpenGL for synchronized scenes / p.3
It took only one day to collect 1 million grasps with 5 threads … The resulting dataset collected with TACTO is several orders of magnitude larger than any publicly available dataset of grasps using tactile sensing. — §IV-A, Learning Grasp Stability / p.6
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: