ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Ruohan Gao*, Zilin Si*, Yen-Yu Chang*, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu · Stanford / CMU · CVPR 2022 · arXiv:2204.02389 · PDF

One-liner. A 1,000-object dataset where every household object is a single implicit neural network (an "Object File") that renders its visual, acoustic, and tactile signals on demand in real time — and, crucially, models trained on these virtual objects transfer to their real-world counterparts on scale estimation, contact localization, and shape reconstruction, making it a Sim2Real testbed for the non-visual senses manipulation actually runs on.

Problem & motivation

Real objects are not just geometry plus a texture: an alarm clock looks glossy, the plate clinks when struck, the knife feels sharp. Prior object datasets (ImageNet, ShapeNet, ABO, YCB, BigBIRD) model only the visual modality or only low-quality geometry, so multisensory object-centric learning has had no realistic, scalable substrate. ObjectFolder 1.0 introduced the implicit-neural "object as a function" idea but covered only 100 objects, rendered slowly, and produced limited-quality audio/touch. ObjectFolder 2.0 attacks all three deficits at once: 10× the objects, ~100× faster rendering, higher fidelity across vision/audio/touch — and adds the thing 1.0 never demonstrated: that learning on the virtual objects generalizes to the real ones.

Method

Each object is an Object File: one implicit neural representation with three coordinate-MLP sub-networks — VisionNet, AudioNet, TouchNet (Fig 3). The pipeline starts from 1,000 high-quality scanned meshes (855 from ABO after material filtering to ceramic/glass/wood/plastic/iron/polycarbonate/steel, all 100 from OF 1.0, 45 polycarbonate from Google Scanned Objects), with real-product metadata (category, material, color, dimensions). It then simulates each modality and distills it into the Object File so that storage is independent of the extrinsic query parameters (viewpoint, impact force, gel deformation).

VisionNet (real-time appearance). Builds on the Object Scattering Function (OSF) volumetric appearance model, but replaces the single large MLP with a KiloOSF design (after KiloNeRF): the object is subdivided into a uniform grid s = (s_x, s_y, s_z), each cell gets a tiny MLP v(i), and a spatial-binning map m(x) routes a 3D point to its cell's network before querying color/density (c, σ) = F_{v(m(x))}(x, r). Trained by distillation from a full OSF teacher, with empty-space skipping and early ray termination. Result: 60× faster inference than 1.0 (Table 1) at better visual quality.

AudioNet (impact sound). Uses linear modal analysis. Each mesh is converted to a quadratic tetrahedral mesh and run through Finite Element Methods (FEM, second-order elements in Abaqus) to get the modal decomposition &ddot;q + (αI + βΛ)˙q + Λq = U^T f, yielding N damped sinusoids with frequencies ω_i, dampings d_i, gains g_i. The key trick: frequencies and dampings are object intrinsics (predicted once), but the per-mode gains g_i depend on contact force and location — so AudioNet is an MLP that, given a tetrahedral-vertex coordinate, predicts the gain vector for unit force in each of three axis directions. At inference an arbitrary force f = k_x f_x + k_y f_y + k_z f_z linearly combines the three predicted gain branches; the waveform is synthesized as S(t) = ∑ ĝ_i e^{-d_i t} sin(2πω_i t). Predicting only the location-dependent part (vs. 1.0's direct spectrogram regression) is what raises quality and lets it run in real time.

TouchNet (GelSight-style tactile). Two stages. First simulate a per-contact deformation map from the object's local shape in the contact area plus the gelpad shape elsewhere, rendered with Pyrender/OpenGL at ~700 fps. TouchNet is then an MLP F_T : (x, y, z, θ_T, φ_T, p, w, h) → d mapping an 8D input (3D location, contact-orientation unit vector, gel penetration depth p, pixel location (w,h)) to the deformation-map value. The deformation map is converted to a GelSight RGB tactile image via Taxim, calibrated against a real GelSight sensor. Unlike OF 1.0 (single tactile image per vertex along the normal), 2.0 renders varied rotation (±15°) and pressing depth (0.5–2 mm), enabling Sim2Real tactile transfer.

Setup

Datasets / benchmarks: ObjectFolder 2.0 itself — 1,000 implicit-neural objects (855 ABO + 100 OF 1.0 + 45 Google Scanned Objects). Three downstream Sim2Real evaluation tasks: object scale estimation, audio-tactile contact localization, visuo-tactile shape reconstruction. 13 real objects collected for the cross-domain tests.
Hardware / simulator: FEM via Abaqus (audio); Pyrender + OpenGL + Taxim, calibrated to a real GelSight sensor (touch); KiloOSF volume rendering (vision). Real-world rig (Fig 5): impact-sound collection setup and GelSight tactile-data collection on a robot arm; contact localization uses an FCRN depth net and MFCC audio features with a particle filter.
Baselines: ObjectFolder 1.0 (rendering-quality and downstream-task comparison); "Random" predictors; per-modality variants (vision-only / audio-only / touch-only / fused) within each task; TACTO + DIGIT (1.0) vs. GelSight (2.0) for tactile ground truth; Point Completion Network (PCN) backbone and ResNet-18 feature extractor for shape reconstruction.
Compute: not reported (rendering speed reported in seconds/sample, Table 1; deformation-map generation at 700 fps).

Results

Rendering speed (Table 1, seconds per sample) and quality (Table 2):

Method	Total render (s)↓	Vision PSNR↑	Vision SSIM↑	Audio STFT (×10^-5)↓	Audio ENV (×10^-4)↓	Touch PSNR↑	Touch SSIM↑
ObjectFolder 1.0	4.129	35.7	0.97	4.94	7.65	27.9	0.64
ObjectFolder 2.0	0.111	36.3	0.98	0.19	1.29	31.6	0.78

~37× faster overall (VisionNet alone 60×), with audio STFT distance cut by ~26× and touch PSNR up ~3.7 dB.

Sim2Real downstream tasks (the headline contribution):

Object scale estimation (Table 3, mean error cm). 2.0 models generalize far better to real objects than 1.0: on real objects, Vision 5.08 (vs 1.0's 7.41), Audio 4.68 (vs 6.85), Touch 3.51 (vs 4.92); random baseline is 14.5. Touch has the smallest Sim2Real gap (virtual 0.45 → real 3.51 is the closest virtual–real ratio among the three for 2.0).
Audio-tactile contact localization (Table 4, mean dist cm). Touch-based localization is far more accurate than audio across six complex objects; e.g. on the mug, Touch 0.03(sim)/0.78(real) vs Audio 0.26/1.16; Audio+Touch fusion is consistently best (mug 0.04/0.36). Several real audio entries are missing (dash in Table 4) for the brush, flute, and toy plane.
Visuo-tactile shape reconstruction (Table 5, Chamfer-L1 cm). Vision+Touch fusion wins on every one of six objects; e.g. coffee mug 0.18(sim)/0.46(real) beats vision-only 0.30/0.72 and touch-only 0.29/0.80, both far below the "Average-mesh" baseline (2.97/1.91).

Across all three tasks the story is the same: models trained purely on 2.0's virtual sensory data transfer to real objects, and combining modalities beats any single one — the empirical case for multisensory representations.

Limitations & open questions

From the authors:

All objects are rigid bodies with a single homogeneous material; real objects are often multi-part, non-rigid, and multi-material.
The simulation does not model the real 3D space's lighting/noise, reverberation, etc. — full Sim2Real transfer needs these factored in.
Material filtering discards object classes whose material is not acoustically simulable, so coverage is biased toward a fixed material set.

What I noticed reading it:

The downstream tasks use only 6–13 real objects; these are perceptual-property benchmarks, not control or manipulation tasks — no robot policy is learned or executed, so "transfers to robotics" is a substrate claim, not a closed-loop demonstration.
The audio modality is the weakest link: largest Sim2Real gap (Table 3), worst localization (Table 4), and missing real-object entries (dashes) for thin/irregular objects — modal analysis presumably breaks down there, but the paper doesn't analyze the failures.
Everything is impact / quasi-static contact: an impulse strike for audio, a single GelSight press for touch. Continuous contact dynamics — sliding, scraping, dragging, pouring (the regimes where contact-audio and shear-tactile signals are richest) — are out of scope.
Numbers are reported as mean error per object with no variance / seeds, so the per-object win margins (especially the small real-object differences) are hard to read as statistically robust.

Why I care

This is squarely on the batch thesis that many manipulation predicates are not visually evaluable — is_struck, surface_is_rough, material_is_ceramic, is_full live in sound and touch. ObjectFolder 2.0 is the substrate for learning exactly those: a scalable repository where the non-visual senses are first-class, queryable functions, and where the empirical result is that fusing them beats vision alone. For BLADE, whose predicate classifiers are currently visual (f_θ(p): O → {T,F} over RGB crops), this paper is the data answer to BLADE's own flagged limitation that "many predicates aren't visually evaluable": a tactile/acoustic predicate classifier needs touch/sound training data, and ObjectFolder 2.0 (and its successors) is where that data is manufacturable at scale with Sim2Real validity. The honest caveat: it is a perception dataset, not a planning or long-horizon-manipulation paper — it gives BLADE-style work a sensory front-end, not a new abstraction mechanism. The implicit-neural "object as a function" framing is also a clean counterpoint to explicit symbolic state: here the "world model" of an object is a renderer of raw multisensory signals, which a downstream predicate classifier would still have to interpret.

Quotable

We virtualize each object by encoding its intrinsics (texture, material type, and 3D shape) with an Object File implicit neural representation. Then we can render its visual appearance, impact sound, and tactile readings based on any extrinsic parameters. — Fig 1 caption / p.1

Models learned from virtual objects in our dataset successfully transfer to their real-world counterparts in three challenging tasks: object scale estimation, contact localization, and shape reconstruction. — Abstract / p.1

Among the three modalities, tactile data has the smallest Sim2Real gap compared to vision and audio. — §5.1, Object Scale Estimation / p.6

Papers cited that should likely be ingested next:

Taxim [61] — the example-based GelSight simulation TouchNet uses to turn deformation maps into tactile RGB. Direct dependency.
TACTO [70] — tactile simulator used as the touch ground-truth pipeline for the OF 1.0 comparison; to ingest.
GelSight [16] — the vision-based tactile sensor whose readings TouchNet emulates; to ingest.
DIGIT [35] — the compact tactile sensor OF 1.0 used; to ingest.
ObjectFolder 1.0 [18] — the predecessor this paper extends 10×; to ingest.

Newly ingested in 2026-06-24 batch — directly relevant:

ObjectFolder — the 100-object predecessor; this paper is its scale-up + Sim2Real demonstration. (Cluster I.)
ObjectFolder Benchmark — the follow-up that turns the dataset into a standardized multisensory benchmark with real-object scans. (Cluster I.)
Taxim and TACTO — the GelSight simulation tooling TouchNet builds on / compares against. (Cluster I.)
Touch and Go — a real human-collected vision–touch dataset; the real-data counterpart to ObjectFolder's simulated touch. (Cluster I.)
SoundSpaces 2.0 — sibling implicit/visual-acoustic simulator from the same audio-visual-Sim2Real lineage. (Cluster I.)
SonicSense and See, Hear, and Feel — downstream manipulation work that consumes exactly the contact-audio / multisensory signals ObjectFolder 2.0 manufactures. (Clusters D.)