ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Ruohan Gao*, Zilin Si*, Yen-Yu Chang*, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu · Stanford / CMU · CVPR 2022 · arXiv:2204.02389 · PDF

One-liner. A 1,000-object dataset where every household object is a single implicit neural network (an "Object File") that renders its visual, acoustic, and tactile signals on demand in real time — and, crucially, models trained on these virtual objects transfer to their real-world counterparts on scale estimation, contact localization, and shape reconstruction, making it a Sim2Real testbed for the non-visual senses manipulation actually runs on.

Problem & motivation

Real objects are not just geometry plus a texture: an alarm clock looks glossy, the plate clinks when struck, the knife feels sharp. Prior object datasets (ImageNet, ShapeNet, ABO, YCB, BigBIRD) model only the visual modality or only low-quality geometry, so multisensory object-centric learning has had no realistic, scalable substrate. ObjectFolder 1.0 introduced the implicit-neural "object as a function" idea but covered only 100 objects, rendered slowly, and produced limited-quality audio/touch. ObjectFolder 2.0 attacks all three deficits at once: 10× the objects, ~100× faster rendering, higher fidelity across vision/audio/touch — and adds the thing 1.0 never demonstrated: that learning on the virtual objects generalizes to the real ones.

Method

Each object is an Object File: one implicit neural representation with three coordinate-MLP sub-networks — VisionNet, AudioNet, TouchNet (Fig 3). The pipeline starts from 1,000 high-quality scanned meshes (855 from ABO after material filtering to ceramic/glass/wood/plastic/iron/polycarbonate/steel, all 100 from OF 1.0, 45 polycarbonate from Google Scanned Objects), with real-product metadata (category, material, color, dimensions). It then simulates each modality and distills it into the Object File so that storage is independent of the extrinsic query parameters (viewpoint, impact force, gel deformation).

VisionNet (real-time appearance). Builds on the Object Scattering Function (OSF) volumetric appearance model, but replaces the single large MLP with a KiloOSF design (after KiloNeRF): the object is subdivided into a uniform grid s = (s_x, s_y, s_z), each cell gets a tiny MLP v(i), and a spatial-binning map m(x) routes a 3D point to its cell's network before querying color/density (c, σ) = F_{v(m(x))}(x, r). Trained by distillation from a full OSF teacher, with empty-space skipping and early ray termination. Result: 60× faster inference than 1.0 (Table 1) at better visual quality.

AudioNet (impact sound). Uses linear modal analysis. Each mesh is converted to a quadratic tetrahedral mesh and run through Finite Element Methods (FEM, second-order elements in Abaqus) to get the modal decomposition &ddot;q + (αI + βΛ)˙q + Λq = U^T f, yielding N damped sinusoids with frequencies ω_i, dampings d_i, gains g_i. The key trick: frequencies and dampings are object intrinsics (predicted once), but the per-mode gains g_i depend on contact force and location — so AudioNet is an MLP that, given a tetrahedral-vertex coordinate, predicts the gain vector for unit force in each of three axis directions. At inference an arbitrary force f = k_x f_x + k_y f_y + k_z f_z linearly combines the three predicted gain branches; the waveform is synthesized as S(t) = ∑ ĝ_i e^{-d_i t} sin(2πω_i t). Predicting only the location-dependent part (vs. 1.0's direct spectrogram regression) is what raises quality and lets it run in real time.

TouchNet (GelSight-style tactile). Two stages. First simulate a per-contact deformation map from the object's local shape in the contact area plus the gelpad shape elsewhere, rendered with Pyrender/OpenGL at ~700 fps. TouchNet is then an MLP F_T : (x, y, z, θ_T, φ_T, p, w, h) → d mapping an 8D input (3D location, contact-orientation unit vector, gel penetration depth p, pixel location (w,h)) to the deformation-map value. The deformation map is converted to a GelSight RGB tactile image via Taxim, calibrated against a real GelSight sensor. Unlike OF 1.0 (single tactile image per vertex along the normal), 2.0 renders varied rotation (±15°) and pressing depth (0.5–2 mm), enabling Sim2Real tactile transfer.

Setup

Results

Rendering speed (Table 1, seconds per sample) and quality (Table 2):

MethodTotal render (s)↓Vision PSNR↑Vision SSIM↑Audio STFT (×10-5)↓Audio ENV (×10-4)↓Touch PSNR↑Touch SSIM↑
ObjectFolder 1.04.12935.70.974.947.6527.90.64
ObjectFolder 2.00.11136.30.980.191.2931.60.78

~37× faster overall (VisionNet alone 60×), with audio STFT distance cut by ~26× and touch PSNR up ~3.7 dB.

Sim2Real downstream tasks (the headline contribution):

Across all three tasks the story is the same: models trained purely on 2.0's virtual sensory data transfer to real objects, and combining modalities beats any single one — the empirical case for multisensory representations.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is squarely on the batch thesis that many manipulation predicates are not visually evaluableis_struck, surface_is_rough, material_is_ceramic, is_full live in sound and touch. ObjectFolder 2.0 is the substrate for learning exactly those: a scalable repository where the non-visual senses are first-class, queryable functions, and where the empirical result is that fusing them beats vision alone. For BLADE, whose predicate classifiers are currently visual (f_θ(p): O → {T,F} over RGB crops), this paper is the data answer to BLADE's own flagged limitation that "many predicates aren't visually evaluable": a tactile/acoustic predicate classifier needs touch/sound training data, and ObjectFolder 2.0 (and its successors) is where that data is manufacturable at scale with Sim2Real validity. The honest caveat: it is a perception dataset, not a planning or long-horizon-manipulation paper — it gives BLADE-style work a sensory front-end, not a new abstraction mechanism. The implicit-neural "object as a function" framing is also a clean counterpoint to explicit symbolic state: here the "world model" of an object is a renderer of raw multisensory signals, which a downstream predicate classifier would still have to interpret.

Quotable

We virtualize each object by encoding its intrinsics (texture, material type, and 3D shape) with an Object File implicit neural representation. Then we can render its visual appearance, impact sound, and tactile readings based on any extrinsic parameters. — Fig 1 caption / p.1
Models learned from virtual objects in our dataset successfully transfer to their real-world counterparts in three challenging tasks: object scale estimation, contact localization, and shape reconstruction. — Abstract / p.1
Among the three modalities, tactile data has the smallest Sim2Real gap compared to vision and audio. — §5.1, Object Scale Estimation / p.6

Related

Papers cited that should likely be ingested next:

Newly ingested in 2026-06-24 batch — directly relevant: