One-liner. A 1,000-object dataset where every household object is a single implicit neural network (an "Object File") that renders its visual, acoustic, and tactile signals on demand in real time — and, crucially, models trained on these virtual objects transfer to their real-world counterparts on scale estimation, contact localization, and shape reconstruction, making it a Sim2Real testbed for the non-visual senses manipulation actually runs on.
Real objects are not just geometry plus a texture: an alarm clock looks glossy, the plate clinks when struck, the knife feels sharp. Prior object datasets (ImageNet, ShapeNet, ABO, YCB, BigBIRD) model only the visual modality or only low-quality geometry, so multisensory object-centric learning has had no realistic, scalable substrate. ObjectFolder 1.0 introduced the implicit-neural "object as a function" idea but covered only 100 objects, rendered slowly, and produced limited-quality audio/touch. ObjectFolder 2.0 attacks all three deficits at once: 10× the objects, ~100× faster rendering, higher fidelity across vision/audio/touch — and adds the thing 1.0 never demonstrated: that learning on the virtual objects generalizes to the real ones.
Each object is an Object File: one implicit neural representation with three coordinate-MLP sub-networks — VisionNet, AudioNet, TouchNet (Fig 3). The pipeline starts from 1,000 high-quality scanned meshes (855 from ABO after material filtering to ceramic/glass/wood/plastic/iron/polycarbonate/steel, all 100 from OF 1.0, 45 polycarbonate from Google Scanned Objects), with real-product metadata (category, material, color, dimensions). It then simulates each modality and distills it into the Object File so that storage is independent of the extrinsic query parameters (viewpoint, impact force, gel deformation).
VisionNet (real-time appearance). Builds on the Object
Scattering Function (OSF) volumetric appearance model, but replaces the single
large MLP with a KiloOSF design (after KiloNeRF): the object is
subdivided into a uniform grid s = (s_x, s_y, s_z), each cell gets
a tiny MLP v(i), and a spatial-binning map m(x) routes
a 3D point to its cell's network before querying color/density
(c, σ) = F_{v(m(x))}(x, r). Trained by distillation from a
full OSF teacher, with empty-space skipping and early ray termination. Result:
60× faster inference than 1.0 (Table 1) at better visual quality.
AudioNet (impact sound). Uses linear modal analysis. Each
mesh is converted to a quadratic tetrahedral mesh and run through Finite Element
Methods (FEM, second-order elements in Abaqus) to get the modal decomposition
&ddot;q + (αI + βΛ)˙q + Λq = U^T f,
yielding N damped sinusoids with frequencies ω_i,
dampings d_i, gains g_i. The key trick: frequencies and
dampings are object intrinsics (predicted once), but the per-mode gains
g_i depend on contact force and location — so AudioNet is an
MLP that, given a tetrahedral-vertex coordinate, predicts the gain vector for
unit force in each of three axis directions. At inference an arbitrary force
f = k_x f_x + k_y f_y + k_z f_z linearly combines the three
predicted gain branches; the waveform is synthesized as
S(t) = ∑ ĝ_i e^{-d_i t} sin(2πω_i t). Predicting
only the location-dependent part (vs. 1.0's direct spectrogram regression) is
what raises quality and lets it run in real time.
TouchNet (GelSight-style tactile). Two stages. First simulate
a per-contact deformation map from the object's local shape in the
contact area plus the gelpad shape elsewhere, rendered with Pyrender/OpenGL at
~700 fps. TouchNet is then an MLP
F_T : (x, y, z, θ_T, φ_T, p, w, h) → d mapping an 8D
input (3D location, contact-orientation unit vector, gel penetration depth
p, pixel location (w,h)) to the deformation-map value.
The deformation map is converted to a GelSight RGB tactile image via
Taxim, calibrated
against a real GelSight sensor. Unlike OF 1.0 (single tactile image per vertex
along the normal), 2.0 renders varied rotation (±15°) and pressing
depth (0.5–2 mm), enabling Sim2Real tactile transfer.
Rendering speed (Table 1, seconds per sample) and quality (Table 2):
| Method | Total render (s)↓ | Vision PSNR↑ | Vision SSIM↑ | Audio STFT (×10-5)↓ | Audio ENV (×10-4)↓ | Touch PSNR↑ | Touch SSIM↑ |
|---|---|---|---|---|---|---|---|
| ObjectFolder 1.0 | 4.129 | 35.7 | 0.97 | 4.94 | 7.65 | 27.9 | 0.64 |
| ObjectFolder 2.0 | 0.111 | 36.3 | 0.98 | 0.19 | 1.29 | 31.6 | 0.78 |
~37× faster overall (VisionNet alone 60×), with audio STFT distance cut by ~26× and touch PSNR up ~3.7 dB.
Sim2Real downstream tasks (the headline contribution):
Across all three tasks the story is the same: models trained purely on 2.0's virtual sensory data transfer to real objects, and combining modalities beats any single one — the empirical case for multisensory representations.
From the authors:
What I noticed reading it:
This is squarely on the batch thesis that many manipulation
predicates are not visually evaluable — is_struck,
surface_is_rough, material_is_ceramic,
is_full live in sound and touch. ObjectFolder 2.0
is the substrate for learning exactly those: a scalable repository where
the non-visual senses are first-class, queryable functions, and where the
empirical result is that fusing them beats vision alone. For
BLADE,
whose predicate classifiers are currently visual
(f_θ(p): O → {T,F} over RGB crops), this paper is the
data answer to BLADE's own flagged limitation that "many predicates aren't
visually evaluable": a tactile/acoustic predicate classifier needs touch/sound
training data, and ObjectFolder 2.0 (and its successors) is where that data is
manufacturable at scale with Sim2Real validity. The honest caveat: it is a
perception dataset, not a planning or long-horizon-manipulation paper
— it gives BLADE-style work a sensory front-end, not a new abstraction
mechanism. The implicit-neural "object as a function" framing is also a clean
counterpoint to explicit symbolic state: here the "world model" of an object is
a renderer of raw multisensory signals, which a downstream predicate classifier
would still have to interpret.
We virtualize each object by encoding its intrinsics (texture, material type, and 3D shape) with an Object File implicit neural representation. Then we can render its visual appearance, impact sound, and tactile readings based on any extrinsic parameters. — Fig 1 caption / p.1
Models learned from virtual objects in our dataset successfully transfer to their real-world counterparts in three challenging tasks: object scale estimation, contact localization, and shape reconstruction. — Abstract / p.1
Among the three modalities, tactile data has the smallest Sim2Real gap compared to vision and audio. — §5.1, Object Scale Estimation / p.6
Papers cited that should likely be ingested next:
Newly ingested in 2026-06-24 batch — directly relevant: