SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

Changan Chen*, Carl Schissler*, Sanchit Garg*, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip Robinson, Kristen Grauman · UT Austin / Reality Labs at Meta / Georgia Tech / FAIR · NeurIPS 2022 (Datasets and Benchmarks Track) · arXiv:2206.08312 · PDF

One-liner. SoundSpaces 2.0 is an on-the-fly, geometry-based acoustic renderer that, given any 3D mesh, produces physically-grounded room impulse responses (reflection, reverb, transmission, diffraction, air absorption, binaural spatialization) for arbitrary source/receiver placements — turning a static visual simulator into a joint audio-visual one fast enough for embodied RL training and faithful enough to transfer to the real world.

Problem & motivation

What we see and hear are tightly coupled: room geometry and surface materials transform every sound that reaches our ears (a marble museum vs. a carpeted bookshop). Embodied-AI and AR/VR systems — a rescue robot localizing a person calling for help, a service robot listening for whether the espresso machine is running — need that visual↔acoustic correspondence in simulation. Prior audio simulators were too rigid. The original SoundSpaces [14] precomputes room impulse responses (RIRs) on a fixed 0.5 m discrete grid for a fixed list of 100 environments, weighing on the order of terabytes; agents can only hop between grid points, configurations (microphones, materials) can't be changed, and it doesn't generalize to new scenes. ThreeDWorld [21] renders continuously but only for oversimplified "shoebox" (rectangular) rooms, so it can't use real-scan meshes. The gap: no geometry-based acoustic simulator was simultaneously high-fidelity, configurable, generalizable to arbitrary meshes, and fast enough for embodied learning.

Method

The core is the RLR-Audio-Propagation engine, integrated into the Habitat-Sim [63] visual simulator (Fig 1). Given a scene mesh plus user-specified source/receiver locations, it computes an RIR via a bidirectional path-tracing algorithm [10] over M logarithmically-spaced frequency bands. It models reflection, transmission, and diffraction, plus binaural spatialization. Rays from each source carry per-band energy and bounce through the scene (reflection via a Phong BRDF, diffraction probabilistically at specially-constructed edge geometry, transmission with probability set by material coefficients); rays are then emitted from the listener and connected to source paths via multiple importance sampling. Energy is accumulated into an energy-time histogram whose spherical-harmonic coefficients per bin encode the directional distribution of arriving sound, which is then spatialized to an ambisonic or binaural pressure IR [69] and convolved with the source audio.

Improvements over original SoundSpaces. (i) Added acoustic diffraction via the fast diffraction approach [68], fixing abrupt occlusion artifacts. (ii) Fixed a direct-to-reverberant ratio (DRR) bias of √(4π) present in the indirect sound pressure of the original.

Continuity. Spatial continuity: render R(s, r, θ) at arbitrary (non-grid) source s, receiver r, heading θ; received sound is Aⁿ = Aₛ ∗ R(s, r, θ). Acoustic continuity: instead of convolving a fresh IR per discrete step (which made the source effectively stop/restart each second), they index the source audio at the right delayed sample (tₙ = tᵢ − R(s, xᵢ, θ) + 1) and apply linear crossfading between successive listener positions over a window T to smooth transitions.

Configurability. Exposed parameters: sampling rate, number of frequency bands, ray counts, whether reflection/transmission/diffraction are enabled. Microphone types: mono, binaural, ambisonics, or a user-specified array. Custom HRTFs can be loaded. Material modeling: 29 built-in acoustic materials, each with frequency-dependent absorption, scattering, and transmission coefficients of the form [f₁, c₁, ..., fₙ, cₙ], plus distance-dependent air absorption [8]. This fixes the original's rigid object-category→material map (e.g., all floors forced to "carpet").

Two rendering modes (Sec 3.5): high-quality (all parameters maxed, temporal coherence off, gold standard) and high-speed (fewer rays, reuses previously-computed IRs under a locally-continuous-motion assumption [66]) for RL training that needs millions of steps.

SoundSpaces-PanoIR dataset. To serve users who want visual-acoustic data without running the simulator, they render an offline dataset of 10M panoramic image + IR pairs across 750 environments from Matterport3D, Gibson, and HM3D, each with the source's polar coordinate relative to the panorama center (Fig 2).

Setup

Datasets / benchmarks: Mesh assets from Replica [71], Matterport3D [11], Gibson [81], HM3D [59] (arbitrary-mesh compatible). Real-world validation: 7 source/receiver IR measurements collected in the FRL apartment of the Replica scene. Released SoundSpaces-PanoIR (10M image-IR pairs, 750 envs). Downstream: continuous AudioGoal navigation dataset [14]; far-field ASR on LibriSpeech [57] train-clean-100, tested on a real RIR dataset [72] + RWCP/REVERB/AIR databases.
Hardware / simulator: SoundSpaces 2.0 / RLR-Audio-Propagation engine integrated with Habitat-Sim [63]. Speed profiled on a Xeon Gold 6230 CPU @ 2.10 GHz. Real IR capture used a B&K Type 4295 omnidirectional speaker (100 Hz–8 kHz) and an Earthworks M30 mic via exponential sine sweep. No robot hardware — this is a simulator paper.
Baselines: Simulators — SoundSpaces 1.0 [14], GWA [74], ThreeDWorld [21], Pyroomacoustics [64] (Table 1, on Audio-Visual / Geometric / Configurable / Arbitrary-Env axes). Navigation: DAV-Nav agent [80, 14] trained on SoundSpaces 1.0 (discrete) vs. SoundSpaces 2.0 (continuous). ASR: SpeechBrain [61] transformer finetuned on real IRs [39], Pyroomacoustics, SoundSpaces 1.0, vs. SoundSpaces 2.0.
Compute: Rendering speed reported (Table 2): high-quality 0.9 FPS (1 thread) / 4.0 FPS (5 threads); high-speed 7.7 / 33.5 FPS; SoundSpaces runs at 500+ FPS bottlenecked on I/O. Training compute for the nav/ASR benchmarks not reported.

Results

Simulator capability (Table 1): SoundSpaces 2.0 is the only platform checking all four boxes — Audio-Visual, Geometric, Configurable, and Arbitrary-Env. SoundSpaces 1.0 lacks Configurable and Arbitrary-Env; ThreeDWorld and Pyroomacoustics lack Geometric/Arbitrary-Env; GWA lacks Audio-Visual.

Accuracy vs. real measurements (Sec 5.2). Against the 7 FRL-apartment measurements, SoundSpaces 2.0 reduces average DRR error from 11.0 dB (original SoundSpaces) to 0.98 dB while preserving a 12.4% relative RT60 error; its energy-decay curves track the real measurements much more closely (Fig 3b,c).

Speed/quality tradeoff (Table 2): high-speed mode is 8× faster (1 thread) / 33× (5 threads) than high-quality while losing only 9.5% relative RT60 accuracy; a nav model trained in high-speed and evaluated in high-quality differs by <1%.

Benchmark 1 — Continuous Audio-Visual Navigation (Table 3):

Train	Test	Success %	SPL %	DTG (m)
SoundSpaces [14]	Continuous space	64.2 ± 0.8	27.5 ± 0.4	5.6 ± 0.2
SoundSpaces [14]	Continuous space & sound	0.9 ± 0.2	0.3 ± 0.1	12.9 ± 0.1
SoundSpaces 2.0	Continuous space & sound	64.7 ± 3.9	49.3 ± 3.0	5.9 ± 0.5

An agent trained on discrete-sound SoundSpaces collapses (0.9% success) when tested under acoustically-continuous audio, because it leans on an always-present (but unrealistic) direct-sound cue; SoundSpaces 2.0 training restores 64.7%. Note the discrete agent retains decent success but much worse SPL (27.5 vs. 49.3) even on continuous space alone — spatial discretization mainly hurts path efficiency.

Benchmark 2 — Far-field ASR (Table 4, Word Error Rate):

Finetuning IRs	WER %
Pretrained (none)	29.10
Real IRs [39]	13.32
Pyroomacoustics [64]	16.24
SoundSpaces 1.0 [14]	18.48
SoundSpaces 2.0	12.48

Finetuning ASR on SoundSpaces 2.0 synthetic IRs beats finetuning on real IRs (12.48 vs. 13.32 WER) — the headline sim2real result. Adding their acoustic randomization (random per-category material + Gaussian coefficient noise) pushes WER further to 12.04%, whereas naive uniform coefficient sampling hurts (12.58%), showing structured randomization matters.

Limitations & open questions

From the authors (Sec 6):

Path tracing needs high-quality watertight meshes; large holes cause ray leakage and inaccurate RT60 (Matterport3D's broken meshes show this). They expose a ray-leak API but mesh repair is on the user.
Acoustic material properties can't be estimated from geometry/visuals alone; they fall back on coarse category→material assignment, so fidelity is capped without real acoustic measurements per scene.
Inherits geometrical-acoustics shortcomings (room modes), though diffraction is now handled.
Real-IR validation is only on one apartment (Replica FRL); broader geometry/material validation is future work.

What I noticed reading it:

The marquee "beats real IRs for ASR" claim rests on a single finetuning/test split (Table 4) with no variance reported on WER — a 1.7-point WER gap over real IRs could be within run-to-run noise. The headline would be stronger with seeds/CIs on the ASR numbers (they do report std for nav and speed, but not ASR).
Real-world acoustic validation is n = 7 source/receiver pairs in one apartment. The DRR/RT60 fidelity claims (0.98 dB) are persuasive but thin geographically; generalization of acoustic accuracy across diverse real rooms is asserted, not measured.
The whole edifice assumes object-category material maps are an acceptable prior. For any contact-rich / fine-grained acoustic task (distinguishing wood vs. metal from impact sound), this category-level material prior is exactly the wrong granularity — the simulator can't render an object's instance-specific acoustic signature it was never told about.
Speed numbers (0.9–33.5 FPS) are CPU-only on one machine; for RL at "millions to billions" of steps these are still slow relative to pure visual sim, and no GPU/throughput-at-scale figure is given.
It renders propagation of arbitrary input sounds but does not synthesize the source sounds themselves (impact/contact audio) the way ObjectFolder-style object-centric simulators do — an important scoping boundary for manipulation use.

Why I care

Off-theme flag: this is an infrastructure/dataset paper for audio-visual navigation and far-field ASR, not robot manipulation, and there is no robot, gripper, or contact-rich task in it. I should not overstate its relevance to BLADE's long-horizon manipulation / planning-abstraction agenda. That said, it is the canonical acoustic-rendering substrate for the broader thesis I keep circling: many manipulation predicates — is_full (pouring), is_screwed_tight, is_inserted, surface_is_rough — are not visually evaluable and live in sound/vibration. If a future BLADE-style system wants to learn a poured-enough(cup) or contact-made predicate classifier from audio, it needs (a) realistic acoustic data at scale and (b) sim2real acoustic transfer. SoundSpaces 2.0 is the cleanest existence proof that geometry-based acoustic sim can out-transfer real recordings (the ASR result), which de-risks the "can we even train acoustic predicate classifiers in sim?" question.

Concrete tangential relevance to my themes:

Sim2real for non-visual sensing. The acoustic-randomization finding (structured material randomization > uniform) is a direct analogue to domain randomization for tactile/force sim — a recipe worth borrowing if acoustic predicates ever enter the abstraction layer.
Adjacent, not core. Unlike SonicSense or ManiWAV, this renders environmental acoustics (room reverb / propagation), not contact audio at the gripper. The manipulation-relevant signal (impact/vibration from grasping, screwing, pouring) is a different physics regime this paper doesn't target — the source-sound synthesis is out of scope here.
Simulator-cluster placement. Sits with ObjectFolder 2.0, TACTO, and Taxim as multisensory-simulation infrastructure — but ObjectFolder is the object-centric / contact-audio counterpart, whereas SoundSpaces is the scene-centric / propagation counterpart.

Quotable

To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. — Abstract / p.1

Finetuning on real IRs also reduces the error substantially, but still not as much as our simulated data, which can be generated at scale across a wide variety of environments. Our simulation generates realistic IRs that help machine learning models generalize better to reality. — §5.4 / p.9

At the object level, we can anticipate the sounds an object makes based on how it looks, and vice versa (a dog barks, a door slams, a baby cries). At the environment level, materials and geometry of the surrounding 3D space ... transform the sounds that reach our ears. — §1 / p.1

Papers cited that should likely be ingested next:

[14] Chen et al. 2020 — SoundSpaces: Audio-Visual Navigation in 3D Environments (ECCV) — the direct predecessor this paper overhauls; defines AudioGoal navigation and the precomputed-grid RIR approach. (would-be PDF)
[24] Gao et al. 2022 — ObjectFolder 2.0 (CVPR) — the object-centric multisensory sim2real counterpart; already in this batch (see below).
[63] Savva et al. 2019 — Habitat (ICCV) — the visual-simulation platform SoundSpaces 2.0 integrates into.
[80] Wijmans et al. 2020 — DD-PPO (ICLR) — the distributed RL engine powering the DAV-Nav agent.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ObjectFolder 2.0 and ObjectFolder — the object-centric multisensory (visual/acoustic/tactile) simulation line; SoundSpaces is the scene-/propagation-centric complement to ObjectFolder's contact-/impact-sound focus.
ObjectFolder Benchmark — neural-vs-real multisensory benchmark; same sim2real-for-audio evaluation spirit as SoundSpaces 2.0's real-IR validation.
SonicSense, ManiWAV, and Making Sense of Audio Vibration (pouring) — manipulation papers that use contact audio at the gripper; SoundSpaces renders environmental acoustics, the orthogonal physics regime — useful contrast for any acoustic-sensing-for-manipulation survey.
The Sound of Simulation — generative audio sim2real; closest sibling on the "simulate audio to transfer to real" thesis.
TACTO, Taxim, TacEx — the tactile-sim counterparts in this batch's simulator cluster; SoundSpaces is the acoustic member of the same "non-visual sensing simulators" family.