SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

Changan Chen*, Carl Schissler*, Sanchit Garg*, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip Robinson, Kristen Grauman · UT Austin / Reality Labs at Meta / Georgia Tech / FAIR · NeurIPS 2022 (Datasets and Benchmarks Track) · arXiv:2206.08312 · PDF

One-liner. SoundSpaces 2.0 is an on-the-fly, geometry-based acoustic renderer that, given any 3D mesh, produces physically-grounded room impulse responses (reflection, reverb, transmission, diffraction, air absorption, binaural spatialization) for arbitrary source/receiver placements — turning a static visual simulator into a joint audio-visual one fast enough for embodied RL training and faithful enough to transfer to the real world.

Problem & motivation

What we see and hear are tightly coupled: room geometry and surface materials transform every sound that reaches our ears (a marble museum vs. a carpeted bookshop). Embodied-AI and AR/VR systems — a rescue robot localizing a person calling for help, a service robot listening for whether the espresso machine is running — need that visual↔acoustic correspondence in simulation. Prior audio simulators were too rigid. The original SoundSpaces [14] precomputes room impulse responses (RIRs) on a fixed 0.5 m discrete grid for a fixed list of 100 environments, weighing on the order of terabytes; agents can only hop between grid points, configurations (microphones, materials) can't be changed, and it doesn't generalize to new scenes. ThreeDWorld [21] renders continuously but only for oversimplified "shoebox" (rectangular) rooms, so it can't use real-scan meshes. The gap: no geometry-based acoustic simulator was simultaneously high-fidelity, configurable, generalizable to arbitrary meshes, and fast enough for embodied learning.

Method

The core is the RLR-Audio-Propagation engine, integrated into the Habitat-Sim [63] visual simulator (Fig 1). Given a scene mesh plus user-specified source/receiver locations, it computes an RIR via a bidirectional path-tracing algorithm [10] over M logarithmically-spaced frequency bands. It models reflection, transmission, and diffraction, plus binaural spatialization. Rays from each source carry per-band energy and bounce through the scene (reflection via a Phong BRDF, diffraction probabilistically at specially-constructed edge geometry, transmission with probability set by material coefficients); rays are then emitted from the listener and connected to source paths via multiple importance sampling. Energy is accumulated into an energy-time histogram whose spherical-harmonic coefficients per bin encode the directional distribution of arriving sound, which is then spatialized to an ambisonic or binaural pressure IR [69] and convolved with the source audio.

Improvements over original SoundSpaces. (i) Added acoustic diffraction via the fast diffraction approach [68], fixing abrupt occlusion artifacts. (ii) Fixed a direct-to-reverberant ratio (DRR) bias of √(4π) present in the indirect sound pressure of the original.

Continuity. Spatial continuity: render R(s, r, θ) at arbitrary (non-grid) source s, receiver r, heading θ; received sound is Aⁿ = Aₛ ∗ R(s, r, θ). Acoustic continuity: instead of convolving a fresh IR per discrete step (which made the source effectively stop/restart each second), they index the source audio at the right delayed sample (tₙ = tᵢ − R(s, xᵢ, θ) + 1) and apply linear crossfading between successive listener positions over a window T to smooth transitions.

Configurability. Exposed parameters: sampling rate, number of frequency bands, ray counts, whether reflection/transmission/diffraction are enabled. Microphone types: mono, binaural, ambisonics, or a user-specified array. Custom HRTFs can be loaded. Material modeling: 29 built-in acoustic materials, each with frequency-dependent absorption, scattering, and transmission coefficients of the form [f₁, c₁, ..., fₙ, cₙ], plus distance-dependent air absorption [8]. This fixes the original's rigid object-category→material map (e.g., all floors forced to "carpet").

Two rendering modes (Sec 3.5): high-quality (all parameters maxed, temporal coherence off, gold standard) and high-speed (fewer rays, reuses previously-computed IRs under a locally-continuous-motion assumption [66]) for RL training that needs millions of steps.

SoundSpaces-PanoIR dataset. To serve users who want visual-acoustic data without running the simulator, they render an offline dataset of 10M panoramic image + IR pairs across 750 environments from Matterport3D, Gibson, and HM3D, each with the source's polar coordinate relative to the panorama center (Fig 2).

Setup

Results

Simulator capability (Table 1): SoundSpaces 2.0 is the only platform checking all four boxes — Audio-Visual, Geometric, Configurable, and Arbitrary-Env. SoundSpaces 1.0 lacks Configurable and Arbitrary-Env; ThreeDWorld and Pyroomacoustics lack Geometric/Arbitrary-Env; GWA lacks Audio-Visual.

Accuracy vs. real measurements (Sec 5.2). Against the 7 FRL-apartment measurements, SoundSpaces 2.0 reduces average DRR error from 11.0 dB (original SoundSpaces) to 0.98 dB while preserving a 12.4% relative RT60 error; its energy-decay curves track the real measurements much more closely (Fig 3b,c).

Speed/quality tradeoff (Table 2): high-speed mode is 8× faster (1 thread) / 33× (5 threads) than high-quality while losing only 9.5% relative RT60 accuracy; a nav model trained in high-speed and evaluated in high-quality differs by <1%.

Benchmark 1 — Continuous Audio-Visual Navigation (Table 3):

TrainTestSuccess %SPL %DTG (m)
SoundSpaces [14]Continuous space64.2 ± 0.827.5 ± 0.45.6 ± 0.2
SoundSpaces [14]Continuous space & sound0.9 ± 0.20.3 ± 0.112.9 ± 0.1
SoundSpaces 2.0Continuous space & sound64.7 ± 3.949.3 ± 3.05.9 ± 0.5

An agent trained on discrete-sound SoundSpaces collapses (0.9% success) when tested under acoustically-continuous audio, because it leans on an always-present (but unrealistic) direct-sound cue; SoundSpaces 2.0 training restores 64.7%. Note the discrete agent retains decent success but much worse SPL (27.5 vs. 49.3) even on continuous space alone — spatial discretization mainly hurts path efficiency.

Benchmark 2 — Far-field ASR (Table 4, Word Error Rate):

Finetuning IRsWER %
Pretrained (none)29.10
Real IRs [39]13.32
Pyroomacoustics [64]16.24
SoundSpaces 1.0 [14]18.48
SoundSpaces 2.012.48

Finetuning ASR on SoundSpaces 2.0 synthetic IRs beats finetuning on real IRs (12.48 vs. 13.32 WER) — the headline sim2real result. Adding their acoustic randomization (random per-category material + Gaussian coefficient noise) pushes WER further to 12.04%, whereas naive uniform coefficient sampling hurts (12.58%), showing structured randomization matters.

Limitations & open questions

From the authors (Sec 6):

What I noticed reading it:

Why I care

Off-theme flag: this is an infrastructure/dataset paper for audio-visual navigation and far-field ASR, not robot manipulation, and there is no robot, gripper, or contact-rich task in it. I should not overstate its relevance to BLADE's long-horizon manipulation / planning-abstraction agenda. That said, it is the canonical acoustic-rendering substrate for the broader thesis I keep circling: many manipulation predicates — is_full (pouring), is_screwed_tight, is_inserted, surface_is_rough — are not visually evaluable and live in sound/vibration. If a future BLADE-style system wants to learn a poured-enough(cup) or contact-made predicate classifier from audio, it needs (a) realistic acoustic data at scale and (b) sim2real acoustic transfer. SoundSpaces 2.0 is the cleanest existence proof that geometry-based acoustic sim can out-transfer real recordings (the ASR result), which de-risks the "can we even train acoustic predicate classifiers in sim?" question.

Concrete tangential relevance to my themes:

Quotable

To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. — Abstract / p.1
Finetuning on real IRs also reduces the error substantially, but still not as much as our simulated data, which can be generated at scale across a wide variety of environments. Our simulation generates realistic IRs that help machine learning models generalize better to reality. — §5.4 / p.9
At the object level, we can anticipate the sounds an object makes based on how it looks, and vice versa (a dog barks, a door slams, a baby cries). At the environment level, materials and geometry of the surrounding 3D space ... transform the sounds that reach our ears. — §1 / p.1

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: