Dongyu supplement. Added as a candidate missing/adjacent dataset paper after searching primary arXiv sources. This is a draft summary for triage, not a full paper read.
One-liner. A public contact-rich manipulation dataset with more than 110k real-world sequences, visual/force/audio/action streams, human videos, and language descriptions.
Large-scale real-world dataset for one-shot imitation/generalization across diverse manipulation skills.
Very relevant to the map because it has force + audio + language descriptions at scale, not only vision.
Language descriptions may be task-level rather than fine-grained dynamic state-change annotations.
arXiv lines 38-42 report dataset size, modalities, language descriptions, and public availability.