Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction

Shuo Jiang, Haonan Li, Ruochen Ren, Yanmin Zhou, Zhipeng Wang, Bin He · 2025 · IEEE Robotics and Automation Letters (RA-L) · arXiv:2503.05231 · PDF

One-liner. Kaiwu is a human-demonstration dataset for industrial assembly that records, time-synchronized to absolute timestamps, the modalities most manipulation datasets omit — per-finger tactile pressure, EMG muscle signals, eye gaze, full-body motion capture, and ambient contact sound — alongside RGB-D video, so robots can learn the dynamics and intention behind dexterous contact-rich skills rather than just their kinematics.

Problem & motivation

Foundation-model and imitation-learning methods are bottlenecked on large-scale, high-quality, multimodal data, but the authors argue current robot datasets share two critical limitations. First, most rely on video and therefore capture only kinematics (trajectory, velocity); they lack the dynamics — force, tactile pressure, muscle activation — that govern real contact-rich manipulation, leaving learning "superficial." Second, there is no universal, intuitive human-perception framework: prior datasets incorporate some sensing (video, IMU) but miss the synchronized, fine-grained, multi-sensory signal needed for complex assembly in open, human-inhabited settings. Kaiwu (named after the Ming-dynasty encyclopedia Tiangong Kaiwu) targets this gap with a wearable + environment-mounted sensor framework aimed at assembly-style human-robot collaboration and human-intention prediction.

Method

This is a dataset/framework paper; the "method" is the data-collection platform, sensor suite, and annotation pipeline (Fig 1, Fig 5). The three stated contributions are: (1) a multimodal collection framework with full situational awareness (manipulation dynamics, neural/EMG signals, attention, multi-view vision); (2) a high-quality large-scale dataset of 11,664 integrated action instances across 20 subjects and 30 interaction objects; (3) rich cross-modal spatio-temporal synchronization and fine-grained multi-level annotation.

Sensor suite. Wearable and environment-mounted devices, each on its own sampling rate, fused via absolute timestamps:

Collection platform. Multi-threaded/multi-processing software for synchronized streaming, storage, and visualization across heterogeneous sampling rates; outputs are aligned through absolute timestamps. Calibration is staged: one-point eye-tracker calibration per subject, mocap skeleton calibration per subject, and a per-session "process initialization" gesture calibration to start data synchronization (0–10 s sync window inside each ~140 s action session).

Annotation (Table IV, Fig 4). Five annotation types over absolute-timestamped streams: gesture classification (10 object tags / 4,959 instances), AOIs / regions-of-interest as personal attention (298), semantic segmentation of 30 key objects (610,778 closed-area annotations; errors held under 8 px), action segmentation (26 tags / 7,197 motion-segmentation events), and gesture segmentation (9 tags / 4,467). Action annotation is two-level: coarse left/right-hand action segments, then fine-grained gesture states within them. Hand states use 8 grasp taxonomy classes (Cylindrical, Oblique palmar, Lumbrical, Intermediate power-precision, Pinch, Lateral pinch, Special pinch, Non-prehensile).

Setup

Results

There are no experimental learning results; the "results" are dataset descriptive statistics. Headline figures: 20 participants (mean height 172.4±6.5 cm, mean age 23.95±5.1 yr), recorded over 11 working days; 30 interaction objects; 11,664 integrated action instances; ~6.3 hours of assembling. Storage footprint per modality (Table V):

Data typeSizeSampling rate
Glove data264 MB100 Hz
Glove export1,124 MB20 Hz
Eye tracking14 GB25 Hz
RGB-D video3,476 GB60 Hz
Motion capture4,160 MB60 Hz
Audio7,955 MB50 Hz
ACC354 MB40 Hz
EMG362 MB40 Hz

Annotation totals (Table IV): 4,959 gesture-classification instances, 298 AOIs, 610,778 semantic-segmentation closed-area annotations, 7,197 action-segmentation events, 4,467 gesture-segmentation instances. Data is released on ScienceDB (keyword "Kaiwu"), packaged as Kaiwu Data Annotation (31.84 GB), AOIs (17.9 GB), gesture segmentation (973 MB), semantic segmentation (11.6 GB), gesture classification (417 MB), plus a raw-data directory and an action-segmentation supplement (193 MB) adding four missing experimental sets. Directory layout is per-subject (P1–P20) with EMGData / GloveData / KinectData / KonovaData / TobiiData / VoiceData sub-folders (Fig 5).

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

Directly on-thesis for the batch's core claim: many manipulation predicates are not visually evaluableis_grasped, is_screwed_tight, is_inserted, surface_is_rough live in touch, force, and sound. Kaiwu is the rare dataset that actually records human demonstrations with those non-visual channels (per-finger tactile, EMG, contact audio) on contact-rich tasks where the predicate of interest (a nut fastened, a bearing seated, a column tightened) is exactly the kind of thing a camera can't see. For BLADE, whose biggest acknowledged gap is that segmentation and predicate classifiers are vision-grounded (gripper-state segmentation, visual predicate classifiers) and whose continuous force parameters hide inside the diffusion policy, a dataset like this is the raw material for learning tactile/force-grounded predicate classifiers and a richer contact-segmentation signal than gripper open/close. The grasp-taxonomy and fine-grained gesture annotation also speak to the "predicate invention" line: the 8 grasp classes are essentially human-labeled symbolic contact states.

That said, Kaiwu is a data resource, not a method — no language conditioning, no policy, no planning abstraction. Its relevance is upstream (it could feed a force/tactile-grounded BLADE-style pipeline), not as a method comparison. It pairs naturally with the tactile-representation and multisensory-policy clusters of this batch, which would turn its raw signals into learnable embeddings.

Quotable

Missing dynamics information including force will deteriorate the learning performance, resulting in the superficial learning. — §I, Introduction / p.1
The Kaiwu dataset directly collects dynamic and static data using wearable sensors. It integrates assembly actions into a coherent process, enhancing human-robot interaction with a narrative and causal structure. — §II, Related work / p.3
Absolute timestamp records are used during data collection, and the output of each module's data stream is synchronized through absolute timestamps. — §III, Data synchronization / p.5

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: