One-liner. Kaiwu is a human-demonstration dataset for industrial assembly that records, time-synchronized to absolute timestamps, the modalities most manipulation datasets omit — per-finger tactile pressure, EMG muscle signals, eye gaze, full-body motion capture, and ambient contact sound — alongside RGB-D video, so robots can learn the dynamics and intention behind dexterous contact-rich skills rather than just their kinematics.
Foundation-model and imitation-learning methods are bottlenecked on large-scale, high-quality, multimodal data, but the authors argue current robot datasets share two critical limitations. First, most rely on video and therefore capture only kinematics (trajectory, velocity); they lack the dynamics — force, tactile pressure, muscle activation — that govern real contact-rich manipulation, leaving learning "superficial." Second, there is no universal, intuitive human-perception framework: prior datasets incorporate some sensing (video, IMU) but miss the synchronized, fine-grained, multi-sensory signal needed for complex assembly in open, human-inhabited settings. Kaiwu (named after the Ming-dynasty encyclopedia Tiangong Kaiwu) targets this gap with a wearable + environment-mounted sensor framework aimed at assembly-style human-robot collaboration and human-intention prediction.
This is a dataset/framework paper; the "method" is the data-collection platform, sensor suite, and annotation pipeline (Fig 1, Fig 5). The three stated contributions are: (1) a multimodal collection framework with full situational awareness (manipulation dynamics, neural/EMG signals, attention, multi-view vision); (2) a high-quality large-scale dataset of 11,664 integrated action instances across 20 subjects and 30 interaction objects; (3) rich cross-modal spatio-temporal synchronization and fine-grained multi-level annotation.
Sensor suite. Wearable and environment-mounted devices, each on its own sampling rate, fused via absolute timestamps:
Collection platform. Multi-threaded/multi-processing software for synchronized streaming, storage, and visualization across heterogeneous sampling rates; outputs are aligned through absolute timestamps. Calibration is staged: one-point eye-tracker calibration per subject, mocap skeleton calibration per subject, and a per-session "process initialization" gesture calibration to start data synchronization (0–10 s sync window inside each ~140 s action session).
Annotation (Table IV, Fig 4). Five annotation types over absolute-timestamped streams: gesture classification (10 object tags / 4,959 instances), AOIs / regions-of-interest as personal attention (298), semantic segmentation of 30 key objects (610,778 closed-area annotations; errors held under 8 px), action segmentation (26 tags / 7,197 motion-segmentation events), and gesture segmentation (9 tags / 4,467). Action annotation is two-level: coarse left/right-hand action segments, then fine-grained gesture states within them. Hand states use 8 grasp taxonomy classes (Cylindrical, Oblique palmar, Lumbrical, Intermediate power-precision, Pinch, Lateral pinch, Special pinch, Non-prehensile).
There are no experimental learning results; the "results" are dataset descriptive statistics. Headline figures: 20 participants (mean height 172.4±6.5 cm, mean age 23.95±5.1 yr), recorded over 11 working days; 30 interaction objects; 11,664 integrated action instances; ~6.3 hours of assembling. Storage footprint per modality (Table V):
| Data type | Size | Sampling rate |
|---|---|---|
| Glove data | 264 MB | 100 Hz |
| Glove export | 1,124 MB | 20 Hz |
| Eye tracking | 14 GB | 25 Hz |
| RGB-D video | 3,476 GB | 60 Hz |
| Motion capture | 4,160 MB | 60 Hz |
| Audio | 7,955 MB | 50 Hz |
| ACC | 354 MB | 40 Hz |
| EMG | 362 MB | 40 Hz |
Annotation totals (Table IV): 4,959 gesture-classification instances, 298 AOIs, 610,778 semantic-segmentation closed-area annotations, 7,197 action-segmentation events, 4,467 gesture-segmentation instances. Data is released on ScienceDB (keyword "Kaiwu"), packaged as Kaiwu Data Annotation (31.84 GB), AOIs (17.9 GB), gesture segmentation (973 MB), semantic segmentation (11.6 GB), gesture classification (417 MB), plus a raw-data directory and an action-segmentation supplement (193 MB) adding four missing experimental sets. Directory layout is per-subject (P1–P20) with EMGData / GloveData / KinectData / KonovaData / TobiiData / VoiceData sub-folders (Fig 5).
From the authors:
What I noticed reading it:
Directly on-thesis for the batch's core claim: many manipulation
predicates are not visually evaluable — is_grasped,
is_screwed_tight, is_inserted,
surface_is_rough live in touch, force, and sound. Kaiwu is the
rare dataset that actually records human demonstrations with those
non-visual channels (per-finger tactile, EMG, contact audio) on
contact-rich tasks where the predicate of interest (a nut fastened, a bearing
seated, a column tightened) is exactly the kind of thing a camera can't see.
For BLADE,
whose biggest acknowledged gap is that segmentation and predicate classifiers
are vision-grounded (gripper-state segmentation, visual predicate
classifiers) and whose continuous force parameters hide inside the diffusion
policy, a dataset like this is the raw material for learning
tactile/force-grounded predicate classifiers and a richer
contact-segmentation signal than gripper open/close. The grasp-taxonomy and
fine-grained gesture annotation also speak to the "predicate invention" line:
the 8 grasp classes are essentially human-labeled symbolic contact states.
That said, Kaiwu is a data resource, not a method — no language conditioning, no policy, no planning abstraction. Its relevance is upstream (it could feed a force/tactile-grounded BLADE-style pipeline), not as a method comparison. It pairs naturally with the tactile-representation and multisensory-policy clusters of this batch, which would turn its raw signals into learnable embeddings.
Missing dynamics information including force will deteriorate the learning performance, resulting in the superficial learning. — §I, Introduction / p.1
The Kaiwu dataset directly collects dynamic and static data using wearable sensors. It integrates assembly actions into a coherent process, enhancing human-robot interaction with a narrative and causal structure. — §II, Related work / p.3
Absolute timestamp records are used during data collection, and the output of each module's data stream is synchronized through absolute timestamps. — §III, Data synchronization / p.5
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: