Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction

Shuo Jiang, Haonan Li, Ruochen Ren, Yanmin Zhou, Zhipeng Wang, Bin He · 2025 · IEEE Robotics and Automation Letters (RA-L) · arXiv:2503.05231 · PDF

One-liner. Kaiwu is a human-demonstration dataset for industrial assembly that records, time-synchronized to absolute timestamps, the modalities most manipulation datasets omit — per-finger tactile pressure, EMG muscle signals, eye gaze, full-body motion capture, and ambient contact sound — alongside RGB-D video, so robots can learn the dynamics and intention behind dexterous contact-rich skills rather than just their kinematics.

Problem & motivation

Foundation-model and imitation-learning methods are bottlenecked on large-scale, high-quality, multimodal data, but the authors argue current robot datasets share two critical limitations. First, most rely on video and therefore capture only kinematics (trajectory, velocity); they lack the dynamics — force, tactile pressure, muscle activation — that govern real contact-rich manipulation, leaving learning "superficial." Second, there is no universal, intuitive human-perception framework: prior datasets incorporate some sensing (video, IMU) but miss the synchronized, fine-grained, multi-sensory signal needed for complex assembly in open, human-inhabited settings. Kaiwu (named after the Ming-dynasty encyclopedia Tiangong Kaiwu) targets this gap with a wearable + environment-mounted sensor framework aimed at assembly-style human-robot collaboration and human-intention prediction.

Method

This is a dataset/framework paper; the "method" is the data-collection platform, sensor suite, and annotation pipeline (Fig 1, Fig 5). The three stated contributions are: (1) a multimodal collection framework with full situational awareness (manipulation dynamics, neural/EMG signals, attention, multi-view vision); (2) a high-quality large-scale dataset of 11,664 integrated action instances across 20 subjects and 30 interaction objects; (3) rich cross-modal spatio-temporal synchronization and fine-grained multi-level annotation.

Sensor suite. Wearable and environment-mounted devices, each on its own sampling rate, fused via absolute timestamps:

Data glove (Fig 2): 19 finger angle sensors + 19 finger pressure/tactile sensors (accuracy 9g), plus an arm sensor recording quaternion data for palm, forearm, and upper arm. Glove data format (Table II) packs hand/forearm/upper-arm quaternions, 19 angle columns (15–33), and 19 tactile columns (34–52).
EMG + ACC: 16 Trigno EMG sensors (8 left, 8 right) on forearm muscle groups, each with an integrated 9-DoF IMU providing acceleration (ACC) time-synchronized to the EMG.
RGB-D camera: front-facing, captures body posture and environment RGB + depth.
Eye tracker: binocular dark-pupil corneal-reflex tracker, average accuracy 0.6°, data-loss rate 0.01%, giving first-person gaze / region-of-interest over the assembly.
Microphones: four mics — two on the operator's table (operating-environment sound), two in the accessory area (gripping tools/parts sound) — capturing contact/rubbing audio.
Motion capture (ground truth): optical mocap with 37 reflective body markers, giving high-precision 3D skeleton.

Collection platform. Multi-threaded/multi-processing software for synchronized streaming, storage, and visualization across heterogeneous sampling rates; outputs are aligned through absolute timestamps. Calibration is staged: one-point eye-tracker calibration per subject, mocap skeleton calibration per subject, and a per-session "process initialization" gesture calibration to start data synchronization (0–10 s sync window inside each ~140 s action session).

Annotation (Table IV, Fig 4). Five annotation types over absolute-timestamped streams: gesture classification (10 object tags / 4,959 instances), AOIs / regions-of-interest as personal attention (298), semantic segmentation of 30 key objects (610,778 closed-area annotations; errors held under 8 px), action segmentation (26 tags / 7,197 motion-segmentation events), and gesture segmentation (9 tags / 4,467). Action annotation is two-level: coarse left/right-hand action segments, then fine-grained gesture states within them. Hand states use 8 grasp taxonomy classes (Cylindrical, Oblique palmar, Lumbrical, Intermediate power-precision, Pinch, Lateral pinch, Special pinch, Non-prehensile).

Setup

Datasets / benchmarks: The Kaiwu dataset itself — 20 subjects × 15 assembly tasks (C1–C15: e.g. Drill a Hole, Clamp a Patch, File Rough Edge, Tighten a Steel Column, Install a Bearing, Fasten a Screw/Nut, Install a Motor) × left/right = 6×15×20 sets, 11,664 total integrated action instances. ~6.3 hours of assembling; each subject ~19 min of data, 2 h recorded. Compared against TSU, Harmonic, HBOD, Humbi, OXE, ActionSense in Table I (Kaiwu is the only one combining IMU + Motion Capture + Hand Pose + Arm + Gaze + EMG + Tactile + RGB + Depth + Audio for industrial assembly).
Hardware / simulator: Wearable data glove (angle + pressure), 16 Trigno EMG/IMU sensors, RGB-D camera, binocular eye tracker, 4 microphones, optical motion-capture rig (37 markers). No robot is run; this is a human-demonstration capture rig. Per-modality sampling rates (Table V): glove 100 Hz, glove export 20 Hz, eye 25 Hz, RGB-D 60 Hz, mocap 60 Hz, audio 50 Hz, ACC 40 Hz, EMG 40 Hz.
Baselines: not reported — no learning benchmark or model is trained/evaluated; the paper is a dataset descriptor. Related datasets are compared qualitatively (Table I) but there are no method baselines.
Compute: not reported.

Results

There are no experimental learning results; the "results" are dataset descriptive statistics. Headline figures: 20 participants (mean height 172.4±6.5 cm, mean age 23.95±5.1 yr), recorded over 11 working days; 30 interaction objects; 11,664 integrated action instances; ~6.3 hours of assembling. Storage footprint per modality (Table V):

Data type	Size	Sampling rate
Glove data	264 MB	100 Hz
Glove export	1,124 MB	20 Hz
Eye tracking	14 GB	25 Hz
RGB-D video	3,476 GB	60 Hz
Motion capture	4,160 MB	60 Hz
Audio	7,955 MB	50 Hz
ACC	354 MB	40 Hz
EMG	362 MB	40 Hz

Annotation totals (Table IV): 4,959 gesture-classification instances, 298 AOIs, 610,778 semantic-segmentation closed-area annotations, 7,197 action-segmentation events, 4,467 gesture-segmentation instances. Data is released on ScienceDB (keyword "Kaiwu"), packaged as Kaiwu Data Annotation (31.84 GB), AOIs (17.9 GB), gesture segmentation (973 MB), semantic segmentation (11.6 GB), gesture classification (417 MB), plus a raw-data directory and an action-segmentation supplement (193 MB) adding four missing experimental sets. Directory layout is per-subject (P1–P20) with EMGData / GloveData / KinectData / KonovaData / TobiiData / VoiceData sub-folders (Fig 5).

Limitations & open questions

From the authors:

Authors note finger tactile sensors "struggle to detect data changes in small, delicate parts," so action units are deliberately kept coarse (large components only) and small parts are pre-assembled before capture.
The dataset ships with "missing data and additions to the raw data" — an action-segmentation supplement exists precisely because some experimental sets were missing from the main segmentation.
Future work (Conclusion) is framed as downstream use — cross-modal prediction, assembly-logic sequence prediction, task planning, robot self-assembly — none demonstrated here; "holding potential for future benchmark" is explicitly aspirational.

What I noticed reading it:

No model, no metric. Despite the "Robot Learning" title, the paper trains nothing and reports zero task-success or prediction numbers. Its value is entirely as a data artifact; the claim that dynamics signals cure "superficial learning" is asserted, not measured against a video-only ablation.
Single-site, narrow demographic. 20 subjects, mean age ~24, one industrial-assembly domain. Generalization beyond this assembly bench is untested, and 20 subjects is small for foundation-model claims.
RGB-D dominates storage (3.5 TB) while the distinctive modalities are tiny (EMG 362 MB, glove 264 MB). The signals the paper sells as its differentiator are a rounding error in the byte budget, which raises questions about their effective resolution/coverage relative to the vision stream.
Synchronization is timestamp-based across very different rates (20–100 Hz). Sub-frame alignment quality for fast contact events (the sound/force transients that matter most) is asserted but not quantified.
The 19-pressure-sensor glove gives where on the finger contact happens but pressure (not calibrated force vectors); how usable this is as a true force-dynamics signal vs. a binary contact cue is unclear.

Why I care

Directly on-thesis for the batch's core claim: many manipulation predicates are not visually evaluable — is_grasped, is_screwed_tight, is_inserted, surface_is_rough live in touch, force, and sound. Kaiwu is the rare dataset that actually records human demonstrations with those non-visual channels (per-finger tactile, EMG, contact audio) on contact-rich tasks where the predicate of interest (a nut fastened, a bearing seated, a column tightened) is exactly the kind of thing a camera can't see. For BLADE, whose biggest acknowledged gap is that segmentation and predicate classifiers are vision-grounded (gripper-state segmentation, visual predicate classifiers) and whose continuous force parameters hide inside the diffusion policy, a dataset like this is the raw material for learning tactile/force-grounded predicate classifiers and a richer contact-segmentation signal than gripper open/close. The grasp-taxonomy and fine-grained gesture annotation also speak to the "predicate invention" line: the 8 grasp classes are essentially human-labeled symbolic contact states.

That said, Kaiwu is a data resource, not a method — no language conditioning, no policy, no planning abstraction. Its relevance is upstream (it could feed a force/tactile-grounded BLADE-style pipeline), not as a method comparison. It pairs naturally with the tactile-representation and multisensory-policy clusters of this batch, which would turn its raw signals into learnable embeddings.

Quotable

Missing dynamics information including force will deteriorate the learning performance, resulting in the superficial learning. — §I, Introduction / p.1

The Kaiwu dataset directly collects dynamic and static data using wearable sensors. It integrates assembly actions into a coherent process, enhancing human-robot interaction with a narrative and causal structure. — §II, Related work / p.3

Absolute timestamp records are used during data collection, and the output of each module's data stream is synchronized through absolute timestamps. — §III, Data synchronization / p.5

Papers cited that should likely be ingested next:

ManiWAV (Liu et al. 2024, ref [11]) — in-the-wild audio-visual manipulation; the contact-audio analogue of Kaiwu's microphone modality, but with a trained policy. expected PDF.
ActionSense (DelPreto et al. 2022, ref [18]) — the closest comparator in Table I: multimodal kitchen-activity capture with wearable EMG + tactile + gaze. Direct dataset peer.
Open X-Embodiment (Padalkar et al., ref [15]) and RT-2 (Brohan et al., ref [9]) — the foundation-model datasets Kaiwu positions itself against (vision-heavy, dynamics-poor).
RH20T (ref [12]), DROID (ref [13]), GraspNet-1Billion (ref [14]) — large-scale manipulation datasets in the same competitive landscape.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ManiWAV — contact-audio for manipulation; turns Kaiwu's microphone modality into a learned policy.
See, Hear, and Feel and Making Sense of Vision and Touch — multisensory (vision + touch + audio) fusion for manipulation; the method side of what Kaiwu only provides as data.
SonicSense — in-hand acoustic/vibration sensing; complements Kaiwu's contact-audio capture for object/material reasoning.
Touch and Go and ObjectFolder 2.0 — sibling multisensory datasets/benchmarks in Cluster I; vision-touch and vision-touch-audio resources Kaiwu sits alongside.
Sparsh and AnyTouch — tactile representation learners that could turn Kaiwu's raw per-finger pressure into learnable embeddings.
FuSe (Beyond Sight) — fuses heterogeneous sensors with language into a VLA; the downstream policy paradigm a Kaiwu-style multimodal corpus is meant to feed.