| Professor Survey |
A Touch, Vision, and Language Dataset for Multimodal Alignment |
2024 |
introduces benchmark |
unknown |
captions / descriptions; property words |
captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again… |
introduced data artifact has unknown release status |
| Supplement |
ABC-130k |
2026 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Active Acoustic Sensing for Robot Manipulation |
2023 |
self-collected eval data |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
AgiBot World 2026 |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: LeRobot v2.1: per-episode Parquet + MP4 for 9 image streams (top/left/right hand RGB, head depth, head fisheye x3, head stereo x2), joint pos/vel and EE actions, plus subtask/bbox/instruc… |
dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag |
| Supplement |
AIRoA MoMa Dataset |
2025 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
ALFA (AirLab Failure and Anomaly Dataset) |
2020 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
ALOHA Static |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no task-family tag |
| Professor Survey |
Analyzing Material Recognition Performance of Thermal Tactile Sensing using a Large Materials Database and a Real Robot |
2022 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Any2Policy: Learning Visuomotor Policy with Any-Modality |
2024 |
introduces benchmark |
unknown |
task instructions / commands |
task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations |
introduced data artifact has unknown release status |
| Professor Survey |
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors |
2025 |
uses existing datasets |
open / public |
property words |
property words: AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities - pixel-level masked modeling for fine detail and semantic-… |
|
| Professor Survey |
Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation |
2025 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
AudioCLIP: Extending CLIP to Image, Text and Audio |
2021 |
uses existing datasets |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: A newly collected real-world multi-task dataset of 27K (text says ~26,866) robot trajectories spanning vision, touch, audio, proprioception (9-DoF IMU), and language instructions across t… |
|
| Professor Survey |
Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations |
2024 |
uses existing datasets |
partial or indirect |
captions / descriptions; property words |
captions / descriptions: Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go. | property … |
|
| Supplement |
BridgeData V2: A Dataset for Robot Learning at Scale |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag |
| Professor Survey |
Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation |
2024 |
survey / review |
not applicable |
none apparent |
none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
CALVIN |
2022 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Simulated benchmark for long-horizon, language-conditioned Franka Panda manipulation: 4 envs, 34 tasks, 24h of play. |
|
| Professor Survey |
CLAP: Learning Audio Concepts From Natural Language Supervision |
2022 |
uses existing datasets |
unknown |
captions / descriptions |
captions / descriptions: CLAP is “CLIP for audio”: train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio-caption pairs to build a joint embedding space, which then does zero-shot … |
|
| Professor Survey |
CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding |
2025 |
uses existing datasets |
unknown |
captions / descriptions; property words |
captions / descriptions: CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pr… |
|
| Professor Survey |
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection |
2025 |
uses existing datasets |
not open |
predicates / constraints |
predicates / constraints: ConSeg is trained on BridgeData V2 [64]: 10,181 trajectories / 219,356 images, with GPT-4o decomposing instructions into subgoals/constraints/object associations and Grounded SAM [53] + Seman… |
|
| Professor Survey |
Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization |
2024 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Demonstrating the Octopi-1.5 Visual-Tactile-Language Model |
2025 |
uses existing datasets |
partial or indirect |
property words |
property words: Octopi-1.5 is a Qwen2-VL-7B visual-tactile-language model that turns GelSight tactile-video frames into tokens, reasons about object properties (hardness, roughness, texture) in language, and adds a si… |
|
| Supplement |
DexMimicGen |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play |
2023 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
DexYCB |
2021 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor with Application to In-Hand Manipulation |
2020 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment |
2024 |
uses existing datasets |
not open |
captions / descriptions; predicates / constraints |
captions / descriptions: A 128 image-text pair fine-tuning set (5 fruit demos) for the VLM-FT variant. | predicates / constraints: DoReMi makes the LLM emit not just a high-level plan but also, for each skill, a set o… |
|
| Supplement |
DreamDojo GR-1 Post-Training |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: DreamDojo-HV: human egocentric RGB videos (640x480) with GPT-derived language task annotations |
|
| Supplement |
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset |
2024 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Every episode uses a standardized Franka Panda 7-DoF arm with two exterior ZED 2 stereo cameras and a wrist-mounted ZED Mini, recording RGB/stereo video, depth, joint and Cartesian propri… |
|
| Supplement |
Ego-Exo4D |
2023 |
introduces dataset |
open / public |
captions / descriptions |
captions / descriptions: It contains 1,286.3 hours of video from 740 camera wearers across 13 cities and 123 scene contexts, with multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired l… |
|
| Supplement |
Ego4D |
2022 |
introduces dataset |
open / public |
captions / descriptions |
captions / descriptions: Portions include audio, 3D environment meshes, eye gaze, stereo, multi-camera footage, IMU, and dense textual narrations, supporting five benchmark suites (episodic memory, hands-and-objects, … |
|
| Supplement |
EgoDex |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: It pairs each frame with 3D pose annotations for the head, upper body, and hands (68 joints) via on-device tracking, plus camera intrinsics and natural-language task descriptions. |
|
| Supplement |
EPIC-KITCHENS-100 |
2022 |
introduces dataset |
open / public |
captions / descriptions; temporal phrases |
captions / descriptions: It provides ~90K fine-grained action segments with dense language narrations, plus optical flow, audio, object segmentation masks, and hand-object bounding boxes. | temporal phrases: It provid… |
|
| Professor Survey |
FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning |
2026 |
self-collected eval data |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning |
2025 |
self-collected eval data |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation |
2026 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
FMB (Functional Manipulation Benchmark) |
2024 |
introduces benchmark |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation |
2025 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
FTP-1 Dataset |
2026 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
FurnitureBench |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
Galaxea Open-World Dataset |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: 500+ hours of real-world mobile bimanual manipulation on the Galaxea R1 Lite robot with subtask language annotations. |
|
| Professor Survey |
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force |
2017 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning |
2021 |
introduces benchmark |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Grounding Predicates through Actions |
2022 |
uses existing datasets |
not open |
predicates / constraints |
predicates / constraints: Trains a visual predicate classifier from weak supervision - just an action label per video - by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and … |
|
| Professor Survey |
Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation |
2024 |
self-collected eval data |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
HIW-500: Humanoids In-the-Wild |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Each episode records synchronized head (stereo RGB) and wrist (RGB + stereo IR) cameras, 29-DoF joint states, end-effector state, IMU, odometry, and language annotations. |
|
| Supplement |
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation |
2025 |
introduces dataset |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no data-supervision tag; introduced data artifact has unknown release status |
| Supplement |
Humanoid Everyday |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Each trajectory aggregates egocentric and third-person RGB, depth, LiDAR point clouds, tactile, IMU, and proprioception at 30 Hz, with natural-language annotations, in LeRobot v2.0 format. |
dataset-relevant row has no task-family tag |
| Supplement |
HumanPlus |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Hybrid Position/Force Control of Manipulators |
1981 |
sensor / foundation paper |
not applicable |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
ImageBind: One Embedding Space To Bind Them All |
2023 |
sensor / foundation paper |
not applicable |
captions / descriptions |
captions / descriptions: Problem & motivation CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modal… |
|
| Supplement |
In-flight Positional and Energy-Use Dataset of Package-Delivery Quadcopter UAVs |
2021 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag |
| Professor Survey |
Inner Monologue: Embodied Reasoning through Planning with Language Models |
2022 |
uses existing datasets |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Interactive Perception: Leveraging Action in Perception and Perception in Action |
2017 |
survey / review |
not applicable |
none apparent |
none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. |
|
| Professor Survey |
Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction |
2025 |
introduces dataset |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Professor Survey |
KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
Language-Table |
2022 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Google’s large language-conditioned tabletop block-manipulation dataset: ~442K real + ~181K sim xArm6 trajectories, plus a sim benchmark. |
|
| Professor Survey |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment |
2024 |
uses existing datasets |
not open |
captions / descriptions |
captions / descriptions: Evaluate on 15 benchmarks: video-text retrieval (MSR-VTT, MSVD, DiDeMo, ActivityNet) |
|
| Professor Survey |
Learning Compositional Behaviors from Demonstration and Language (BLADE) |
2024 |
uses existing datasets |
partial or indirect |
predicates / constraints |
predicates / constraints: BLADE automatically recovers PDDL-style behavior abstractions (preconditions, effects, a contact-primitive body ) from language-annotated demos by querying an LLM, learns visual classifiers f… |
|
| Supplement |
LeRobot Shirt-Folding Dataset |
2026 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning |
2023 |
introduces benchmark |
open / public |
task instructions / commands |
task instructions / commands: A language-conditioned lifelong robot learning benchmark with four task suites, 130 tasks, and human teleoperated demonstrations. |
|
| Professor Survey |
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos |
2022 |
introduces dataset |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring |
2019 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks |
2019 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
ManipArena |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Demonstrations are recorded on 5 robot platforms with 3 synchronized RGB cameras (one overhead + two wrist), 56-D proprioception (joint positions/velocities/currents), gripper and mobile-… |
|
| Supplement |
ManiSkill2 |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data |
2024 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Material Classification Using Active Temperature Controllable Robotic Gripper |
2021 |
introduces dataset |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Professor Survey |
Material Recognition via Heat Transfer Given Ambiguous Initial Conditions |
2020 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Meta-Transformer: A Unified Framework for Multimodal Learning |
2023 |
uses existing datasets |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
MimicGen |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation |
2024 |
introduces dataset |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Supplement |
Mobile ALOHA |
2024 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: The TFDS/Open X release contains 276 episodes with 3 RGB cameras (overhead + two wrist cameras at 480x640), a 14-dim state, a 16-dim action, and per-step language instructions. |
|
| Supplement |
MolmoAct Dataset |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: It uses a Franka arm with three RGB views (primary, secondary, wrist) and a 7-dim end-effector action space, in LeRobot format with per-episode language annotations. |
dataset-relevant row has no task-family tag |
| Supplement |
MolmoAct2 Bimanual YAM Dataset |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Each episode provides three RGB camera views (left, right, top) plus 14-dim joint/gripper states, in LeRobot format with per-episode annotated language instructions. |
|
| Supplement |
MolmoAct2 SO-100/SO-101 Dataset |
2026 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: A MolmoAct2 resource from Ai2 providing per-episode annotated language instructions for low-cost SO-100 and SO-101 arm data sourced from 1,220 community LeRobot repositories (377 users). |
dataset-relevant row has no task-family tag |
| Professor Survey |
Multimodal Detection and Identification of Robot Manipulation Failures (FINO-Net) |
2023 |
introduces dataset |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training |
2024 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
NVIDIA GR00T X-Embodiment Sim |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Multiple per-embodiment/per-task LeRobot datasets (data/meta/videos) -> episodes -> steps -> {observation (rgb images + state/proprioception), action, language task annotation} |
|
| Professor Survey |
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer |
2022 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations |
2021 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Objects that Sound |
2018 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Octopi: Object Property Reasoning with Large Tactile-Language Models |
2024 |
introduces dataset |
unknown |
property words |
property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b… |
introduced data artifact has unknown release status |
| Professor Survey |
OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing |
2026 |
introduces dataset |
unknown |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status |
| Supplement |
Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag |
| Professor Survey |
PaLM-E: An Embodied Multimodal Language Model |
2023 |
uses existing datasets |
not open |
captions / descriptions |
captions / descriptions: General VL benchmarks: OK-VQA, VQA v2, COCO captioning. |
|
| Professor Survey |
Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning |
2022 |
self-collected eval data |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL |
2024 |
uses existing datasets |
unknown |
task instructions / commands; predicates / constraints |
task instructions / commands: An end-to-end PR2 cooking system that takes a natural-language recipe, converts it to a sequence of robot-interpretable cooking functions via few-shot GPT-4 prompting, complements the omi… |
|
| Professor Survey |
Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot |
2023 |
self-collected eval data |
not open |
captions / descriptions; temporal phrases |
captions / descriptions: negative natural-language description of a heat-induced food change (e.g. | temporal phrases: the time-series of that probability, smoothed and thresholded, becomes a recognizer for when the s… |
|
| Professor Survey |
REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrecTion |
2023 |
uses existing datasets |
unknown |
captions / descriptions |
captions / descriptions: REFLECT converts raw multisensory robot observations (RGB-D, audio, proprioception) into a three-level hierarchical text summary, then queries an LLM progressively to detect, localize, and exp… |
|
| Supplement |
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot |
2023 |
introduces dataset |
open / public |
captions / descriptions |
captions / descriptions: Each sequence includes visual, force, audio, and action information, plus a human demonstration video and language description. |
|
| Supplement |
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots |
2024 |
introduces simulator |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots |
2026 |
introduces benchmark |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no modality tag |
| Supplement |
robomimic |
2021 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence |
2026 |
introduces dataset |
unknown |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status |
| Supplement |
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation |
2025 |
introduces benchmark |
partial or indirect |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
RoboNet |
2019 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
RoboSet (RoboAgent) |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
Robotic Interestingness Dataset (SubTF) |
2020 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins |
2025 |
introduces benchmark |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no data-supervision tag |
| Supplement |
RT-1 Robot Action Dataset |
2022 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: Each step pairs an RGB image and natural-language instruction with a discretized arm+base action, plus success/feasible/undesirable labels, stored in RLDS/TFDS format. |
|
| Professor Survey |
See to Touch: Learning Tactile Dexterity through Visual Incentives |
2023 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation |
2022 |
self-collected eval data |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
SonicSense: Object Perception from In-Hand Acoustic Vibration |
2024 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning |
2022 |
introduces simulator |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Sparsh: Self-supervised touch representations for vision-based tactile sensing |
2024 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
SubT-MRS |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no task-family tag |
| Professor Survey |
TacEx: GelSight Tactile Simulation in Isaac Sim – Combining Soft-Body and Visuotactile Simulators |
2024 |
introduces simulator |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation |
2025 |
sensor / foundation paper |
not applicable |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation |
2026 |
uses existing datasets |
unknown |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization |
2025 |
introduces dataset |
unknown |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status |
| Professor Survey |
TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors |
2022 |
introduces simulator |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
TartanAir |
2020 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
TartanAir V2 |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
TartanAviation |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
TartanDrive |
2022 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag |
| Supplement |
TartanDrive 2.0 |
2024 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Taxim: An Example-based Simulation Model for GelSight Tactile Sensors |
2021 |
introduces simulator |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Professor Survey |
Teaching Physical Awareness to LLMs through Sounds |
2025 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation |
2022 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects |
2023 |
introduces benchmark |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
The Sound of Pixels |
2018 |
sensor / foundation paper |
not applicable |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio |
2025 |
uses existing datasets |
partial or indirect |
task instructions / commands |
task instructions / commands: Table 1), each scored over 12 evaluations (4 language commands x 3 random locations). |
|
| Professor Survey |
The Sound of Water: Inferring Physical Properties from Pouring Liquids |
2025 |
introduces dataset |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Supplement |
TLA: Tactile-Language-Action Model for Contact-Rich Manipulation |
2025 |
introduces dataset |
open / public |
task instructions / commands |
task instructions / commands: This is a direct fit for language-conditioned tactile manipulation. |
|
| Professor Survey |
Touch and Go: Learning from Human-Collected Vision and Touch |
2022 |
introduces dataset |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation |
2024 |
introduces dataset |
unknown |
captions / descriptions; property words |
captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus … |
introduced data artifact has unknown release status |
| Professor Survey |
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset |
2024 |
introduces dataset |
unknown |
captions / descriptions |
captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m… |
introduced data artifact has unknown release status |
| Professor Survey |
Towards Forceful Robotic Foundation Models: a Literature Survey |
2025 |
survey / review |
not applicable |
none apparent |
none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation |
2026 |
uses existing datasets |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
TrajAir |
2021 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag |
| Professor Survey |
Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks |
2024 |
uses existing datasets |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects |
2025 |
self-collected eval data |
partial or indirect |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Variable Impedance Control and Learning – A Review |
2020 |
survey / review |
not applicable |
none apparent |
none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. |
|
| Professor Survey |
VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation |
2026 |
introduces benchmark |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
Visually Indicated Sounds |
2016 |
uses existing datasets |
not open |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Professor Survey |
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning |
2024 |
uses existing datasets |
unknown |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag |
| Professor Survey |
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback |
2025 |
self-collected eval data |
not open |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Professor Survey |
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation |
2025 |
introduces benchmark |
partial or indirect |
task instructions / commands; captions / descriptions |
task instructions / commands: CSI (CALVIN with Speech Instructions): CALVIN’s 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes | captions / descriptions: SQA : 185K image-au… |
|
| Professor Survey |
VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning |
2025 |
introduces benchmark |
unknown |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
introduced data artifact has unknown release status |
| Professor Survey |
What Foundation Models can Bring for Robot Learning in Manipulation: A Survey |
2025 |
survey / review |
not applicable |
none apparent |
none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
Wire Detection Dataset |
2017 |
introduces dataset |
open / public |
none apparent |
none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. |
language modality/method signal, but no explicit dataset language annotation evidence |
| Supplement |
WIT-UAS (Wildland-fire Infrared Thermal UAS Dataset) |
2023 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|
| Supplement |
Yamaha-CMU Off-Road Dataset (YCOR) |
2018 |
introduces dataset |
open / public |
none apparent |
none apparent: No explicit language annotation is stated in the dataset/setup evidence. |
|