Filter Audit for Multimodal Dataset Map

Generated: 2026-06-28

Scope and Rule

This audit checks every catalog row against the local professor summaries and generated supplement summaries. Language Annotation is treated as an explicit data annotation signal, not merely the fact that a paper uses an LLM, VLM, or VLA. When language appears only in the method/title/tags and the dataset evidence does not state language annotations, the row is marked none apparent and flagged.

Primary local sources reviewed for the rule: source_materials/professor_survey_raw/index.html, knowledge/brainstorm/grounding_dynamic_state_change_concepts_multimodal.html, and knowledge/search/language_multimodal_sensing_manipulation_2026_06_24.html.

Summary

Metric Count
Total catalog rows 149
Professor Survey rows 91
Dataset-relevant rows 140
Introduced datasets/benchmarks/simulators 92
Open or partial data rows 87
Rows with explicit language annotations 45
Professor Survey rows with explicit language annotations 23
Language annotations + non-visual modality 40
Language annotations + dynamic concept 37

Source Counts

Source Count
Professor Survey 91
Supplement 58

Language Annotation Counts

Value Count
none apparent 104
task instructions / commands 23
captions / descriptions 17
property words 7
predicates / constraints 5
temporal phrases 2

Data Supervision Counts

Value Count
demonstrations / trajectories 85
simulation labels 70
class labels 51
self-supervised pairs 50
property labels 33
temporal / event labels 26

Modality Counts

Value Count
vision 134
language 80
proprioception 65
tactile 57
point cloud / 3D 53
audio 43
force 43
thermal 14

Dataset Role Counts

Value Count
introduces dataset 75
uses existing datasets 27
self-collected eval data 21
introduces benchmark 12
introduces simulator 5
survey / review 5
sensor / foundation paper 4

Open Data Counts

Value Count
open / public 61
not open 29
partial or indirect 26
unknown 24
not applicable 9

Interpretation

  • The language-annotation count dropped compared with the earlier UI because ordinary class/property labels are no longer counted as language annotations.
  • The strongest language-annotated tactile cluster is TVL / TLV / Touch100k / Octopi / CLTP / AnyTouch / UniTouch.
  • Temporal language grounding remains sparse: only a small number of rows explicitly pair language with state-change or temporal phrases.
  • The professor brainstorm’s target gap remains visible: learned, reusable dynamic state-change predicates grounded in non-visual multimodal sensing are still weakly covered, especially for thermal/acoustic/force signals.

Professor Survey Rows With Language Annotations

Source Title Year Dataset role Open data Language annotation Evidence Warnings
Professor Survey A Touch, Vision, and Language Dataset for Multimodal Alignment 2024 introduces benchmark unknown captions / descriptions; property words captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again… introduced data artifact has unknown release status
Professor Survey Any2Policy: Learning Visuomotor Policy with Any-Modality 2024 introduces benchmark unknown task instructions / commands task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations introduced data artifact has unknown release status
Professor Survey AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors 2025 uses existing datasets open / public property words property words: AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities - pixel-level masked modeling for fine detail and semantic-…  
Professor Survey Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding 2025 introduces dataset open / public task instructions / commands task instructions / commands: A newly collected real-world multi-task dataset of 27K (text says ~26,866) robot trajectories spanning vision, touch, audio, proprioception (9-DoF IMU), and language instructions across t…  
Professor Survey Binding Touch to Everything: Learning Unified Multimodal Tactile Representations 2024 uses existing datasets partial or indirect captions / descriptions; property words captions / descriptions: Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go. | property …  
Professor Survey CLAP: Learning Audio Concepts From Natural Language Supervision 2022 uses existing datasets unknown captions / descriptions captions / descriptions: CLAP is “CLIP for audio”: train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio-caption pairs to build a joint embedding space, which then does zero-shot …  
Professor Survey CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding 2025 uses existing datasets unknown captions / descriptions; property words captions / descriptions: CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pr…  
Professor Survey Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection 2025 uses existing datasets not open predicates / constraints predicates / constraints: ConSeg is trained on BridgeData V2 [64]: 10,181 trajectories / 219,356 images, with GPT-4o decomposing instructions into subgoals/constraints/object associations and Grounded SAM [53] + Seman…  
Professor Survey Demonstrating the Octopi-1.5 Visual-Tactile-Language Model 2025 uses existing datasets partial or indirect property words property words: Octopi-1.5 is a Qwen2-VL-7B visual-tactile-language model that turns GelSight tactile-video frames into tokens, reasons about object properties (hardness, roughness, texture) in language, and adds a si…  
Professor Survey DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment 2024 uses existing datasets not open captions / descriptions; predicates / constraints captions / descriptions: A 128 image-text pair fine-tuning set (5 fruit demos) for the VLM-FT variant. | predicates / constraints: DoReMi makes the LLM emit not just a high-level plan but also, for each skill, a set o…  
Professor Survey Grounding Predicates through Actions 2022 uses existing datasets not open predicates / constraints predicates / constraints: Trains a visual predicate classifier from weak supervision - just an action label per video - by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and …  
Professor Survey ImageBind: One Embedding Space To Bind Them All 2023 sensor / foundation paper not applicable captions / descriptions captions / descriptions: Problem & motivation CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modal…  
Professor Survey LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 2024 uses existing datasets not open captions / descriptions captions / descriptions: Evaluate on 15 benchmarks: video-text retrieval (MSR-VTT, MSVD, DiDeMo, ActivityNet)  
Professor Survey Learning Compositional Behaviors from Demonstration and Language (BLADE) 2024 uses existing datasets partial or indirect predicates / constraints predicates / constraints: BLADE automatically recovers PDDL-style behavior abstractions (preconditions, effects, a contact-primitive body ) from language-annotated demos by querying an LLM, learns visual classifiers f…  
Professor Survey Octopi: Object Property Reasoning with Large Tactile-Language Models 2024 introduces dataset unknown property words property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b… introduced data artifact has unknown release status
Professor Survey PaLM-E: An Embodied Multimodal Language Model 2023 uses existing datasets not open captions / descriptions captions / descriptions: General VL benchmarks: OK-VQA, VQA v2, COCO captioning.  
Professor Survey Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL 2024 uses existing datasets unknown task instructions / commands; predicates / constraints task instructions / commands: An end-to-end PR2 cooking system that takes a natural-language recipe, converts it to a sequence of robot-interpretable cooking functions via few-shot GPT-4 prompting, complements the omi…  
Professor Survey Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot 2023 self-collected eval data not open captions / descriptions; temporal phrases captions / descriptions: negative natural-language description of a heat-induced food change (e.g. | temporal phrases: the time-series of that probability, smoothed and thresholded, becomes a recognizer for when the s…  
Professor Survey REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrecTion 2023 uses existing datasets unknown captions / descriptions captions / descriptions: REFLECT converts raw multisensory robot observations (RGB-D, audio, proprioception) into a three-level hierarchical text summary, then queries an LLM progressively to detect, localize, and exp…  
Professor Survey The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio 2025 uses existing datasets partial or indirect task instructions / commands task instructions / commands: Table 1), each scored over 12 evaluations (4 language commands x 3 random locations).  
Professor Survey Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation 2024 introduces dataset unknown captions / descriptions; property words captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus … introduced data artifact has unknown release status
Professor Survey Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset 2024 introduces dataset unknown captions / descriptions captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m… introduced data artifact has unknown release status
Professor Survey VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation 2025 introduces benchmark partial or indirect task instructions / commands; captions / descriptions task instructions / commands: CSI (CALVIN with Speech Instructions): CALVIN’s 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes | captions / descriptions: SQA : 185K image-au…  

Rows Needing Manual Follow-Up

Source Title Year Dataset role Open data Language annotation Evidence Warnings
Professor Survey A Touch, Vision, and Language Dataset for Multimodal Alignment 2024 introduces benchmark unknown captions / descriptions; property words captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again… introduced data artifact has unknown release status
Supplement ABC-130k 2026 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement AgiBot World 2026 2026 introduces dataset open / public task instructions / commands task instructions / commands: LeRobot v2.1: per-episode Parquet + MP4 for 9 image streams (top/left/right hand RGB, head depth, head fisheye x3, head stereo x2), joint pos/vel and EE actions, plus subtask/bbox/instruc… dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement AIRoA MoMa Dataset 2025 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement ALOHA Static 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag
Professor Survey Any2Policy: Learning Visuomotor Policy with Any-Modality 2024 introduces benchmark unknown task instructions / commands task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations introduced data artifact has unknown release status
Professor Survey Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation 2025 uses existing datasets partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey AudioCLIP: Extending CLIP to Image, Text and Audio 2021 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement BridgeData V2: A Dataset for Robot Learning at Scale 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation 2024 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization 2024 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation 2026 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement FMB (Functional Manipulation Benchmark) 2024 introduces benchmark open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation 2025 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement FTP-1 Dataset 2026 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no data-supervision tag; introduced data artifact has unknown release status
Supplement Humanoid Everyday 2025 introduces dataset open / public task instructions / commands task instructions / commands: Each trajectory aggregates egocentric and third-person RGB, depth, LiDAR point clouds, tactile, IMU, and proprioception at 30 Hz, with natural-language annotations, in LeRobot v2.0 format. dataset-relevant row has no task-family tag
Supplement In-flight Positional and Energy-Use Dataset of Package-Delivery Quadcopter UAVs 2021 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey Inner Monologue: Embodied Reasoning through Planning with Language Models 2022 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring 2019 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Material Classification Using Active Temperature Controllable Robotic Gripper 2021 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey Meta-Transformer: A Unified Framework for Multimodal Learning 2023 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation 2024 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Supplement MolmoAct Dataset 2026 introduces dataset open / public task instructions / commands task instructions / commands: It uses a Franka arm with three RGB views (primary, secondary, wrist) and a 7-dim end-effector action space, in LeRobot format with per-episode language annotations. dataset-relevant row has no task-family tag
Supplement MolmoAct2 SO-100/SO-101 Dataset 2026 introduces dataset open / public task instructions / commands task instructions / commands: A MolmoAct2 resource from Ai2 providing per-episode annotated language instructions for low-cost SO-100 and SO-101 arm data sourced from 1,220 community LeRobot repositories (377 users). dataset-relevant row has no task-family tag
Professor Survey Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training 2024 introduces dataset partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Octopi: Object Property Reasoning with Large Tactile-Language Models 2024 introduces dataset unknown property words property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b… introduced data artifact has unknown release status
Professor Survey OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing 2026 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement Open X-Embodiment: Robotic Learning Datasets and RT-X Models 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Supplement RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots 2024 introduces simulator open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots 2026 introduces benchmark partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no modality tag
Supplement RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence 2026 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation 2025 introduces benchmark partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement RoboSet (RoboAgent) 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins 2025 introduces benchmark open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no data-supervision tag
Supplement SubT-MRS 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag
Professor Survey Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation 2026 uses existing datasets unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization 2025 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement TartanAviation 2024 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement TartanDrive 2022 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey Taxim: An Example-based Simulation Model for GelSight Tactile Sensors 2021 introduces simulator unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey Teaching Physical Awareness to LLMs through Sounds 2025 uses existing datasets partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey The Sound of Water: Inferring Physical Properties from Pouring Liquids 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation 2024 introduces dataset unknown captions / descriptions; property words captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus … introduced data artifact has unknown release status
Professor Survey Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset 2024 introduces dataset unknown captions / descriptions captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m… introduced data artifact has unknown release status
Professor Survey Towards Forceful Robotic Foundation Models: a Literature Survey 2025 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation 2026 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement TrajAir 2021 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning 2024 uses existing datasets unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning 2025 introduces benchmark unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey What Foundation Models can Bring for Robot Learning in Manipulation: A Survey 2025 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Supplement Wire Detection Dataset 2017 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence

Full Per-Row Audit

The CSV next to this report contains the full untruncated evidence and online/local links for all rows.

Source Title Year Dataset role Open data Language annotation Evidence Warnings
Professor Survey A Touch, Vision, and Language Dataset for Multimodal Alignment 2024 introduces benchmark unknown captions / descriptions; property words captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again… introduced data artifact has unknown release status
Supplement ABC-130k 2026 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Active Acoustic Sensing for Robot Manipulation 2023 self-collected eval data partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement AgiBot World 2026 2026 introduces dataset open / public task instructions / commands task instructions / commands: LeRobot v2.1: per-episode Parquet + MP4 for 9 image streams (top/left/right hand RGB, head depth, head fisheye x3, head stereo x2), joint pos/vel and EE actions, plus subtask/bbox/instruc… dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement AIRoA MoMa Dataset 2025 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement ALFA (AirLab Failure and Anomaly Dataset) 2020 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement ALOHA Static 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag
Professor Survey Analyzing Material Recognition Performance of Thermal Tactile Sensing using a Large Materials Database and a Real Robot 2022 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Any2Policy: Learning Visuomotor Policy with Any-Modality 2024 introduces benchmark unknown task instructions / commands task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations introduced data artifact has unknown release status
Professor Survey AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors 2025 uses existing datasets open / public property words property words: AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities - pixel-level masked modeling for fine detail and semantic-…  
Professor Survey Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation 2025 uses existing datasets partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey AudioCLIP: Extending CLIP to Image, Text and Audio 2021 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding 2025 introduces dataset open / public task instructions / commands task instructions / commands: A newly collected real-world multi-task dataset of 27K (text says ~26,866) robot trajectories spanning vision, touch, audio, proprioception (9-DoF IMU), and language instructions across t…  
Professor Survey Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Binding Touch to Everything: Learning Unified Multimodal Tactile Representations 2024 uses existing datasets partial or indirect captions / descriptions; property words captions / descriptions: Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go. | property …  
Supplement BridgeData V2: A Dataset for Robot Learning at Scale 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation 2024 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Supplement CALVIN 2022 introduces dataset open / public task instructions / commands task instructions / commands: Simulated benchmark for long-horizon, language-conditioned Franka Panda manipulation: 4 envs, 34 tasks, 24h of play.  
Professor Survey CLAP: Learning Audio Concepts From Natural Language Supervision 2022 uses existing datasets unknown captions / descriptions captions / descriptions: CLAP is “CLIP for audio”: train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio-caption pairs to build a joint embedding space, which then does zero-shot …  
Professor Survey CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding 2025 uses existing datasets unknown captions / descriptions; property words captions / descriptions: CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pr…  
Professor Survey Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection 2025 uses existing datasets not open predicates / constraints predicates / constraints: ConSeg is trained on BridgeData V2 [64]: 10,181 trajectories / 219,356 images, with GPT-4o decomposing instructions into subgoals/constraints/object associations and Grounded SAM [53] + Seman…  
Professor Survey Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization 2024 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Demonstrating the Octopi-1.5 Visual-Tactile-Language Model 2025 uses existing datasets partial or indirect property words property words: Octopi-1.5 is a Qwen2-VL-7B visual-tactile-language model that turns GelSight tactile-video frames into tokens, reasons about object properties (hardness, roughness, texture) in language, and adds a si…  
Supplement DexMimicGen 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play 2023 uses existing datasets partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement DexYCB 2021 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor with Application to In-Hand Manipulation 2020 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment 2024 uses existing datasets not open captions / descriptions; predicates / constraints captions / descriptions: A 128 image-text pair fine-tuning set (5 fruit demos) for the VLM-FT variant. | predicates / constraints: DoReMi makes the LLM emit not just a high-level plan but also, for each skill, a set o…  
Supplement DreamDojo GR-1 Post-Training 2026 introduces dataset open / public task instructions / commands task instructions / commands: DreamDojo-HV: human egocentric RGB videos (640x480) with GPT-derived language task annotations  
Supplement DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset 2024 introduces dataset open / public task instructions / commands task instructions / commands: Every episode uses a standardized Franka Panda 7-DoF arm with two exterior ZED 2 stereo cameras and a wrist-mounted ZED Mini, recording RGB/stereo video, depth, joint and Cartesian propri…  
Supplement Ego-Exo4D 2023 introduces dataset open / public captions / descriptions captions / descriptions: It contains 1,286.3 hours of video from 740 camera wearers across 13 cities and 123 scene contexts, with multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired l…  
Supplement Ego4D 2022 introduces dataset open / public captions / descriptions captions / descriptions: Portions include audio, 3D environment meshes, eye gaze, stereo, multi-camera footage, IMU, and dense textual narrations, supporting five benchmark suites (episodic memory, hands-and-objects, …  
Supplement EgoDex 2025 introduces dataset open / public task instructions / commands task instructions / commands: It pairs each frame with 3D pose annotations for the head, upper body, and hands (68 joints) via on-device tracking, plus camera intrinsics and natural-language task descriptions.  
Supplement EPIC-KITCHENS-100 2022 introduces dataset open / public captions / descriptions; temporal phrases captions / descriptions: It provides ~90K fine-grained action segments with dense language narrations, plus optical flow, audio, object segmentation masks, and hand-object bounding boxes. | temporal phrases: It provid…  
Professor Survey FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning 2026 self-collected eval data partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning 2025 self-collected eval data unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation 2026 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement FMB (Functional Manipulation Benchmark) 2024 introduces benchmark open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation 2025 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation 2025 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement FTP-1 Dataset 2026 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement FurnitureBench 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement Galaxea Open-World Dataset 2025 introduces dataset open / public task instructions / commands task instructions / commands: 500+ hours of real-world mobile bimanual manipulation on the Galaxea R1 Lite robot with subtask language annotations.  
Professor Survey GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force 2017 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning 2021 introduces benchmark not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Grounding Predicates through Actions 2022 uses existing datasets not open predicates / constraints predicates / constraints: Trains a visual predicate classifier from weak supervision - just an action label per video - by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and …  
Professor Survey Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation 2024 self-collected eval data partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement HIW-500: Humanoids In-the-Wild 2026 introduces dataset open / public task instructions / commands task instructions / commands: Each episode records synchronized head (stereo RGB) and wrist (RGB + stereo IR) cameras, 29-DoF joint states, end-effector state, IMU, odometry, and language annotations.  
Supplement Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no data-supervision tag; introduced data artifact has unknown release status
Supplement Humanoid Everyday 2025 introduces dataset open / public task instructions / commands task instructions / commands: Each trajectory aggregates egocentric and third-person RGB, depth, LiDAR point clouds, tactile, IMU, and proprioception at 30 Hz, with natural-language annotations, in LeRobot v2.0 format. dataset-relevant row has no task-family tag
Supplement HumanPlus 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Hybrid Position/Force Control of Manipulators 1981 sensor / foundation paper not applicable none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey ImageBind: One Embedding Space To Bind Them All 2023 sensor / foundation paper not applicable captions / descriptions captions / descriptions: Problem & motivation CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modal…  
Supplement In-flight Positional and Energy-Use Dataset of Package-Delivery Quadcopter UAVs 2021 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey Inner Monologue: Embodied Reasoning through Planning with Language Models 2022 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Interactive Perception: Leveraging Action in Perception and Perception in Action 2017 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.  
Professor Survey Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement Language-Table 2022 introduces dataset open / public task instructions / commands task instructions / commands: Google’s large language-conditioned tabletop block-manipulation dataset: ~442K real + ~181K sim xArm6 trajectories, plus a sim benchmark.  
Professor Survey LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 2024 uses existing datasets not open captions / descriptions captions / descriptions: Evaluate on 15 benchmarks: video-text retrieval (MSR-VTT, MSVD, DiDeMo, ActivityNet)  
Professor Survey Learning Compositional Behaviors from Demonstration and Language (BLADE) 2024 uses existing datasets partial or indirect predicates / constraints predicates / constraints: BLADE automatically recovers PDDL-style behavior abstractions (preconditions, effects, a contact-primitive body ) from language-annotated demos by querying an LLM, learns visual classifiers f…  
Supplement LeRobot Shirt-Folding Dataset 2026 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning 2023 introduces benchmark open / public task instructions / commands task instructions / commands: A language-conditioned lifelong robot learning benchmark with four task suites, 130 tasks, and human teleoperated demonstrations.  
Professor Survey Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos 2022 introduces dataset not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring 2019 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks 2019 uses existing datasets partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement ManipArena 2026 introduces dataset open / public task instructions / commands task instructions / commands: Demonstrations are recorded on 5 robot platforms with 3 synchronized RGB cameras (one overhead + two wrist), 56-D proprioception (joint positions/velocities/currents), gripper and mobile-…  
Supplement ManiSkill2 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data 2024 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Material Classification Using Active Temperature Controllable Robotic Gripper 2021 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey Material Recognition via Heat Transfer Given Ambiguous Initial Conditions 2020 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Meta-Transformer: A Unified Framework for Multimodal Learning 2023 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement MimicGen 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation 2024 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Supplement Mobile ALOHA 2024 introduces dataset open / public task instructions / commands task instructions / commands: The TFDS/Open X release contains 276 episodes with 3 RGB cameras (overhead + two wrist cameras at 480x640), a 14-dim state, a 16-dim action, and per-step language instructions.  
Supplement MolmoAct Dataset 2026 introduces dataset open / public task instructions / commands task instructions / commands: It uses a Franka arm with three RGB views (primary, secondary, wrist) and a 7-dim end-effector action space, in LeRobot format with per-episode language annotations. dataset-relevant row has no task-family tag
Supplement MolmoAct2 Bimanual YAM Dataset 2026 introduces dataset open / public task instructions / commands task instructions / commands: Each episode provides three RGB camera views (left, right, top) plus 14-dim joint/gripper states, in LeRobot format with per-episode annotated language instructions.  
Supplement MolmoAct2 SO-100/SO-101 Dataset 2026 introduces dataset open / public task instructions / commands task instructions / commands: A MolmoAct2 resource from Ai2 providing per-episode annotated language instructions for low-cost SO-100 and SO-101 arm data sourced from 1,220 community LeRobot repositories (377 users). dataset-relevant row has no task-family tag
Professor Survey Multimodal Detection and Identification of Robot Manipulation Failures (FINO-Net) 2023 introduces dataset not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training 2024 introduces dataset partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement NVIDIA GR00T X-Embodiment Sim 2025 introduces dataset open / public task instructions / commands task instructions / commands: Multiple per-embodiment/per-task LeRobot datasets (data/meta/videos) -> episodes -> steps -> {observation (rgb images + state/proprioception), action, language task annotation}  
Professor Survey ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer 2022 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations 2021 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Objects that Sound 2018 uses existing datasets partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Octopi: Object Property Reasoning with Large Tactile-Language Models 2024 introduces dataset unknown property words property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b… introduced data artifact has unknown release status
Professor Survey OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing 2026 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement Open X-Embodiment: Robotic Learning Datasets and RT-X Models 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey PaLM-E: An Embodied Multimodal Language Model 2023 uses existing datasets not open captions / descriptions captions / descriptions: General VL benchmarks: OK-VQA, VQA v2, COCO captioning.  
Professor Survey Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning 2022 self-collected eval data unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation 2025 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL 2024 uses existing datasets unknown task instructions / commands; predicates / constraints task instructions / commands: An end-to-end PR2 cooking system that takes a natural-language recipe, converts it to a sequence of robot-interpretable cooking functions via few-shot GPT-4 prompting, complements the omi…  
Professor Survey Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot 2023 self-collected eval data not open captions / descriptions; temporal phrases captions / descriptions: negative natural-language description of a heat-induced food change (e.g. | temporal phrases: the time-series of that probability, smoothed and thresholded, becomes a recognizer for when the s…  
Professor Survey REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrecTion 2023 uses existing datasets unknown captions / descriptions captions / descriptions: REFLECT converts raw multisensory robot observations (RGB-D, audio, proprioception) into a three-level hierarchical text summary, then queries an LLM progressively to detect, localize, and exp…  
Supplement RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot 2023 introduces dataset open / public captions / descriptions captions / descriptions: Each sequence includes visual, force, audio, and action information, plus a human demonstration video and language description.  
Supplement RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots 2024 introduces simulator open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots 2026 introduces benchmark partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no modality tag
Supplement robomimic 2021 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence 2026 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation 2025 introduces benchmark partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement RoboNet 2019 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement RoboSet (RoboAgent) 2023 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement Robotic Interestingness Dataset (SubTF) 2020 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins 2025 introduces benchmark open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no data-supervision tag
Supplement RT-1 Robot Action Dataset 2022 introduces dataset open / public task instructions / commands task instructions / commands: Each step pairs an RGB image and natural-language instruction with a discretized arm+base action, plus success/feasible/undesirable labels, stored in RLDS/TFDS format.  
Professor Survey See to Touch: Learning Tactile Dexterity through Visual Incentives 2023 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation 2022 self-collected eval data unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey SonicSense: Object Perception from In-Hand Acoustic Vibration 2024 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning 2022 introduces simulator open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Sparsh: Self-supervised touch representations for vision-based tactile sensing 2024 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement SubT-MRS 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag
Professor Survey TacEx: GelSight Tactile Simulation in Isaac Sim – Combining Soft-Body and Visuotactile Simulators 2024 introduces simulator not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation 2025 sensor / foundation paper not applicable none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation 2025 self-collected eval data not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation 2026 uses existing datasets unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization 2025 introduces dataset unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Professor Survey TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors 2022 introduces simulator open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement TartanAir 2020 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement TartanAir V2 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement TartanAviation 2024 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement TartanDrive 2022 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement TartanDrive 2.0 2024 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Taxim: An Example-based Simulation Model for GelSight Tactile Sensors 2021 introduces simulator unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey Teaching Physical Awareness to LLMs through Sounds 2025 uses existing datasets partial or indirect none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation 2022 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects 2023 introduces benchmark partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey The Sound of Pixels 2018 sensor / foundation paper not applicable none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio 2025 uses existing datasets partial or indirect task instructions / commands task instructions / commands: Table 1), each scored over 12 evaluations (4 language commands x 3 random locations).  
Professor Survey The Sound of Water: Inferring Physical Properties from Pouring Liquids 2025 introduces dataset unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Supplement TLA: Tactile-Language-Action Model for Contact-Rich Manipulation 2025 introduces dataset open / public task instructions / commands task instructions / commands: This is a direct fit for language-conditioned tactile manipulation.  
Professor Survey Touch and Go: Learning from Human-Collected Vision and Touch 2022 introduces dataset partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation 2024 introduces dataset unknown captions / descriptions; property words captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus … introduced data artifact has unknown release status
Professor Survey Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset 2024 introduces dataset unknown captions / descriptions captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m… introduced data artifact has unknown release status
Professor Survey Towards Forceful Robotic Foundation Models: a Literature Survey 2025 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation 2026 uses existing datasets not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement TrajAir 2021 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks 2024 uses existing datasets partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects 2025 self-collected eval data partial or indirect none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Variable Impedance Control and Learning – A Review 2020 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.  
Professor Survey VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation 2026 introduces benchmark not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey Visually Indicated Sounds 2016 uses existing datasets not open none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Professor Survey VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning 2024 uses existing datasets unknown none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback 2025 self-collected eval data not open none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation 2025 introduces benchmark partial or indirect task instructions / commands; captions / descriptions task instructions / commands: CSI (CALVIN with Speech Instructions): CALVIN’s 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes | captions / descriptions: SQA : 185K image-au…  
Professor Survey VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning 2025 introduces benchmark unknown none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence. introduced data artifact has unknown release status
Professor Survey What Foundation Models can Bring for Robot Learning in Manipulation: A Survey 2025 survey / review not applicable none apparent none apparent: Survey/review paper; no paper-specific dataset language annotations are reported. language modality/method signal, but no explicit dataset language annotation evidence
Supplement Wire Detection Dataset 2017 introduces dataset open / public none apparent none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations. language modality/method signal, but no explicit dataset language annotation evidence
Supplement WIT-UAS (Wildland-fire Infrared Thermal UAS Dataset) 2023 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.  
Supplement Yamaha-CMU Off-Road Dataset (YCOR) 2018 introduces dataset open / public none apparent none apparent: No explicit language annotation is stated in the dataset/setup evidence.