Filter Audit for Multimodal Dataset Map

Generated: 2026-06-28

Scope and Rule

This audit checks every catalog row against the local professor summaries and generated supplement summaries. Language Annotation is treated as an explicit data annotation signal, not merely the fact that a paper uses an LLM, VLM, or VLA. When language appears only in the method/title/tags and the dataset evidence does not state language annotations, the row is marked none apparent and flagged.

Primary local sources reviewed for the rule: source_materials/professor_survey_raw/index.html, knowledge/brainstorm/grounding_dynamic_state_change_concepts_multimodal.html, and knowledge/search/language_multimodal_sensing_manipulation_2026_06_24.html.

Summary

Metric	Count
Total catalog rows	149
Professor Survey rows	91
Dataset-relevant rows	140
Introduced datasets/benchmarks/simulators	92
Open or partial data rows	87
Rows with explicit language annotations	45
Professor Survey rows with explicit language annotations	23
Language annotations + non-visual modality	40
Language annotations + dynamic concept	37

Source Counts

Source	Count
Professor Survey	91
Supplement	58

Language Annotation Counts

Value	Count
none apparent	104
task instructions / commands	23
captions / descriptions	17
property words	7
predicates / constraints	5
temporal phrases	2

Data Supervision Counts

Value	Count
demonstrations / trajectories	85
simulation labels	70
class labels	51
self-supervised pairs	50
property labels	33
temporal / event labels	26

Modality Counts

Value	Count
vision	134
language	80
proprioception	65
tactile	57
point cloud / 3D	53
audio	43
force	43
thermal	14

Dataset Role Counts

Value	Count
introduces dataset	75
uses existing datasets	27
self-collected eval data	21
introduces benchmark	12
introduces simulator	5
survey / review	5
sensor / foundation paper	4

Open Data Counts

Value	Count
open / public	61
not open	29
partial or indirect	26
unknown	24
not applicable	9

Interpretation

The language-annotation count dropped compared with the earlier UI because ordinary class/property labels are no longer counted as language annotations.
The strongest language-annotated tactile cluster is TVL / TLV / Touch100k / Octopi / CLTP / AnyTouch / UniTouch.
Temporal language grounding remains sparse: only a small number of rows explicitly pair language with state-change or temporal phrases.
The professor brainstorm’s target gap remains visible: learned, reusable dynamic state-change predicates grounded in non-visual multimodal sensing are still weakly covered, especially for thermal/acoustic/force signals.

Professor Survey Rows With Language Annotations

Source	Title	Year	Dataset role	Open data	Language annotation	Evidence	Warnings
Professor Survey	A Touch, Vision, and Language Dataset for Multimodal Alignment	2024	introduces benchmark	unknown	captions / descriptions; property words	captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again…	introduced data artifact has unknown release status
Professor Survey	Any2Policy: Learning Visuomotor Policy with Any-Modality	2024	introduces benchmark	unknown	task instructions / commands	task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations	introduced data artifact has unknown release status
Professor Survey	AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors	2025	uses existing datasets	open / public	property words	property words: AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities - pixel-level masked modeling for fine detail and semantic-…
Professor Survey	Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: A newly collected real-world multi-task dataset of 27K (text says ~26,866) robot trajectories spanning vision, touch, audio, proprioception (9-DoF IMU), and language instructions across t…
Professor Survey	Binding Touch to Everything: Learning Unified Multimodal Tactile Representations	2024	uses existing datasets	partial or indirect	captions / descriptions; property words	captions / descriptions: Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go. \| property …
Professor Survey	CLAP: Learning Audio Concepts From Natural Language Supervision	2022	uses existing datasets	unknown	captions / descriptions	captions / descriptions: CLAP is “CLIP for audio”: train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio-caption pairs to build a joint embedding space, which then does zero-shot …
Professor Survey	CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding	2025	uses existing datasets	unknown	captions / descriptions; property words	captions / descriptions: CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pr…
Professor Survey	Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection	2025	uses existing datasets	not open	predicates / constraints	predicates / constraints: ConSeg is trained on BridgeData V2 [64]: 10,181 trajectories / 219,356 images, with GPT-4o decomposing instructions into subgoals/constraints/object associations and Grounded SAM [53] + Seman…
Professor Survey	Demonstrating the Octopi-1.5 Visual-Tactile-Language Model	2025	uses existing datasets	partial or indirect	property words	property words: Octopi-1.5 is a Qwen2-VL-7B visual-tactile-language model that turns GelSight tactile-video frames into tokens, reasons about object properties (hardness, roughness, texture) in language, and adds a si…
Professor Survey	DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment	2024	uses existing datasets	not open	captions / descriptions; predicates / constraints	captions / descriptions: A 128 image-text pair fine-tuning set (5 fruit demos) for the VLM-FT variant. \| predicates / constraints: DoReMi makes the LLM emit not just a high-level plan but also, for each skill, a set o…
Professor Survey	Grounding Predicates through Actions	2022	uses existing datasets	not open	predicates / constraints	predicates / constraints: Trains a visual predicate classifier from weak supervision - just an action label per video - by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and …
Professor Survey	ImageBind: One Embedding Space To Bind Them All	2023	sensor / foundation paper	not applicable	captions / descriptions	captions / descriptions: Problem & motivation CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modal…
Professor Survey	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	2024	uses existing datasets	not open	captions / descriptions	captions / descriptions: Evaluate on 15 benchmarks: video-text retrieval (MSR-VTT, MSVD, DiDeMo, ActivityNet)
Professor Survey	Learning Compositional Behaviors from Demonstration and Language (BLADE)	2024	uses existing datasets	partial or indirect	predicates / constraints	predicates / constraints: BLADE automatically recovers PDDL-style behavior abstractions (preconditions, effects, a contact-primitive body ) from language-annotated demos by querying an LLM, learns visual classifiers f…
Professor Survey	Octopi: Object Property Reasoning with Large Tactile-Language Models	2024	introduces dataset	unknown	property words	property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b…	introduced data artifact has unknown release status
Professor Survey	PaLM-E: An Embodied Multimodal Language Model	2023	uses existing datasets	not open	captions / descriptions	captions / descriptions: General VL benchmarks: OK-VQA, VQA v2, COCO captioning.
Professor Survey	Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL	2024	uses existing datasets	unknown	task instructions / commands; predicates / constraints	task instructions / commands: An end-to-end PR2 cooking system that takes a natural-language recipe, converts it to a sequence of robot-interpretable cooking functions via few-shot GPT-4 prompting, complements the omi…
Professor Survey	Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot	2023	self-collected eval data	not open	captions / descriptions; temporal phrases	captions / descriptions: negative natural-language description of a heat-induced food change (e.g. \| temporal phrases: the time-series of that probability, smoothed and thresholded, becomes a recognizer for when the s…
Professor Survey	REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrecTion	2023	uses existing datasets	unknown	captions / descriptions	captions / descriptions: REFLECT converts raw multisensory robot observations (RGB-D, audio, proprioception) into a three-level hierarchical text summary, then queries an LLM progressively to detect, localize, and exp…
Professor Survey	The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio	2025	uses existing datasets	partial or indirect	task instructions / commands	task instructions / commands: Table 1), each scored over 12 evaluations (4 language commands x 3 random locations).
Professor Survey	Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation	2024	introduces dataset	unknown	captions / descriptions; property words	captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus …	introduced data artifact has unknown release status
Professor Survey	Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset	2024	introduces dataset	unknown	captions / descriptions	captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m…	introduced data artifact has unknown release status
Professor Survey	VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation	2025	introduces benchmark	partial or indirect	task instructions / commands; captions / descriptions	task instructions / commands: CSI (CALVIN with Speech Instructions): CALVIN’s 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes \| captions / descriptions: SQA : 185K image-au…

Rows Needing Manual Follow-Up

Source	Title	Year	Dataset role	Open data	Language annotation	Evidence	Warnings
Professor Survey	A Touch, Vision, and Language Dataset for Multimodal Alignment	2024	introduces benchmark	unknown	captions / descriptions; property words	captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again…	introduced data artifact has unknown release status
Supplement	ABC-130k	2026	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	AgiBot World 2026	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: LeRobot v2.1: per-episode Parquet + MP4 for 9 image streams (top/left/right hand RGB, head depth, head fisheye x3, head stereo x2), joint pos/vel and EE actions, plus subtask/bbox/instruc…	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement	AIRoA MoMa Dataset	2025	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	ALOHA Static	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag
Professor Survey	Any2Policy: Learning Visuomotor Policy with Any-Modality	2024	introduces benchmark	unknown	task instructions / commands	task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations	introduced data artifact has unknown release status
Professor Survey	Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation	2025	uses existing datasets	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	AudioCLIP: Extending CLIP to Image, Text and Audio	2021	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	BridgeData V2: A Dataset for Robot Learning at Scale	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation	2024	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization	2024	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation	2026	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	FMB (Functional Manipulation Benchmark)	2024	introduces benchmark	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation	2025	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	FTP-1 Dataset	2026	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no data-supervision tag; introduced data artifact has unknown release status
Supplement	Humanoid Everyday	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: Each trajectory aggregates egocentric and third-person RGB, depth, LiDAR point clouds, tactile, IMU, and proprioception at 30 Hz, with natural-language annotations, in LeRobot v2.0 format.	dataset-relevant row has no task-family tag
Supplement	In-flight Positional and Energy-Use Dataset of Package-Delivery Quadcopter UAVs	2021	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey	Inner Monologue: Embodied Reasoning through Planning with Language Models	2022	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring	2019	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Material Classification Using Active Temperature Controllable Robotic Gripper	2021	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	Meta-Transformer: A Unified Framework for Multimodal Learning	2023	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation	2024	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Supplement	MolmoAct Dataset	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: It uses a Franka arm with three RGB views (primary, secondary, wrist) and a 7-dim end-effector action space, in LeRobot format with per-episode language annotations.	dataset-relevant row has no task-family tag
Supplement	MolmoAct2 SO-100/SO-101 Dataset	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: A MolmoAct2 resource from Ai2 providing per-episode annotated language instructions for low-cost SO-100 and SO-101 arm data sourced from 1,220 community LeRobot repositories (377 users).	dataset-relevant row has no task-family tag
Professor Survey	Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training	2024	introduces dataset	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Octopi: Object Property Reasoning with Large Tactile-Language Models	2024	introduces dataset	unknown	property words	property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b…	introduced data artifact has unknown release status
Professor Survey	OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing	2026	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement	Open X-Embodiment: Robotic Learning Datasets and RT-X Models	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Supplement	RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots	2024	introduces simulator	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots	2026	introduces benchmark	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no modality tag
Supplement	RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence	2026	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement	RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation	2025	introduces benchmark	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	RoboSet (RoboAgent)	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins	2025	introduces benchmark	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no data-supervision tag
Supplement	SubT-MRS	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag
Professor Survey	Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation	2026	uses existing datasets	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization	2025	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement	TartanAviation	2024	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	TartanDrive	2022	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey	Taxim: An Example-based Simulation Model for GelSight Tactile Sensors	2021	introduces simulator	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	Teaching Physical Awareness to LLMs through Sounds	2025	uses existing datasets	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	The Sound of Water: Inferring Physical Properties from Pouring Liquids	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation	2024	introduces dataset	unknown	captions / descriptions; property words	captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus …	introduced data artifact has unknown release status
Professor Survey	Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset	2024	introduces dataset	unknown	captions / descriptions	captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m…	introduced data artifact has unknown release status
Professor Survey	Towards Forceful Robotic Foundation Models: a Literature Survey	2025	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation	2026	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	TrajAir	2021	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning	2024	uses existing datasets	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning	2025	introduces benchmark	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	What Foundation Models can Bring for Robot Learning in Manipulation: A Survey	2025	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	Wire Detection Dataset	2017	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence

Full Per-Row Audit

The CSV next to this report contains the full untruncated evidence and online/local links for all rows.

Source	Title	Year	Dataset role	Open data	Language annotation	Evidence	Warnings
Professor Survey	A Touch, Vision, and Language Dataset for Multimodal Alignment	2024	introduces benchmark	unknown	captions / descriptions; property words	captions / descriptions: New TVL Benchmark : open-vocabulary 402-way tactile classification (top-1/top-5, tactile-vision and tactile-language) + a tactile-semantic description task scored 1-10 by text-only GPT-4 again…	introduced data artifact has unknown release status
Supplement	ABC-130k	2026	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Active Acoustic Sensing for Robot Manipulation	2023	self-collected eval data	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	AgiBot World 2026	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: LeRobot v2.1: per-episode Parquet + MP4 for 9 image streams (top/left/right hand RGB, head depth, head fisheye x3, head stereo x2), joint pos/vel and EE actions, plus subtask/bbox/instruc…	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement	AIRoA MoMa Dataset	2025	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	ALFA (AirLab Failure and Anomaly Dataset)	2020	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	ALOHA Static	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag
Professor Survey	Analyzing Material Recognition Performance of Thermal Tactile Sensing using a Large Materials Database and a Real Robot	2022	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Any2Policy: Learning Visuomotor Policy with Any-Modality	2024	introduces benchmark	unknown	task instructions / commands	task instructions / commands: Every task is annotated with k =5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations	introduced data artifact has unknown release status
Professor Survey	AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors	2025	uses existing datasets	open / public	property words	property words: AnyTouch builds a sensor-agnostic visuo-tactile representation by training a shared encoder on tactile images and videos at two granularities - pixel-level masked modeling for fine detail and semantic-…
Professor Survey	Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation	2025	uses existing datasets	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	AudioCLIP: Extending CLIP to Image, Text and Audio	2021	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: A newly collected real-world multi-task dataset of 27K (text says ~26,866) robot trajectories spanning vision, touch, audio, proprioception (9-DoF IMU), and language instructions across t…
Professor Survey	Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Binding Touch to Everything: Learning Unified Multimodal Tactile Representations	2024	uses existing datasets	partial or indirect	captions / descriptions; property words	captions / descriptions: Tasks: material classification, grasp-stability prediction, ObjectFolder 2.0 cross-modal retrieval, touch-to-image generation on Touch and Go, Touch-LLM captioning on Touch and Go. \| property …
Supplement	BridgeData V2: A Dataset for Robot Learning at Scale	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation	2024	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	CALVIN	2022	introduces dataset	open / public	task instructions / commands	task instructions / commands: Simulated benchmark for long-horizon, language-conditioned Franka Panda manipulation: 4 envs, 34 tasks, 24h of play.
Professor Survey	CLAP: Learning Audio Concepts From Natural Language Supervision	2022	uses existing datasets	unknown	captions / descriptions	captions / descriptions: CLAP is “CLIP for audio”: train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio-caption pairs to build a joint embedding space, which then does zero-shot …
Professor Survey	CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding	2025	uses existing datasets	unknown	captions / descriptions; property words	captions / descriptions: CLTP aligns 3D contact-deformed tactile point clouds with natural-language descriptions of multidimensional contact state (shape, area, depth, position, texture) by distilling into a frozen pr…
Professor Survey	Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection	2025	uses existing datasets	not open	predicates / constraints	predicates / constraints: ConSeg is trained on BridgeData V2 [64]: 10,181 trajectories / 219,356 images, with GPT-4o decomposing instructions into subgoals/constraints/object associations and Grounded SAM [53] + Seman…
Professor Survey	Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization	2024	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Demonstrating the Octopi-1.5 Visual-Tactile-Language Model	2025	uses existing datasets	partial or indirect	property words	property words: Octopi-1.5 is a Qwen2-VL-7B visual-tactile-language model that turns GelSight tactile-video frames into tokens, reasons about object properties (hardness, roughness, texture) in language, and adds a si…
Supplement	DexMimicGen	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play	2023	uses existing datasets	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	DexYCB	2021	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor with Application to In-Hand Manipulation	2020	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment	2024	uses existing datasets	not open	captions / descriptions; predicates / constraints	captions / descriptions: A 128 image-text pair fine-tuning set (5 fruit demos) for the VLM-FT variant. \| predicates / constraints: DoReMi makes the LLM emit not just a high-level plan but also, for each skill, a set o…
Supplement	DreamDojo GR-1 Post-Training	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: DreamDojo-HV: human egocentric RGB videos (640x480) with GPT-derived language task annotations
Supplement	DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset	2024	introduces dataset	open / public	task instructions / commands	task instructions / commands: Every episode uses a standardized Franka Panda 7-DoF arm with two exterior ZED 2 stereo cameras and a wrist-mounted ZED Mini, recording RGB/stereo video, depth, joint and Cartesian propri…
Supplement	Ego-Exo4D	2023	introduces dataset	open / public	captions / descriptions	captions / descriptions: It contains 1,286.3 hours of video from 740 camera wearers across 13 cities and 123 scene contexts, with multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired l…
Supplement	Ego4D	2022	introduces dataset	open / public	captions / descriptions	captions / descriptions: Portions include audio, 3D environment meshes, eye gaze, stereo, multi-camera footage, IMU, and dense textual narrations, supporting five benchmark suites (episodic memory, hands-and-objects, …
Supplement	EgoDex	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: It pairs each frame with 3D pose annotations for the head, upper body, and hands (68 joints) via on-device tracking, plus camera intrinsics and natural-language task descriptions.
Supplement	EPIC-KITCHENS-100	2022	introduces dataset	open / public	captions / descriptions; temporal phrases	captions / descriptions: It provides ~90K fine-grained action segments with dense language narrations, plus optical flow, audio, object segmentation masks, and hand-object bounding boxes. \| temporal phrases: It provid…
Professor Survey	FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning	2026	self-collected eval data	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning	2025	self-collected eval data	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation	2026	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	FMB (Functional Manipulation Benchmark)	2024	introduces benchmark	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation	2025	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation	2025	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	FTP-1 Dataset	2026	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	FurnitureBench	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	Galaxea Open-World Dataset	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: 500+ hours of real-world mobile bimanual manipulation on the Galaxea R1 Lite robot with subtask language annotations.
Professor Survey	GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force	2017	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	2021	introduces benchmark	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Grounding Predicates through Actions	2022	uses existing datasets	not open	predicates / constraints	predicates / constraints: Trains a visual predicate classifier from weak supervision - just an action label per video - by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and …
Professor Survey	Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation	2024	self-collected eval data	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	HIW-500: Humanoids In-the-Wild	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: Each episode records synchronized head (stereo RGB) and wrist (RGB + stereo IR) cameras, 29-DoF joint states, end-effector state, IMU, odometry, and language annotations.
Supplement	Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no data-supervision tag; introduced data artifact has unknown release status
Supplement	Humanoid Everyday	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: Each trajectory aggregates egocentric and third-person RGB, depth, LiDAR point clouds, tactile, IMU, and proprioception at 30 Hz, with natural-language annotations, in LeRobot v2.0 format.	dataset-relevant row has no task-family tag
Supplement	HumanPlus	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Hybrid Position/Force Control of Manipulators	1981	sensor / foundation paper	not applicable	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	ImageBind: One Embedding Space To Bind Them All	2023	sensor / foundation paper	not applicable	captions / descriptions	captions / descriptions: Problem & motivation CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modal…
Supplement	In-flight Positional and Energy-Use Dataset of Package-Delivery Quadcopter UAVs	2021	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Professor Survey	Inner Monologue: Embodied Reasoning through Planning with Language Models	2022	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Interactive Perception: Leveraging Action in Perception and Perception in Action	2017	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.
Professor Survey	Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	Language-Table	2022	introduces dataset	open / public	task instructions / commands	task instructions / commands: Google’s large language-conditioned tabletop block-manipulation dataset: ~442K real + ~181K sim xArm6 trajectories, plus a sim benchmark.
Professor Survey	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	2024	uses existing datasets	not open	captions / descriptions	captions / descriptions: Evaluate on 15 benchmarks: video-text retrieval (MSR-VTT, MSVD, DiDeMo, ActivityNet)
Professor Survey	Learning Compositional Behaviors from Demonstration and Language (BLADE)	2024	uses existing datasets	partial or indirect	predicates / constraints	predicates / constraints: BLADE automatically recovers PDDL-style behavior abstractions (preconditions, effects, a contact-primitive body ) from language-annotated demos by querying an LLM, learns visual classifiers f…
Supplement	LeRobot Shirt-Folding Dataset	2026	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	2023	introduces benchmark	open / public	task instructions / commands	task instructions / commands: A language-conditioned lifelong robot learning benchmark with four task suites, 130 tasks, and human teleoperated demonstrations.
Professor Survey	Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos	2022	introduces dataset	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring	2019	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks	2019	uses existing datasets	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	ManipArena	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: Demonstrations are recorded on 5 robot platforms with 3 synchronized RGB cameras (one overhead + two wrist), 56-D proprioception (joint positions/velocities/currents), gripper and mobile-…
Supplement	ManiSkill2	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data	2024	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Material Classification Using Active Temperature Controllable Robotic Gripper	2021	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	Material Recognition via Heat Transfer Given Ambiguous Initial Conditions	2020	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Meta-Transformer: A Unified Framework for Multimodal Learning	2023	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	MimicGen	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation	2024	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Supplement	Mobile ALOHA	2024	introduces dataset	open / public	task instructions / commands	task instructions / commands: The TFDS/Open X release contains 276 episodes with 3 RGB cameras (overhead + two wrist cameras at 480x640), a 14-dim state, a 16-dim action, and per-step language instructions.
Supplement	MolmoAct Dataset	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: It uses a Franka arm with three RGB views (primary, secondary, wrist) and a 7-dim end-effector action space, in LeRobot format with per-episode language annotations.	dataset-relevant row has no task-family tag
Supplement	MolmoAct2 Bimanual YAM Dataset	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: Each episode provides three RGB camera views (left, right, top) plus 14-dim joint/gripper states, in LeRobot format with per-episode annotated language instructions.
Supplement	MolmoAct2 SO-100/SO-101 Dataset	2026	introduces dataset	open / public	task instructions / commands	task instructions / commands: A MolmoAct2 resource from Ai2 providing per-episode annotated language instructions for low-cost SO-100 and SO-101 arm data sourced from 1,220 community LeRobot repositories (377 users).	dataset-relevant row has no task-family tag
Professor Survey	Multimodal Detection and Identification of Robot Manipulation Failures (FINO-Net)	2023	introduces dataset	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training	2024	introduces dataset	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	NVIDIA GR00T X-Embodiment Sim	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: Multiple per-embodiment/per-task LeRobot datasets (data/meta/videos) -> episodes -> steps -> {observation (rgb images + state/proprioception), action, language task annotation}
Professor Survey	ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer	2022	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations	2021	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Objects that Sound	2018	uses existing datasets	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Octopi: Object Property Reasoning with Large Tactile-Language Models	2024	introduces dataset	unknown	property words	property words: Octopi bolts a GelSight tactile encoder onto a Vicuna LLM (via a CLIP visual backbone + a LLaVA-style projection module) so that a vision-language model can feel - predicting hardness, roughness, and b…	introduced data artifact has unknown release status
Professor Survey	OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing	2026	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement	Open X-Embodiment: Robotic Learning Datasets and RT-X Models	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	PaLM-E: An Embodied Multimodal Language Model	2023	uses existing datasets	not open	captions / descriptions	captions / descriptions: General VL benchmarks: OK-VQA, VQA v2, COCO captioning.
Professor Survey	Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning	2022	self-collected eval data	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation	2025	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL	2024	uses existing datasets	unknown	task instructions / commands; predicates / constraints	task instructions / commands: An end-to-end PR2 cooking system that takes a natural-language recipe, converts it to a sequence of robot-interpretable cooking functions via few-shot GPT-4 prompting, complements the omi…
Professor Survey	Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot	2023	self-collected eval data	not open	captions / descriptions; temporal phrases	captions / descriptions: negative natural-language description of a heat-induced food change (e.g. \| temporal phrases: the time-series of that probability, smoothed and thresholded, becomes a recognizer for when the s…
Professor Survey	REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrecTion	2023	uses existing datasets	unknown	captions / descriptions	captions / descriptions: REFLECT converts raw multisensory robot observations (RGB-D, audio, proprioception) into a three-level hierarchical text summary, then queries an LLM progressively to detect, localize, and exp…
Supplement	RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot	2023	introduces dataset	open / public	captions / descriptions	captions / descriptions: Each sequence includes visual, force, audio, and action information, plus a human demonstration video and language description.
Supplement	RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots	2024	introduces simulator	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots	2026	introduces benchmark	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no modality tag
Supplement	robomimic	2021	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence	2026	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Supplement	RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation	2025	introduces benchmark	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	RoboNet	2019	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	RoboSet (RoboAgent)	2023	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	Robotic Interestingness Dataset (SubTF)	2020	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins	2025	introduces benchmark	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no data-supervision tag
Supplement	RT-1 Robot Action Dataset	2022	introduces dataset	open / public	task instructions / commands	task instructions / commands: Each step pairs an RGB image and natural-language instruction with a discretized arm+base action, plus success/feasible/undesirable labels, stored in RLDS/TFDS format.
Professor Survey	See to Touch: Learning Tactile Dexterity through Visual Incentives	2023	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation	2022	self-collected eval data	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	SonicSense: Object Perception from In-Hand Acoustic Vibration	2024	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning	2022	introduces simulator	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Sparsh: Self-supervised touch representations for vision-based tactile sensing	2024	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	SubT-MRS	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag
Professor Survey	TacEx: GelSight Tactile Simulation in Isaac Sim – Combining Soft-Body and Visuotactile Simulators	2024	introduces simulator	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation	2025	sensor / foundation paper	not applicable	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation	2025	self-collected eval data	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation	2026	uses existing datasets	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization	2025	introduces dataset	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; introduced data artifact has unknown release status
Professor Survey	TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors	2022	introduces simulator	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	TartanAir	2020	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	TartanAir V2	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	TartanAviation	2024	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	TartanDrive	2022	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	dataset-relevant row has no task-family tag; dataset-relevant row has no data-supervision tag
Supplement	TartanDrive 2.0	2024	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Taxim: An Example-based Simulation Model for GelSight Tactile Sensors	2021	introduces simulator	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	Teaching Physical Awareness to LLMs through Sounds	2025	uses existing datasets	partial or indirect	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation	2022	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects	2023	introduces benchmark	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	The Sound of Pixels	2018	sensor / foundation paper	not applicable	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio	2025	uses existing datasets	partial or indirect	task instructions / commands	task instructions / commands: Table 1), each scored over 12 evaluations (4 language commands x 3 random locations).
Professor Survey	The Sound of Water: Inferring Physical Properties from Pouring Liquids	2025	introduces dataset	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Supplement	TLA: Tactile-Language-Action Model for Contact-Rich Manipulation	2025	introduces dataset	open / public	task instructions / commands	task instructions / commands: This is a direct fit for language-conditioned tactile manipulation.
Professor Survey	Touch and Go: Learning from Human-Collected Vision and Touch	2022	introduces dataset	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Touch100k: A Large-Scale Touch-Language-Vision Dataset for Touch-Centric Multimodal Representation	2024	introduces dataset	unknown	captions / descriptions; property words	captions / descriptions: Touch100k is the first ~100k-scale paired touch-language-vision dataset where GelSight tactile observations are annotated with GPT-4V-generated multi-granularity language (full sentences plus …	introduced data artifact has unknown release status
Professor Survey	Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset	2024	introduces dataset	unknown	captions / descriptions	captions / descriptions: TLV is the first touch-language-vision dataset with sentence-level (not just lexical-label) tactile descriptions - ~20K GelSight-touch / RGB-vision pairs auto-captioned by GPT-4V via a human-m…	introduced data artifact has unknown release status
Professor Survey	Towards Forceful Robotic Foundation Models: a Literature Survey	2025	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation	2026	uses existing datasets	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	TrajAir	2021	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks	2024	uses existing datasets	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects	2025	self-collected eval data	partial or indirect	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Variable Impedance Control and Learning – A Review	2020	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.
Professor Survey	VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation	2026	introduces benchmark	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	Visually Indicated Sounds	2016	uses existing datasets	not open	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Professor Survey	VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning	2024	uses existing datasets	unknown	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence; dataset-relevant row has no task-family tag
Professor Survey	VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback	2025	self-collected eval data	not open	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Professor Survey	VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation	2025	introduces benchmark	partial or indirect	task instructions / commands; captions / descriptions	task instructions / commands: CSI (CALVIN with Speech Instructions): CALVIN’s 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes \| captions / descriptions: SQA : 185K image-au…
Professor Survey	VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning	2025	introduces benchmark	unknown	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.	introduced data artifact has unknown release status
Professor Survey	What Foundation Models can Bring for Robot Learning in Manipulation: A Survey	2025	survey / review	not applicable	none apparent	none apparent: Survey/review paper; no paper-specific dataset language annotations are reported.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	Wire Detection Dataset	2017	introduces dataset	open / public	none apparent	none apparent: Language appears in the method/title/tags, but the dataset evidence does not show explicit language annotations.	language modality/method signal, but no explicit dataset language annotation evidence
Supplement	WIT-UAS (Wildland-fire Infrared Thermal UAS Dataset)	2023	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.
Supplement	Yamaha-CMU Off-Road Dataset (YCOR)	2018	introduces dataset	open / public	none apparent	none apparent: No explicit language annotation is stated in the dataset/setup evidence.

Dongyu Luo