Abstract
Robotic agents rely on multimodal perception to understand and interact with their environments. Recent foundation models with open-vocabulary semantics promise task-agnostic generalization on various natural language processing and computer vision tasks. Yet for robotic applications, two critical gaps remain: enabling perception systems to process long-form temporal input and maintain spatial awareness in an efficient and reliable manner, and leveraging such perception to support generalized navigation and manipulation. To address this, we leverage vision-language foundation models to tackle long-form multimodal video understanding, scaling beyond short clips while mitigating hallucination. We further introduce a generalizable, near-zero-shot harmonic mobile manipulation agent that tightly couples base and arm control and requires minimal task- or environment-specific fine-tuning.
Our research advances reliable and efficient robot learning along four fronts: (1) we enable foundation model–based agents to abstract over and process hour-long multimodal videos by integrating explicit, interpretable tools, which mitigates hallucination; (2) we pair LLM instruction-tuning with redundant-token pruning to shorten sequences, reduce latency, and sometimes improve accuracy; (3) we build an open-vocabulary navigation system that reasons about affordances and task feasibility while self-correcting visual perception in an annotation-free manner; and (4) we design a harmonic mobile manipulation agent that jointly optimizes base-gripper motion for tightly coordinated control.
Our research advances reliable and efficient robot learning along four fronts: (1) we enable foundation model–based agents to abstract over and process hour-long multimodal videos by integrating explicit, interpretable tools, which mitigates hallucination; (2) we pair LLM instruction-tuning with redundant-token pruning to shorten sequences, reduce latency, and sometimes improve accuracy; (3) we build an open-vocabulary navigation system that reasons about affordances and task feasibility while self-correcting visual perception in an annotation-free manner; and (4) we design a harmonic mobile manipulation agent that jointly optimizes base-gripper motion for tightly coordinated control.
Bio
Hung-Ting Su is a postdoctoral researcher in multi-modal AI and robotics with 8 years of hands-on research
experience. He has led teams of PhD and master’s students on vision-language and embodied-AI
projects. His current work extended the reasoning limits of LLMs/VLMs, applying them to hour-long video
understanding and mobile manipulation. He has published 30+ peer-reviewed papers with 900+ citations.
experience. He has led teams of PhD and master’s students on vision-language and embodied-AI
projects. His current work extended the reasoning limits of LLMs/VLMs, applying them to hour-long video
understanding and mobile manipulation. He has published 30+ peer-reviewed papers with 900+ citations.