中央研究院資訊科技創新研究中心

Abstract

Conventional computer vision tasks rely on visual input, categorical labels, and bounding boxes. A recent trend is to consider multimodal inputs that contain images, videos, scene graphs, and captions. Multimodal learning, e.g., OpenAI’s CLIP (Contrastive Language–Image Pre-training), has received a lot of attention. Although large language models (LLMs) have impressive performance on multi-modal benchmarks. Their black-box nature and high training and inference costs are major challenges. My lab at USC has developed multimodal learning algorithms grounded in the Green Learning principle in the last three years. They are interpretable and efficient. They cluster heterogeneous multimodal data into homogeneous subgroups and then establish an interpretable and efficient mechanism to connect visual and textual data. The modular design yields intermediate results with semantic meaning. The final decisions are made by aggregating conditional probabilities. Furthermore, subgroup inference eliminates the need to train complex large models that handle heterogeneous data simultaneously. I will illustrate the above-mentioned points using three examples. They are human-object-interaction (HOI) detection, image-text retrieval, and video-text retrieval.

Bio

A person in a suit and tie Description automatically generated

Dr. C.-C. Jay Kuo received his Ph.D. from the Massachusetts Institute of Technology in 1987. He is now with the University of Southern California (USC) as the Ming Hsieh Chair Professor, a Distinguished Professor of Electrical and Computer Engineering and Computer Science, and the Director of the Media Communications Laboratory. His research interests are in visual computing and communication. He is a Fellow of AAAS, ACM, IEEE, NAI, and SPIE and an Academician of Academia Sinica. Dr. Kuo has received a few awards for his research contributions, including the 2010 Electronic Imaging Scientist of the Year Award, the 2010-11 Fulbright-Nokia Distinguished Chair in Information and Communications Technologies, the 2019 IEEE Computer Society Edward J. McCluskey Technical Achievement Award, the 2019 IEEE Signal Processing Society Claude Shannon-Harry Nyquist Technical Achievement Award, the 72nd annual Technology and Engineering Emmy Award (2020), and the 2021 IEEE Circuits and Systems Society Charles A. Desoer Technical Achievement Award. Dr. Kuo was the Editor-in-Chief of the IEEE Transactions on Information Forensics and Security (2012-2014) and the Journal of Visual Communication and Image Representation (1997-2011). He is currently the Editor-in-Chief for the APSIPA Trans. on Signal and Information Processing (2022-2025). He has guided 181 students to their Ph.D. degrees and supervised 31 postdoctoral research fellows.

智慧物聯網專題中心

智慧物聯網專題中心

學術演講

Interpretable and Efficient Multimodal Learning