中央研究院資訊科技創新研究中心

Abstract

線上會議連結： https://asmeet.webex.com/asmeet/j.php?MTID=m52e51f4c1cda1c678577350088c2bcbc
會議號： 2511 645 1670
密碼： W7RyA2hXDT5
This presentation introduces some of our group’s attempts at building an end-to-end network that integrates various speech processing modules into a single neural network while maintaining explainability. We will focus on far-field conversation recognition as an example and show how to unify automatic speech recognition, denoising, dereverberation, separation, and localization. We will also introduce our latest techniques for combining self-supervised learning, careful pre-training/fine-tuning strategies, and multi-task learning within our integrated network. This work achieved the best performance reported in the literature on several noisy reverberant speech recognition benchmarks, reaching the clean speech recognition performance.

Bio

Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.

資訊科技創新研究中心

資訊科技創新研究中心

學術演講

Explainable End-to-End Neural Networks for Far-Field Conversation Recognition