Abstract
線上連結: https://asmeet.webex.com/asmeet/j.php?MTID=m5ad15466537e6e515ba32cf214597dde
會議號: 2514 833 9682
密碼: ZcyE4J7Eh56
We propose to use Transformer-Transducer (T-T) for streaming end-to-end (E2E) speech translation (ST). Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed model enjoys low inference latency/computational cost and approaches the quality of cascaded ST. We then extend it to build a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into the text of a target language. We further extend SM2 to multiple output languages which only adds a small number of parameters to the original model and enables truly zero-shot capacity for unseen {source-speech, target-text} pairs. A non-erasing decoding method will be also introduced which completely solves the stability issue for online streaming system. Finally, the model is further improved by simultaneously generating automatic speech recognition (ASR) and ST results.
會議號: 2514 833 9682
密碼: ZcyE4J7Eh56
We propose to use Transformer-Transducer (T-T) for streaming end-to-end (E2E) speech translation (ST). Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed model enjoys low inference latency/computational cost and approaches the quality of cascaded ST. We then extend it to build a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into the text of a target language. We further extend SM2 to multiple output languages which only adds a small number of parameters to the original model and enables truly zero-shot capacity for unseen {source-speech, target-text} pairs. A non-erasing decoding method will be also introduced which completely solves the stability issue for online streaming system. Finally, the model is further improved by simultaneously generating automatic speech recognition (ASR) and ST results.
Bio
Dr. Jinyu Li earned his Ph.D. from Georgia Institute of Technology in 2008. From 2000 to 2003, he was a Researcher in the Intel China Research Center and Research Manager in iFlytek, China. He joined Microsoft in 2008 and now serves as Partner Applied Science Manager, leading a dynamic team dedicated to designing and enhancing speech modeling algorithms and technologies. Their aim is to ensure that Microsoft products maintain cutting-edge quality within the industry. His diverse research areas include end-to-end modeling for speech recognition and speech translation, deep learning, acoustic modeling, and noise robustness. He has been a member of IEEE Speech and Language Processing Technical Committee since 2017. He also served as the associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020. He was awarded as the Industrial Distinguished Leader at Asia-Pacific Signal and Information Processing Association (APSIPA) in 2021 and APSIPA Sadaoki Furui Prize Paper Award in 2023.