:::
AICC

Towards Universal Audio Intelligence: From Unified Generation to Precise Separation


  • 講者 : 吳宜樵 博士
  • 日期 : 2026/02/11 (Wed.) 10:30~12:30
  • 地點 : 資創中心122 演講廳、視訊
  • 邀請人 : 曹昱
Abstract
[Google Meet]
連結點我


Achieving Universal Audio Intelligence requires a system that possesses the duality of a world model: the ability to simulate the physical properties of sound (Generation) and the capacity to deconstruct complex auditory scenes (Perception). In this talk, I present a unified research roadmap based on Flow Matching and Diffusion Transformers (DiT) that bridges these two critical capabilities.
First, I will introduce Audiobox, a foundation model that breaks down the traditional silos between speech, music, and sound effects. I will demonstrate how we achieve controllable, open-domain synthesis using natural language and voice prompts within a single unified framework. Next, I will discuss scaling this capability to multimodal and long-context settings with Movie Gen Audio. I will illustrate how we leveraged an automated aesthetic data engine to scale the model to 13B parameters, achieving high-fidelity video-to-audio synchronization and semantic alignment.
Finally, to close the loop between creation and perception, I will present SAM Audio. Treating separation as the inverse problem of generation, we adapt the "Segment Anything" paradigm to the auditory domain. I will show how SAM Audio utilizes the same architectural backbone to perform precise, zero-shot separation of sound objects via text, visual, and temporal prompts. Together, these works illustrate a path toward a unified, scalable, and interactive audio intelligence capable of both generating and reasoning about the auditory world.
Bio
Yi-Chiao Wu was a Research Scientist specializing in audio and speech generation at Meta FAIR. He has been a core contributor to several state-of-the-art foundation models, including Audiobox (Unified Audio Generation), Movie Gen (Video-to-Audio Synthesis), and SAM Audio (Universal Audio Separation). His research focuses on building scalable systems that unify speech, music, and sound processing, leveraging flow matching and diffusion transformers to bridge the gap between generative synthesis and semantic perception.
Prior to FAIR, he was a Research Scientist at Meta Reality Labs (Codec Avatars), where he focused on neural audio codecs and spatial audio rendering for immersive virtual reality. Before joining Meta, he worked at Academia Sinica, ASUS, and RealTek, focusing on voice conversion, speech enhancement, speaker identification, and audio systems. He received his Ph.D. from Nagoya University (2021), where his research centered on neural vocoders and voice conversion. With over a decade of experience in audio signal processing and machine learning, Yi-Chiao brings a deep understanding of the full audio stack.