In this talk, I will introduce a raw waveform generation model named Quasi-Periodic WaveNet (QPNet), which has a fundamental frequency (F0)-dependent adaptive architecture with a pitch-dependent dilated convolution neural network (PDCNN), to improve the F0 controllability of vanilla WaveNet (WN). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN attains significant performance in high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the F0 controllability of WN. Specifically, it is difficult for WN to precisely generate the periodic parts of audio signals when the given auxiliary F0 features are outside the F0 range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the length of the receptive field of WN according to the given auxiliary F0 features. Second, a cascade network structure is utilized to simultaneously model the long- and short-term dependences of quasi-periodic signals such as speech. The performances of single-tone sinusoids and speech generations are evaluated. The experimental results confirm the effectiveness of PDCNN for unseen auxiliary F0 features and the effectiveness of the cascade structure for speech generation.
YI-CHIAO WU received his B.S and M.S degrees in engineering from the School of Communication Engineering of National Chiao Tung University in 2009 and 2011, respectively. He worked at Realtek, ASUS, and Academia Sinica for 5 years. Currently, he is pursuing his Ph.D. degree at the Graduate School of Informatics, Nagoya University. His research topics focus on speech generation applications based on machine learning methods, such as voice conversion and speech enhancement.