New progress made in speech synthesis by National & Local Joint Engineering Research Center of Intelligent Information Processing for Mongolian-内蒙古大学

New progress made in speech synthesis by National & Local Joint Engineering Research Center of Intelligent Information Processing for Mongolian

Recently, National & Local Joint Engineering Research Center of Intelligent Information Processing for Mongolian(Inner Mongolia Key Laboratory of Mongolian Information Processing Technology), in collaboration with Prof. Li Haizhou’s research team of the Chinese University of Hong Kong, has published in IEEE/ACM Transactions on Audio, Speech, and Language Processing -a top international journal for language processing, a research article named Decoding Knowledge Transfer for Neural Text-to-Speech Training（DOI：10.1109/TASLP.2022.3171974）. IEEE/ACM Transactions on Audio, Speech, and Language Processing, is a top journal for audio, acoustics and language processing. It is a top journal in the First Section of SCI Indexing by Chinese Academy of Sciences. It is also a Class A journal in the recommended list for computer academic journals by Tsinghua University whose effect factor is 3.919.

The article researches such aspects as the robustness and expressiveness modeling in speech synthesis and proposes a multi-teacher knowledge distillation (MT-KD) network for Tacotron2 TTS model. Liu Rui, researcher with the College of Computer Science of IMU, is the first author of the article and Prof. Gao Guanglai the correspondence author of the article. IMU is the institution that the first author and correspondence author are with.

The main purpose of the speech synthesis is to transform the texts into high-quality speech. And the end-to-end text-to-speech based on the structure of “encoder-decoder” can lead to the excellent speech synthesis performance and is a major way for speech synthesis. The mismatch of the decoding ways that the decoder of the end-to-end text-to-speech model has between the training and inference process results in robustness and expressiveness degradation. In order to better the robustness and expressiveness of the end-to-end text-to speech model, the article propose a speech modeling based on the multi-teacher knowledge distillation(MT-KD) learning. The whole system includes two teacher models and one student model. Two teacher models respectively adopt two decoding mechanisms: teacher-forcing and scheduled sampling, which can output genuine and stable speech parameters. The student model performs free-running during inference process. During the multi-teacher knowledge distillation, with the addition of the loss functions of MT-KD, the knowledge of the teacher model can guide the output of the student model. After the training, the student model can be directly used during the inference to output stable and reliable speech parameters to form the speech. Finally, our experiments show that compared with the conventional end-to-end text-to speech, MT-KD can synthesize more robust and more expressive speech.

The research is funded by the IMU Steed Plan project for high-level new talents and the national key research project(Grant No. 2018YFE0122900), the project from National Natural Science Foundation of China（Grant No. 61773224，62066033）, Natural Science Foundation of Inner Mongolia Autonomous Region(Grant No. 2018MS06006) and the project from the Research and Development of Applied Technologies of Inner Mongolia Fund(Grant No. 2019GG372，2020GG0046).

A brief introduction to the first author: Liu Rui(Personal homepage: https://ttslr.github.io/), researcher of the College of Computer Science of IMU who possesses Class B1 position of Steed Plan of IMU for new talents attraction and PhD supervisor. Liu has had deeply explored artificial intelligence, in-depth learning and expressive text-to-speech and published over 20 articles in the famous journals in the fields concerned including IEEE/ACM TASLP（a top journal in Q1section of JCR and Section 1 of SCI indexing）, IEEE Internet of Things Journal（a top journal in Q1section of JCR and Section 1 of SCI indexing）, Neural Networks（a journal in Q1section of JCR and Section 2 of SCI indexing）, IEEE Signal Processing Letters（a journal in Q1section of JCR and Section 2 of SCI indexing）and academic conferences such as ICASSP（CCF-B，top conference in speech）、InterSpeech（CCF-C，a journal in Q1section of JCR and Section 2 of SCI indexing）.

URL:：https://ieeexplore.ieee.org/document/9767637

Pre:IMU achieves much in Mathematical Contest in Modeling/ Interdisciplinary Contest in Modeling
Next： Ceremony to get online “the Digital Platform for Historical Materials ” held by Northeast Asian Language Resources Center of Inner Mongolia Autonomous Region at Xilin Gol Vocational College

【Close】

IMU Campus Links

Copyright@Inner Mongolia University | Address: No.235 West College Road,Saihan District,Hohhot Inner Mongolia,P.R.China | Zip code: 010021| E-mail: webmaster@imu.edu.cn | Mongolia ICP No. 05005324