Recently the Mongolian Language Intelligence Information Processing Team of the College of Computer Science of IM, in cooperation with the team headed by Prof. Li Haizhou from National University of Singapore, published a thesis titled Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis (DOI: 10.1109/TASLP.2020.3040523 in IEEE/ACM Transactions on Audio, Speech, and Language Processing, a top-level international journal in the field of signal processing. IEEE/ACM Transactions on Audio, Speech, and Language Processing is a top-level journal in audio, acoustics and language signal processing, a Section I TOP journal of SCI journal classification of the Chinese Academy of Sciences and a Class A journal on Tsinghua University’s latest list of recommended journals in computer science, with an impact factor of 3.531.
The thesis conducts research in the issue of rhythm modeling of the phonic synthesization of the Mongolian language and proposes a rhythm modeling method for the Mongolian language which integrates morphology and phonology. Liu Rui, a doctoral graduate of IMU in 2020 (under the supervision of Prof. Gao Guanglai) is the first author, Assoc Prof. Fei Long from the College of Computer Science of IMU the correspondent author, and IMU the organization of the first author and correspondent author of the thesis.
The issue of rhythm modeling is a major factor influencing the naturalness and comprehensibility of voice synthesization. With the development of the technology of deep learning and the support of massive quantities of text and voice data, satisfactory effects can be achieved in rhythm modeling. However, rhythm modeling in Mongolian, a language with relatively small quantity of resources, is in most cases full of challenges. For one thing, the data for rhythm modeling in Mongolian is limited, and massive quantities of text and voice data for sufficient training in modeling are not available; for another, given the complicated rules of word formation caused by the distinctive agglutinative feature of Mongolian, the existing rhythm modeling method has not yet fully explored the word formation features related to rhythm expression for training in rhythm modeling in Mongolian. To solve the above-mentioned problems, the thesis proposes a method for intensifying the expression features of Mongolian words and predicts the rhythm modeling by combination with self-attention mechanism. The knowledge of morphological and phonological word formation in Mongolian words is utilized in the thesis to enhance the capability of text encoders in feature expression. In view of the fact that self-attention models can fully learn the correlation of the input contexts, the thesis takes self-attention models as the decoder of rhythm models in the prediction of rhythm labeling. In conclusion, the effectiveness of the method proposed in the thesis is verified through a series of objective and subjective experiments, indicating that the method proposed in the thesis can effectively improve the accuracy in rhythm modeling for the Mongolian language and eventually upgrade the overall performance of the voice synthesization by the system of Mongolian voice synthesization.
The research was sponsored by the National Key R&D Program of China (2018YFE0122900), the National Natural Science Foundation of China (61773224, 62066033), the Natural Science Foundation of Inner Mongolia (2018MS06006), the Program of Commercialization of Scientific and Research Findings of Inner Mongolia (CGZH2018125) and the Program of the Funds for the R&D of Applied Technologies of Inner Mongolia (2019GG372, 2020GG0046).
URL for the thesis: https://ieeexplore.ieee.org/document/9271923