Modeling sound duration in an Arabic text to speech system

Abstract

This article describes our method to model the sound duration of standard Arabic speech. The final aim is the prosodic generation for an Arabic Text-To-Speech (TTS) synthesis system. Several authors have already identified on isolated word corpora the effects of speakers, word structure, sound nature and Arabic dialects. In our work hypothesis is to verify these effects for standard Arabic on continuous speech. Firstly a continuous speech corpora is defined and recorded. Then, two automatic tools —the SYNTHAR+ grapheme-phoneme system and the MBROLIGN alignment— are used to label the speech database recorded. Finally several contextually analysis of phonetic unit duration are conducted. Our studies on continuous speech confirm the results already obtained on isolated words –both the consonantal and final pause context effect and the consonantal gemination effect on the vowel duration. We will report also on the effect of the syllabic number on the speech rate. Introduction The current systems of (TTS) synthesis for Latin languages such as French, English, German, etc. are able to produce an intelligible and an almost natural speech synthesis (Dutoit, 1997). For the French language, these qualities are the result of many basic researches dedicated for decades to the study of the prosody (Di Cristo, 1998), (Malfrère, 1998) and (Mertens, 1999), to quote only those. This quality is essentially due to the use of high-performance automatic learning and labeling tools for prosodic process as well as the availability of speech databases. (Bartkova, 1987) has defined a set of rules which determines the segmental duration according to syntactic and prosodic marker relative to the word, the syllable position in the word, the phoneme position in the syllable, etc. Many works have studied the variation of sound duration for the Arabic language according to several factors: speakers, phonological phenomena, syllabic structures, consonantal geminating and Arabic dialects, etc. Among them, (Ghazali, 1992a) has considered the phonological levels for the Arabic TTS; (Jomaa 1994) has treated the opposition between short and long vowels in Arabic language whereas (Amrouche, 1998) has studied the variation of vowel duration according to the syllabic structures. It is important to mention that all these studies were mainly based on isolated and non-sense word corpora. A neural-network based model of Arabic syllable duration was presented by (Chehab, 2000) for the naturalness amelioration of Arabic TTS and (Zaki, 2000) has proposed a set of rules to describe the variation of the stylized F0 curves for interrogative synthesized sentences. (Zemirli, 1998) underlined the major role of the duration of sounds in the intelligibility rate of the Multivox system. Few works are still dedicated to the prosodic generation for the TTS of the Arabic language. The prosodic representation system consists in three prosodic parameters: fundamental frequency, duration and intensity. The sound duration is one of the most difficult to model. As reported by the works described above on isolated words, the duration depends on the contextual realization of phonemes characterized both by the nature, size and structure of syllables and the stress, etc. The aim of this paper is to study the effects of these parameters on continuous speech corpora useful to elaborate prosodic scheme for new generations of TTS systems. This present work describes the approach adopted to predict the sound duration useful for the automatic rate generation in standard Arabic TTS system. It aims to quantify the effects of the left/right consonant context, the elongation of geminated consonants as well as the influence of the syllabic structure of sound duration on continuous speech corpora. Firstly, we will proceed with the elaboration of representative continuous speech corpora including all the phonotactical constraints for the Arabic language. Secondly, we will describe briefly two automatic tools used to align the phonetic units on the speech. Several extracted criteria based on syllabic structure and phonetic characteristics of the Arabic language are defined to compute the phonetic unit duration. These criteria are quite similar to those described in (Dutoit, 1997) but they are adapted to phonological and syllabic phenomena of the Arabic language. Various results reporting on the effects of the syllabic structure, the geminating and the position The 1st International Sysmposium on Computers and Arabic Language & Exhibition 2007 © KACST & SCS McCallum, A. K. (2002). “MALLET: A Machine Learning for Language Toolkit.” from http://mallet.cs.umass.edu. Michelson, M. and C. A. Knoblock (2005). Semantic annotation of unstructured and ungrammatical text. the 19th International Joint Conference on Artificial Intelligence(IJCAI-2005)), Edinburgh, UK. MKBEEM. (2005). “(web site).” from http://www.mkbeem.com. Moldovan, D., C. Clark, et al. (2003). COGEX: A Logic Prover for Question Answering. HLT-NAACL 2003, Edmonton. Sekine, S. (1997). The Domain Dependence of Parsing. Applied Natural Language Processing(ANLP’97), Washington D.C., USA,. Sérasset, G. and C. Boitet (2000). On UNL as the future “html of the linguistic content” & the reuse of existing NLP components in UNL-related applications with the example of a UNL-French deconverter. COLING-2000, Saarbrücken, ACL & Morgan Kaufmann. Sitter, A. D., T. Calders, et al. (2004). “ Formal Framework for Evaluation of Information Extraction “, from http://citeseer.ist.psu.edu/648270.html. Somers, H., B. Black, et al. (1997). Multilingual Generation and Summarization of Job Adverts: the TREE Project. Fifth Conference on Applied Natural Language Processing, Washington, DC. Uchida, H. (1999, 1999). “Enconverter Specifications.” from http://www.undl.org. 52 which can be read by the system MBROLA). Each record of this file is associated to a phoneme (including the pause). Each phoneme (including the pause) is associated its wording, its duration and possibly a whole of couples of numbers representing the position of F0 expressed as a percentage compared to the total duration of the phoneme and the value of F0 to this position. The following example is the prosodic output file for the sentence: “ ِ نارَيبِكَ ِ نارَيرِسَ اهَي ِ ف نَاكَ”. In MBROLA formalism: /kaanafiihaasariraanikabiraani / “There were two double beds inside. _ 200 k 74 aa 160 9 115 39 135 64 147 79 152 99 152 n 82 8 150 37 145 46 143 56 139 66 136 75 134 85 129 95 126 a 112 3 120 32 111 39 111 46 109 75 105 82 101 89 98 96 89 f 107 ii 150 11 137 16 139 32 145 48 146 53 149 69 149 74 148 80 146 85 146 90 145 96 143 h 70 2 141 14 139 25 135 37 134 48 130 60 129 71 125 82 124 94 122 aa 145 2 118 8 118 19 114 30 111 41 111 52 110 63 108 74 107 85 103 90 103 96 103 s 104 a 70 2 113 14 112 25 112 37 113 48 111 60 111 71 111 82 110 94 109 r 110 3 109 18 108 25 108 40 106 54 95 61 100 69 102 76 103 83 104 90 104 98 105 ii 190 11 107 20 108 36 105 45 105 53 104 62 102 70 100 78 100 91 99 95 98 100 98 r 70 11 98 22 97 34 94 45 90 57 99 68 100 80 102 91 106 aa 208 12 120 20 126 35 140 50 146 58 149 66 150 77 148 85 145 93 143 96 142 n 61 2 132 15 133 28 133 41 131 54 128 67 124 80 122 93 119 i 70 5 116 17 114 28 109 40 110 51 108 62 107 74 105 85 102 97 93 k 107 a 89 3 101 12 101 21 97 29 97 38 95 47 94 56 93 65 92 74 91 83 89 92 88 b 54 2 87 17 86 32 86 46 86 61 86 76 87 90 88 ii 199 13 93 17 94 25 95 37 95 49 94 61 92 65 91 77 90 81 90 85 90 89 90 93 89 97 89 r 100 4 90 12 90 20 89 28 89 36 88 44 72 92 100 100 103 aa 212 11 108 15 110 22 114 37 118 48 118 60 114 71 107 82 95 90 92 93 90 97 86 n 97 3 86 11 81 19 82 27 82 36 81 44 79 52 79 60 77 69 78 77 76 85 72 93 73 i 90 2 72 11 72 From these data structure obtained for all the sentences of the study corpora described above, a base of phonetic unit duration was computed. With each phoneme are associated, its contexts left and right, the position and the value of the first pitch (F0), the position and the value of the last pitch, the number of pitch (NP) and the value of the slope of F0 (+,-). Figure 1. Example of a line of the file “.pho” Generated with Mbrolign Modeling sound duration in an Arabic text to speech system H 145 0 99 50 105 100 130

Topics

7 Figures and Tables

Download Full PDF Version (Non-Commercial Use)