HUSCAP logo Hokkaido Univ. logo

Hokkaido University Collection of Scholarly and Academic Papers >
Theses >
博士 (工学) >

A study of high quality speech synthesis based on the analysis of the randomness in speech signals

Files in This Item:
学位論文2000.pdf1.65 MBPDFView/Open
Please use this identifier to cite or link to this item:https://doi.org/10.11501/3168677

Title: A study of high quality speech synthesis based on the analysis of the randomness in speech signals
Other Titles: 音声信号におけるランダムネスの解析に基づいた高品質音声合成に関する研究
Authors: Aoki, Naofumi1 Browse this author →KAKEN DB
Authors(alt): 青木, 直史1
Issue Date: 24-Mar-2000
Abstract: Randomness observed in human speech signals is considered to be a key factor in the naturalness of human speech. This research project has investigated the characteristics of several kinds of randomness observed in human speech signals phonated by normal speakers. Based on the results of the analysis, some advanced techniques for artificially reproducing such randomness were developed with the aim of enhancing the voice quality of synthesized speech. The types of randomness particularly investigated in this project were: (1) amplitude fluctuation, (2) period fluctuation, (3) waveform fluctuation, (4) random fractalness of the source signals obtained by linear predictive analysis, and (5) unvoiced characteristics, namely, aperiodicity observed in voiced consonants. Using their statistical characteristics, a simple model was made for these forms of randomness, and was evaluated how it could contribute to realize high quality speech synthesis systems based on the LPC (linear predictive coding) vocoder. Normal sustained vowels always contain a cyclic change of maximum peak amplitudes and pitch periods, even at those times when the values seem to be quite stable. This project investigated the statistical characteristics of the fluctuations that were particularly labeled amplitude fluctuation and period fluctuation, respectively. Since the frequency characteristics of these fluctuation sequences appeared to be roughly subject to a 1/f power law, the author reached the conclusion that amplitude and period fluctuation could be modeled as 1/f fluctuations for a preliminary model. Psychoacoustic experiments performed in this study indicated that the differences in the frequency characteristics of the amplitude and period fluctuation could potentially influence the voice quality of synthesized speech. Compared with 1/f0 (white noise), 1/f2, and 1/f3 fluctuation models, amplitude and period fluctuation modeled as 1/f fluctuations could produce voice quality which was more similar to that of human speech phonated by normal speakers. Normal sustained vowels also always contain a cyclic change of the waveform itself, even during their most steady parts. This project investigated the statistical characteristics of the waveform fluctuations extracted from the residual signals of the LPC vocoder. Since the frequency characteristics of the waveform fluctuations appeared to be subject to a 1/f2 power law, the author reached the conclusion that the waveform fluctuations could be modeled as 1/f2 fluctuations for a preliminary model. Psychoacoustic experiments performed in this study indicated that the differences in the frequency characteristics of waveform fluctuations could potentially influence the voice quality of synthesized speech. Compared with 1/f0 (white noise), 1/f, and 1/f3 fluctuation models, waveform fluctuations modeled as 1/f2 fluctuations could produce voice quality which was more similar to that of human speech phonated by normal speakers. Theoretically, the source signals of the LPC vocoder are defined as being characterized by a spectral −6 dB/oct decay in the frequency domain, when the −12 dB/oct glottal vibration and the +6 dB/oct mouth radiation characteristics are taken into consideration simultaneously. Since this frequency characteristic is equivalent to a 1/f2 spectral decay, the source signals of the LPC vocoder can be potentially classified as Brownian motion from the viewpoint of the random fractal theory. This project employed a multiresolution analysis method, based on Schauder expansion, in order to statistically investigate the time domain characteristics of the source signals. The results of the analysis indicated that the random fractalness was clearly observed, particularly when a large resolution level was chosen. The author also found that a certain limitation existed in the size of the discontinuity for the source signal waveforms obtained from human speech signals. Based on the results of the analysis, an advanced technique was newly developed with the aim of enhancing the voice quality of synthesized speech produced by the conventional impulse train. This study reached the conclusion that the buzzer-like degraded voice quality resulting from utilizing the impulse train could be improved by removing the extremely large discontinuity of the waveforms from the impulse train. The developed technique also included a method called random fractal interpolation for restoring power in the high frequency region which had been undesirably decreased by removing the sharpness of the impulse train. The author implemented two applications that exemplified the effectiveness of the techniques developed through this research. One such application was a real-time vocoder system implemented on a digital signal processor (DSP) evaluation module (Texas Instruments, TMS320C62EVM); the other was a Japanese rule-based speech synthesis system implemented on a personal computer (Apple, Macintosh Quadra 840AV). Both applications employed the modified LPC vocoder as their speech synthesizer which fully implemented the features that were investigated in this research. In addition, these applications demonstrated how the voice quality of voiced consonants was enhanced by a MELP (mixed excitation linear prediction) scheme. Since voiced consonants are a mixture of both a periodic component attributed to voiced characteristics and an aperiodic component attributed to unvoiced characteristics, the waveforms of unvoiced consonants — which seem basically periodic due to reflecting the voiced feature — are disturbed in detail by the unvoiced feature. Psychoacoustic experiments conducted in this research clarified that synthesized voiced consonants produced by the conventional LPC vocoder tendedto degrade in voice quality, since such a vocoder completely disregards the incorporation of the unvoiced feature into the voiced consonants. An advanced technique, employing a wavelet transform for processing subband decomposition and reconstruction, was developed as a method for the inclusion of the unvoiced component with the voiced component at desirable bands. It was concluded that synthesized voiced consonants, for which the unvoiced feature was incorporated at high frequency subbands, could be perceived as possessing a more natural voice quality than that of the conventional LPC vocoder. This project has reached the following two major conclusions: (1) the voice quality of synthesized speech can be enhanced by the inclusion of the randomness that is artificially produced by adequate models, (2) the knowledge acquired through the techniques developed in this project can be applied to the design of LPC-vocoder-based high quality speech synthesis systems that can be expected to produce more realistic human-like natural speech.
Conffering University: 北海道大学
Degree Report Number: 甲第5113号
Degree Level: 博士
Degree Discipline: 工学
Type: theses (doctoral)
URI: http://hdl.handle.net/2115/28112
Appears in Collections:学位論文 (Theses) > 博士 (工学)

Submitter: 青木 直史

Export metadata:

OAI-PMH ( junii2 , jpcoar_1.0 )

MathJax is now OFF:


 

 - Hokkaido University