SPEECH ACQUISITIONFirst step in thislevel is to select good speakers. A notice was published among the departmentsof Islamic University inviting speakers in this regard. Huge amount ofinterested students both male and female were applied to do the job. Anaudition was arranged to check their comparative efficiency of correctutterance/pronunciation. Depending on their performance, 75 male and 75 femalespeakers were selected to finalize the speaker list. The major content of thelist was the students of language and computer related departments. Theselected speakers were very young with age range of 18-25 years.
Finallyselected speakers were attended a two day workshop. The objective of theworkshop was to concern the speakers about the theory, methods and work plan ofthe project. The speakers were given practical training of speech acquisitionsuch as headset setting, loudness of utterance etc. Speech data wererecorded in Laboratory environment by close-talking microphone directlyconnected to the computer.
The speakers were asked to read the text in standardpronunciation as well as possible. The whole recording was done in the presenceof the researcher who controlled the recording. The control includes running therecording script, starting the recording session, breaking the recording,playing back the recorded speech, re-recording the sentence if required andmoving to the next sentence. Every speaker was explained the purpose of theproject and instructed to start the recording when she/he was ready to read.After the entire session for a reader was finished, all the utterances werelistened to by the reader and the researcher for corrections.
Furthermore, asonly 8-10 speakers have been scheduled to read per day, the researcher was ableto listen again to the recorded speech at the end of each day to control itsquality. There were some console operators to help the researcher and when anoperator identifies the hesitation or erroneous utterances, he asked the speakerto repeat the utterance until the speaker makes correct utterances. Therecorded speech data were taken with 8 kHz sampling with 8 bit quantization.
The recorded speech was stored as wav file format in various lengthsdepending on the speaker’s ability to speak over a length of continuous time. Connected Words were uttered with aminimum pause between two consecutive words. But the speakers were very muchconscious about the context dependence of word utterances. Thus the finalresult was that the speaker speaks normally but with a word-short pause-wordstyle which ensure a minimum separation between two consecutive words.Depending on the speaker’s ability, the total text was recorded and stored as wavfile format in two or three primary unedited files. The speakers were trainedand directed continuously to maintain the similar rhythm in text reading toensure the same utterance of text. 5. SPEECH EDITING AND LABELLINGIt is necessary tocheck the recorded data from the following points: difference between the utterance and the utterance list, degreeof dialectal accent, speech rate, clarity of pronunciation, recording level,noise, etc.
Especially, we found some speakers speak with very low speech leveland it was possible to magnify the amplitude of such speech data to a suitablelevel. However, the recording was carried out in an environment, which was nottruly noiseless. The recoding instruments were also produced a little noise insome cases. All types of problems were identified and corrected during editingphase.
Noiseless clean speech files were separated from the noisy speech files.Noisy files were cleaned using various filters and tagged with a comment. Theoriginal unedited recorded files were edited so that each file contains onecomplete sentence.
As the HMM Toolkit developed by Cambridge UniversityEngineering Department was already proved its efficiency and using frequentlyby most of the research workers, the database was labeled by following thespecified format of speech data to make it ready for use in HMM Toolkit forevaluation 11. Two edited and finalized files are shown in figure-1. As shownin the figure minimum length file contains 3 words with duration of 2.3 secondsand maximum length file contains 11 words with duration of 8.5 seconds. Thusthe estimated average time required for each file was 5.
4 seconds. The totalrecording time for connected words database was about 62 hours after editingand labeling.