Abstract
Korean pronunciation errors occur mainly in children and foreigners who are not accustomed to Korean. Pronunciation errors result in quality degradation of various services based on speech recognition for these groups. To overcome this, AI Hub has released ‘Korean children’s speech data’ and ‘Foreign Korean speech data’. However, it was difficult to improve quality through error analysis and removal because each dataset did not provide the label for pronunciation errors. To resolve this problem, we propose a multimodal learning model using speech data and text data together for detecting pronunciation error segments in this paper. Experimental results show that the proposed multimodal-based model improved the performance of the existing speech data-based model by 20.1~ 20.2% and 57.4~60.5% in terms of Character Error Rate (CER) and Word Error Rate(WER), respectively. CER and WER are originally indicators for error rates of speech recognition. They were used to indicate detection accuracy of pronunciation error segments in this study. Finally, we evaluated error rates for the samples with or without error segments detected by the proposed model. The experimental results revealed that average error rates for samples without error segments were reduced by 1.47% and 1.3% in terms of WER and CER, respectively, compared to those of samples with error segments.
| Translated title of the contribution | Multimodal Learning Model for Detecting Pronunciation Error Segments of Children's and Foreigners' Speech Data |
|---|---|
| Original language | Korean |
| Pages (from-to) | 396-401 |
| Number of pages | 6 |
| Journal | 정보과학회 컴퓨팅의 실제 논문지 |
| Volume | 29 |
| Issue number | 8 |
| DOIs | |
| State | Published - 2023 |