아동 및 외국인 음성 데이터의 발음 오류 구간 검출을 위한 멀티모달 학습 모델

Translated title of the contribution: Multimodal Learning Model for Detecting Pronunciation Error Segments of Children's and Foreigners' Speech Data

Research output: Contribution to journalArticlepeer-review

Abstract

Korean pronunciation errors occur mainly in children and foreigners who are not accustomed to Korean. Pronunciation errors result in quality degradation of various services based on speech recognition for these groups. To overcome this, AI Hub has released ‘Korean children’s speech data’ and ‘Foreign Korean speech data’. However, it was difficult to improve quality through error analysis and removal because each dataset did not provide the label for pronunciation errors. To resolve this problem, we propose a multimodal learning model using speech data and text data together for detecting pronunciation error segments in this paper. Experimental results show that the proposed multimodal-based model improved the performance of the existing speech data-based model by 20.1~ 20.2% and 57.4~60.5% in terms of Character Error Rate (CER) and Word Error Rate(WER), respectively. CER and WER are originally indicators for error rates of speech recognition. They were used to indicate detection accuracy of pronunciation error segments in this study. Finally, we evaluated error rates for the samples with or without error segments detected by the proposed model. The experimental results revealed that average error rates for samples without error segments were reduced by 1.47% and 1.3% in terms of WER and CER, respectively, compared to those of samples with error segments.
Translated title of the contributionMultimodal Learning Model for Detecting Pronunciation Error Segments of Children's and Foreigners' Speech Data
Original languageKorean
Pages (from-to)396-401
Number of pages6
Journal정보과학회 컴퓨팅의 실제 논문지
Volume29
Issue number8
DOIs
StatePublished - 2023

Fingerprint

Dive into the research topics of 'Multimodal Learning Model for Detecting Pronunciation Error Segments of Children's and Foreigners' Speech Data'. Together they form a unique fingerprint.

Cite this