Search : [ keyword: speech recognition ] (3)

A Study on Improving the Accuracy of Korean Speech Recognition Texts Using KcBERT

Donguk Min, Seungsoo Nam, Daeseon Choi

http://doi.org/10.5626/JOK.2024.51.12.1115

In the field of speech recognition, models such as Whisper, Wav2Vec2.0, and Google STT are widely utilized. However, Korean speech recognition faces challenges because complex phonological rules and diverse pronunciation variations hinder performance improvements. To address these issues, this study proposed a method that combined the Whisper model with a post-processing approach using KcBERT. By applying KcBERT’s bidirectional contextual learning to text generated by the Whisper model, the proposed method could enhance contextual coherence and refine the text for greater naturalness. Experimental results showed that post-processing reduced the Character Error Rate (CER) from 5.12% to 1.88% in clean environments and from 22.65% to 10.17% in noisy environments. Furthermore, the Word Error Rate (WER) was significantly improved, decreasing from 13.29% to 2.71% in clean settings and from 38.98% to 11.15% in noisy settings. BERTScore also exhibited overall improvement. These results demonstrate that the proposed approach is effective in addressing complex phonological rules and maintaining text coherence within Korean speech recognition.

Creating a of Noisy Environment Speech Mixture Dataset for Korean Speech Separation

Jaehoo Jang, Kun Park, Jeongpil Lee, Myoung-Wan Koo

http://doi.org/10.5626/JOK.2024.51.6.513

In the field of speech separation, models are typically trained using datasets that contain mixtures of speech and overlapping noise. Although there are established international datasets for advancing speech separation techniques, Korea currently lacks a similar precedent for constructing datasets with overlapping speech and noise. Therefore, this paper presents a dataset generator specifically designed for single-channel speech separation models tailored to the Korean language. The Korean Speech mixture with Noise dataset is introduced, which has been constructed using this generator. In our experiments, we train and evaluate a Conv-TasNet speech separation model using the newly created dataset. Additionally, we verify the dataset's efficacy by comparing the Character Error Rate (CER) between the separated speech and the original speech using a pre-trained speech recognition model.

An Automated Error Detection Method for Speech Transcription Corpora Based on Speech Recognition and Language Models

Jeongpil Lee, Jeehyun Lee, Yerin Choi, Jaehoo Jang, Myoung-Wan Koo

http://doi.org/10.5626/JOK.2024.51.4.362

This research proposes a "machine-in-the-loop" approach for automatic error detection in Korean speech corpora by integrating the knowledge of CTC-based speech recognition models and language models. We experimentally validated its error detection performance through a three-step procedure that leveraged Character Error Rate (CER) from the speech recognition model and Perplexity (PPL) from the language model to identify potential transcription error candidates and verify their text labels. This research focused on the Korean speech corpus, KsponSpeech, resulting in a reduction of the character error rate on the test set from 9.44% to 8.9%. Notably, this performance enhancement was achieved even when inspecting only approximately 11% of the test data, highlighting the higher efficiency of our proposed method than a comprehensive manual inspection process. Our study affirms the potential of this efficient "machine-in-the-loop" approach for a cost-effective error detection mechanism in speech data while ensuring accuracy.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr