Search : [ author: Seongwan Park ] (1)

SyllaBERT: A Syllable-Based Efficient Robust Transformer Model for Real-World Noise and Typographical Errors

Seongwan Park, Yumin Heo, Youngjoong Ko

http://doi.org/10.5626/JOK.2025.52.3.250

Training a Korean language model necessitates the development of a tokenizer specifically designed for the unique features of the Korean language, making this a crucial step in the modeling process. Most current language models utilize morpheme-based or subword-based tokenization. While these approaches work well with clean Korean text data, they are prone to out-of-vocabulary (OOV) issues due to abbreviations and neologisms frequently encountered in real-world Korean data. Moreover, actual Korean text often contains various typos and non-standard expressions, to which traditional morpheme-based or subword-based tokenizers are not sufficiently robust. To tackle these challenges, this paper introduces the SyllaBERT model, which employs syllable-level tokenization to effectively address the specific characteristics of Korean, even in noisy and non-standard contexts, with minimal resources. A compact syllable-level vocabulary was created, and a syllable-based language model was developed by reducing the embedding and hidden layer sizes of existing models. Experimental results show that, despite having approximately four times fewer parameters than subword-based models, the SyllaBERT model outperforms them in natural language understanding tasks on real-world conversational Korean data that includes noise.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr