SyllaBERT: A Syllable-Based Efficient Robust Transformer Model for Real-World Noise and Typographical Errors 


Vol. 52,  No. 3, pp. 250-259, Mar.  2025
10.5626/JOK.2025.52.3.250


PDF

  Abstract

Training a Korean language model necessitates the development of a tokenizer specifically designed for the unique features of the Korean language, making this a crucial step in the modeling process. Most current language models utilize morpheme-based or subword-based tokenization. While these approaches work well with clean Korean text data, they are prone to out-of-vocabulary (OOV) issues due to abbreviations and neologisms frequently encountered in real-world Korean data. Moreover, actual Korean text often contains various typos and non-standard expressions, to which traditional morpheme-based or subword-based tokenizers are not sufficiently robust. To tackle these challenges, this paper introduces the SyllaBERT model, which employs syllable-level tokenization to effectively address the specific characteristics of Korean, even in noisy and non-standard contexts, with minimal resources. A compact syllable-level vocabulary was created, and a syllable-based language model was developed by reducing the embedding and hidden layer sizes of existing models. Experimental results show that, despite having approximately four times fewer parameters than subword-based models, the SyllaBERT model outperforms them in natural language understanding tasks on real-world conversational Korean data that includes noise.


  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

S. Park, Y. Heo, Y. Ko, "SyllaBERT: A Syllable-Based Efficient Robust Transformer Model for Real-World Noise and Typographical Errors," Journal of KIISE, JOK, vol. 52, no. 3, pp. 250-259, 2025. DOI: 10.5626/JOK.2025.52.3.250.


[ACM Style]

Seongwan Park, Yumin Heo, and Youngjoong Ko. 2025. SyllaBERT: A Syllable-Based Efficient Robust Transformer Model for Real-World Noise and Typographical Errors. Journal of KIISE, JOK, 52, 3, (2025), 250-259. DOI: 10.5626/JOK.2025.52.3.250.


[KCI Style]

박성완, 허유민, 고영중, "SyllaBERT: 실세계 노이즈 및 오탈자에 강건한 음절 기반 경량화 트랜스포머 모델," 한국정보과학회 논문지, 제52권, 제3호, 250~259쪽, 2025. DOI: 10.5626/JOK.2025.52.3.250.


[Endnote/Zotero/Mendeley (RIS)]  Download


[BibTeX]  Download



Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr