Search : [ keyword: 토크나이저 ] (2)

SyllaBERT: A Syllable-Based Efficient Robust Transformer Model for Real-World Noise and Typographical Errors

Seongwan Park, Yumin Heo, Youngjoong Ko

http://doi.org/10.5626/JOK.2025.52.3.250

Training a Korean language model necessitates the development of a tokenizer specifically designed for the unique features of the Korean language, making this a crucial step in the modeling process. Most current language models utilize morpheme-based or subword-based tokenization. While these approaches work well with clean Korean text data, they are prone to out-of-vocabulary (OOV) issues due to abbreviations and neologisms frequently encountered in real-world Korean data. Moreover, actual Korean text often contains various typos and non-standard expressions, to which traditional morpheme-based or subword-based tokenizers are not sufficiently robust. To tackle these challenges, this paper introduces the SyllaBERT model, which employs syllable-level tokenization to effectively address the specific characteristics of Korean, even in noisy and non-standard contexts, with minimal resources. A compact syllable-level vocabulary was created, and a syllable-based language model was developed by reducing the embedding and hidden layer sizes of existing models. Experimental results show that, despite having approximately four times fewer parameters than subword-based models, the SyllaBERT model outperforms them in natural language understanding tasks on real-world conversational Korean data that includes noise.

A Small-Scale Korean-Specific BERT Language Model

Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, Hyopil Shin

http://doi.org/10.5626/JOK.2020.47.7.682

Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr