Search : [ author: Suzi Park ] (1)

A Small-Scale Korean-Specific BERT Language Model

Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, Hyopil Shin

http://doi.org/10.5626/JOK.2020.47.7.682

Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr