A Small-Scale Korean-Specific BERT Language Model 


Vol. 47,  No. 7, pp. 682-692, Jul.  2020
10.5626/JOK.2020.47.7.682


PDF

  Abstract

Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.


  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

S. Lee, H. Jang, Y. Baik, S. Park, H. Shin, "A Small-Scale Korean-Specific BERT Language Model," Journal of KIISE, JOK, vol. 47, no. 7, pp. 682-692, 2020. DOI: 10.5626/JOK.2020.47.7.682.


[ACM Style]

Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, and Hyopil Shin. 2020. A Small-Scale Korean-Specific BERT Language Model. Journal of KIISE, JOK, 47, 7, (2020), 682-692. DOI: 10.5626/JOK.2020.47.7.682.


[KCI Style]

이상아, 장한솔, 백연미, 박수지, 신효필, "소규모 데이터 기반 한국어 버트 모델," 한국정보과학회 논문지, 제47권, 제7호, 682~692쪽, 2020. DOI: 10.5626/JOK.2020.47.7.682.


[Endnote/Zotero/Mendeley (RIS)]  Download


[BibTeX]  Download



Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr