Digital Library[ Search Result ]
Study on the Evaluation of Embedding Models in the Natural Language Processing
http://doi.org/10.5626/JOK.2025.52.2.141
This paper applies embedding techniques to key tasks in the field of Natural Language Processing (NLP), including semantic textual search, text classification, question answering, and clustering, and evaluates their performance. Recently, with the advancement of large-scale language models, embedding technologies have played a crucial role in various NLP applications. Several types of embedding models have been publicly released, and this paper assesses the performance of these models. For this evaluation, vector representations generated by embedding models were used as an intermediate step for each selected task. The experiments utilized publicly available Korean and English datasets, and five NLP tasks were defined. Notably, the BGE-M3 model, which demonstrated exceptional performance in multilingual, cross-lingual, and long-document retrieval tasks, was a key focus of this study. The experimental results show that the BGE-M3 model outperforms other models in three of the evaluated NLP tasks. The findings of this research are expected to provide guidance in selecting embedding models for identifying similar sentences or documents in recent Retrieval-Augmented Generation (RAG) applications.
A Small-Scale Korean-Specific BERT Language Model
Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, Hyopil Shin
http://doi.org/10.5626/JOK.2020.47.7.682
Recent models for the sentence embedding use huge corpus and parameters. They have massive data and large hardware and it incurs extensive time to pre-train. This tendency raises the need for a model with comparable performance while economically using training data. In this study, we proposed a Korean-specific model KR-BERT, using sub-character level to character-level Korean dictionaries and BidirectionalWordPiece Tokenizer. As a result, our KR-BERT model performs comparably and even better than other existing pre-trained models using one-tenth the size of training data from the existing models. It demonstrates that in a morphologically complex and resourceless language, using sub-character level and BidirectionalWordPiece Tokenizer captures language-specific linguistic phenomena that the Multilingual BERT model missed.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr