Journal of KIISE

Search : [ author: Jae-kyun Kim ] (3)

A corpus is an essential resource for machine learning and deep learning in the field of natural language processing. In Korean, there are insufficient well-refined named entity corpus compared to advanced research countries such as the United States, Japan, and China. Most projects for building a named entity corpus proceed manually and/or semi-automatically and thus require a lot of cost and effort. In this paper, we propose a novel method for automatically augmenting a small-sized named entity corpus. The proposed method augments the corpus by automatically editing, for example, substituting, inserting, and deleting. We use probabilistic sampling rather than simple editing to make the augmented corpus natural and diverse. Through experiments, we have shown that the performance of Korean named entity recognition can be improved using the augmented corpus and the proposed method should be used in practice.

A BIT Named Entity Format Suitable for Low Resource Environments

Ho Yoon, Chang-Hyun Kim, Min-ah Cheon, Ho-min Park, Young Namgoong, Min-seok Choi, Jae-kyun Kim, Jae-Hoon Kim

http://doi.org/10.5626/JOK.2021.48.3.293

Named entity recognition (NER) seeks to locate and classify named entities into predefined categories such as person names, organization, location, and others. Most name entities consist of more than one word and so the multitude of annotated corpora for NER are encoded by the BIO (short for Beginning, Inside, and Outside) format: A “B-” prefix before a tag indicates that the tag is the beginning of a named entity, and an “I-” prefix before a tag indicates that the tag is inside the named entity. An “O” tag indicates that a word belongs to no named entity. In this format, words with “O” tags in the corpora amount to more than about 90% of the words and thus, can cause two problems: the high perplexity of words with “O” tags and imbalance learning. In this paper, we propose a novel format to represent the NER corpus called the BIT format, which uses “T (short for POS Tags)” tags in place of “O” tags. Experiments have shown that the BIT format outperforms the BIO format when the meaning projection of the word representation is unreliable, namely, when word embedding is trained through a relatively small number of words.

Defining Chunks and Chunking using Its Corpus and Bi-LSTM/CRFs in Korean

Young Namgoong, Chang-Hyun Kim, Min-ah Cheon, Ho-min Park, Ho Yoon, Min-seok Choi, Jae-kyun Kim, Jae-Hoon Kim

http://doi.org/10.5626/JOK.2020.47.6.587

There are several notorious problems in Korean dependency parsing: the head position problem and the constituent unit problem. Such problems can be somewhat resolved by chunking. Chunking seeks to locate and classify constituents referred to as chunks into predefined categories. So far, several studies in Korean have been conducted without a clear definition of chunks partially. Thus, we define chunks in Korean thoroughly and build a chunk-tagged corpus based on the definition as well as propose a Bi-LSTM/CRF chunking model using the corpus. Through experiments, we have shown that the proposed model achieved a F1-score of 98.54% and can be used for practical applications. We analyzed performance variations according to word embedding and so fastText showed the best performance. Error analysis was performed so that it could be used to improve the proposed model in the near future.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Journal of KIISE

Journal of KIISE

Digital Library[ Search Result ]

Named Entity Tagged Corpus Augmentation Using Automatic Editing

A BIT Named Entity Format Suitable for Low Resource Environments

Defining Chunks and Chunking using Its Corpus and Bi-LSTM/CRFs in Korean

Search

Editorial Office