Journal of KIISE

Search : [ author: Sung-Pil Choi ] (7)

Performance Improvement of a Korean Open Domain Q&A System by Applying the Trainable Re-ranking and Response Filtering Model

Hyeonho Shin, Myunghoon Lee, Hong-Woo Chun, Jae-Min Lee, Sung-Pil Choi

http://doi.org/10.5626/JOK.2023.50.3.273

Research on Open Domain Q&A, which can identify answers to user inquiries without preparing the target paragraph in advance, is currently being undertaken as deep learning technology is used for natural language processing. However, existing studies have limitations in semantic matching using keyword-based information retrieval. To supplement this, deep learning-based information retrieval research is in progress. But there are not many domestic studies that have been empirically applied to real systems. In this paper, a two-step performance enhancement method was proposed to improve the performance of the Korean open domain Q&A system. The proposed method is a method of sequentially applying a machine learning-based re-ranking model and a response filtering model to a baseline system in which a search engine and an MRC model was combined. In the case of the baseline system, the initial performance was an F1 score of 74.43 and an EM score of 60.79, and it was confirmed that the performance improved to an F1 score of 82.5 and an EM score of 68.82 when the proposed method was used.

Training Data Augmentation Technique for Machine Comprehension by Question-Answer Pairs Generation Models based on a Pretrained Encoder-Decoder Model

Hyeonho Shin, Sung-Pil Choi

http://doi.org/10.5626/JOK.2022.49.2.166

The goal of Machine Reading Comprehension (MRC) research is to find answers to questions in documents. MRC research requires large-scale, high-quality data. However, individual researchers or small research institutes have limitations in constructing them. To overcome the limitations, in this paper, we propose an MRC data augmentation technique using a pre-training language model. This MRC data augmentation technique consists of a Q&A pair generation model and a data validation model. The Q&A pair generation model consists of an answer extraction model and a question generation model. Both models are constructed by fine-tuning the BART model. The data validation model is added to increase the reliability of the augmented data. It is used to verify the generated augmented data. The validation model is used by fine-tuning the ELECTRA model as an MRC model. To see the performance improvement of the MRC model through the data augmentation technique, we applied the data augmentation technique to KorQuAD v1.0 data. As a result of the experiment, compared to the previous model, the Exact Match(EM) Score increased up to 7.2 and the F1 Score increased up to 5.7.

A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model

Myunghoon Lee, Hyeonho Shin, Hong-Woo Chun, Jae-Min Lee, Taehyun Ha, Sung-Pil Choi

http://doi.org/10.5626/JOK.2021.48.6.696

Recently, with the rapid development of materials and chemistry fields, the academic literature has increased exponentially. Accordingly, studies are being conducted to extract meaningful information from the existing accumulated data, and Named Entity Recognition (NER) is being utilized as one of the methodologies. NER in materials and chemistry fields is a task of extracting standardized entities such as materials, material property information, and experimental conditions from academic literature and classifying types of the entities. In this paper, we studied the NER in materials and chemistry fields using a combination of embedding and a Bi-direction LSTM-CRF model with an existing published language model without pre-training a neural network language model. As a result, we found the best performing embedding combinations and analyzed their performance. Additionally, the pre-trained language model was used as a NER model to compare performance through fine-tuning. The process showed that the use of a public pre-trained language model for embedding combinations could derive meaningful results in NER in the materials and chemistry fields.

Sentence Generation from Knowledge Base Triples Using Attention Mechanism Encoder-decoder

Garam Choi, Sung-Pil Choi

http://doi.org/10.5626/JOK.2019.46.9.934

In this paper, we have investigated the generation of natural language sentences by using Knowledge Base Triples data with a structured structure. In order to generate a sentence that expresses the triple, a LSTM (Long Short-term Memory Network) encoder-decoder structure is used along with an Attention Mechanism. The BLEU score and ROUGE score for the test data were 42.264 (BLEU-1), 32.441 (BLEU-2), 26.820 (BLEU-3), 24.446 (BLEU-4), and 47.341 and 0.8% (based on BLEU-1) for the data comparison model. In addition, the average of the top 10 test data BLEU scores was recorded as 99.393 (BLEU-1).

Metadata Extraction based on Deep Learning from Academic Paper in PDF

Seon-Wu Kim, Seon-Yeong Ji, Hee-Seok Jeong, Hwa-Mook Yoon, Sung-Pil Choi

http://doi.org/10.5626/JOK.2019.46.7.644

Recently, with a rapid increase in the number of academic documents, there has arisen a need for an academic database service to obtain information about the latest research trends. Although automated metadata extraction service for academic database construction has been studied, most of the academic texts are composed of PDF, which makes it difficult to automatically extract information. In this paper, we propose an automatic metadata extraction method for PDF documents. First, after transforming the PDF into XML format, the coordinates, size, width, and text feature in the XML markup token are extracted and constructed as a vector form. Extracted feature information is analyzed using Bidirectional GRU-CRF, which is an deep learning model specialized for sequence labeling, and finally, metadata are extracted. In this study, 10 kinds of journals among various domestic journals were selected and a training set for metadata extraction was constructed and experimented using the proposed methodology. As a result of extraction experiment on 9 kinds of metadata, 88.27% accuracy and 84.39% F1 performance was obtained.

Comparative Analysis of Various Korean Morpheme Embedding Models using Massive Textual Resources

Da-Bin Lee, Sung-Pil Choi

http://doi.org/10.5626/JOK.2019.46.5.413

Word embedding is a transformation technique that enables a computer to recognize natural language. It is used in various fields of natural language processing based on machine learning such as machine translation and named-entity recognition. Various word-embedding models are available; however, few studies have compared the performance of these models under similar conditions. In this paper, we compare and analyze the performance of Word2Vec Skip-Gram, CBOW, Glove, and FastText, which are actively used according to Korean morpheme spacing. Based on experimental results with large news corpus and Sejong corpus, FastText yielded the best performance among CBOW, Skip-gram, Glove, and FastText of Word2Vec.

Research on Joint Models for Korean Word Spacing and POS (Part-Of-Speech) Tagging based on Bidirectional LSTM-CRF

Seon-Wu Kim, Sung-Pil Choi

http://doi.org/10.5626/JOK.2018.45.8.792

In general, Korean part-of-speech tagging is done on a sentence in which the spacing is completed by a word as an input. In order to process a sentence that is not properly spaced, automatic spacing is needed to correct the error. However, if the automatic spacing and the parts tagging are sequentially performed, a serious performance degradation may result from an error occurring at each step. In this study, we try to solve this problem by constructing an integrated model that can perform automatic spacing and POS(Part-Of-Speech) tagging simultaneously. Based on the Bidirectional LSTM-CRF model, we propose an integrated model that can simultaneously perform syllable-based word spacing and POS tagging complementarily. In the experiments using a Sejong tagged text, we obtained 98.77% POS tagging accuracy for the completely spaced sentences, and 97.92% morpheme accuracy for the sentences without any word spacing.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Digital Library[ Search Result ]

Search

Editorial Office