Digital Library[ Search Result ]
KorSciQA 2.0: Question Answering Dataset for Machine Reading Comprehension of Korean Papers in Science & Technology Domain
Hyesoo Kong, Hwamook Yoon, Mihwan Hyun, Hyejin Lee, Jaewook Seol
http://doi.org/10.5626/JOK.2022.49.9.686
Recently, the performance of the Machine Reading Comprehension(MRC) system has been increased through various open-ended Question Answering(QA) task, and challenging QA task which has to comprehensively understand multiple text paragraphs and make discrete inferences is being released to train more intelligent MRC systems. However, due to the absence of a QA dataset for complex reasoning to understand academic information in Korean, MRC research on academic papers has been limited. In this paper, we constructed a QA dataset, KorSciQA 2.0, for the full text including abstracts of Korean academic papers and divided the difficulty level into general, easy, and hard for discriminative MRC systems. A methodology, process, and system for constructing KorSciQA 2.0 were proposed. We conducted MRC performance evaluation experiments and when fine-tuning based on the KorSciBERT model, which is a Korean-based BERT model for science and technology domains, the F1 score was 80.76%, showing the highest performance.
Metadata Extraction based on Deep Learning from Academic Paper in PDF
Seon-Wu Kim, Seon-Yeong Ji, Hee-Seok Jeong, Hwa-Mook Yoon, Sung-Pil Choi
http://doi.org/10.5626/JOK.2019.46.7.644
Recently, with a rapid increase in the number of academic documents, there has arisen a need for an academic database service to obtain information about the latest research trends. Although automated metadata extraction service for academic database construction has been studied, most of the academic texts are composed of PDF, which makes it difficult to automatically extract information. In this paper, we propose an automatic metadata extraction method for PDF documents. First, after transforming the PDF into XML format, the coordinates, size, width, and text feature in the XML markup token are extracted and constructed as a vector form. Extracted feature information is analyzed using Bidirectional GRU-CRF, which is an deep learning model specialized for sequence labeling, and finally, metadata are extracted. In this study, 10 kinds of journals among various domestic journals were selected and a training set for metadata extraction was constructed and experimented using the proposed methodology. As a result of extraction experiment on 9 kinds of metadata, 88.27% accuracy and 84.39% F1 performance was obtained.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr