Search : [ author: Seon-Wu Kim ] (2)

Metadata Extraction based on Deep Learning from Academic Paper in PDF

Seon-Wu Kim, Seon-Yeong Ji, Hee-Seok Jeong, Hwa-Mook Yoon, Sung-Pil Choi

http://doi.org/10.5626/JOK.2019.46.7.644

Recently, with a rapid increase in the number of academic documents, there has arisen a need for an academic database service to obtain information about the latest research trends. Although automated metadata extraction service for academic database construction has been studied, most of the academic texts are composed of PDF, which makes it difficult to automatically extract information. In this paper, we propose an automatic metadata extraction method for PDF documents. First, after transforming the PDF into XML format, the coordinates, size, width, and text feature in the XML markup token are extracted and constructed as a vector form. Extracted feature information is analyzed using Bidirectional GRU-CRF, which is an deep learning model specialized for sequence labeling, and finally, metadata are extracted. In this study, 10 kinds of journals among various domestic journals were selected and a training set for metadata extraction was constructed and experimented using the proposed methodology. As a result of extraction experiment on 9 kinds of metadata, 88.27% accuracy and 84.39% F1 performance was obtained.

Research on Joint Models for Korean Word Spacing and POS (Part-Of-Speech) Tagging based on Bidirectional LSTM-CRF

Seon-Wu Kim, Sung-Pil Choi

http://doi.org/10.5626/JOK.2018.45.8.792

In general, Korean part-of-speech tagging is done on a sentence in which the spacing is completed by a word as an input. In order to process a sentence that is not properly spaced, automatic spacing is needed to correct the error. However, if the automatic spacing and the parts tagging are sequentially performed, a serious performance degradation may result from an error occurring at each step. In this study, we try to solve this problem by constructing an integrated model that can perform automatic spacing and POS(Part-Of-Speech) tagging simultaneously. Based on the Bidirectional LSTM-CRF model, we propose an integrated model that can simultaneously perform syllable-based word spacing and POS tagging complementarily. In the experiments using a Sejong tagged text, we obtained 98.77% POS tagging accuracy for the completely spaced sentences, and 97.92% morpheme accuracy for the sentences without any word spacing.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr