TY  - JOUR
T1  - Metadata Extraction based on Deep Learning from Academic Paper in PDF
AU  - Kim, Seon-Wu 
AU  - Ji, Seon-Yeong 
AU  - Jeong, Hee-Seok 
AU  - Yoon, Hwa-Mook 
AU  - Choi, Sung-Pil 
JO  - Journal of KIISE, JOK
PY  - 2019
DA  - 2019/1/14
DO  - 10.5626/JOK.2019.46.7.644
KW  - PDF Metadata extraction
KW  - metadata extraction
KW  - information extraction
KW  - text mining
KW  - deep learning
AB  - Recently, with a rapid increase in the number of academic documents, there has arisen a need for an academic database service to obtain information about the latest research trends. Although automated metadata extraction service for academic database construction has been studied, most of the academic texts are composed of PDF, which makes it difficult to automatically extract information. In this paper, we propose an automatic metadata extraction method for PDF documents. First, after transforming the PDF into XML format, the coordinates, size, width, and text feature in the XML markup token are extracted and constructed as a vector form. Extracted feature information is analyzed using Bidirectional GRU-CRF, which is an deep learning model specialized for sequence labeling, and finally, metadata are extracted. In this study, 10 kinds of journals among various domestic journals were selected and a training set for metadata extraction was constructed and experimented using the proposed methodology. As a result of extraction experiment on 9 kinds of metadata, 88.27% accuracy and 84.39% F1 performance was obtained.