Metadata Extraction based on Deep Learning from Academic Paper in PDF 


Vol. 46,  No. 7, pp. 644-652, Jul.  2019
10.5626/JOK.2019.46.7.644


PDF

  Abstract

Recently, with a rapid increase in the number of academic documents, there has arisen a need for an academic database service to obtain information about the latest research trends. Although automated metadata extraction service for academic database construction has been studied, most of the academic texts are composed of PDF, which makes it difficult to automatically extract information. In this paper, we propose an automatic metadata extraction method for PDF documents. First, after transforming the PDF into XML format, the coordinates, size, width, and text feature in the XML markup token are extracted and constructed as a vector form. Extracted feature information is analyzed using Bidirectional GRU-CRF, which is an deep learning model specialized for sequence labeling, and finally, metadata are extracted. In this study, 10 kinds of journals among various domestic journals were selected and a training set for metadata extraction was constructed and experimented using the proposed methodology. As a result of extraction experiment on 9 kinds of metadata, 88.27% accuracy and 84.39% F1 performance was obtained.


  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

S. Kim, S. Ji, H. Jeong, H. Yoon, S. Choi, "Metadata Extraction based on Deep Learning from Academic Paper in PDF," Journal of KIISE, JOK, vol. 46, no. 7, pp. 644-652, 2019. DOI: 10.5626/JOK.2019.46.7.644.


[ACM Style]

Seon-Wu Kim, Seon-Yeong Ji, Hee-Seok Jeong, Hwa-Mook Yoon, and Sung-Pil Choi. 2019. Metadata Extraction based on Deep Learning from Academic Paper in PDF. Journal of KIISE, JOK, 46, 7, (2019), 644-652. DOI: 10.5626/JOK.2019.46.7.644.


[KCI Style]

김선우, 지선영, 정희석, 윤화묵, 최성필, "학술논문 PDF에 대한 딥러닝 기반의 메타데이터 추출 방법 연구," 한국정보과학회 논문지, 제46권, 제7호, 644~652쪽, 2019. DOI: 10.5626/JOK.2019.46.7.644.


[Endnote/Zotero/Mendeley (RIS)]  Download


[BibTeX]  Download



Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr