Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification 


Vol. 45,  No. 7, pp. 690-700, Jul.  2018
10.5626/JOK.2018.45.7.690


PDF

  Abstract

Neural networks with word embedding have recently used for document classification. Researchers concentrate on designing new architecture or optimizing model parameters to increase performance. However, most recent studies have overlooked text preprocessing and word embedding, in that the description of text preprocessing used is insufficient, and a certain pretrained word embedding model is mostly used without any plausible reasons. Our paper shows that finding a suitable combination of text preprocessing and word embedding can be one of the important factors required to enhance the performance. We conducted experiments on AG’s News dataset to compare those possible combinations, and zero/random padding, and presence or absence of fine-tuning. We used pretrained word embedding models such as skip-gram, GloVe, and fastText. For diversity, we also use an average of multiple pretrained embeddings (Average), randomly initialized embedding (Random), task data-trained skip-gram (AGNews-Skip). In addition, we used three advanced neural networks for the sake of generality. Experimental results based on OOV (Out Of Vocabulary) word statistics suggest the necessity of those comparisons and a suitable combination of text preprocessing and word embedding.


  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

Y. Kim and S. Lee, "Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification," Journal of KIISE, JOK, vol. 45, no. 7, pp. 690-700, 2018. DOI: 10.5626/JOK.2018.45.7.690.


[ACM Style]

Yeongsu Kim and Seungwoo Lee. 2018. Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification. Journal of KIISE, JOK, 45, 7, (2018), 690-700. DOI: 10.5626/JOK.2018.45.7.690.


[KCI Style]

김영수, 이승우, "문서 분류를 위한 신경망 모델에 적합한 텍스트 전처리와 워드 임베딩의 조합," 한국정보과학회 논문지, 제45권, 제7호, 690~700쪽, 2018. DOI: 10.5626/JOK.2018.45.7.690.


[Endnote/Zotero/Mendeley (RIS)]  Download


[BibTeX]  Download



Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr