Journal of KIISE

Search : [ author: Seungwoo Lee ] (2)

In order to analyze an issue based on real-time news articles, automatic clustering and technique for summarization of multiple news articles are mandatory. Although traditional clustering and summarization technique have been widely used in many natural language processing tasks with considerable success, the focus was on static corpora, instead of real-time data. Consequently, in the present work, we propose an incremental and hierarchical news clustering and multi-document summarization method for analysis of a large set of news articles in real time. We have employed both qualitative and quantitative evaluation methods. For qualitative evaluation, we used real-time data of about two months between October 2016 and November 2016, and the professionally trained researchers conducted a qualitative evaluation based on Precision at k. For quantitative evaluation, manually constructed news evaluation data was used, and document allocation accuracy was used for clustering performance. Furthermore, the ROUGE evaluation method was used for summarization performance. Accordingly, in the qualitative evaluation, the cluster performance was 66% on an average and the summarization performance was 92% on an average. In the quantitative evaluation, the cluster performance was 53.95% on an average. The summary performance was ROUGE-1: 0.2269, ROUGE-2: 0.1018, and ROUGE-L: 0.1689.

Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification

Yeongsu Kim, Seungwoo Lee

http://doi.org/10.5626/JOK.2018.45.7.690

Neural networks with word embedding have recently used for document classification. Researchers concentrate on designing new architecture or optimizing model parameters to increase performance. However, most recent studies have overlooked text preprocessing and word embedding, in that the description of text preprocessing used is insufficient, and a certain pretrained word embedding model is mostly used without any plausible reasons. Our paper shows that finding a suitable combination of text preprocessing and word embedding can be one of the important factors required to enhance the performance. We conducted experiments on AG’s News dataset to compare those possible combinations, and zero/random padding, and presence or absence of fine-tuning. We used pretrained word embedding models such as skip-gram, GloVe, and fastText. For diversity, we also use an average of multiple pretrained embeddings (Average), randomly initialized embedding (Random), task data-trained skip-gram (AGNews-Skip). In addition, we used three advanced neural networks for the sake of generality. Experimental results based on OOV (Out Of Vocabulary) word statistics suggest the necessity of those comparisons and a suitable combination of text preprocessing and word embedding.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Journal of KIISE

Journal of KIISE

Digital Library[ Search Result ]

Incremental Clustering and Multi-Document Summarization for Issue Analysis based on Real-time News

Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification

Search

Editorial Office