TY  - JOUR
T1  - Semi-automatic Expansion for a Chatting Corpus Based on a K-means Clustering Method And Similarity Measure
AU  - An, Jaehyun 
AU  - Ko, Youngjoong 
JO  - Journal of KIISE, JOK
PY  - 2019
DA  - 2019/1/14
DO  - 10.5626/JOK.2019.46.5.440
KW  - chatting system
KW  - semi-automatic expansion
KW  - similarity
KW  - convolutional neural networks
KW  - utterance embedding
AB  - In this paper, we proposed a semi-automatic expansion method to expand a chatting corpus using a large amount of utterance data from movie subtitles and drama scripts. To expand the chatting corpus, the proposed system used previously constructed chatting corpus and a similarity measure. If the similarity is calculated between a previously constructed chatting corpus and the input utterance was greater than a threshold value set in the experiment, the input utterance was selected as a new chatting utterance, that it is a correct chatting pair. We used morpheme-unit word embeddings and a Convolutional Neural Networks to efficiently calculate the similarity of the utterance embedding. In order to improve the speed of the semi-automatic expansion process, we proposed to reduce the amount of computation by clustering chat corpus by K-means clustering algorithm. Experimental results showed that the precision, recall, and F1 score of the proposed system were 61.28%, 53.19%, and 56.94%, respectively, which was 5.16%p, 6.09%, and 5.73%p higher than that of the baseline system. The term frequency and the speed of our system were also about a hundred times faster.