Digital Library[ Search Result ]
LEXAI : Legal Document Similarity Analysis Service using Explainable AI
http://doi.org/10.5626/JOK.2020.47.11.1061
Recently, in keeping with the improvement of deep learning, studies on using deep learning a specialized field have diversified. Semantic searching for legal documents is an essential part of the legal field. However, it is difficult to function outside of the service using the expert system because it requires professional knowledge in the relevant field. It is also challenging to establish an automated, semantically similar legal document retrieval environment because the cost of hiring professional human resources is high. While existing retrieval services provide an environment based on expert systems and statistical systems, the proposed method adopts the deep learning method with a classification task. We propose a database system structure that provides searching for legal documents with high semantic similarity using an explainable neural network. The features of these proposed methods show the performance of developing and verifying visual similarity assessment methods for semantic relevance among similar documents.
Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification
http://doi.org/10.5626/JOK.2018.45.7.690
Neural networks with word embedding have recently used for document classification. Researchers concentrate on designing new architecture or optimizing model parameters to increase performance. However, most recent studies have overlooked text preprocessing and word embedding, in that the description of text preprocessing used is insufficient, and a certain pretrained word embedding model is mostly used without any plausible reasons. Our paper shows that finding a suitable combination of text preprocessing and word embedding can be one of the important factors required to enhance the performance. We conducted experiments on AG’s News dataset to compare those possible combinations, and zero/random padding, and presence or absence of fine-tuning. We used pretrained word embedding models such as skip-gram, GloVe, and fastText. For diversity, we also use an average of multiple pretrained embeddings (Average), randomly initialized embedding (Random), task data-trained skip-gram (AGNews-Skip). In addition, we used three advanced neural networks for the sake of generality. Experimental results based on OOV (Out Of Vocabulary) word statistics suggest the necessity of those comparisons and a suitable combination of text preprocessing and word embedding.
Hybrid Word-Character Neural Network Model for the Improvement of Document Classification
http://doi.org/10.5626/JOK.2017.44.12.1290
Document classification, a task of classifying the category of each document based on text, is one of the fundamental areas for natural language processing. Document classification may be used in various fields such as topic classification and sentiment classification. Neural network models for document classification can be divided into two categories: word-level models and character-level models that treat words and characters as basic units respectively. In this study, we propose a neural network model that combines character-level and word-level models to improve performance of document classification. The proposed model extracts the feature vector of each word by combining information obtained from a word embedding matrix and information encoded by a character-level neural network. Based on feature vectors of words, the model classifies documents with a hierarchical structure wherein recurrent neural networks with attention mechanisms are used for both the word and the sentence levels. Experiments on real life datasets demonstrate effectiveness of our proposed model.
Feature Extraction to Detect Hoax Articles
Readership of online newspapers has grown with the proliferation of smart devices. However, fierce competition between Internet newspaper companies has resulted in a large increase in the number of hoax articles. Hoax articles are those where the title does not convey the content of the main story, and this gives readers the wrong information about the contents. We note that the hoax articles have certain characteristics, such as unnecessary celebrity quotations, mismatch in the title and content, or incomplete sentences. Based on these, we extract and validate features to identify hoax articles. We build a large-scale training dataset by analyzing text keywords in replies to articles and thus extracted five effective features. We evaluate the performance of the support vector machine classifier on the extracted features, and a 92% accuracy is observed in our validation set. In addition, we also present a selective bigram model to measure the consistency between the title and content, which can be effectively used to analyze short texts in general.
Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification
Hokyung Lee, Seon Yang, Youngjoong Ko
Data such as Twitter, Facebook, and customer reviews belong to the informal document group, whereas, newspapers that have grammar correction step belong to the formal document group. Finding consistent rules or patterns in informal documents is difficult, as compared to formal documents. Hence, there is a need for additional approaches to improve informal document analysis. In this study, we classified Twitter data, a representative informal document, into ten categories. To improve performance, we revised and expanded features based on LDA(Latent Dirichlet allocation) word distribution. Using LDA top-ranked words, the other words were separated or bundled, and the feature set was thus expanded repeatedly. Finally, we conducted document classification with the expanded features. Experimental results indicated that the proposed method improved the micro-averaged F1-score of 7.11%p, as compared to the results before the feature expansion step.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr