Digital Library[ Search Result ]
Contract Eligibility Verification Enhanced by Keyword and Contextual Embeddings
Sangah Lee, Seokgi Kim, Eunjin Kim, Minji Kang, Hyopil Shin
http://doi.org/10.5626/JOK.2022.49.10.848
Contracts need to be reviewed to be verified if they include all the essential clauses for them to be valid. Such clauses are highly formal and repetitive regardless of the kinds of contracts, and automated legal technologies are required for legal text comprehension. In this paper, we have constructed a simple item-by-item classification model for clauses in contracts to estimate contract eligibility by addressing formal and repetitive properties of contract clauses. We have used keyword embeddings based on conventional requirements of contracts and concatenate them to sentence embeddings of clauses, extracted from a BERT model fine-tuned with legal documents. The contract eligibility can be verified by the predicted labels. Based on our methods, we report reasonable performances with the accuracy of 90.57 and 90.64, and an F1-score of 93.27 and 93.26, using additional keyword embeddings with BERT embeddings.
Effective Transfer Learning in Text Classification with the Label-Based Discriminative Feature Learning
http://doi.org/10.5626/JOK.2022.49.3.214
The performance of the natural language processing with transfer learning methodology has improved by pre-training language models with a large amount of general data and applying them on downstream tasks. However, the problem is that it learns general features rather than those specific to the downstream tasks as the data used in pre-training is irrelevant to the downstream tasks. This paper proposes a novel learning method for embeddings of pre-trained models to learn specific features of the downstream tasks. The proposed method is to learn the label feature of the downstream tasks through contrast learning with label embedding and sampled data pairs. To demonstrate the performance of the proposed method, we conducted experiments on sentence classification datasets and evaluated whether features of downstream tasks have been learned through PCA(Principal component analysis) and clustering on embeddings.
Method for the Automatic Generation of Training Sets for Word Embedding Reflecting Sentiment Information
Dahee Lee, Won-Min Lee, Byung-Won On
http://doi.org/10.5626/JOK.2022.49.1.42
Word embedding is a method of expressing a word as a vector. However, since existing word embedding methods predict words that appear together, they are expressed as similar vectors even if they have different emotions. When building a sentiment analysis model using this, sentences with similar patterns may be classified into the same polarity, which is one of the factors that degrade the performance of the emotional analysis model. In this paper, to address the problem, we proposed the automatic generation of a training set for word embedding reflecting sentiment information using morpheme analysis, dependence parsing, and a sentiment dictionary. Using sentiment-specific word embedding vectors generated by the proposed model, we showed that the proposed sentiment-specific word embedding model outperformed the existing word embedding models including CBOW, Skip-Gram, FastText, ELMo, and BERT.
GPT-2 for Knowledge Graph Completion
http://doi.org/10.5626/JOK.2021.48.12.1281
Knowledge graphs become an important resource in many artificial intelligence (AI) tasks. Many studies are being conducted to complete the incomplete knowledge graph. Among them, interest in research that knowledge completion by link prediction and relation prediction is increasing. The most talked-about language models in AI natural language processing include BERT and GPT-2, among which KG-BERT wants to solve knowledge completion problems with BERT. In this paper, we wanted to solve the problem of knowledge completion by utilizing GPT-2, which is the biggest recent issue in the language model of AI. Triple information-based knowledge completion and path-triple-based knowledge completion were proposed and explained as methods to solve the knowledge completion problem using the GPT-2 language model. The model proposed in this paper was defined as KG-GPT2, and experiments were conducted by comparing the link prediction and relationship prediction results of TransE, TransR, KG-BERT, and KG-GPT2 to evaluate knowledge completion performance. For link prediction, WN18RR, FB15k-237, and UMLS datasets were used, and for relation prediction, FB15K was used. As a result of the experiment, in the case of link prediction in the path- triple-based knowledge completion of KG-GPT2, the best performance was recorded for all experimental datasets except UMLS. In the path-triple-based knowledge completion of KG-GPT2, the model"s relationship prediction work also recorded the best performance for the FB15K dataset.
An Efficient Document Clustering Method using Space Transformation based on LDA and WMD
http://doi.org/10.5626/JOK.2021.48.9.1052
The existing TF-IDF-based document clustering methods do not properly exploit the contextual information of documents, i.e., co-occurence and word-order, and tend to degrade the performance due to the curse of dimensionality. To overcome these problems, the techniques such as a weighted average of word embedding vectors or Word Mover"s Distance (WMD) have been proposed. The performance of the techniques is good at document classification, but not a document clustering that needs to group documents. In this study, we define a document group as a topic document using LDA, the document group"s representative document, and solve the existing problem by calculating the WMD based on the topic document. However, since WMD requires a large amount of computation, we propose a space transformation method that shows a good performance while reducing the computation cost by mapping each document to a low-dimensional space in which each axis means WMD value from each topic document.
A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model
Myunghoon Lee, Hyeonho Shin, Hong-Woo Chun, Jae-Min Lee, Taehyun Ha, Sung-Pil Choi
http://doi.org/10.5626/JOK.2021.48.6.696
Recently, with the rapid development of materials and chemistry fields, the academic literature has increased exponentially. Accordingly, studies are being conducted to extract meaningful information from the existing accumulated data, and Named Entity Recognition (NER) is being utilized as one of the methodologies. NER in materials and chemistry fields is a task of extracting standardized entities such as materials, material property information, and experimental conditions from academic literature and classifying types of the entities. In this paper, we studied the NER in materials and chemistry fields using a combination of embedding and a Bi-direction LSTM-CRF model with an existing published language model without pre-training a neural network language model. As a result, we found the best performing embedding combinations and analyzed their performance. Additionally, the pre-trained language model was used as a NER model to compare performance through fine-tuning. The process showed that the use of a public pre-trained language model for embedding combinations could derive meaningful results in NER in the materials and chemistry fields.
An Embedding Method of Emotes for the Detection of Popular Clips on Twitch.tv
Hyeonho Song, Kunwoo Park, Meeyoung Cha
http://doi.org/10.5626/JOK.2020.47.12.1153
This study presents an embedding method that effectively learns emote’s meaning in Twitch.tv to understand the audience reaction in live streaming. The proposed method first trains an embedding matrix for text and emotes, respectively, and merges the two matrices into one. Using 2,220,761 clips shared on Twitch.tv, this study conducted two experiments: clustering and clip popularity prediction. Results showed that the approach identifies emote clusters that express a similar emotion and detects popular clips. Future studies could utilize the proposed emote embedding method for the highlight prediction of a live stream.
Defining Chunks and Chunking using Its Corpus and Bi-LSTM/CRFs in Korean
Young Namgoong, Chang-Hyun Kim, Min-ah Cheon, Ho-min Park, Ho Yoon, Min-seok Choi, Jae-kyun Kim, Jae-Hoon Kim
http://doi.org/10.5626/JOK.2020.47.6.587
There are several notorious problems in Korean dependency parsing: the head position problem and the constituent unit problem. Such problems can be somewhat resolved by chunking. Chunking seeks to locate and classify constituents referred to as chunks into predefined categories. So far, several studies in Korean have been conducted without a clear definition of chunks partially. Thus, we define chunks in Korean thoroughly and build a chunk-tagged corpus based on the definition as well as propose a Bi-LSTM/CRF chunking model using the corpus. Through experiments, we have shown that the proposed model achieved a F1-score of 98.54% and can be used for practical applications. We analyzed performance variations according to word embedding and so fastText showed the best performance. Error analysis was performed so that it could be used to improve the proposed model in the near future.
Comparison of Context-Sensitive Spelling Error Correction using Embedding Techniques
Jung-Hun Lee, Minho Kim, Hyuk-Chul Kwon
http://doi.org/10.5626/JOK.2020.47.2.147
This paper focuses on the use of embedding techniques to solve problems in context-sensitive spelling correction and compare the performance of each technique. A vector of words obtained through embedding learning is used to correct the distance between the correction target word and the surrounding context word. In this paper, we tried to improve the correction performance by reflecting the processing of words not included in the learning corpus and surrounding contextual information of the correction words. The embedding techniques used for proofing were divided into word-based embeddings and embeddings that reflected contextual information. This paper performed correction experiments using the embedding techniques, focusing on the above two improvement goals, and obtained reliable correction performance.
Analyzing Semantic Relations of Word Vectors trained by The Word2vec Model
http://doi.org/10.5626/JOK.2019.46.10.1088
As the usage of artificial intelligence (AI) in natural language processing has increased, the importance of word embedding has grown significantly. This paper qualitatively analyzes the representational capability of word2vec models to structure semantic relation in terms of antonymy and hyponymy based on clustering characteristics and t-SNE distribution. To this end, a K-means clustering algorithm was applied to a set of words drawn from 10 categories. Some words in antonymy are found not to be embedded properly. This is attributed to the fact that they typically have many common attributes with a very few opposite ones. It is also observed that words in hyponymy are not properly embedded at all. This can be attributed to the fact that the hyponymic relations of those words are based on the information gathered through a learning process of a knowledge system, as opposed to a natural process of language acquisition. Thus, it appears that word2vec models based on the distributional hypothesis are limited to representing certain antonymic relations and do not properly represent hyponymic relations at all.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr