Search : [ keyword: Word Embedding ] (22)

Biomedical Named Entity Recognition using Multi-head Attention with Highway Network

Minsoo Cho, Jinuk Park, Jihwan Ha, Chanhee Park, Sanghyun Park

http://doi.org/10.5626/JOK.2019.46.6.544

Biomedical named entity recognition(BioNER) is the process of extracting biomedical entities such as diseases, genes, proteins, and chemicals from biomedical literature. BioNER is an indispensable technique for the extraction of meaningful data from biomedical domains. The proposed model employs deep learning based Bi-LSTM-CRF model which eliminates the need for hand-crafted feature engineering. Additionally, the model contains multi-head attention to capture the relevance between words, which is used when predicting the label of each input token. Also, in the input embedding layer, the model integrates character-level embedding with word-level embedding and applies the combined word embedding into the highway network to adaptively carry each embedding to the input of the Bi-LSTM model. Two English biomedical benchmark datasets were employed in the present research to evaluate the level of performance. The proposed model resulted in higher f1-score compared to other previously studied models. The results demonstrate the effectiveness of the proposed methods in biomedical named entity recognition study.

Comparative Analysis of Various Korean Morpheme Embedding Models using Massive Textual Resources

Da-Bin Lee, Sung-Pil Choi

http://doi.org/10.5626/JOK.2019.46.5.413

Word embedding is a transformation technique that enables a computer to recognize natural language. It is used in various fields of natural language processing based on machine learning such as machine translation and named-entity recognition. Various word-embedding models are available; however, few studies have compared the performance of these models under similar conditions. In this paper, we compare and analyze the performance of Word2Vec Skip-Gram, CBOW, Glove, and FastText, which are actively used according to Korean morpheme spacing. Based on experimental results with large news corpus and Sejong corpus, FastText yielded the best performance among CBOW, Skip-gram, Glove, and FastText of Word2Vec.

Word Embedding using Relative Position Information between Words

Hyunsun Hwang, Changki Lee, HyunKi Jang, Dongho Kang

http://doi.org/10.5626/JOK.2018.45.9.943

In Word embedding, which is used to apply deep learning to natural language processing, a word is expressed on a vector space. This has the advantage of dimension reduction, whereby similar words have similar vector values. Word embedding needs to learn large-scale corpus to get achieve good performance. However, the word2vec model, which has frequently been used in the past, has a disadvantage in that it does not use relative position information between words because it largely learns the word appearance rate by simplifying the model for large capacity corpus learning. In this paper, we modified the existing word embedding learning model to enable it to learn using relative position information between words. Experimental results show that the performance of the word-analogy of the proposed modified word embedding learning model is improved when word embedding is learned using relative position information between words.

Multi-sense Word Embedding to Improve Performance of a CNN-based Relation Extraction Model

Sangha Nam, Kijong Han, Eun-kyung Kim, Sunggoo Kwon, Yoosung Jung, Key-Sun Choi

http://doi.org/10.5626/JOK.2018.45.8.816

The relation extraction task is to classify a relation between two entities in an input sentence and is important in natural language processing and knowledge extraction. Many studies have designed a relation extraction model using a distant supervision method. Recently the deep-learning based relation extraction model became mainstream such as CNN or RNN. However, the existing studies do not solve the homograph problem of word embedding used as an input of the model. Therefore, model learning proceeds with a single embedding value of homogeneous terms having different meanings; that is, the relation extraction model is learned without grasping the meaning of a word accurately. In this paper, we propose a relation extraction model using multi-sense word embedding. In order to learn multi-sense word embedding, we used a word sense disambiguation module based on the CoreNet concept, and the relation extraction model used CNN and PCNN models to learn key words in sentences.

Combinations of Text Preprocessing and Word Embedding Suitable for Neural Network Models for Document Classification

Yeongsu Kim, Seungwoo Lee

http://doi.org/10.5626/JOK.2018.45.7.690

Neural networks with word embedding have recently used for document classification. Researchers concentrate on designing new architecture or optimizing model parameters to increase performance. However, most recent studies have overlooked text preprocessing and word embedding, in that the description of text preprocessing used is insufficient, and a certain pretrained word embedding model is mostly used without any plausible reasons. Our paper shows that finding a suitable combination of text preprocessing and word embedding can be one of the important factors required to enhance the performance. We conducted experiments on AG’s News dataset to compare those possible combinations, and zero/random padding, and presence or absence of fine-tuning. We used pretrained word embedding models such as skip-gram, GloVe, and fastText. For diversity, we also use an average of multiple pretrained embeddings (Average), randomly initialized embedding (Random), task data-trained skip-gram (AGNews-Skip). In addition, we used three advanced neural networks for the sake of generality. Experimental results based on OOV (Out Of Vocabulary) word statistics suggest the necessity of those comparisons and a suitable combination of text preprocessing and word embedding.

Morpheme-based Efficient Korean Word Embedding

Dongjun Lee, Yubin Lim, Ted “Taekyoung” Kwon

http://doi.org/10.5626/JOK.2018.45.5.444

Previous word embedding models such as word2vec and Glove are not able to learn the internal structure of words. This is a serious limitation for agglutinative languages with morphology such as Korean. In this paper, we propose a new model which is an expansion of the previous skip-gram model. This defines each word vector as a sum of its morpheme vectors and hence, learns the vectors of morphemes. To test the efficiency of our embedding, we conducted a word similarity test and a word analogy test. Furthermore, using our trained vectors on other NLP tasks, we tested how much performance actually had been enhanced.

Assignment Semantic Category of a Word using Word Embedding and Synonyms

Da-Sol Park, Jeong-Won Cha

http://doi.org/10.5626/JOK.2017.44.9.946

Semantic Role Decision defines the semantic relationship between the predicate and the arguments in natural language processing (NLP) tasks. The semantic role information and semantic category information should be used to make Semantic Role Decisions. The Sejong Electronic Dictionary contains frame information that is used to determine the semantic roles. In this paper, we propose a method to extend the Sejong electronic dictionary using word embedding and synonyms. The same experiment is performed using existing word-embedding and retrofitting vectors. The system performance of the semantic category assignment is 32.19%, and the system performance of the extended semantic category assignment is 51.14% for words that do not appear in the Sejong electronic dictionary of the word using the word embedding. The system performance of the semantic category assignment is 33.33%, and the system performance of the extended semantic category assignment is 53.88% for words that do not appear in the Sejong electronic dictionary of the vector using retrofitting. We also prove it is helpful to extend the semantic category word of the Sejong electronic dictionary by assigning the semantic categories to new words that do not have assigned semantic categories.

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, Jong C. Park

http://doi.org/

We present a system that automatically annotates unverified Web sentences with information from credible sources. The system turns to neural theorem proving for an annotating task for cancer related Wikipedia data (1,486 propositions) with Korean National Cancer Center data (19,304 propositions). By switching the recursive module in a neural theorem prover to a word embedding module, we overcome the fundamental problem of tremendous learning time. Within the identical environment, the original neural theorem prover was estimated to spend 233.9 days of learning time. In contrast, the revised neural theorem prover took only 102.1 minutes of learning time. We demonstrated that a neural theorem prover, which encodes a proposition in a tensor, includes a classic theorem prover for exact match and enables end-to-end differentiable logic for analogous words.

Korean Named Entity Recognition and Classification using Word Embedding Features

Yunsu Choi, Jeongwon Cha

http://doi.org/

Named Entity Recognition and Classification (NERC) is a task for recognition and classification of named entities such as a person"s name, location, and organization. There have been various studies carried out on Korean NERC, but they have some problems, for example lacking some features as compared with English NERC. In this paper, we propose a method that uses word embedding as features for Korean NERC. We generate a word vector using a Continuous-Bag-of- Word (CBOW) model from POS-tagged corpus, and a word cluster symbol using a K-means algorithm from a word vector. We use the word vector and word cluster symbol as word embedding features in Conditional Random Fields (CRFs). From the result of the experiment, performance improved 1.17%, 0.61% and 1.19% respectively for TV domain, Sports domain and IT domain over the baseline system. Showing better performance than other NERC systems, we demonstrate the effectiveness and efficiency of the proposed method.

Improving The Performance of Triple Generation Based on Distant Supervision By Using Semantic Similarity

Hee-Geun Yoon, Su Jeong Choi, Seong-Bae Park

http://doi.org/

The existing pattern-based triple generation systems based on distant supervision could be flawed by assumption of distant supervision. For resolving flaw from an excessive assumption, statistics information has been commonly used for measuring confidence of patterns in previous studies. In this study, we proposed a more accurate confidence measure based on semantic similarity between patterns and properties. Unsupervised learning method, word embedding and WordNet-based similarity measures were adopted for learning meaning of words and measuring semantic similarity. For resolving language discordance between patterns and properties, we adopted CCA for aligning bilingual word embedding models and a translation-based approach for a WordNet-based measure. The results of our experiments indicated that the accuracy of triples that are filtered by the semantic similarity-based confidence measure was 16% higher than that of the statistics-based approach. These results suggested that semantic similarity-based confidence measure is more effective than statistics-based approach for generating high quality triples.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr