Search : [ keyword: Topic Model ] (5)

Infinite Latent Topic Models for Document Analysis

Bong-Kee Sin

http://doi.org/10.5626/JOK.2018.45.7.701

Since the concept of the topic is highly abstract, the characterization of the topics of a text is not clearly defined. Depending on the problem’s context or needs, various levels of detail may be provided, which could make it difficult to automatically analyze documents. This paper presents infinite topic extensions to the well-known model of Latent Dirichlet Allocation (LDA) i.e., the infinite Latent Dirichlet Topic model and the infinite Latent Markov Topic model. The first model simply relaxes the constraint of fixed known number of topics in LDA using the method of the Dirichlet process. The second model further extends it by including Markov dynamics that captures the sequential evolution of topics in a text. Both models are theoretically rigorous and structurally flexible, as well as being capable of capturing document organizations at a desired level of topics. A set of experiments show interesting results and a more intuitive topic characterization and local stationarity properties than related models with Gibbs sampling and variational inferences.

Analysis System for SNS Issues per Country based on Topic Model

Seong Hoon Kim, Ji Won Yoon

http://doi.org/

As the use of SNS continues to increase, various related studies have been conducted. According to the effectiveness of the topic model for existing theme extraction, a huge number of related research studies on topic model based analysis have been introduced. In this research, we suggested an automation system to analyze topics of each country and its distribution in twitter by combining world map visualization and issue matching method. The core system components are the following three modules; 1) collection of tweets and classification by nation, 2) extraction of topics and distribution by country based on topic model algorithm, and 3) visualization of topics and distribution based on Google geochart. In experiments with USA and UK, we could find issues of the two nations and how they changed. Based on these results, we could analyze the differences of each nation"s position on ISIS problem.

Automatic Prioritization of Requirements using Topic Modeling and Stakeholder Needs-Artifacts

Jong-In Jang, Jongmoon Baik

http://doi.org/

Due to the limitations of budget, resources, and time invested in a project, software requirements should be prioritized and be implemented in order of importance. Existing approaches to prioritizing requirements mostly depend on human decisions. The manual prioritization process is based on intensive interactions with the stakeholders, thus raising the issues of scalability and biased prioritization. To solve these problems, we propose a fully automated requirements prioritization approach, ToMSN (Topic Modeling Stakeholder Needs for requirements prioritization), by topic modeling the stakeholder needs-artifacts earned in the requirements elicitation phase. The requirements dataset of a 30,000-user system was utilized for the performance evaluation. ToMSN showed competitive prioritizing accuracy with existing approaches without human aids, therefore solving scalability and biased prioritization issues.

Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling

JuRyong Cheon, YoungJoong Ko

http://doi.org/

In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.

Semantic Dependency Link Topic Model for Biomedical Acronym Disambiguation

Seonho Kim, Juntae Yoon, Jungyun Seo

http://doi.org/

Many important terminologies in biomedical text are expressed as abbreviations or acronyms. We newly suggest a semantic link topic model based on the concepts of topic and dependency link to disambiguate biomedical abbreviations and cluster long form variants of abbreviations which refer to the same senses. This model is a generative model inspired by the latent Dirichlet allocation (LDA) topic model, in which each document is viewed as a mixture of topics, with each topic characterized by a distribution over words. Thus, words of a document are generated from a hidden topic structure of a document and the topic structure is inferred from observable word sequences of document collections. In this study, we allow two distinct word generation to incorporate semantic dependencies between words, particularly between expansions (long forms) of abbreviations and their sentential co-occurring words. Besides topic information, the semantic dependency between words is defined as a link and a new random parameter for the link presence is assigned to each word. As a result, the most probable expansions with respect to abbreviations of a given abstract are decided by word-topic distribution, document-topic distribution, and word-link distribution estimated from document collection though the semantic dependency link topic model. The abstracts retrieved from the MEDLINE Entrez interface by the query relating 22 abbreviations and their 186 expansions were used as a data set. The link topic model correctly predicted expansions of abbreviations with the accuracy of 98.30%.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr