Search : [ author: Kyuseok Shim ] (9)

Semi-Supervised Learning Exploiting Robust Loss Function for Sparse Labeled Data

Youngjun Ahn, Kyuseok Shim

http://doi.org/10.5626/JOK.2021.48.12.1343

This paper proposes a semi-supervised learning method which uses data augmentation and robust loss function when labeled data are extremely sparse. Existing semi-supervised learning methods augment unlabeled data and use one-hot vector labels predicted by the current model if the confidence of the prediction is high. Since it does not use low-confidence data, a recent work has used low-confidence data in the training by utilizing robust loss function. Meanwhile, if labeled data are extremely sparse, the prediction can be incorrect even if the confidence is high. In this paper, we propose a method to improve the performance of a classification model when labeled data are extremely sparse by using predicted probability, instead of one hot vector as the label. Experiments show that the proposed method improves the performance of a classification model.

An Efficient and Differentially Private K-Means Clustering Algorithm Using the Voronoi Diagram

Daeyoung Hong, Kyuseok Shim

http://doi.org/10.5626/JOK.2020.47.9.879

Studies have been recently conducted on preventing the leakage of personal information from the analysis results of data. Among them, differential privacy is a widely studied standard since it guarantees rigorous and provable privacy preservation. In this paper, we propose an algorithm based on the Voronoi diagram to publish the results of the K-means clustering for 2D data while guaranteeing the differential privacy. Existing algorithms have a disadvantage in that it is difficult to select the number of samples for the data since the running time and the accuracy of the clustering results may change according to the number of samples. The proposed algorithm, however, could quickly provide an accurate clustering result without requiring such a parameter. We also demonstrate the performance of the proposed algorithm through experiments using real-life data.

Improving the Upper Bound of the Dynamic Time Warping for Sparse and Long Time Sequences

Janghyuk Seo, Woohwan Jung, Kyuseok Shim

http://doi.org/10.5626/JOK.2019.46.6.570

Dynamic Time Warping (DTW), a distance measure widely used in time series analysis, is associated with the shortcoming of long execution time with long time sequences. To alleviate the problem, several algorithms which use a compression technique called run-length encoding and compute approximate DTW distances were recently proposed. However, the computation of the upper bounds by such algorithms consists of adding unnecessary distance values. In this paper, we propose an approximation algorithm which improves the state of the art for computing approximate DTW distances while keeping the time complexity unchanged. Experimental results with both synthetic and real-life data confirm the effectiveness of our proposed algorithm.

Differentially Private k-Means Clustering based on Dynamic Space Partitioning using a Quad-Tree

Hanjun Goo, Woohwan Jung, Seongwoong Oh, Suyong Kwon, Kyuseok Shim

http://doi.org/10.5626/JOK.2018.45.3.288

There have recently been several studies investigating how to apply a privacy preserving technique to publish data. Differential privacy can protect personal information regardless of an attacker’s background knowledge by adding probabilistic noise to the original data. To perform differentially private k-means clustering, the existing algorithm builds a differentially private histogram and performs the k-means clustering. Since it constructs an equi-width histogram without considering the distribution of data, there are many buckets to which noise should be added. We propose a k-means clustering algorithm using a quad-tree that captures the distribution of data by using a small number of buckets. Our experiments show that the proposed algorithm shows better performance than the existing algorithm.

Hybrid Word-Character Neural Network Model for the Improvement of Document Classification

Daeyoung Hong, Kyuseok Shim

http://doi.org/10.5626/JOK.2017.44.12.1290

Document classification, a task of classifying the category of each document based on text, is one of the fundamental areas for natural language processing. Document classification may be used in various fields such as topic classification and sentiment classification. Neural network models for document classification can be divided into two categories: word-level models and character-level models that treat words and characters as basic units respectively. In this study, we propose a neural network model that combines character-level and word-level models to improve performance of document classification. The proposed model extracts the feature vector of each word by combining information obtained from a word embedding matrix and information encoded by a character-level neural network. Based on feature vectors of words, the model classifies documents with a hierarchical structure wherein recurrent neural networks with attention mechanisms are used for both the word and the sentence levels. Experiments on real life datasets demonstrate effectiveness of our proposed model.

Efficient Authentication of Aggregation Queries for Outsourced Databases

Jongmin Shin, Kyuseok Shim

http://doi.org/10.5626/JOK.2017.44.7.703

Outsourcing databases is to offload storage and computationally intensive tasks to the third party server. Therefore, data owners can manage big data, and handle queries from clients, without building a costly infrastructure. However, because of the insecurity of network systems, the third-party server may be untrusted, thus the query results from the server may be tampered with. This problem has motivated significant research efforts on authenticating various queries such as range query, kNN query, function query, etc. Although aggregation queries play a key role in analyzing big data, authenticating aggregation queries has not been extensively studied, and the previous works are not efficient for data with high dimension or a large number of distinct values. In this paper, we propose the AMR-tree that is a data structure, applied to authenticate aggregation queries. We also propose an efficient proof construction method and a verification method with the AMR-tree. Furthermore, we validate the performance of the proposed algorithm by conducting various experiments through changing parameters such as the number of distinct values, the number of records, and the dimension of data.

A Traffic-Classification Method Using the Correlation of the Network Flow

YoungHoon Goo, Kyuseok Shim, Sungho Lee, Baraka D. Sija, MyungSup Kim

http://doi.org/

Presently, the ubiquitous emergence of high-speed-network environments has led to a rapid increase of various applications, leading to constantly complicated network traffic. To manage networks efficiently, the traffic classification of specific units is essential. While various traffic-classification methods have been studied, a methods for the complete classification of network traffic has not yet been developed. In this paper, a correlation model of the network flow is defined, and a traffic-classification method for which this model is used is proposed. The proposed network-correlation model for traffic classification consists of a similarity model and a connectivity model. Suggestion for the effectiveness of the proposed method is demonstrated in terms of accuracy and completeness through experiments.

Inverse Document Frequency-Based Word Embedding of Unseen Words for Question Answering Systems

Wooin Lee, Gwangho Song, Kyuseok Shim

http://doi.org/

Question answering system (QA system) is a system that finds an actual answer to the question posed by a user, whereas a typical search engine would only find the links to the relevant documents. Recent works related to the open domain QA systems are receiving much attention in the fields of natural language processing, artificial intelligence, and data mining. However, the prior works on QA systems simply replace all words that are not in the training data with a single token, even though such unseen words are likely to play crucial roles in differentiating the candidate answers from the actual answers. In this paper, we propose a method to compute vectors of such unseen words by taking into account the context in which the words have occurred. Next, we also propose a model which utilizes inverse document frequencies (IDF) to efficiently process unseen words by expanding the system’s vocabulary. Finally, we validate that the proposed method and model improve the performance of a QA system through experiments.

Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic

Wooyeol Kim, Kyuseok Shim

http://doi.org/

Theta join is one of the essential and important types of queries in database systems. As the amount of data needs to be processed increases, processing theta joins with a single machine becomes impractical. Therefore, theta join algorithms using distributed computing frameworks have been studied widely. Although one of the state-of-the-art theta-join algorithms uses M-Bucket-I heuristic, it is hard to use since running time of M-Bucket-I heuristic, which computes a mapping from a record to a reducer (i.e., reducer mapping), is O(n) where n is the size of input data. In this paper, we propose MBI-I algorithm which reduces the running time of M-Bucket-I heuristic to O(rmaxlogn and gives the same result as M-Bucket-I heuristic does. We also conducted several experiments to show algorithm and confirmed that our algorithm can improve the performance of a theta join by 10%.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr