Search : [ author: Yang-Sae Moon ] (7)

An Automatic Framework for Nested Normalization and Table Migration of Large-Scale Hierarchical Data

Dasol Kim, Myeong-Seon Gil, Heesun Won, Yang-Sae Moon

http://doi.org/10.5626/JOK.2023.50.6.521

In the open data portal, a lot of data is distributed in the hierarchical structure of JSON and XML formats, and the scale is very large. Such hierarchical data includes several nestings because of its structural characteristics. As a result, nested table normalization and scale limitation problems can occur, which limits the utilization of large-scale open data. In this paper, we adopt Airbyte, an open-source ELT platform, for table migration of hierarchical files, and propose a new framework for automating table migration. This is the first study to report Airbyte’s nested JSON handling issue and contribute to solving the issue. Through extensive evaluation of the proposed framework for actual US data portals, we show that it operates normally even for structures that include multiple nestings, and it can process large-scale migration of 1.6K or more by providing automated processing logic. These results mean that the proposed framework is a very practical one that supports the nested normalization of hierarchical data and provides a reliable large-scale migration function.

Performance Improvement of Distributed Parallel Graph Data Processing in InfiniBand Networks

Hyeongjong Kim, Myeong-Seon Gil, Yang-Sae Moon

http://doi.org/10.5626/JOK.2023.50.4.359

Graph data, which values the relationship of each object, is widely used for new rules or association analysis that cannot be found in relational databases. However, there is a limit to high-speed processing due to its complex structure and massive data size. In this paper, we propose PIGraph (Pregel and InfiniBand-based Graph processing engine) to improve the processing performance of graph data. PIGraph is an advanced graph processing engine based on Pregel, which is a representative graph processing model. PIGraph supports the distributed parallel structure using InfiniBand and RDMA (Remote Direct Memory Access) technology to reduce the management complexity of distributed graph processing. In particular, PIGraph improves the processing performance of graph data by optimizing the RDMA communication with segment-based transmissions. Experimental results show that PIGraph improves the processing time by up to 190% compared to Apache Giraph.

Distributed Processing of Deep Learning Inference Models for Data Stream Classification

Hyojong Moon, Siwoon Son, Yang-Sae Moon

http://doi.org/10.5626/JOK.2021.48.10.1154

The increased generation of data streams has subsequently led to increased utilization of deep learning. In order to classify data streams using deep learning, we need to execute the model in real-time through serving. Unfortunately, the serving model incurs long latency due to gRPC or HTTP communication. In addition, if the serving model uses a stacking ensemble method with high complexity, a longer latency occurs. To solve the long latency challenge, we proposed distributed processing solutions for data stream classification using Apache Storm. First, we proposed a real-time distributed inference method based on Apache Storm to reduce the long latency of the existing serving method. The present study"s experimental results showed that the proposed distributed inference method reduces the latency by up to 11 times compared to the existing serving method. Second, to reduce the long latency of the stacking-based inference model for detecting malicious URLs, we proposed four distributed processing techniques for classifying URL streams in real-time. The proposed techniques are Independent Stacking, Sequential Stacking, Semi-Sequential Stacking, and Stepwise-Independent Stacking. Our study experimental results showed that Stepwise-Independent Stacking, whose characteristics are similar to those of independent execution and sequential processing, is the best technique for classifying URL streams with the shortest latency.

Optimization of Distributed Binary Bernoulli Sampling

Wonhyeong Cho, Myeong-Seon Gil, Namsu Ju, Yang-Sae Moon

http://doi.org/10.5626/JOK.2019.46.12.1322

This paper proposes a method to improve the performance of Binary Bernoulli Sampling(BBS). BBS is a sampling technique suitable for a multi-source stream environment. Accordingly, a recent approach has been proposed for distributed processing of BBS based on Apache Storm, with a multi-coordinator structure. However, this approach causes an additional coordinator waiting problem, which limits the performance improvement. In this paper, we solve the coordinator waiting problem by introducing a multi-distribution structure and a distributor separation structure. The multi-distribution structure enables multiple coordinators, rather than one, to participate in the distribution, minimizing the coordinator waiting time. The distributor separation structure moves the distributing function from the coordinators to the distributors, maximizing the processing performance. We perform various experiments by implementing our proposed structure on the Storm-based distributed BBS. The experimental results show that our structure improves the performance by up to 90 times compared to the previous distributed BBS.

Secure Multiparty Computation of Principal Component Analysis

Sang-Pil Kim, Sanghun Lee, Myeong-Seon Gil, Yang-Sae Moon, Hee-Sun Won

http://doi.org/

In recent years, many research efforts have been made on privacy-preserving data mining (PPDM) in data of large volume. In this paper, we propose a PPDM solution based on principal component analysis (PCA), which can be widely used in computing correlation among sensitive data sets. The general method of computing PCA is to collect all the data spread in multiple nodes into a single node before starting the PCA computation; however, this approach discloses sensitive data of individual nodes, involves a large amount of computation, and incurs large communication overheads. To solve the problem, in this paper, we present an efficient method that securely computes PCA without the need to collect all the data. The proposed method shares only limited information among individual nodes, but obtains the same result as that of the original PCA. In addition, we present a dimensionality reduction technique for the proposed method and use it to improve the performance of secure similar document detection. Finally, through various experiments, we show that the proposed method effectively and efficiently works in a large amount of multi-dimensional data.

Efficient Multi-Step k-NN Search Methods Using Multidimensional Indexes in Large Databases

Sanghun Lee, Bum-Soo Kim, Mi-Jung Choi, Yang-Sae Moon

http://doi.org/

In this paper, we address the problem of improving the performance of multi-step k-NN search using multi-dimensional indexes. Due to information loss by lower-dimensional transformations, existing multi-step k-NN search solutions produce a large tolerance (i.e., a large search range), and thus, incur a large number of candidates, which are retrieved by a range query. Those many candidates lead to overwhelming I/O and CPU overheads in the postprocessing step. To overcome this problem, we propose two efficient solutions that improve the search performance by reducing the tolerance of a range query, and accordingly, reducing the number of candidates. First, we propose a tolerance reduction-based (approximate) solution that forcibly decreases the tolerance, which is determined by
a k-NN query on the index, by the average ratio of high- and low-dimensional distances. Second, we propose a coefficient control-based (exact) solution that uses c?k instead of k in a k-NN query to obtain a tigher tolerance and performs a range query using this tigher tolerance. Experimental results show that the proposed solutions significantly reduce the number of candidates, and accordingly, improve the search performance in comparison with the existing multi-step k-NN solution.

Partial Denoising Boundary Image Matching Based on Time-Series Data

Bum-Soo Kim, Sanghoon Lee, Yang-Sae Moon

http://doi.org/

Removing noise, called denoising, is an essential factor for the more intuitive and more accurate results in boundary image matching. This paper deals with a partial denoising problem that tries to allow a limited amount of partial noise embedded in boundary images. To solve this problem, we first define partial denoising time-series which can be generated from an original image time-series by removing a variety of partial noises and propose an efficient mechanism that quickly obtains those partial denoising time-series in the time-series domain rather than the image domain. We next present the partial denoising distance, which is the minimum distance from a query time-series to all possible partial denoising time-series generated from a data time-series, and we use this partial denoising distance as a similarity measure in boundary image matching. Using the partial denoising distance, however, incurs a severe computational overhead since there are a large number of partial denoising time-series to be considered. To solve this problem, we derive a tight lower bound for the partial denoising distance and formally prove its correctness. We also propose range and k-NN search algorithms exploiting the partial denoising distance in boundary image matching. Through extensive experiments, we finally show that our lower bound-based approach improves search performance by up to an order of magnitude in partial denoising-based boundary image matching.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr