Journal of KIISE

Search : [ keyword: Spark ] (19)

With the advent of the big data era, heterogeneous computing is employed to process large amounts of data. Since Apache Spark, a representative big data analysis framework, is built with the Scala programming language, programs written in Scala need to be rewritten with CUDA, OpenCL, and others to enjoy the benefits of GPU computing. TornadoVM automatically converts Java programs into OpenCL programs using compiler annotations defined in the Java specification. Scala shares bytecode in an executable form with Java, but the annotation capabilities of current Scala compilers lack the annotations indispensable for TornadoVM’s OpenCL translation. In this work, the annotation capabilities of Scala compilers are extended to enable OpenCL translation on TornadoVM. Furthermore, we experimentally confirmed that the performance of Scala-OpenCL converted code is as fast as Java-OpenCL converted code. With our extension, we expect Scala programs to easily use GPU acceleration in the Apache Spark framework.

Predicting the Cache Performance Benefits for In-memory Data Analytics Frameworks

Minseop Jeong, Hwansoo Han

http://doi.org/10.5626/JOK.2021.48.5.479

In-memory data analytics frameworks provide intermediate results in caching facilities for performance. For effective caching, the actual performance benefits from cached data should be taken into consideration. As existing frameworks only measure execution times at the distributed task level, they have limitations in predicting the cache performance benefits accurately. In this paper, we propose an operator-level time measurement method, which incorporates the existing task-level execution time measurement with our cost prediction model according to input data sizes. Based on the proposed model and the execution flow of the application, we propose a prediction method for the performance benefits from data caching. Our proposed model provides opportunities for cache optimization with predicted performance benefits. Our cost model for operators showed prediction error rate of 7.3% on average, when measured with 10x input data. The difference between predicted performance and actual performance wes limited to within 24%.

A Sort and Merge Method for Genome Variant Call Format (GVCF) Files using Parallel and Distributed Computing

JinWoo Lee, Jung-Im Won, JeeHee Yoon

http://doi.org/10.5626/JOK.2021.48.3.358

With the development of next-generation sequencing (NGS) techniques, a large volume of genomic data is being produced and accumulated, and parallel and distributed computing has become an essential tool. Generally, NGS data processing entails two main steps: obtaining read alignment results in BAM format and extracting variant information in genome variant call format (GVCF) or variant call format (VCF). However, each step requires a long execution time due to the size of the data. In this study, we propose a new GVCF file sorting/merging module using distributed parallel clusters to shorten the execution time. In the proposed algorithm, Spark is used as a distributed parallel cluster. The sorting/merge process is performed in two steps according to the structural characteristics of the GVCF file in order to use the resources in the cluster efficiently. The performance was evaluated by comparing our method with the GATK"s CombineGVCFs module based on sorting and merging execution time of multiple GVCF files. The outcomes suggest the effectiveness of the proposed method in reducing execution time. The method can be used as a scalable and powerful distributed computing tool to solve the GVCF file sorting/merge problem.

OWL-Horst Ontology Inference Engine Using Distributed Table Structure in Cloud Computing Environment

Min-Sung Kim, Min-Ho Lee, Wan-Gon Lee, Young-Tack Park

http://doi.org/10.5626/JOK.2020.47.7.674

Recently, many machine learning methods that extend ontology through data obtained from the web are being studied. As data from the web continues to increase, interest in large-capacity ontology inference methods is also increasing. However, the increasing amount of data decreases processing speeds. This paper describes how to improve the performance of large-scale OWL-Horst inference using distributed table structured data frames to solve the problem of the slow processing speed of large-capacity data. Also, a distributed parallel inference algorithm and optimization method used to improve the inference performance is described. To evaluate the performance of the inference system using the distributed table structured data frame proposed in this paper, experiments were conducted with LUBM1000, LUBM2000, LUBM3000, and LUBM4000. Our reasoning system showed the best performance.

An Efficient Distributed In-memory High-dimensional Indexing Scheme for Content-based Image Retrieval in Spark Environments

Dojin Choi, Songhee Park, Yeondong Kim, Jiwon Wee, Hyeonbyeong Lee, Jongtae Lim, Kyoungsoo Bok, Jaesoo Yoo

http://doi.org/10.5626/JOK.2020.47.1.95

Content-based image retrieval that searches an object in images has been utilizing for criminal activity monitoring and object tracking in video. In this paper, we propose a high-dimensional indexing scheme based on distributed in-memory for the content-based image retrieval. It provides similarity search by using massive feature vectors extracted from images or objects. In order to process a large amount of data, we utilized a big data platform called Spark. Moreover, we employed a master/slave model for efficient distributed query processing allocation. The master distributes data and queries. and the slaves index and process them. To solve k-NN query processing performance problems in the existing distributed high-dimension indexing schemes, we propose optimization methods for the k-NN query processing considering density and search costs. We conduct various performance evaluations to demonstrate the superiority of the proposed scheme.

S-PARAFAC: Distributed Tensor Decomposition using Apache Spark

Hye-Kyung Yang, Hwan-Seung Yong

http://doi.org/10.5626/JOK.2018.45.3.280

Recently, the use of a recommendation system and tensor data analysis, which has high-dimensional data, is increasing, as they allow us to analyze the tensor and extract potential elements and patterns. However, due to the large size and complexity of the tensor, it needs to be decomposed in order to analyze the tensor data. While several tools are used for tensor decomposition such as rTensor, pyTensor, and MATLAB, since such tools run on a single machine, they are unable to handle large data. Also, while distributed tensor decomposition tools based on Hadoop can handle a scalable tensor, its computing speed is too slow. In this paper, we propose S-PARAFAC, which is a tensor decomposition tool based on Apache Spark, in distributed in-memory environments. We converted the PARAFAC algorithm into an Apache Spark version that enables rapid processing of tensor data. We also compared the performance of the Hadoop based tensor tool and S-PARAFAC. The result showed that S-PARAFAC is approximately 4～25 times faster than the Hadoop based tensor tool.

Distributed Processing Method of Hotspot Spatial Analysis Based on Hadoop and Spark

Changsoo Kim, Joosub Lee, KyuMoon Hwang, Hyojin Sung

http://doi.org/10.5626/JOK.2018.45.2.99

One of the spatial statistical analysis, hotspot analysis is one of easy method of see spatial patterns. It is based on the concept that "Adjacent ones are more relevant than those that are far away". However, in hotspot analysis is spatial adjacency must be considered, Therefore, distributed processing is not easy. In this paper, we proposed a distributed algorithm design for hotspot spatial analysis. Its performance was compared to standalone system and Hadoop, Spark based processing. As a result, it is compare to standalone system, Performance improvement rate of Hadoop at 625.89% and Spark at 870.14%. Furthermore, performance improvement rate is high at Spark processing than Hadoop at as more large data set.

SWOSpark : Spatial Web Object Retrieval System based on Distributed Processing

Pyoung Woo Yang, Kwang Woo Nam

http://doi.org/10.5626/JOK.2018.45.1.53

This study describes a spatial web object retrieval system using Spark, an in - memory based distributed processing system. Development of social networks has created massive amounts of spatial web objects, and retrieval and analysis of data is difficult by using exist spatial web object retrieval systems. Recently, development of distributed processing systems supports the ability to analyze and retrieve large amounts of data quickly. Therefore, a method is promoted to search a large-capacity spatial web object by using the distributed processing system . Data is processed in block units, and one of these blocks is converted to RDD and processed in Spark. Regarding the discussed method, we propose a system in which each RDD consists of spatial web object index for the included data, dividing the entire spatial region into non-overlapping spatial regions, and allocating one divided region to one RDD. We propose a system that can efficiently use the distributed processing system by dividing space and increasing efficiency of searching the divided space. Additionally by comparing QP-tree with R-tree, we confirm that the proposed system is better for searching the spatial web objects; QP-tree builds index with both spatial and words information while R-tree build index only with spatial information.

Techniques to Guarantee Real-Time Fault Recovery in Spark Streaming Based Cloud System

Jungho Kim, Daedong Park, Sangwook Kim, Yongshik Moon, Seongsoo Hong

http://doi.org/

In a real-time cloud environment, the data analysis framework plays a pivotal role. Spark Streaming meets most real-time requirements among existing frameworks. However, the framework does not meet the second scale real-time fault recovery requirement. Spark Streaming fault recovery time increases in proportion to the transformation history length called lineage. This is because it recovers the last state data based on the cumulative lineage recorded during normal operation. Therefore, fault recovery time is not bounded within a limited time. In addition, it is impossible to achieve a second-scale fault recovery time because it costs tens of seconds to read initial state data from fault-tolerant storage. In this paper, we propose two techniques to solve the problems mentioned above. We apply the proposed techniques to Spark Streaming 1.6.2. Experimental results show that the fault recovery time is bounded and the average fault recovery time is reduced by up to 41.57%.

Distributed Assumption-Based Truth Maintenance System for Scalable Reasoning

Batselem Jagvaral, Young-Tack Park

http://doi.org/

Assumption-based truth maintenance system (ATMS) is a tool that maintains the reasoning process of inference engine. It also supports non-monotonic reasoning based on dependency-directed backtracking. Bookkeeping all the reasoning processes allows it to quickly check and retract beliefs and efficiently provide solutions for problems with large search space. However, the amount of data has been exponentially grown recently, making it impossible to use a single machine for solving large-scale problems. The maintaining process for solving such problems can lead to high computation cost due to large memory overhead. To overcome this drawback, this paper presents an approach towards incrementally maintaining the reasoning process of inference engine on cluster using Spark. It maintains data dependencies such as assumption, label, environment and justification on a cluster of machines in parallel and efficiently updates changes in a large amount of inferred datasets. We deployed the proposed ATMS on a cluster with 5 machines, conducted OWL/RDFS reasoning over University benchmark data (LUBM) and evaluated our system in terms of its performance and functionalities such as assertion, explanation and retraction. In our experiments, the proposed system performed the operations in a reasonably short period of time for over 80GB inferred LUBM2000 dataset.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Journal of KIISE

Journal of KIISE

Digital Library[ Search Result ]

Compiler-directive based Heterogeneous Computing for Scala

Predicting the Cache Performance Benefits for In-memory Data Analytics Frameworks

A Sort and Merge Method for Genome Variant Call Format (GVCF) Files using Parallel and Distributed Computing

OWL-Horst Ontology Inference Engine Using Distributed Table Structure in Cloud Computing Environment

An Efficient Distributed In-memory High-dimensional Indexing Scheme for Content-based Image Retrieval in Spark Environments

S-PARAFAC: Distributed Tensor Decomposition using Apache Spark

Distributed Processing Method of Hotspot Spatial Analysis Based on Hadoop and Spark

SWOSpark : Spatial Web Object Retrieval System based on Distributed Processing

Techniques to Guarantee Real-Time Fault Recovery in Spark Streaming Based Cloud System

Distributed Assumption-Based Truth Maintenance System for Scalable Reasoning

Search

Editorial Office