Search : [ keyword: Hadoop ] (10)

Dynamic Core Affinity for Manycore Partitioning

Chan-Gyu Lee, Joong-Yeon Cho, Hyun-Wook Jin

http://doi.org/10.5626/JOK.2020.47.12.1111

As the number of cores in computer systems with NUMA architecture is increasing, the contemporary operating systems are not scalable because of increased cache misses, cache coherence activities and synchronizations. To resolve this problem, several studies have suggested controlling the core affinity of the system calls and the event handlers, making these run on a specific set of cores. However, these core partitioning approaches statically decide the number of cores available for controlling the core affinity without considering the characteristics of the applications and the system architectures. In this paper, we propose a dynamic core affinity scheme for the core partitioning and compare with a static core partitioning mechanism.

Development of Big Data Platform Operation and Management System Considering HPC Environments

Jae-Hyuck Kwak, Jieun Choi, Eunkyu Byun, Sangwan Kim

http://doi.org/10.5626/JOK.2020.47.3.240

Software technologies in traditional computational science and big data fields have evolved into different forms, but the growth of big data technology and recent advances in artificial intelligence technology have broken down boundaries between the two fields and have lead to generalized, high-performance computing environments. However, as these two areas of software stack were built and developed independently, it is not easy to integrate and operate them seamlessly in high-performance computing environment. In this paper, we developed a big data platform operation and management system considering high-performance computing environment. The system is an extension of Ambari, an open-source Hadoop platform operations management system that also provides installation management for Lustre, configuration of the Hadoop-on-Lustre execution environment, YARN job monitoring with user-defined and dynamic monitoring metrics as well as a web-based interface for high-performance computing resource monitoring.

Streaming Compression Scheme for Reducing Network Resource Usage in Hadoop System

Seung Joon Noh, Young Ik Eom

http://doi.org/10.5626/JOK.2018.45.6.516

Recently, the Hadoop system has become one of the most popular large-scale distributed systems used in enterprises, and the amount of data on the system has been increasing continually. As the amount of data in the Hadoop system is increased, the scale of Hadoop clusters is also growing. Resources in a node, such as processor, memory, and storage, are isolated from other nodes, and hence, even though resource usage is increased by data processing requests from clients, it doesn’t affect the performance of other nodes. However, all the nodes in a Hadoop cluster are connected to the network resource, a shared resource in the Hadoop cluster, and so, if some nodes dominate the network resource, other nodes would experience less network resources, which could cause overall performance degradation in the Hadoop system. In this paper, we propose a streaming compression scheme that can decrease the network traffic generated by write operations in the system. We also evaluate the performance of our streaming compression scheme and analyze the overhead of the proposed scheme. Our experimental results with a real-world workload show that our proposed scheme decreases the network traffic in a Hadoop cluster by 56% over the existing HDFS systems.

Distributed Processing Method of Hotspot Spatial Analysis Based on Hadoop and Spark

Changsoo Kim, Joosub Lee, KyuMoon Hwang, Hyojin Sung

http://doi.org/10.5626/JOK.2018.45.2.99

One of the spatial statistical analysis, hotspot analysis is one of easy method of see spatial patterns. It is based on the concept that "Adjacent ones are more relevant than those that are far away". However, in hotspot analysis is spatial adjacency must be considered, Therefore, distributed processing is not easy. In this paper, we proposed a distributed algorithm design for hotspot spatial analysis. Its performance was compared to standalone system and Hadoop, Spark based processing. As a result, it is compare to standalone system, Performance improvement rate of Hadoop at 625.89% and Spark at 870.14%. Furthermore, performance improvement rate is high at Spark processing than Hadoop at as more large data set.

Design of a Large-scale Task Dispatching & Processing System based on Hadoop

Jik-Soo Kim, Nguyen Cao, Seoyoung Kim, Soonwook Hwang

http://doi.org/

This paper presents a MOHA(Many-Task Computing on Hadoop) framework which aims to effectively apply the Many-Task Computing(MTC) technologies originally developed for high-performance processing of many tasks, to the existing Big Data processing platform Hadoop. We present basic concepts, motivation, preliminary results of PoC based on distributed message queue, and future research directions of MOHA. MTC applications may have relatively low I/O requirements per task. However, a very large number of tasks should be efficiently processed with potentially heavy inter-communications based on files. Therefore, MTC applications can show another pattern of dataintensive workloads compared to existing Hadoop applications, typically based on relatively large data block sizes. Through an effective convergence of MTC and Big Data technologies, we can introduce a new MOHA framework which can support the large-scale scientific applications along with the Hadoop ecosystem, which is evolving into a multi-application platform.

Design of Extended Real-time Data Pipeline System Architecture

Hoseung Shin, Sungwon Kang, Jihyun Lee

http://doi.org/

Big data systems are widely used to collect large-scale log data, so it is very important for these systems to operate with a high level of performance. However, the current Hadoop-based big data system architecture has a problem in that its performance is low as a result of redundant processing. This paper solves this problem by improving the design of the Hadoop system architecture. The proposed architecture uses the batch-based data collection of the existing architecture in combination with a single processing method. A high level of performance can be achieved by analyzing the collected data directly in memory to avoid redundant processing. The proposed architecture guarantees system expandability, which is an advantage of using the Hadoop architecture. This paper confirms that the proposed architecture is approximately 30% to 35% faster in analyzing and processing data than existing architectures and that it is also extendable.

Yet Another BGP Archive Forensic Analysis Tool Using Hadoop and Hive

Yeonhee Lee, YoungSeok Lee

http://doi.org/

A large volume of continuously growing BGP data files can raise two technical challenges regarding scalability and manageability. Due to the recent development of the open-source distributed computing infrastructure, Hadoop, it becomes feasible to handle a large amount of data in a scalable manner. In this paper, we present a new Hadoop-based BGP tool (BGPdoop) that provides the scaleout performance as well as the extensible and agile analysis capability. In particular, BGPdoop realizes a query-based BGP record exploration function using Hive on the partitioned BGP data structure, which enables flexible and versatile analytics of BGP archive files. From the experiments for the scalability with a Hadoop cluster of 20 nodes, we demonstrate that BGPdoop achieves 5 times higher performance and the user-defined analysis capability by expressing diverse BGP routing analytics in Hive queries.

A MapReduce-based kNN Join Query Processing Algorithm for Analyzing Large-scale Data

HyunJo Lee, TaeHoon Kim, JaeWoo Chang

http://doi.org/

Recently, the amount of data is rapidly increasing with the popularity of the SNS and the development of mobile technology. So, it has been actively studied for the effective data analysis schemes of the large amounts of data. One of the typical schemes is a Voronoi diagram based on kNN join algorithm (VkNN-join) using MapReduce. For two datasets R and S, VkNN-join can reduce the time of the join query processing involving big data because it selects the corresponding subset Sj for each Ri and processes the query with them. However, VkNN-join requires a high computational cost for constructing the Voronoi diagram. Moreover, the computational overhead of the VkNN-join is high because the number of the candidate cells increases as the value of the k increases. In order to solve these problems, we propose a MapReduce-based kNN-join query processing algorithm for analyzing the large amounts of data. Using the seed-based dynamic partitioning, our algorithm can reduce the overhead for constructing the index structure. Also, it can reduce the computational overhead to find the candidate partitions by selecting corresponding partitions with the average distance between two seeds. We show that our algorithm has better performance than the existing scheme in terms of the query processing time.

A Scalable OWL Horst Lite Ontology Reasoning Approach based on Distributed Cluster Memories

Je-Min Kim, Young-Tack Park

http://doi.org/

Current ontology studies use the Hadoop distributed storage framework to perform map-reduce algorithm-based reasoning for scalable ontologies. In this paper, however, we propose a novel approach for scalable Web Ontology Language (OWL) Horst Lite ontology reasoning, based on distributed cluster memories. Rule-based reasoning, which is frequently used for scalable ontologies, iteratively executes triple-format ontology rules, until the inferred data no longer exists. Therefore, when the scalable ontology reasoning is performed on computer hard drives, the ontology reasoner suffers from performance limitations. In order to overcome this drawback, we propose an approach that loads the ontologies into distributed cluster memories, using Spark (a memory-based distributed computing framework), which executes the ontology reasoning. In order to implement an appropriate OWL Horst Lite ontology reasoning system on Spark, our method divides the scalable ontologies into blocks, loads each block into the cluster nodes, and subsequently handles the data in the distributed memories. We used the Lehigh University Benchmark, which is used to evaluate ontology inference and search speed, to experimentally evaluate the methods suggested in this paper, which we applied to LUBM8000 (1.1 billion triples, 155 gigabytes). When compared with WebPIE, a representative mapreduce algorithm-based scalable ontology reasoner, the proposed approach showed a throughput improvement of 320% (62k/s) over WebPIE (19k/s).

Scalable RDFS Reasoning using Logic Programming Approach in a Single Machine

Batselem Jagvaral, Jemin Kim, Wan Gon Lee, Young Tack Park

http://doi.org/

As the web of data is increasingly producing large RDFS datasets, it becomes essential in building scalable reasoning engines over large triples. There have been many researches used expensive distributed framework, such as Hadoop, to reason over large RDFS triples. However, in many cases we are required to handle millions of triples. In such cases, it is not necessary to deploy expensive distributed systems because logic program based reasoners in a single machine can produce similar reasoning performances with that of distributed reasoner using Hadoop. In this paper, we propose a scalable RDFS reasoner using logical programming methods in a single machine and compare our empirical results with that of distributed systems. We show that our logic programming based reasoner using a single machine performs as similar as expensive distributed reasoner does up to 200 million RDFS triples. In addition, we designed a meta data structure by decomposing the ontology triples into separate sectors. Instead of loading all the triples into a single model, we selected an appropriate subset of the triples for each ontology reasoning rule. Unification makes it easy to handle conjunctive queries for RDFS schema reasoning, therefore, we have designed and implemented RDFS axioms using logic programming unifications and efficient conjunctive query handling mechanisms. The throughputs of our approach reached to 166K Triples/sec over LUBM1500 with 200 million triples. It is comparable to that of WebPIE, distributed reasoner using Hadoop and Map Reduce, which performs 185K Triples/sec. We show that it is unnecessary to use the distributed system up to 200 million triples and the performance of logic programming based reasoner in a single machine becomes comparable with that of expensive distributed reasoner which employs Hadoop framework.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr