Design of a Large-scale Task Dispatching & Processing System based on Hadoop

Jik-Soo Kim, Nguyen Cao, Seoyoung Kim, Soonwook Hwang

http://doi.org/

This paper presents a MOHA(Many-Task Computing on Hadoop) framework which aims to effectively apply the Many-Task Computing(MTC) technologies originally developed for high-performance processing of many tasks, to the existing Big Data processing platform Hadoop. We present basic concepts, motivation, preliminary results of PoC based on distributed message queue, and future research directions of MOHA. MTC applications may have relatively low I/O requirements per task. However, a very large number of tasks should be efficiently processed with potentially heavy inter-communications based on files. Therefore, MTC applications can show another pattern of dataintensive workloads compared to existing Hadoop applications, typically based on relatively large data block sizes. Through an effective convergence of MTC and Big Data technologies, we can introduce a new MOHA framework which can support the large-scale scientific applications along with the Hadoop ecosystem, which is evolving into a multi-application platform.

A Comparative Analysis of Recursive Query Algorithm Implementations based on High Performance Distributed In-Memory Big Data Processing Platforms

Minseo Kang, Jaesung Kim, Jaegil Lee

http://doi.org/

Recursive query algorithm is used in many social network services, e.g., reachability queries in social networks. Recently, the size of social network data has increased as social network services evolve. As a result, it is almost impossible to use the recursive query algorithm on a single machine. In this paper, we implement recursive query on two popular in-memory distributed platforms, Spark and Twister, to solve this problem. We evaluate the performance of two implementations using 50 machines on Amazon EC2, and real-world data sets: LiveJournal and ClueWeb. The result shows that recursive query algorithm shows better performance on Spark for the Livejournal input data set with relatively high average degree, but smaller vertices. However, recursive query on Twister is superior to Spark for the ClueWeb input data set with relatively low average degree, but many vertices.

An Efficient Cleaning Scheme for File Defragmentation on Log-Structured File System

Jonggyu Park, Dong Hyun Kang, Euiseong Seo, Young Ik Eom

http://doi.org/

When many processes issue write operations alternately on Log-structured File System (LFS), the created files can be fragmented on the file system layer although LFS sequentially allocates new blocks of each process. Unfortunately, this file fragmentation degrades read performance because it increases the number of block I/Os. Additionally, read-ahead operations which increase the number of data to request at a time exacerbates the performance degradation. In this paper, we suggest a new cleaning method on LFS that minimizes file fragmentation. During a cleaning process of LFS, our method sorts valid data blocks by inode numbers before copying the valid blocks to a new segment. This sorting re-locates fragmented blocks contiguously. Our cleaning method experimentally eliminates 60% of file fragmentation as compared to file fragmentation before cleaning. Consequently, our cleaning method improves sequential read throughput by 21% when read-ahead is applied.

Automatic Correction of Errors in Annotated Corpus Using Kernel Ripple-Down Rules

Tae-Ho Park, Jeong-Won Cha

http://doi.org/

Annotated Corpus is important to understand natural language using machine learning method. In this paper, we propose a new method to automate error reduction of annotated corpora. We use the Ripple-Down Rules(RDR) for reducing errors and Kernel to extend RDR for NLP. We applied our system to the Korean Wikipedia and blog corpus errors to find the annotated corpora error type. Experimental results with various views from the Korean Wikipedia and blog are reported to evaluate the effectiveness and efficiency of our proposed approach. The proposed approach can be used to reduce errors of large corpora.

Comparing Initiating and Responding Joint Attention as a Social Learning Mechanism : A Study Using Human-Avatar Head/Hand Interaction

Mingyu Kim, So-Yeon Kim, Kwanguk Kim

http://doi.org/

Joint Attention (JA) has been known to play a key role in human social learning. However, relative impact of different interaction types has yet to be rigorously examined because of limitation of existing methodologies to simulate human-to-human interaction. In the present study, we designed a new JA paradigm with emulating human-avatar interaction and virtual reality technologies, and tested the paradigm in two experiments with healthy adults. Our results indicated that initiating JA (IJA) condition was more effective than responding JA (RJA) condition for social learning in both head and hand interactions. Moreover, the hand interaction involved better information processing than the head interaction. The implication of the results, the validity of the new paradigm, and limitations of this study were discussed.

Improving The Performance of Triple Generation Based on Distant Supervision By Using Semantic Similarity

Hee-Geun Yoon, Su Jeong Choi, Seong-Bae Park

http://doi.org/

The existing pattern-based triple generation systems based on distant supervision could be flawed by assumption of distant supervision. For resolving flaw from an excessive assumption, statistics information has been commonly used for measuring confidence of patterns in previous studies. In this study, we proposed a more accurate confidence measure based on semantic similarity between patterns and properties. Unsupervised learning method, word embedding and WordNet-based similarity measures were adopted for learning meaning of words and measuring semantic similarity. For resolving language discordance between patterns and properties, we adopted CCA for aligning bilingual word embedding models and a translation-based approach for a WordNet-based measure. The results of our experiments indicated that the accuracy of triples that are filtered by the semantic similarity-based confidence measure was 16% higher than that of the statistics-based approach. These results suggested that semantic similarity-based confidence measure is more effective than statistics-based approach for generating high quality triples.

Hazard Identification and Testcase Design Method based on Use Case and HAZOP

Sungryong Do, Hyuksoo Han

http://doi.org/

As electric and electronic control systems have sharply increased in vehicles, safety accident has emerged as an important issue. Therefore, in order to ensure safety of the vehicle, engineers are required to identify the hazards utilizing PHA and HAZOP, etc. in the early phase of development and implement safety mechanisms to prevent them. HAZOP has been widely used in a systematic manner based on guidewords. However, HAZOP identifies malfunctions from the top-level functionality provided by the system, so it cannot sufficiently identify hazards during the system operation. This leads to restrictions in designing testcases, because the safety requirements are derived from only some of the hazards. This research aimed to provide a hazard identification method utilizing Use case description, which defines operation procedure of the system and HAZOP and a testcase design method based on safety requirements. We introduced a case study on Smart Key Control System in vehicles and compared with hazards identification results based on HAZOP, to demonstrate the effectiveness of this study. The result of this study could potentially reduce development cost and increase system quality by adequately identifying hazards and safety requirements and designing the related testcase.

Design and Evaluation of Information Broker Architecture for Network-Centric Operational Environment

Jejun Park, Dongsu Kang

http://doi.org/

The information superiority through effective networking is a core element that accelerates command decision for mission completion. Our military wants to acquire capabilities of effective information sharing with Network-Centric Operational Environment(NCOE) for Network-Centric Warfare (NCW). In this paper, we suggested an information broker for overcoming current limits and maximizing future expandability and possibility of information sharing capacities. The information broker, which is an intermediate layer between users and information providers, provides the functions for mediating and managing information and for ensuring security of the system. We evaluated the consistency of proposed architecture and the implementation of the operational architecture design concept using existing design frameworks.

Korean Named Entity Recognition and Classification using Word Embedding Features

Yunsu Choi, Jeongwon Cha

http://doi.org/

Named Entity Recognition and Classification (NERC) is a task for recognition and classification of named entities such as a person"s name, location, and organization. There have been various studies carried out on Korean NERC, but they have some problems, for example lacking some features as compared with English NERC. In this paper, we propose a method that uses word embedding as features for Korean NERC. We generate a word vector using a Continuous-Bag-of- Word (CBOW) model from POS-tagged corpus, and a word cluster symbol using a K-means algorithm from a word vector. We use the word vector and word cluster symbol as word embedding features in Conditional Random Fields (CRFs). From the result of the experiment, performance improved 1.17%, 0.61% and 1.19% respectively for TV domain, Sports domain and IT domain over the baseline system. Showing better performance than other NERC systems, we demonstrate the effectiveness and efficiency of the proposed method.

Multiview Data Clustering by using Adaptive Spectral Co-clustering

Jeong-Woo Son, Junekey Jeon, Sang-Yun Lee, Sun-Joong Kim

http://doi.org/

In this paper, we introduced the adaptive spectral co-clustering, a spectral clustering for multiview data, especially data with more than three views. In the adaptive spectral co-clustering, the performance is improved by sharing information from diverse views. For the efficiency in information sharing, a co-training approach is adopted. In the co-training step, a set of parameters are estimated to make all views in data maximally independent, and then, information is shared with respect to estimated parameters. This co-training step increases the efficiency of information sharing comparing with ordinary feature concatenation and co-training methods that assume the independence among views. The adaptive spectral co-clustering was evaluated with synthetic dataset and multi lingual document dataset. The experimental results indicated the efficiency of the adaptive spectral co-clustering with the performances in every iterations and similarity matrix generated with information sharing.

Effective Integer Promotion Bug Detection Technique for Embedded Software

Yunho Kim, Taejin Kim, Moonzoo Kim, Ho-jung Lee, Hoon Jang, Mingyu Park

http://doi.org/

C compilers for 8-bit MCUs used in washing machines and refrigerators often do not follow the C standard to improve runtime performance. Developers who are unaware of the difference between C compilers following the C standard and the C compilers for 8-bit MCU can cause bugs that do not appear in the standard C environment but appear in the embedded systems using 8-bit MCUs. It is difficult for bug detectors that assume the standard C environment to detect such bugs. In this paper, we introduce integer promotion bugs caused by the different integer promotion rules of the C compilers for 8-bit MCU from the C standard and propose 5 bug patterns where the integer promotion bugs occur. We have developed an integer promotion bug detection tool and applied it to the washing machine control software developed by the LG electronics. The integer promotion bug detection tool successfully detected 27 integer promotion bugs in the washing machine control software.

Improving Join Performance for SPARQL Query Processing in the Clouds

Gyu-Jin Choi, Yun-Hee Son, Kyu-Chul Lee

http://doi.org/

Recently, with the rapid growth of LOD (Linked Open Data) existing methods based on a single machine have limitation in performance. Existing solutions use distributed framework such as Mapreduce in order to improve the performance. However, the MapReduce framework for processing SPARQL queries involves multiple MapReduce jobs and additional costs incurred. In addition, the problem of unnecessary data processing arises. In this study, we proposed a method to reduce the number of MapReduce jobs during SPARQL query processing and join indexes based on Bitmap for minimizing the costs of processing unnecessary data.

Smart Fog : Advanced Fog Server-centric Things Abstraction Framework for Multi-service IoT System

Gyeonghwan Hong, Eunsoo Park, Sihoon Choi, Dongkun Shin

http://doi.org/

Recently, several research studies on things abstraction framework have been proposed in order to implement the multi-service Internet of Things (IoT) system, where various IoT services share the thing devices. Distributed things abstraction has an IoT service duplication problem, which aggravates power consumption of mobile devices and network traffic. On the other hand, cloud server-centric things abstraction cannot cover real-time interactions due to long network delay. Fog server-centric things abstraction has limits in insufficient IoT interfaces. In this paper, we propose Smart Fog which is a fog server-centric things abstraction framework to resolve the problems of the existing things abstraction frameworks. Smart Fog consists of software modules to operate the Smart Gateway and three interfaces. Smart Fog is implemented based on IoTivity framework and OIC standard. We construct a smart home prototype on an embedded board Odroid-XU3 using Smart Fog. We evaluate the network performance and energy efficiency of Smart Fog. The experimental results indicate that the Smart Fog shows short network latency, which can perform real-time interaction. The results also show that the proposed framework has reduction in the network traffic of 74% and power consumption of 21% in mobile device, compared to distributed things abstraction.

An Effective Concept Drift Detection Method on Streaming Data Using Probability Estimates

Young-In Kim, Cheong Hee Park

http://doi.org/

In streaming data analysis, detecting concept drift accurately is important to maintain the performance of classification model. Error rates are usually used for concept drift detection. However, by describing prediction results with only binary values of 0 or 1, useful information about a behavior pattern of a classifier can be lost. In this paper, we propose an effective concept drift detection method which describes performance pattern of a classifier by utilizing probability estimates for class prediction and detects a significant change in a classifier behavior. Experimental results on synthetic and real streaming data show the efficiency of the proposed method for detecting the occurrence of concept drift.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr