Journal of KIISE

Search : [ author: 신동군 ] (9)

In object detection research, multiscale feature fusion—combining feature maps of different scales to detect objects of varying sizes—has become a critical focus. Network structures like Feature Pyramid Networks (FPNs) and Path Aggregation Networks (PANets) have been developed to address this challenge. PANet, an enhancement of FPN, integrates both top-down and bottom-up pathways, leading to significant improvements in object detection performance. However, during multiscale feature fusion, PANet’s upscaling and downscaling processes can result in the loss of crucial low- or high-level information from the original feature maps. In this paper, we introduce the Octave C2f module, which employs octave convolution to seamlessly fuse feature maps of different sizes without the need for additional processing. This innovative approach enhances accuracy while reducing computational complexity. Experimental results on the PASCAL VOC and MS COCO datasets demonstrate improved accuracy, reduced computational effort, and a decrease in parameter count compared to the default YOLOv8 model.

Accelerating DNN Models via Hierarchical N:M Sparsity

Seungmin Yu, Hayun Lee, Dongkun Shin

http://doi.org/10.5626/JOK.2024.51.7.583

N:M sparsity pruning is an effective approach for compressing deep neural networks by leveraging NVIDIA’s Sparse Tensor Core technology. Despite its effectiveness, this technique is constrained by hardware limitations, leading to fixed compression ratios and increased access to unnecessary input data, and does not adequately address the imbalanced distribution of essential parameters. This paper proposes Hierarchical N:M (HiNM) sparsity, where vector sparsity is applied prior to N:M sparsity for various-levels of sparsity. We also introduce a novel permutation technique tailored for HiNM sparsity, named 2-axis channel permutation (2CP). The experimental results showed that HiNM sparsity achieves a compression ratio twice that of traditional N:M sparsity while reducing latency by an average of 37%.

Optimizing Computation of Tensor-Train Decomposed Embedding Layer

Seungmin Yu, Hayun Lee, Dongkun Shin

http://doi.org/10.5626/JOK.2023.50.9.729

Personalized recommendation system is ubiquitous in daily life. However, the huge amount of memory requirement to store the embedding tables used by deep learning-based recommendation system models is taking up most of the resources of industrial AI data centers. To overcome this problem, one of the solutions is to use Tensor-Train (TT) decomposition, is promising compression technique in deep neural network. In this study, we analyze unnecessary computations in Tensor-Train Gather and Reduce (TT-GnR) which is the operation of embedding layer applied with TT decomposition. To solve this problem, we define a computational unit called group to bind the item vectors into a group and propose Group Reduced TT-Gather and Reduce operation to reduce unnecessary operations by calculating with groups. Since the GRT-GnR operation is calculated in groups, computational cost varies depending on how item vectors are grouped. Experimental results showed that the GRT-GnR operation had a 41% decrease in latency compared to conventional TT-GnR operation.

Code Generation and Data Layout Transformation Techniques for Processing-in-Memory

Hayun Lee, Gyungmo Kim, Dongkun Shin

http://doi.org/10.5626/JOK.2023.50.8.639

Processing-in-Memory (PIM) capitalizes on internal parallelism and bandwidth within memory systems, thereby achieving superior performance to CPUs or GPUs in memory-intensive operations. Although many PIM architectures were proposed, the compiler issues for PIM are not currently well-studied. To generate efficient program codes for PIM devices, the PIM compiler must optimize operation schedules and data layouts. Additionally, the register reuse of PIM processing units must be maximized to reduce data movement traffic between host and PIM devices. We propose a PIM compiler, which can support various PIM architectures. It achieves up to 2.49 times performance improvement in GEMV operations through register reuse optimization.

Host-Level I/O Scheduler for Achieving Performance Isolation with Open-Channel SSDs

Sooyun Lee, Kyuhwa Han, Dongkun Shin

http://doi.org/10.5626/JOK.2020.47.2.119

As Solid State Drives (SSDs) provide higher I/O performance and lower energy consumption compared to Hard Disk Drives (HDDs), SSDs are currently widening its adoption in areas such as datacenters and cloud computing where multiple users share resources. Based on this trend, there is currently greater research effort being made on ensuring Quality of Service (QoS) in environments where resources are shared. The previously proposed Workload-Aware Budget Compensation (WA-BC) scheduler aims to ensure QoS among multiple Virtual Machines (VMs) sharing an NVMe SSD. However, the WA-BC scheduler has a weakness in that it misuses multi-stream SSDs for identifying workload characteristics. In this paper, we propose a new host-level I/O scheduler, which complements this vulnerability of the WA-BC scheduler. It aims to eliminate performance interference between different users that share an Open-Channel SSD. The proposed scheduler identifies workload characteristics without having to allocate separate SSD streams by observing the sequentiality of I/O requests. Although the proposed scheduler exists within the host, it can reflect the status of device internals by exploiting the characteristics of Open-Channel SSDs. We show that by identifying those that attribute more to garbage collection, a source of I/O interference within SSDs, using workload characteristics and penalizing such users helps to achieve performance isolation amongst different users sharing storage resources.

Performance and Energy Comparison of Different BLAS and Neural Network Libraries for Efficient Deep Learning Inference on ARM-based IoT Devices

Hayun Lee, Dongkun Shin

http://doi.org/10.5626/JOK.2019.46.3.219

Cloud computing is generally used to perform deep learning on IoT devices. However, its application is associated with limitations such as connection instability, energy consumption for communication, and security vulnerabilities. To solve such problems, recent attempts at performing deep learning within IoT devices have occurred. These attempts mainly suggest either lightweight deep learning models or compression techniques concerning IoT devices, but they lack analysis of the effect when it is performed in actual IoT devices. Since each IoT device has different configuration of processing units and supported libraries, it is necessary to analyze various execution environments in each IoT device in order to perform optimized deep learning. In this study, performance and energy of IoT devices with various hardware configurations were measured and analyzed according to the application of the deep learning model, library, and compression technique. It was established that utilizing the appropriate libraries improve both speed and energy efficiency up to 13.3 times and 48.5 times, respectively.

Smart Fog : Advanced Fog Server-centric Things Abstraction Framework for Multi-service IoT System

Gyeonghwan Hong, Eunsoo Park, Sihoon Choi, Dongkun Shin

http://doi.org/

Recently, several research studies on things abstraction framework have been proposed in order to implement the multi-service Internet of Things (IoT) system, where various IoT services share the thing devices. Distributed things abstraction has an IoT service duplication problem, which aggravates power consumption of mobile devices and network traffic. On the other hand, cloud server-centric things abstraction cannot cover real-time interactions due to long network delay. Fog server-centric things abstraction has limits in insufficient IoT interfaces. In this paper, we propose Smart Fog which is a fog server-centric things abstraction framework to resolve the problems of the existing things abstraction frameworks. Smart Fog consists of software modules to operate the Smart Gateway and three interfaces. Smart Fog is implemented based on IoTivity framework and OIC standard. We construct a smart home prototype on an embedded board Odroid-XU3 using Smart Fog. We evaluate the network performance and energy efficiency of Smart Fog. The experimental results indicate that the Smart Fog shows short network latency, which can perform real-time interaction. The results also show that the proposed framework has reduction in the network traffic of 74% and power consumption of 21% in mobile device, compared to distributed things abstraction.

Storage I/O Subsystem for Guaranteeing Atomic Write in Database Systems

Kyuhwa Han, Dongkun Shin, Yongserk Kim

http://doi.org/

The atomic write technique is a good solution to solve the problem of the double write buffer. The atomic write technique needs modified I/O subsystems (i.e., file system and I/O schedulers) and a special SSD that guarantees the atomicity of the write request. In this paper, we propose the writing unit aligned block allocation technique (for EXT4 file system) and the merge prevention of requests technique for the CFQ scheduler. We also propose an atomic write-supporting SSD which stores the atomicity information in the spare area of the flash memory page. We evaluate the performance of the proposed atomic write scheme in MariaDB using the tpcc-mysql and SysBench benchmarks. The experimental results show that the proposed atomic write technique shows a performance improvement of 1.4～1.5 times compared to the double write buffer technique.

Partial Garbage Collection Technique for Improving Write Performance of Log-Structured File Systems

Hyunho Gwak, Dongkun Shin

http://doi.org/

Recently, flash storages devices have become popular. Log-structured file systems (LFS) are suitable for flash storages since these can provide high write performance by only generating sequential writes to the flash device. However, LFS should perform garbage collections (GC) in order to reclaim obsolete space. Recently, a slack space recycling (SSR) technique was proposed to reduce the GC overhead. However, since SSR generates random writes, write performance can be negatively impacted if the random write performance is significantly lower than sequential write performance of the target device. This paper proposes a partial garbage collection technique that copies only a part of valid blocks in a victim segment in order to increase the size of the contiguous invalid space to be used by SSR. The experiments performed in this study show that the write performance in an SD card improves significantly as a result of the partial GC technique.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Journal of KIISE

Journal of KIISE

Digital Library[ Search Result ]

Octave-YOLO: Direct Multi-scale Feature Fusion for Object Detection

Accelerating DNN Models via Hierarchical N:M Sparsity

Optimizing Computation of Tensor-Train Decomposed Embedding Layer

Code Generation and Data Layout Transformation Techniques for Processing-in-Memory

Host-Level I/O Scheduler for Achieving Performance Isolation with Open-Channel SSDs

Performance and Energy Comparison of Different BLAS and Neural Network Libraries for Efficient Deep Learning Inference on ARM-based IoT Devices

Smart Fog : Advanced Fog Server-centric Things Abstraction Framework for Multi-service IoT System

Storage I/O Subsystem for Guaranteeing Atomic Write in Database Systems

Partial Garbage Collection Technique for Improving Write Performance of Log-Structured File Systems

Search

Editorial Office