Search : [ keyword: computer vision ] (10)

A Survey of Advantages of Self-Supervised Learning Models in Visual Recognition Tasks

Euihyun Yoon, Hyunjong Lee, Donggeon Kim, Joochan Park, Jinkyu Kim, Jaekoo Lee

http://doi.org/10.5626/JOK.2024.51.7.609

Recently, the field of teacher-based artificial intelligence (AI) has been rapidly advancing. However, teacher-based learning relies on datasets with specified correct answers, which can increase the cost of obtaining these correct answers. To address this issue, self-supervised learning, which can learn general features of photos without needing correct answers, is being researched. In this paper, various self-supervised learning models were classified based on their learning methods and backbone networks. Their strengths, weaknesses, and performances were then compared and analyzed. Photo classification tasks were used for performance comparison. For comparing the performance of transfer learning, detailed prediction tasks were also compared and analyzed. As a result, models that only used positive pairs achieved higher performance by minimizing noise than models that used both positive and negative pairs. Furthermore, for fine-grained predictions, methods such as masking images for learning or utilizing multi-stage models achieved higher performance by additionally learning regional information.

Improvement of Background Inpainting using Binary Masking of a Generated Image

Jihoon Lee, Chan Ho Bae, Seunghun Lee, Myung-Seok Choi, Ryong Lee, Sangtae Ahn

http://doi.org/10.5626/JOK.2024.51.6.537

Recently, image generation technology has been rapidly advancing in the field of deep learning. One of the most effective ways to represent images is by using text prompts to generate them. The performance of models that generate images using this technique is outstanding. However, it is not easy to naturally change specific parts of an image using only text prompts. This is considered a typical problem with conventional image generation models. Thus, in this study, we developed a background inpainting technique that extracts text for each area of an image and uses it as a basis to seamlessly change the background while preserving the objects in the image. In particular, the background transformation inpainting technique developed in this study has the advantage of not only transforming a single image but also rapidly transforming multiple images. Therefore, the proposed text prompt-based image style transfer can be used in fields with limited data for training, and the technique could enhance the performance of models through image augmentation.

A Proposal for Lightweight Human Action Recognition Model with Video Frame Selection for Residential Area

Sohyeon Kim, Ji-Hyeong Han

http://doi.org/10.5626/JOK.2023.50.12.1111

Residential area closed-circuit televisions (CCTVs) need human action recognition (HAR) to predict any accidents and crucial problems. HAR model must be not only accurate but also light and fast to apply in the real world. Therefore, in this paper, a cross-modal PoseC3D model with a frame selection method is proposed. The proposed cross-modal PoseC3D model integrates multi-modality inputs (i.e., RGB image and human skeleton data) and trains them in a single model. Thus, the proposed model is lighter and faster than previous works such as two-pathway PoseC3D. Moreover, we apply the frame selection method to use only the meaningful frames based on differences between frames instead of using the whole frame of a video. AI Hub open dataset was used to verify the performance of proposed method. The experimental results showed that the proposed method achieves similar or better performance and is much lighter and faster than those in the previous works.

Exploring Neural Network Models for Road Classification in Personal Mobility Assistants: A Comparative Study on Accuracy and Computational Efficiency

Gwanghee Lee, Sangjun Moon, Kyoungson Jhang

http://doi.org/10.5626/JOK.2023.50.12.1083

With the increasing use of personal mobility devices, the frequency of traffic accidents has also risen, with most accidents resulting from collisions with cars or pedestrians. Notably, the compliance rate of the traffic rules on the roads is low. Auxiliary systems that recognize and provide information about roads could help reduce the number of accidents. Since road images have distinct material characteristics, models studied in the field of image classification are suitable for application. In this study, we compared the performance of various road image classification models with parameter counts ranging from 2 million to 30 million, enabling the selection of the appropriate model based on the situation. The majority of the models achieved an accuracy of over 95%, with most models surpassing 99% in the top-2 accuracy. Of the models, MobileNet v2 had the fewest parameters while still exhibiting excellent performance and EfficientNet had stable accuracy across all classes, surpassing 90% accuracy.

A Survey on Methods for Image Description

Subin Ok, Daeho Lee

http://doi.org/10.5626/JOK.2023.50.3.210

Image description, which has been receiving much attention with the development of deep learning, uses computer vision methods that identify the contents of images and natural language processing methods that represent descriptive sentences. Image description techniques are utilized in many applications including services for visually impaired people. In this paper, we summarize image description methods within three categories; template-based methods, visual/semantic similarity search-based methods, and deep learning-based methods, and compare their performances. Through performance comparison, we try to provide useful information by offering basic architectures, advantages, limitations, and performances of the models. We especially survey the deep learning-based methods in detail because the performances of these methods are significantly improved compared to other methods. Through this process, we aim to organize the overall contents of image description techniques. For the performance of each study, compare the METEOR and BLEU scores for the commonly used Flickr30K and MS COCO datasets, and if the results are not provided, check the test image and the sentences generated for it.

Video Object Detection Network by Estimation of Center and Movement of The Object by Stacking Continuous Images

Hayoung Son, Yujin Lee, Kaewon Choi

http://doi.org/10.5626/JOK.2022.49.6.416

Various obstacles such as large containers and logistics machines are placed, in an environment such as a spacious port that is difficult to monitor at once. We studied object detection methods to track very small pedestrians and port vehicle objects. Since we need to learn small objects and unclear shapes, we trained a model based on CenterNet, a network of Anchor-Free methods, and to supplement information on very small objects, we learned by stacking several consecutive images. In addition, Lack of datasets due to the special environment was solved by enhancing data that uses multiple datasets together, randomly selecting multiple still images, and processing them into a continuous image, thereby preventing overfitting.

Data Augmentation for Image based Parking Space Classification Deep Model

Hojin Yoo, Kyungkoo Jun

http://doi.org/10.5626/JOK.2022.49.2.126

A parking occupancy state determination system using an ultrasonic sensor or a camera is mainly used in indoor parking lots. However, in the case of an outdoor parking lot, there is a limit to the introduction of these systems due to the high installation cost and accuracy problems. In addition, the application of deep learning is restricted because it is difficult to obtain representative learning data due to diverse lighting conditions, camera positions, and features. In this paper, we analyzed the effect of augmentation techniques on the performance of a deep model for parking status classification in such a data shortage situation. To this end, the parking area images were classified by situations. Four augmentation techniques were applied to the training of ResNet, EfficientNet, and MobileNet. Based on performance evaluation, the accuracy was improved by up to 5.2%, 8.67%, and 15.44%p in the case of mixup, stopper, and rescaling methods, respectively. On the other hand, in the case of center crop, which was known to have performance improvement in other studies, the accuracy decreased by an average of 4.86%p.

Boosting Image Caption Generation with Parts of Speech

Philgoo Kang, Yubin Lim, Hyoungjoo Kim

http://doi.org/10.5626/JOK.2021.48.3.317

With the integration of smart devices and reliance on AI into our daily lives, the ability to generate image caption is becoming increasingly important in various fields such as guidance for visually-impaired individuals, human-computer interaction and so on. In this paper, we propose a novel approach based on parts of speech (POS), such as nouns and verbs extracted from image to enhance the image caption generation. The proposed model exploits multiple CNN encoders, which were specifically trained to identify features related to POS, and feed them into an LSTM decoder to generate image captions. We conducted experiments involving both Flickr30k and MS-COCO datasets using several text metrics and additional human surveys to validate the practical effectiveness of the proposed model.

3D Object-grabbing Hand Tracking based on Depth Reconstruction and Prior Knowledge of Grasp

Woojin Cho, Gabyong Park, Woontack Woo

http://doi.org/10.5626/JOK.2019.46.7.673

We propose a real-time 3D object-grabbing hand tracking system based on the prior knowledge of grasping an object. The problem of tracking a hand interacting with an object is more difficult compared to the issue of an isolated hand since it requires consideration of occlusion by an object. Most of the previous studies resort to the insufficient data which lacks the data of occluded hand and the information that the presence of an object may rather be a constraint on the pose of the hand. In the present work, we focused on the sequence of a hand grabbing an object by utilizing prior knowledge about grasp situation. Consequently, an excluded depth data of the hand occluded by the object was reconstructed with proper depth data and a reinitialization process was conducted based on the plausible grasp pose of the human. The effectiveness of the proposed process was verified based on model-based tracker with particle swarm optimization. Quantitative and qualitative experiments demonstrate that the proposed processes can effectively improve the performance of model-based tracker for the object-grabbing hand.

Automatic Generation of HTML Code Based on Web Page Sketch

Bada Kim, Sangmin Park, Taeyeon Won, Junyung Heo

http://doi.org/10.5626/JOK.2019.46.1.9

Various studies have been conducted to automatically encode GUI designs in web application development. In the past study, the focus was on object region detection using computer vision and object detection based on deep-learning. The past reported study had the limitations of incorrect detection or no detection of the object. In the present work, two technologies were applied collectively to reduce the limitations of conventional object detection. The computer vision is used for layout detection, and deep-learning is used for GUI object detection. Based on these technologies, detected layouts and GUI objects were converted into HTML code. Consequently, the accuracy and recall rate of GUI object detection were 91% and 86%, respectively, and it was possible to convert into HTML code.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr