Digital Library[ Search Result ]
A Survey of Advantages of Self-Supervised Learning Models in Visual Recognition Tasks
Euihyun Yoon, Hyunjong Lee, Donggeon Kim, Joochan Park, Jinkyu Kim, Jaekoo Lee
http://doi.org/10.5626/JOK.2024.51.7.609
Recently, the field of teacher-based artificial intelligence (AI) has been rapidly advancing. However, teacher-based learning relies on datasets with specified correct answers, which can increase the cost of obtaining these correct answers. To address this issue, self-supervised learning, which can learn general features of photos without needing correct answers, is being researched. In this paper, various self-supervised learning models were classified based on their learning methods and backbone networks. Their strengths, weaknesses, and performances were then compared and analyzed. Photo classification tasks were used for performance comparison. For comparing the performance of transfer learning, detailed prediction tasks were also compared and analyzed. As a result, models that only used positive pairs achieved higher performance by minimizing noise than models that used both positive and negative pairs. Furthermore, for fine-grained predictions, methods such as masking images for learning or utilizing multi-stage models achieved higher performance by additionally learning regional information.
A Proposal for Lightweight Human Action Recognition Model with Video Frame Selection for Residential Area
http://doi.org/10.5626/JOK.2023.50.12.1111
Residential area closed-circuit televisions (CCTVs) need human action recognition (HAR) to predict any accidents and crucial problems. HAR model must be not only accurate but also light and fast to apply in the real world. Therefore, in this paper, a cross-modal PoseC3D model with a frame selection method is proposed. The proposed cross-modal PoseC3D model integrates multi-modality inputs (i.e., RGB image and human skeleton data) and trains them in a single model. Thus, the proposed model is lighter and faster than previous works such as two-pathway PoseC3D. Moreover, we apply the frame selection method to use only the meaningful frames based on differences between frames instead of using the whole frame of a video. AI Hub open dataset was used to verify the performance of proposed method. The experimental results showed that the proposed method achieves similar or better performance and is much lighter and faster than those in the previous works.
A Survey on Methods for Image Description
http://doi.org/10.5626/JOK.2023.50.3.210
Image description, which has been receiving much attention with the development of deep learning, uses computer vision methods that identify the contents of images and natural language processing methods that represent descriptive sentences. Image description techniques are utilized in many applications including services for visually impaired people. In this paper, we summarize image description methods within three categories; template-based methods, visual/semantic similarity search-based methods, and deep learning-based methods, and compare their performances. Through performance comparison, we try to provide useful information by offering basic architectures, advantages, limitations, and performances of the models. We especially survey the deep learning-based methods in detail because the performances of these methods are significantly improved compared to other methods. Through this process, we aim to organize the overall contents of image description techniques. For the performance of each study, compare the METEOR and BLEU scores for the commonly used Flickr30K and MS COCO datasets, and if the results are not provided, check the test image and the sentences generated for it.
Video Object Detection Network by Estimation of Center and Movement of The Object by Stacking Continuous Images
Hayoung Son, Yujin Lee, Kaewon Choi
http://doi.org/10.5626/JOK.2022.49.6.416
Various obstacles such as large containers and logistics machines are placed, in an environment such as a spacious port that is difficult to monitor at once. We studied object detection methods to track very small pedestrians and port vehicle objects. Since we need to learn small objects and unclear shapes, we trained a model based on CenterNet, a network of Anchor-Free methods, and to supplement information on very small objects, we learned by stacking several consecutive images. In addition, Lack of datasets due to the special environment was solved by enhancing data that uses multiple datasets together, randomly selecting multiple still images, and processing them into a continuous image, thereby preventing overfitting.
Data Augmentation for Image based Parking Space Classification Deep Model
http://doi.org/10.5626/JOK.2022.49.2.126
A parking occupancy state determination system using an ultrasonic sensor or a camera is mainly used in indoor parking lots. However, in the case of an outdoor parking lot, there is a limit to the introduction of these systems due to the high installation cost and accuracy problems. In addition, the application of deep learning is restricted because it is difficult to obtain representative learning data due to diverse lighting conditions, camera positions, and features. In this paper, we analyzed the effect of augmentation techniques on the performance of a deep model for parking status classification in such a data shortage situation. To this end, the parking area images were classified by situations. Four augmentation techniques were applied to the training of ResNet, EfficientNet, and MobileNet. Based on performance evaluation, the accuracy was improved by up to 5.2%, 8.67%, and 15.44%p in the case of mixup, stopper, and rescaling methods, respectively. On the other hand, in the case of center crop, which was known to have performance improvement in other studies, the accuracy decreased by an average of 4.86%p.
Boosting Image Caption Generation with Parts of Speech
Philgoo Kang, Yubin Lim, Hyoungjoo Kim
http://doi.org/10.5626/JOK.2021.48.3.317
With the integration of smart devices and reliance on AI into our daily lives, the ability to generate image caption is becoming increasingly important in various fields such as guidance for visually-impaired individuals, human-computer interaction and so on. In this paper, we propose a novel approach based on parts of speech (POS), such as nouns and verbs extracted from image to enhance the image caption generation. The proposed model exploits multiple CNN encoders, which were specifically trained to identify features related to POS, and feed them into an LSTM decoder to generate image captions. We conducted experiments involving both Flickr30k and MS-COCO datasets using several text metrics and additional human surveys to validate the practical effectiveness of the proposed model.
3D Object-grabbing Hand Tracking based on Depth Reconstruction and Prior Knowledge of Grasp
Woojin Cho, Gabyong Park, Woontack Woo
http://doi.org/10.5626/JOK.2019.46.7.673
We propose a real-time 3D object-grabbing hand tracking system based on the prior knowledge of grasping an object. The problem of tracking a hand interacting with an object is more difficult compared to the issue of an isolated hand since it requires consideration of occlusion by an object. Most of the previous studies resort to the insufficient data which lacks the data of occluded hand and the information that the presence of an object may rather be a constraint on the pose of the hand. In the present work, we focused on the sequence of a hand grabbing an object by utilizing prior knowledge about grasp situation. Consequently, an excluded depth data of the hand occluded by the object was reconstructed with proper depth data and a reinitialization process was conducted based on the plausible grasp pose of the human. The effectiveness of the proposed process was verified based on model-based tracker with particle swarm optimization. Quantitative and qualitative experiments demonstrate that the proposed processes can effectively improve the performance of model-based tracker for the object-grabbing hand.
Automatic Generation of HTML Code Based on Web Page Sketch
Bada Kim, Sangmin Park, Taeyeon Won, Junyung Heo
http://doi.org/10.5626/JOK.2019.46.1.9
Various studies have been conducted to automatically encode GUI designs in web application development. In the past study, the focus was on object region detection using computer vision and object detection based on deep-learning. The past reported study had the limitations of incorrect detection or no detection of the object. In the present work, two technologies were applied collectively to reduce the limitations of conventional object detection. The computer vision is used for layout detection, and deep-learning is used for GUI object detection. Based on these technologies, detected layouts and GUI objects were converted into HTML code. Consequently, the accuracy and recall rate of GUI object detection were 91% and 86%, respectively, and it was possible to convert into HTML code.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr