Journal of KIISE

Search : [ keyword: Multimodal ] (11)

Enhancing Molecular Understanding in LLMs through Multimodal Graph-SMILES Representations

http://doi.org/10.5626/JOK.2025.52.5.379

Recent advancements in large language models (LLMs) have shown remarkable performace across various tasks, with increasing focus on multimodal research. Notably, BLIP-2 can enhance performance by efficiently aligning images and text using a Q-Former, aided by an image encoder pre-trained on multimodal data. Inspired by this, the MolCA model extends BLIP-2 to the molecular domain to improve performance. However, the graph encoder in MolCA is pre-trained on unimodal data, necessitating updates during model training, which is a limitation. Therefore, this paper replaced it with a graph encoder pre-trained on multimodal data and frozen while training the model. Experimental results showed that using the graph encoder pre-trained on multimodal data generally enhanced performance. Additionally, unlike the graph encoder pre-trained on unimodal data, which performed better when updated, the graph encoder pre-trained on multimodal data achieved superior results across all metrics when frozen.

An Experimental Study on the Text Generation Capability for Chart Image Descriptions in Korean SLLM

Hyojun An, Sungpil Choi

http://doi.org/10.5626/JOK.2025.52.2.132

This study explores the capability of using Small Large Language Models(SLLMs) for automatically generating and interpreting information from chart images. To achieve this goal, we built an instruction dataset for SLLM training by extracting text data from chart images and adding descriptive information. We conducted instruction tuning on a Korean SLLM and evaluated its ability to generate information from chart images. The experimental results demonstrated that the SLLM, which was fine-tuned with the constructed instruction dataset, was capable of generating descriptive text comparable to OpenAI's GPT-4o-mini API. This study suggests that, in the future, Korean SLLMs may be effectively used for generating descriptive text and providing information across a broader range of visual data.

Single-Modal Pedestrian Detection Leveraging Multimodal Knowledge for Blackout Situations

Seungho Shin, Jung Uk kim

http://doi.org/10.5626/JOK.2024.51.1.86

Multispectral pedestrian detection using both visible and thermal data is an actively researched topic in the field of computer vision. However, the majority of the existing studies have only considered scenarios where the camera operates without challenges, leading to a significant decline in performance when a camera blackout happens. Recognizing the importance of addressing the camera blackout challenge in multispectral pedestrian detection, this paper researched models that remain robust even during camera blackouts. Our model, proposed in this study, utilizes the Feature Tracing Method during training phase to apply the knowledge from multiple modalities to single-modal pedestrian detection. Even if the camera experiences a blackout and only one modality is input, the model predicts and operates as if it"s using multiple modalities. Through this approach, pedestrian detection performance in blackout situations is improved.

TwinAMFNet : Twin Attention-based Multi-modal Fusion Network for 3D Semantic Segmentation

Jaegeun Yoon, Jiyeon Jeon, Kwangho Song

http://doi.org/10.5626/JOK.2023.50.9.784

Recently, with the increase in the number of accidents due to misrecognition in autonomous driving, interest in 3D semantic segmentation based on sensor fusion using multi-modal sensors has increased. Accordingly, this study introduces TwinAMFNet, a novel 3D semantic segmentation neural network through sensor fusion of RGB cameras and LiDAR. The proposed neural network includes a twin neural network that processes RGB images and point cloud projection images projected on a 2D coordinate plane and through an attention-based fusion module for feature step fusion in the encoder and decoder. The proposed method shows improvement of further extended object and boundary classification. As a result, the proposed neural network recorded approximately 68% performance in 3D semantic segmentation based on mIoU, and showed approximately 4.5% improved performance compared to the ones reported in the existing studies.

Real-time Multimodal Audio-to-Tactile Conversion System for Playing or Watching Mobile Shooting Games

Minjae Mun, Gyeore Yun, Chaeyong Park, Seungmoon Choi

http://doi.org/10.5626/JOK.2023.50.3.228

This study presents a real-time multimodal audio-to-tactile conversion system for improving user experiences when users play or watch first-person shooting games with a mobile device. The system detects whether sounds from the mobile devices are appropriate to provide haptic feedback in real-time and provides vibrotactile feedback, mainly used for conventional haptic feedback, and impact effects of short and strong force as well. To this end, we confirmed the suitability of the impact haptic feedback compared to the vibrotactile feedback for shooting games. We implemented two types of impulsive sound detectors using psychoacoustic features and a support vector machine. We found that our detectors outperformed the one from a previous study. Lastly, we conduct a user study to evaluate our system. Results showed that our system could significantly improve user experiences.

Movie Summarization Based on Emotion Dynamics and Multimodal Information

Myungji Lee, Hongseok Kwon, WonKee Lee, Jong-Hyeok Lee

http://doi.org/10.5626/JOK.2022.49.9.735

Movie summarization is the task of summarizing a full-length movie by creating a short video summary containing its most informative scenes. This paper proposes an automatic movie summarization model that comprehensively considers the three main elements of the movie: characters, plot, and video information for movie summary. To accurately identify major events on the movie plot, we propose a Transformer-based architecture that uses the movie script"s dialogue information and the main characters" emotion dynamics information as model training features, and then combines the script and video information. Through experiments, the proposed method is shown to be helpful in increasing the accuracy of identifying major events in movies and consequently improves the quality of movie summaries.

Multimodal Haptic Rendering for Interactive VR Sports Applications

Minjae Mun, Seungjae Oh, Chaeyong Park, Seungmoon Choi

http://doi.org/10.5626/JOK.2022.49.2.97

This study explores how to deliver realistic haptic sensations for virtual collision events in virtual reality (VR). For this purpose, we implemented a multifaceted haptic device that produced both vibration and impact and designed a haptic rendering method combining the simulated interactions of a physics engine and the collision data of real objects. We also designed a virtual simulation of three sports activities, billiards, ping-pong, and tennis, in which a user could interact with virtual objects having different material properties. We performed a user study to evaluate the subjective quality of the haptic feedback from three rendering conditions, vibration, impact, and multimodal of combining both, and compared it to real haptic sensations. The results suggested that each rendering condition had different perceptual characteristics. Therefore, the addition of a haptic modality can broaden the dynamic range of virtual collisions.

SMERT: Single-stream Multimodal BERT for Sentiment Analysis and Emotion Detection

Kyeonghun Kim, Jinuk Park, Jieun Lee, Sanghyun Park

http://doi.org/10.5626/JOK.2021.48.10.1122

Sentiment Analysis is defined as a task that analyzes subjective opinion or propensity and, Emotion Detection is the task that finds emotions such as ‘happy’ or ‘sad’ from text data. Multimodal data refers to the appearance of image and voice data in addition to text data. In prior research, RNN or cross-transformer models were used, however, RNN models have long-term dependency problems. Also, since cross-transformer models could not capture the attribute of modalities, they got worse results. To solve those problems, we propose SMERT based on a single-stream transformer ran on a single network. SMERT can get joint representation for Sentiment Analysis and Emotion Detection. Besides, we use BERT tasks which are improved to utilize for multimodal data. To present the proposed model, we verify the superiority of SMERT through a comparative experiment on the combination of modalities using the CMU-MOSEI dataset and various evaluation metrics.

English-to-Korean Machine Translation using Image Information

Jangseong Bae, Hyunsun Hwang, Changki Lee

http://doi.org/10.5626/JOK.2019.46.7.690

Machine translation automatically converts a text in one language into another language. Conventional machine translations use only texts for translation which is a disadvantage in that various information related to input text cannot be utilized. In recent years, multimodal machine translation models have emerged that use images related to input text as additional inputs, unlike conventional machine translations which use only textual data. In this paper, image information was added at decoding time of machine translation according to recent research trends and used for English-to-Korean automated translation. In addition, we propose a model with a decoding gate to adjust the textual and image information at the decoding time. Our experimental results show that the proposed method resulted in better performance than the non-gated model.

Effect Scene Detection using Multimodal Deep Learning Models

Jeongseon Lim, Mikyung Han, Hyunjin Yoon

http://doi.org/10.5626/JOK.2018.45.12.1250

A conventional movie can be converted into a 4D movie by identifying effect scenes. In order to automate this process, in this paper, we propose a multimodal deep learning model that detects effect scenes using both visual and audio features of a movie. We have classified effect/non-effect scenes using audio-based Convolutional Recurrent Neural Network (CRNN) model and video-based Long Short-term Memory (LSTM) and Multilayer Perceptron (MLP) model. Also, we have implemented feature-level fusion. In addition, based on our own observation that effects typically occur during non-dialog scenes, we further detected non-dialog scenes using audio-based Convolutional Neural Network (CNN) model. Subsequently, the prediction scores of audio-visual effect scene classification and audio-based non-dialog classification models were combined. Finally, we detected sequences of effect scenes of the entire movie using prediction score of the input window. Experiments using real-world 4D movies demonstrate that the proposed multimodal deep learning model outperforms unimodal models in terms of effect scene detection accuracy.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Digital Library[ Search Result ]

Search

Editorial Office