Search : [ author: Byoung-Tak Zhang ] (8)

CraftGround: A Flexible Reinforcement Learning Environment Based on the Latest Minecraft

Hyeonseo Yang, Minsu Lee, Byoung-Tak Zhang

http://doi.org/10.5626/JOK.2025.52.3.189

This paper presents CraftGround, an innovative reinforcement learning environment based on the latest version of Minecraft (1.21). CraftGround provides flexible experimental setups and supports reinforcement learning in complex 3D environments, offering a variety of observational data, including visual information, audio cues, biome-specific contexts, and in-game statistics. Our experiments evaluated several agents, such as VPT (Video PreTraining), PPO, RecurrentPPO, and DQN, across various tasks, including tree chopping, evading hostile monsters, and fishing. The results indicated that VPT performed exceptionally well due to its pretraining, achieving higher performance and efficiency in structured tasks. In contrast, online learning algorithms like PPO and RecurrentPPO demonstrated a greater ability to adapt to environmental changes, showing improvement over time. These findings highlight CraftGround's potential to advance research on adaptive agent behaviors in dynamic 3D simulations.

Efficient Compositional Translation Embedding for Visual Relationship Detection

Yu-Jung Heo, Eun-Sol Kim, Woo Suk Choi, Kyoung-Woon On, Byoung-Tak Zhang

http://doi.org/10.5626/JOK.2022.49.7.544

Scene graphs are widely used to express high-order visual relationships between objects present in an image. To generate the scene graph automatically, we propose an algorithm that detects visual relationships between objects and predicts the relationship as a predicate. Inspired by the well-known knowledge graph embedding method TransR, we present the CompTransR algorithm that i) defines latent relational subspaces considering the compositional perspective of visual relationships and ii) encodes predicate representations by applying transitive constraints between the object representations in each subspace. Our proposed model not only reduces computational complexity but also outperformed previous state-of-the-art performance in predicate detection tasks in three benchmark datasets: VRD, VG200, and VrR-VG. We also showed that a scene graph could be applied to the image-caption retrieval task, which is one of the high-level visual reasoning tasks, and the scene graph generated by our model increased retrieval performance.

Analyzing and Solving GuessWhat?!

Sang-Woo Lee, Cheolho Han, Yujung Heo, Wooyoung Kang, Jaehyun Jun, Byoung-Tak Zhang

http://doi.org/10.5626/JOK.2018.45.1.30

GuessWhat?! is a game in which two machine players, composed of questioner and answerer, ask and answer yes-no-N/A questions about the object hidden for the answerer in the image, and the questioner chooses the correct object. GuessWhat?! has received much attention in the field of deep learning and artificial intelligence as a testbed for cutting-edge research on the interplay of computer vision and dialogue systems. In this study, we discuss the objective function and characteristics of the GuessWhat?! game. In addition, we propose a simple solver for GuessWhat?! using a simple rule-based algorithm. Although a human needs four or five questions on average to solve this problem, the proposed method outperforms state-of-the-art deep learning methods using only two questions, and exceeds human performance using five questions.

Question Answering Optimization via Temporal Representation and Data Augmentation of Dynamic Memory Networks

Dong-Sig Han, Chung-Yeon Lee, Byoung-Tak Zhang

http://doi.org/

The research area for solving question answering (QA) problems using artificial intelligence models is in a methodological transition period, and one such architecture, the dynamic memory network (DMN), is drawing attention for two key attributes: its attention mechanism defined by neural network operations and its modular architecture imitating cognition processes during QA of human. In this paper, we increased accuracy of the inferred answers, by adapting an automatic data augmentation method for lacking amount of training data, and by improving the ability of time perception. The experimental results showed that in the 1K-bAbI tasks, the modified DMN achieves 89.21% accuracy and passes twelve tasks which is 13.58% higher with passing four more tasks, as compared with one implementation of DMN. Additionally, DMN’s word embedding vectors form strong clusters after training. Moreover, the number of episodic passes and that of supporting facts shows direct correlation, which affects the performance significantly.

Active Vision from Image-Text Multimodal System Learning

Jin-Hwa Kim, Byoung-Tak Zhang

http://doi.org/

In image classification, recent CNNs compete with human performance. However, there are limitations in more general recognition. Herein we deal with indoor images that contain too much information to be directly processed and require information reduction before recognition. To reduce the amount of data processing, typically variational inference or variational Bayesian methods are suggested for object detection. However, these methods suffer from the difficulty of marginalizing over the given space. In this study, we propose an image-text integrated recognition system using active vision based on Spatial Transformer Networks. The system attempts to efficiently sample a partial region of a given image for a given language information. Our experimental results demonstrate a significant improvement over traditional approaches. We also discuss the results of qualitative analysis of sampled images, model characteristics, and its limitations.

Event Cognition-based Daily Activity Prediction Using Wearable Sensors

Chung-Yeon Lee, Dong Hyun Kwak, Beom-Jin Lee, Byoung-Tak Zhang

http://doi.org/

Learning from human behaviors in the real world is essential for human-aware intelligent systems such as smart assistants and autonomous robots. Most of research focuses on correlations between sensory patterns and a label for each activity. However, human activity is a combination of several event contexts and is a narrative story in and of itself. We propose a novel approach of human activity prediction based on event cognition. Egocentric multi-sensor data are collected from an individual’s daily life by using a wearable device and smartphone. Event contexts about location, scene and activities are then recognized, and finally the users’’ daily activities are predicted from a decision rule based on the event contexts. The proposed method has been evaluated on a wearable sensor data collected from the real world over 2 weeks by 2 people. Experimental results showed improved recognition accuracies when using the proposed method comparing to results directly using sensory features.

Character-based Subtitle Generation by Learning of Multimodal Concept Hierarchy from Cartoon Videos

Kyung-Min Kim, Jung-Woo Ha, Beom-Jin Lee, Byoung-Tak Zhang

http://doi.org/

Previous multimodal learning methods focus on problem-solving aspects, such as image and video search and tagging, rather than on knowledge acquisition via content modeling. In this paper, we propose the Multimodal Concept Hierarchy (MuCH), which is a content modeling method that uses a cartoon video dataset and a character-based subtitle generation method from the learned model. The MuCH model has a multimodal hypernetwork layer, in which the patterns of the words and image patches are represented, and a concept layer, in which each concept variable is represented by a probability distribution of the words and the image patches. The model can learn the characteristics of the characters as concepts from the video subtitles and scene images by using a Bayesian learning method and can also generate character-based subtitles from the learned model if text queries are provided. As an experiment, the MuCH model learned concepts from ‘Pororo’ cartoon videos with a total of 268 minutes in length and generated character-based subtitles. Finally, we compare the results with those of other multimodal learning models. The Experimental results indicate that given the same text query, our model generates more accurate and more character-specific subtitles than other models.

Locally Linear Embedding for Face Recognition with Simultaneous Diagonalization

Eun-Sol Kim, Yung-Kyun Noh, Byoung-Tak Zhang

http://doi.org/

Locally linear embedding (LLE) [1] is a type of manifold algorithms, which preserves inner product value between high-dimensional data when embedding the high-dimensional data to low-dimensional space. LLE closely embeds data points on the same subspace in low-dimensional space, because the data points have significant inner product values. On the other hand, if the data points are located orthogonal to each other, these are separately embedded in low-dimensional space, even though they are in close proximity to each other in high-dimensional space. Meanwhile, it is well known that the facial images of the same person under varying illumination lie in a low-dimensional linear subspace [2]. In this study, we suggest an improved LLE method for face recognition problem. The method maximizes the characteristic of LLE, which embeds the data points totally separately when they are located orthogonal to each other. To accomplish this, all of the subspaces made by each class are forced to locate orthogonally. To make all of the subspaces orthogonal, the simultaneous Diagonalization (SD) technique was applied. From experimental results, the suggested method is shown to dramatically improve the embedding results and classification performance.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr