Digital Library[ Search Result ]
C3DSG: A 3D Scene Graph Generation Model Using Point Clouds of Indoor Environment
http://doi.org/10.5626/JOK.2023.50.9.758
To design an effective deep neural network model to generate 3D scene graphs from point clouds, the following three challenging issues need to be resolved: 1) to decide how to extract effective geometric features from point clouds, 2) to determine what non-geometric features are used complementarily for recognizing 3D spatial relationships between two objects, and 3) to decide which spatial reasoning mechanism is used. To address these challenging issues, we proposed a novel deep neural network model for generating 3D scene graphs from point clouds of indoor environments. The proposed model uses both geometric features of 3D point cloud extracted using Point Transformer and various non-geometric features such as linguistic features and relative comparison features that can help predict the 3D spatial relationship between objects. In addition, the proposed model uses a new NE-GAT graph neural network module that can apply attention to both object nodes and edges connecting them to effectively derive spatial context between objects. Conducting a variety of experiments using 3DSSG benchmark dataset, effectiveness and superiority of the proposed mode were proven.
Visual Commonsense Reasoning with Vision-Language Co-embedding and Knowledge Graph Embedding
http://doi.org/10.5626/JOK.2020.47.10.985
In this paper, we proposed a novel model for Visual Commonsense Reasoning (VCR). The proposed model co-embeds multi-modal input data together using a pre-trained vision-language model to effectively cope with the problem of visual grounding, which requires mutual alignment between an image, a natural language question, and the corresponding answer list. In addition, the proposed model extracts the common conceptual knowledge necessary for Visual Commonsense Reasoning from ConceptNet, an open knowledge base, and then embeds it using a Graph Convolutional neural Network (GCN). In this paper, we introduced the design details of the proposed model, VLKG_VCR, and verified the performance of the model through various experiments using an enhanced VCR benchmark data set.
Open Domain Question Answering using Knowledge Graph
http://doi.org/10.5626/JOK.2020.47.9.853
In this paper, we propose a novel knowledge graph inference model called KGNet for answering the open domain complex questions. This model addresses the problem of knowledge base incompleteness. In this model, two different types of knowledge resources, knowledge base and corpus, are integrated into a single knowledge graph. Moreover, to derive answers to complex multi-hop questions effectively, this model adopts a new knowledge embedding and reasoning module based on Graph Neural Network (GNN). We demonstrate the effectiveness and performance of the proposed model through various experiments over two large question answering benchmark datasets, WebQuestionsSP and MetaQA.
Neural Module Network Learning for Visual Dialog
http://doi.org/10.5626/JOK.2019.46.12.1304
In this paper, we propose a novel neural module network (NMN) model for visual dialog. Visual dialog currently has several challenges: The first one is visual grounding, which is concerned with how to associate the entities mentioned in the natural language question with the visual objects included in the given image. The other one is visual co-reference resolution, which involves how to determine which words, typically noun phrases and pronouns, co-refer to the same visual object in a given image. In order to address these issues, we suggest a new visual dialog model using both question-customized neural module networks and a reference pool. The proposed model includes not only a new Compare module to answer the questions that require comparing prosperities between two visual objects, but also a novel Find module improved by using a dual attention mechanism, and a Refer module to resolve visual co-references with the reference pool. To evaluate the performance of the proposed model, we conduct various experiments on two large benchmark datasets, VisDial v0.9 and VisDial v1.0. The results of these experiments show that the proposed model outperforms the state-of-the-art models for visual dialog.
Learning Semantic Features for Dense Video Captioning
http://doi.org/10.5626/JOK.2019.46.8.753
In this paper, we propose a new deep neural network model for dense video captioning. Dense video captioning is an emerging task that aims at both localizing and describing all events in a video. Unlike many existing models, which use only visual features extracted from the given video through a sort of convolutional neural network(CNN), our proposed model makes additional use of high-level semantic features that describe important event components such as actions, people, objects, and backgrounds. The proposed model localizes temporal regions of events by using LSTM, a recurrent neural network(RNN). Furthermore, our model adopts an attention mechanism for caption generation to selectively focus on input features depending on their importance. By conducting experiments using a large-scale benchmark dataset for dense video captioning, AcitivityNet Captions, we demonstrate high performance and superiority of our model.
Visual Scene Understanding with Contexts
http://doi.org/10.5626/JOK.2018.45.12.1279
In this paper, as a visual scene understanding problem, we address the problem of generating corresponding scene graphs and image captions from input images. While a scene graph is a formal knowledge representation expressing in-image objects and their relationships, an image caption is a natural language sentence describing the scene captured in the given image. To address the problem effectively, we propose a novel deep neural network model, CSUN(Context-based Scene Understanding Network), to generate two different representations in a complementary way, by exchanging useful contexts with each other. The proposed model consists of three different layers, such as object detection, relationship detection, and caption generation, each of which makes use of proper context to accomplish its own task. To evaluate performance of the proposed model, we conduct various experiments on a large-scale benchmark dataset, Visual Genome. Through these experiments, we demonstrate that our model using useful contexts, achieves significant improvements in accuracy over state-of-the-art models.
Activity Detection in Untrimmed Videos with Semantic Features and Temporal Region Proposals
http://doi.org/10.5626/JOK.2018.45.7.678
In this paper, we propose a deep neural network model that effectively detects human activities in untrimmed videos. While temporal visual features extracted over several successive image frames in a video, it helps to recognize a dynamic activity itself; spatial visual features extracted from each frame help to find objects associated with the activity. To detect activities precisely in a video, therefore, both temporal and spatial visual features should be considered together. In addition to these visual features, semantic features describing video contents in high-level concepts may also help to improve video-activity detection. To localize the activity region accurately, as well as to classify an activity correctly in an untrimmed video, it is required to design a mechanism for temporal region proposal. The activity-detection model proposed in this work learns both visual and semantic features of the given video, with deep convolutional neural networks. Moreover, by using recurrent neural networks, the model effectively proposes temporal activity regions and classifies activities in the video. Experiments with large-scale benchmark datasets such as ActivityNet and THUMOS, showed the high performance of our activity-detection model.
Direction Relation Representation and Reasoning for Indoor Service Robots
Seokjun Lee, Jonghoon Kim, Incheol Kim
http://doi.org/10.5626/JOK.2018.45.3.211
In this paper, we propose a robot-centered direction relation representation and the relevant reasoning methods for indoor service robots. Many conventional works on qualitative spatial reasoning, when deciding the relative direction relation of the target object, are based on the use of position information only. These reasoning methods may infer an incorrect direction relation of the target object relative to the robot, since they do not take into consideration the heading direction of the robot itself as the base object. In this paper, we present a robot-centered direction relation representation and the reasoning methods. When deciding the relative directional relationship of target objects based on the robot in an indoor environment, the proposed methods make use of the orientation information as well as the position information of the robot. The robot-centered reasoning methods are implemented by extending the existing cone-based, matrix-based, and hybrid methods which utilized only the position information of two objects. In various experiments with both the physical Turtlebot and the simulated one, the proposed representation and reasoning methods displayed their high performance and applicability.
Ontology-Based Dynamic Context Management and Spatio-Temporal Reasoning for Intelligent Service Robots
Jonghoon Kim, Seokjun Lee, Dongha Kim, Incheol Kim
One of the most important capabilities for autonomous service robots working in living environments is to recognize and understand the correct context in dynamically changing environment. To generate high-level context knowledge for decision-making from multiple sensory data streams, many technical problems such as multi-modal sensory data fusion, uncertainty handling, symbolic knowledge grounding, time dependency, dynamics, and time-constrained spatio-temporal reasoning should be solved. Considering these problems, this paper proposes an effective dynamic context management and spatio-temporal reasoning method for intelligent service robots. In order to guarantee efficient context management and reasoning, our algorithm was designed to generate low-level context knowledge reactively for every input sensory or perception data, while postponing high-level context knowledge generation until it was demanded by the decision-making module. When high-level context knowledge is demanded, it is derived through backward spatio-temporal reasoning. In experiments with Turtlebot using Kinect visual sensor, the dynamic context management and spatio-temporal reasoning system based on the proposed method showed high performance.
Design and Implementation of a Hybrid Spatial Reasoning Algorithm
In order to answer questions successfully on behalf of the human contestant in DeepQA environments such as ‘Jeopardy!’, the American quiz show, the computer needs to have the capability of fast temporal and spatial reasoning on a large-scale commonsense knowledge base. In this paper, we present a hybrid spatial reasoning algorithm, among various efficient spatial reasoning methods, for handling directional and topological relations. Our algorithm not only improves the query processing time while reducing unnecessary reasoning calculation, but also effectively deals with the change of spatial knowledge base, as it takes a hybrid method that combines forward and backward reasoning. Through experiments performed on the sample spatial knowledge base with the hybrid spatial reasoner of our algorithm, we demonstrated the high performance of our hybrid spatial reasoning algorithm.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr