Active Vision from Image-Text Multimodal System Learning 


Vol. 43,  No. 7, pp. 795-800, Jul.  2016


PDF

  Abstract

In image classification, recent CNNs compete with human performance. However, there are limitations in more general recognition. Herein we deal with indoor images that contain too much information to be directly processed and require information reduction before recognition. To reduce the amount of data processing, typically variational inference or variational Bayesian methods are suggested for object detection. However, these methods suffer from the difficulty of marginalizing over the given space. In this study, we propose an image-text integrated recognition system using active vision based on Spatial Transformer Networks. The system attempts to efficiently sample a partial region of a given image for a given language information. Our experimental results demonstrate a significant improvement over traditional approaches. We also discuss the results of qualitative analysis of sampled images, model characteristics, and its limitations.


  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

J. Kim and B. Zhang, "Active Vision from Image-Text Multimodal System Learning," Journal of KIISE, JOK, vol. 43, no. 7, pp. 795-800, 2016. DOI: .


[ACM Style]

Jin-Hwa Kim and Byoung-Tak Zhang. 2016. Active Vision from Image-Text Multimodal System Learning. Journal of KIISE, JOK, 43, 7, (2016), 795-800. DOI: .


[KCI Style]

김진화, 장병탁, "능동 시각을 이용한 이미지 - 텍스트 다중 모달 체계 학습," 한국정보과학회 논문지, 제43권, 제7호, 795~800쪽, 2016. DOI: .


[Endnote/Zotero/Mendeley (RIS)]  Download


[BibTeX]  Download



Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr