TY - JOUR T1 - Resolving Ambiguity in Visual Question Answering through an Iterative Clarifying QA-based Framework AU - Sung, Yu-Jeong AU - Park, Gyu-Min AU - Park, Seong-Bae JO - Journal of KIISE, JOK PY - 2025 DA - 2025/1/14 DO - 10.5626/JOK.2025.52.9.778 KW - visual question answering(VQA) KW - ambiguous objects KW - clarifying question generation KW - multi-turn reasoning KW - multimodal AB - This paper presents a three-stage framework to tackle the problem of ambiguous objects in Visual Question Answering (VQA), where the object referred to in a question is unclear due to multiple candidates in the image. The framework includes: (1) detecting whether the question is ambiguous, (2) generating clarification questions when ambiguity is detected, and (3) utilizing the Q&A history to perform the final VQA. Clarification questions are generated directly by the model, leveraging visual features without any additional training. The model iteratively refines its questions by incorporating the history of previous question-answer pairs. Experiments using the LLaVA v1.6 model demonstrate that the proposed framework enhances accuracy by 6.7% and semantic accuracy by 5.6% compared to the baseline. Moreover, the integration of ambiguity detection and an early stopping strategy reduces the inefficiencies associated with multi-turn interactions, resulting in a 44% decrease in execution time. This study offers a practical solution to the ambiguous objects problem by enabling real-time clarification without the need for additional training, ultimately leading to improved VQA accuracy.