TY  - JOUR
T1  - Resolving Ambiguity in Visual Question Answering through an Iterative Clarifying QA-based Framework
AU  - Sung, Yu-Jeong 
AU  - Park, Gyu-Min 
AU  - Park, Seong-Bae 
JO  - Journal of KIISE, JOK
PY  - 2025
DA  - 2025/1/14
DO  - 10.5626/JOK.2025.52.9.778
KW  - visual question answering(VQA)
KW  - ambiguous objects
KW  - clarifying question generation
KW  - multi-turn reasoning
KW  - multimodal
AB  - This paper presents a three-stage framework to tackle the problem of ambiguous objects in Visual Question Answering (VQA), where the object referred to in a question is unclear due to multiple candidates in the image. The framework includes: (1) detecting whether the question is ambiguous, (2) generating clarification questions when ambiguity is detected, and (3) utilizing the Q&A history to perform the final VQA. Clarification questions are generated directly by the model, leveraging visual features without any additional training. The model iteratively refines its questions by incorporating the history of previous question-answer pairs. Experiments using the LLaVA v1.6 model demonstrate that the proposed framework enhances accuracy by 6.7% and semantic accuracy by 5.6% compared to the baseline. Moreover, the integration of ambiguity detection and an early stopping strategy reduces the inefficiencies associated with multi-turn interactions, resulting in a 44% decrease in execution time. This study offers a practical solution to the ambiguous objects problem by enabling real-time clarification without the need for additional training, ultimately leading to improved VQA accuracy.