TY - JOUR
T1 - Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering∗
AU - Lin, Ying Jia
AU - Tseng, Ching Shan
AU - Kao, Hung Yu
N1 - Publisher Copyright:
© 2024 Institute of Information Science. All rights reserved.
PY - 2024/5
Y1 - 2024/5
N2 - Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black-box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation-aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi-task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid-Attention mechanism to enhance cross-modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.
AB - Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black-box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation-aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi-task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid-Attention mechanism to enhance cross-modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.
UR - http://www.scopus.com/inward/record.url?scp=85192676438&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192676438&partnerID=8YFLogxK
U2 - 10.6688/JISE.202405_40(3).0014
DO - 10.6688/JISE.202405_40(3).0014
M3 - Article
AN - SCOPUS:85192676438
SN - 1016-2364
VL - 40
SP - 649
EP - 659
JO - Journal of Information Science and Engineering
JF - Journal of Information Science and Engineering
IS - 3
ER -