TY - GEN
T1 - GViG
T2 - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
AU - Li, Yi Ting
AU - Lin, Ying Jia
AU - Yeh, Chia Jen
AU - Lin, Chun Yi
AU - Kao, Hung Yu
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github.com/IKMLab/GViG.git
AB - The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github.com/IKMLab/GViG.git
UR - https://www.scopus.com/pages/publications/85192847917
UR - https://www.scopus.com/pages/publications/85192847917#tab=citedBy
U2 - 10.1007/978-981-97-2266-2_7
DO - 10.1007/978-981-97-2266-2_7
M3 - Conference contribution
AN - SCOPUS:85192847917
SN - 9789819722655
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 83
EP - 94
BT - Advances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Taipei, Taiwan, May 7–10, 2024, Proceedings
A2 - Yang, De-Nian
A2 - Xie, Xing
A2 - Tseng, Vincent S.
A2 - Pei, Jian
A2 - Huang, Jen-Wei
A2 - Lin, Jerry Chun-Wei
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 7 May 2024 through 10 May 2024
ER -