Skip to main navigation Skip to search Skip to main content

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

  • Yi Ting Li
  • , Ying Jia Lin
  • , Chia Jen Yeh
  • , Chun Yi Lin
  • , Hung Yu Kao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github.com/IKMLab/GViG.git

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Taipei, Taiwan, May 7–10, 2024, Proceedings
EditorsDe-Nian Yang, Xing Xie, Vincent S. Tseng, Jian Pei, Jen-Wei Huang, Jerry Chun-Wei Lin
PublisherSpringer Science and Business Media Deutschland GmbH
Pages83-94
Number of pages12
ISBN (Print)9789819722655
DOIs
Publication statusPublished - 2024
Event28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024 - Taipei, Taiwan
Duration: 2024 May 72024 May 10

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14650 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
Country/TerritoryTaiwan
CityTaipei
Period24-05-0724-05-10

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering'. Together they form a unique fingerprint.

Cite this