GDVT: geometrically aware dual transformer encoding visual and textual features for image captioning

Yu Ling Chang, Hao Shang Ma, Shiou Chi Li, Bijay Prasad Jaysawal, Jen Wei Huang

研究成果: Article同行評審

摘要

When describing pictures from the point of view of human observers, the tendency is to prioritize eye-catching objects, link them to corresponding labels, and then integrate the results with background information (i.e., nearby objects or locations) to provide context. Most caption generation schemes consider the visual information of objects, while ignoring the corresponding labels, the setting, and/or the spatial relationship between the object and setting. This fails to exploit most of the useful information that the image might otherwise provide. In the current study, we develop a model using object’s tags to supplement insufficient information of visual object features as well as establishing relationship between objects and background features based on relative and absolute coordinate information. In addition, we propose an attention architecture to account for all of the features in generating an image description. The effectiveness of the proposed geometrically aware dual transformer encoding visual and textual features (GDVT) is demonstrated in experiment settings with and without pre-training. The effectiveness of our added features is verified in the case studies.

All Science Journal Classification (ASJC) codes

  • 資訊系統
  • 建模與模擬
  • 電腦科學應用
  • 計算機理論與數學
  • 應用數學

指紋

深入研究「GDVT: geometrically aware dual transformer encoding visual and textual features for image captioning」主題。共同形成了獨特的指紋。

引用此