TY - JOUR
T1 - GDVT
T2 - geometrically aware dual transformer encoding visual and textual features for image captioning
AU - Chang, Yu Ling
AU - Ma, Hao Shang
AU - Li, Shiou Chi
AU - Jaysawal, Bijay Prasad
AU - Huang, Jen Wei
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025
Y1 - 2025
N2 - When describing pictures from the point of view of human observers, the tendency is to prioritize eye-catching objects, link them to corresponding labels, and then integrate the results with background information (i.e., nearby objects or locations) to provide context. Most caption generation schemes consider the visual information of objects, while ignoring the corresponding labels, the setting, and/or the spatial relationship between the object and setting. This fails to exploit most of the useful information that the image might otherwise provide. In the current study, we develop a model using object’s tags to supplement insufficient information of visual object features as well as establishing relationship between objects and background features based on relative and absolute coordinate information. In addition, we propose an attention architecture to account for all of the features in generating an image description. The effectiveness of the proposed geometrically aware dual transformer encoding visual and textual features (GDVT) is demonstrated in experiment settings with and without pre-training. The effectiveness of our added features is verified in the case studies.
AB - When describing pictures from the point of view of human observers, the tendency is to prioritize eye-catching objects, link them to corresponding labels, and then integrate the results with background information (i.e., nearby objects or locations) to provide context. Most caption generation schemes consider the visual information of objects, while ignoring the corresponding labels, the setting, and/or the spatial relationship between the object and setting. This fails to exploit most of the useful information that the image might otherwise provide. In the current study, we develop a model using object’s tags to supplement insufficient information of visual object features as well as establishing relationship between objects and background features based on relative and absolute coordinate information. In addition, we propose an attention architecture to account for all of the features in generating an image description. The effectiveness of the proposed geometrically aware dual transformer encoding visual and textual features (GDVT) is demonstrated in experiment settings with and without pre-training. The effectiveness of our added features is verified in the case studies.
UR - https://www.scopus.com/pages/publications/105010843873
UR - https://www.scopus.com/inward/citedby.url?scp=105010843873&partnerID=8YFLogxK
U2 - 10.1007/s41060-025-00836-6
DO - 10.1007/s41060-025-00836-6
M3 - Article
AN - SCOPUS:105010843873
SN - 2364-415X
JO - International Journal of Data Science and Analytics
JF - International Journal of Data Science and Analytics
ER -