TY - GEN
T1 - Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
AU - Chang, Yu Ling
AU - Ma, Hao Shang
AU - Li, Shiou Chi
AU - Huang, Jen Wei
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - When describing pictures from the point of view of human observers, the tendency is to prioritize eye-catching objects, link them to corresponding labels, and then integrate the results with background information (i.e., nearby objects or locations) to provide context. Most caption generation schemes consider the visual information of objects, while ignoring the corresponding labels, the setting, and/or the spatial relationship between the object and setting. This fails to exploit most of the useful information that the image might otherwise provide. In the current study, we developed a model that adds the object’s tags to supplement the insufficient information in visual object features, and established relationship between objects and background features based on relative and absolute coordinate information. We also proposed an attention architecture to account for all of the features in generating an image description. The effectiveness of the proposed Geometrically-Aware Dual Transformer Encoding Visual and Textual Features (GDVT) is demonstrated in experiment settings with and without pre-training.
AB - When describing pictures from the point of view of human observers, the tendency is to prioritize eye-catching objects, link them to corresponding labels, and then integrate the results with background information (i.e., nearby objects or locations) to provide context. Most caption generation schemes consider the visual information of objects, while ignoring the corresponding labels, the setting, and/or the spatial relationship between the object and setting. This fails to exploit most of the useful information that the image might otherwise provide. In the current study, we developed a model that adds the object’s tags to supplement the insufficient information in visual object features, and established relationship between objects and background features based on relative and absolute coordinate information. We also proposed an attention architecture to account for all of the features in generating an image description. The effectiveness of the proposed Geometrically-Aware Dual Transformer Encoding Visual and Textual Features (GDVT) is demonstrated in experiment settings with and without pre-training.
UR - http://www.scopus.com/inward/record.url?scp=85192840255&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192840255&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-2262-4_2
DO - 10.1007/978-981-97-2262-4_2
M3 - Conference contribution
AN - SCOPUS:85192840255
SN - 9789819722648
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 15
EP - 27
BT - Advances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Proceedings
A2 - Yang, De-Nian
A2 - Xie, Xing
A2 - Tseng, Vincent S.
A2 - Pei, Jian
A2 - Huang, Jen-Wei
A2 - Lin, Jerry Chun-Wei
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
Y2 - 7 May 2024 through 10 May 2024
ER -