Unsupervised alignment of news video and text using visual patterns and textual concepts

Jun Bin Yeh, Chung-Hsien Wu, Sheng Xiong Chang

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

A brief preview of a news video can be generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with the visual field shots. Since accurately detecting the object in a visual shot is difficult and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. In this study, the temporal relation among the frames in a visual shot is characterized by a visual language model. The language model-based temporal relation is then applied to sentence-based alignment. The bag-of-word representations for the main objects in the key frames of a visual shot are firstly mapped to the visual patterns trained from the news video database. Furthermore, the textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge base. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1. For evaluation, the visual pattern language model yields an alignment score of 0.77, exceeding that, 0.66, from the DTW method. Considering the performance for different news categories, visual pattern discovery and textual concept discovery can indeed improve the alignment performance in most news categories.

Original languageEnglish
Article number5657260
Pages (from-to)206-215
Number of pages10
JournalIEEE Transactions on Multimedia
Volume13
Issue number2
DOIs
Publication statusPublished - 2011 Apr 1

Fingerprint

Anchors
Visual languages

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Cite this

@article{df371bf908bd481a894b3f87eedd80df,
title = "Unsupervised alignment of news video and text using visual patterns and textual concepts",
abstract = "A brief preview of a news video can be generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with the visual field shots. Since accurately detecting the object in a visual shot is difficult and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. In this study, the temporal relation among the frames in a visual shot is characterized by a visual language model. The language model-based temporal relation is then applied to sentence-based alignment. The bag-of-word representations for the main objects in the key frames of a visual shot are firstly mapped to the visual patterns trained from the news video database. Furthermore, the textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge base. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1. For evaluation, the visual pattern language model yields an alignment score of 0.77, exceeding that, 0.66, from the DTW method. Considering the performance for different news categories, visual pattern discovery and textual concept discovery can indeed improve the alignment performance in most news categories.",
author = "Yeh, {Jun Bin} and Chung-Hsien Wu and Chang, {Sheng Xiong}",
year = "2011",
month = "4",
day = "1",
doi = "10.1109/TMM.2010.2095412",
language = "English",
volume = "13",
pages = "206--215",
journal = "IEEE Transactions on Multimedia",
issn = "1520-9210",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

Unsupervised alignment of news video and text using visual patterns and textual concepts. / Yeh, Jun Bin; Wu, Chung-Hsien; Chang, Sheng Xiong.

In: IEEE Transactions on Multimedia, Vol. 13, No. 2, 5657260, 01.04.2011, p. 206-215.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Unsupervised alignment of news video and text using visual patterns and textual concepts

AU - Yeh, Jun Bin

AU - Wu, Chung-Hsien

AU - Chang, Sheng Xiong

PY - 2011/4/1

Y1 - 2011/4/1

N2 - A brief preview of a news video can be generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with the visual field shots. Since accurately detecting the object in a visual shot is difficult and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. In this study, the temporal relation among the frames in a visual shot is characterized by a visual language model. The language model-based temporal relation is then applied to sentence-based alignment. The bag-of-word representations for the main objects in the key frames of a visual shot are firstly mapped to the visual patterns trained from the news video database. Furthermore, the textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge base. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1. For evaluation, the visual pattern language model yields an alignment score of 0.77, exceeding that, 0.66, from the DTW method. Considering the performance for different news categories, visual pattern discovery and textual concept discovery can indeed improve the alignment performance in most news categories.

AB - A brief preview of a news video can be generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with the visual field shots. Since accurately detecting the object in a visual shot is difficult and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. In this study, the temporal relation among the frames in a visual shot is characterized by a visual language model. The language model-based temporal relation is then applied to sentence-based alignment. The bag-of-word representations for the main objects in the key frames of a visual shot are firstly mapped to the visual patterns trained from the news video database. Furthermore, the textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge base. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1. For evaluation, the visual pattern language model yields an alignment score of 0.77, exceeding that, 0.66, from the DTW method. Considering the performance for different news categories, visual pattern discovery and textual concept discovery can indeed improve the alignment performance in most news categories.

UR - http://www.scopus.com/inward/record.url?scp=79952925500&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952925500&partnerID=8YFLogxK

U2 - 10.1109/TMM.2010.2095412

DO - 10.1109/TMM.2010.2095412

M3 - Article

AN - SCOPUS:79952925500

VL - 13

SP - 206

EP - 215

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

SN - 1520-9210

IS - 2

M1 - 5657260

ER -