WISDOM: Web Intrapage Informative Structure Mining based on Document Object Model

Hung Yu Kao, Jan Ming Ho, Ming Syan Chen

Research output: Contribution to journalArticlepeer-review

39 Citations (Scopus)


To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM'S practical applicability.

Original languageEnglish
Pages (from-to)614-627
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Issue number5
Publication statusPublished - 2005 May

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics


Dive into the research topics of 'WISDOM: Web Intrapage Informative Structure Mining based on Document Object Model'. Together they form a unique fingerprint.

Cite this