Similarity retrieval of web documents considering both text and style

Chao Chun Chen, Yu Chi Chung, Cheng Chieh Chien, Chiang Lee

Research output: Contribution to journalArticlepeer-review

Abstract

As tremendous amount of web pages are added to the Internet everyday, the World Wide Web (WWW) becomes the most fertile database for retrieving information. However, an annoying problem of using WWW is to find a dead link, returning a so-called HTTP 404 error message to the user to indicate that the desired home page is missing. If the desired home page is simply changed to a new location, then there should be a way to rediscover where the new site is and return the information to the user. A 404 Error RecoveRing (abbreviated as 404 Err) Server is under development to make a dead link alive again. The main idea of our design is to compare the old home page, saved in the 404 Err server, with all the other home pages on web to find the most similar ones and recommend them to the user. Therefore, the kernel technique of such a system is based on the concept of similarity retrieval of web documents. In the past, related research on similarity retrieval mainly considers only the text part of a web document. The design style (i.e., the layout) of a web document is ignored in the similarity comparison. But the style of a design could also be extremely valuable in making finer differentiation on text-wise similar documents. This paper presents a comparison technique of our 404 Err Server which considers text information as well as style information. Experiments are conducted to demonstrate the feasibility and efficiency of the proposed method.

Original languageEnglish
Pages (from-to)620-629
Number of pages10
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3007
DOIs
Publication statusPublished - 2004

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Similarity retrieval of web documents considering both text and style'. Together they form a unique fingerprint.

Cite this