Similarity retrieval of web documents considering both text and style

Chao Chun Chen, Yu Chi Chung, Cheng Chieh Chien, Chiang Lee

研究成果: Article同行評審


As tremendous amount of web pages are added to the Internet everyday, the World Wide Web (WWW) becomes the most fertile database for retrieving information. However, an annoying problem of using WWW is to find a dead link, returning a so-called HTTP 404 error message to the user to indicate that the desired home page is missing. If the desired home page is simply changed to a new location, then there should be a way to rediscover where the new site is and return the information to the user. A 404 Error RecoveRing (abbreviated as 404 Err) Server is under development to make a dead link alive again. The main idea of our design is to compare the old home page, saved in the 404 Err server, with all the other home pages on web to find the most similar ones and recommend them to the user. Therefore, the kernel technique of such a system is based on the concept of similarity retrieval of web documents. In the past, related research on similarity retrieval mainly considers only the text part of a web document. The design style (i.e., the layout) of a web document is ignored in the similarity comparison. But the style of a design could also be extremely valuable in making finer differentiation on text-wise similar documents. This paper presents a comparison technique of our 404 Err Server which considers text information as well as style information. Experiments are conducted to demonstrate the feasibility and efficiency of the proposed method.

All Science Journal Classification (ASJC) codes

  • 理論電腦科學
  • 電腦科學(全部)


深入研究「Similarity retrieval of web documents considering both text and style」主題。共同形成了獨特的指紋。