TY - JOUR
T1 - Similarity retrieval of web documents considering both text and style
AU - Chen, Chao Chun
AU - Chung, Yu Chi
AU - Chien, Cheng Chieh
AU - Lee, Chiang
PY - 2004
Y1 - 2004
N2 - As tremendous amount of web pages are added to the Internet everyday, the World Wide Web (WWW) becomes the most fertile database for retrieving information. However, an annoying problem of using WWW is to find a dead link, returning a so-called HTTP 404 error message to the user to indicate that the desired home page is missing. If the desired home page is simply changed to a new location, then there should be a way to rediscover where the new site is and return the information to the user. A 404 Error RecoveRing (abbreviated as 404 Err) Server is under development to make a dead link alive again. The main idea of our design is to compare the old home page, saved in the 404 Err server, with all the other home pages on web to find the most similar ones and recommend them to the user. Therefore, the kernel technique of such a system is based on the concept of similarity retrieval of web documents. In the past, related research on similarity retrieval mainly considers only the text part of a web document. The design style (i.e., the layout) of a web document is ignored in the similarity comparison. But the style of a design could also be extremely valuable in making finer differentiation on text-wise similar documents. This paper presents a comparison technique of our 404 Err Server which considers text information as well as style information. Experiments are conducted to demonstrate the feasibility and efficiency of the proposed method.
AB - As tremendous amount of web pages are added to the Internet everyday, the World Wide Web (WWW) becomes the most fertile database for retrieving information. However, an annoying problem of using WWW is to find a dead link, returning a so-called HTTP 404 error message to the user to indicate that the desired home page is missing. If the desired home page is simply changed to a new location, then there should be a way to rediscover where the new site is and return the information to the user. A 404 Error RecoveRing (abbreviated as 404 Err) Server is under development to make a dead link alive again. The main idea of our design is to compare the old home page, saved in the 404 Err server, with all the other home pages on web to find the most similar ones and recommend them to the user. Therefore, the kernel technique of such a system is based on the concept of similarity retrieval of web documents. In the past, related research on similarity retrieval mainly considers only the text part of a web document. The design style (i.e., the layout) of a web document is ignored in the similarity comparison. But the style of a design could also be extremely valuable in making finer differentiation on text-wise similar documents. This paper presents a comparison technique of our 404 Err Server which considers text information as well as style information. Experiments are conducted to demonstrate the feasibility and efficiency of the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=35048867363&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35048867363&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-24655-8_67
DO - 10.1007/978-3-540-24655-8_67
M3 - Article
AN - SCOPUS:35048867363
SN - 0302-9743
VL - 3007
SP - 620
EP - 629
JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ER -