TY - GEN
T1 - The mining and extraction of primary informative blocks and data objects from systematic Web pages
AU - Tseng, Yi Feng
AU - Kao, Hung Yu
PY - 2007
Y1 - 2007
N2 - With the fast development of Internet, the Web has already been an enormous database so far, which contains extremely abundant information. Most of Web pages are represented their content by using a list of objects, such as search engine results, product information of shopping Web sites and so on, and these objects form the primary information of each page. In this paper, we focus on the issues of mining primary information and the constituted object groups. The system is divided into three major phases: (1) By transforming each Web page into corresponding tree structures, our system can visit all regions of the Web page in an efficient way, and detects the informative parts. (2) We design and quantize several novel features according to the characters of regions of a Web page. (3) A weighting model is proposed that calculates the important degree of each region, we then extract the primary information of the Web pages. The experimental result proves our system can be applied to a large number of Web pages with different themes and styles to find the correct primary information and the list of corresponding objects.
AB - With the fast development of Internet, the Web has already been an enormous database so far, which contains extremely abundant information. Most of Web pages are represented their content by using a list of objects, such as search engine results, product information of shopping Web sites and so on, and these objects form the primary information of each page. In this paper, we focus on the issues of mining primary information and the constituted object groups. The system is divided into three major phases: (1) By transforming each Web page into corresponding tree structures, our system can visit all regions of the Web page in an efficient way, and detects the informative parts. (2) We design and quantize several novel features according to the characters of regions of a Web page. (3) A weighting model is proposed that calculates the important degree of each region, we then extract the primary information of the Web pages. The experimental result proves our system can be applied to a large number of Web pages with different themes and styles to find the correct primary information and the list of corresponding objects.
UR - http://www.scopus.com/inward/record.url?scp=42549131779&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=42549131779&partnerID=8YFLogxK
U2 - 10.1109/WI.2006.167
DO - 10.1109/WI.2006.167
M3 - Conference contribution
AN - SCOPUS:42549131779
SN - 0769527477
SN - 9780769527475
T3 - Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06
SP - 370
EP - 373
BT - Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06
T2 - 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI'06
Y2 - 18 December 2006 through 22 December 2006
ER -