TY - GEN
T1 - Using category-based adherence to cluster market-basket data
AU - Yun, Ching Huang
AU - Chuang, Kun Ta
AU - Chen, Ming Syan
PY - 2002/12/1
Y1 - 2002/12/1
N2 - In this paper, we devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item (i.e., leaf) or a category (i.e., internal) node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA (standing for Category-Based Adherence), for market-basket data with the objective to minimize the category-based adherence. A validation model based on Information Gain (IG) is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.
AB - In this paper, we devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item (i.e., leaf) or a category (i.e., internal) node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA (standing for Category-Based Adherence), for market-basket data with the objective to minimize the category-based adherence. A validation model based on Information Gain (IG) is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.
UR - http://www.scopus.com/inward/record.url?scp=13444303743&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=13444303743&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:13444303743
SN - 0769517544
SN - 9780769517544
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 546
EP - 553
BT - Proceedings - 2002 IEEE International Conference on Data Mining, ICDM 2002
T2 - 2nd IEEE International Conference on Data Mining, ICDM '02
Y2 - 9 December 2002 through 12 December 2002
ER -