TY - JOUR
T1 - A learning method for the class imbalance problem with medical data sets
AU - Li, Der Chiang
AU - Liu, Chiao Wen
AU - Hu, Susan C.
PY - 2010/5
Y1 - 2010/5
N2 - In medical data sets, data are predominately composed of "normal" samples with only a small percentage of "abnormal" ones, leading to the so-called class imbalance problems. In class imbalance problems, inputting all the data into the classifier to build up the learning model will usually lead a learning bias to the majority class. To deal with this, this paper uses a strategy which over-samples the minority class and under-samples the majority one to balance the data sets. For the majority class, this paper builds up the Gaussian type fuzzy membership function and α-cut to reduce the data size; for the minority class, we use the mega-trend diffusion membership function to generate virtual samples for the class. Furthermore, after balancing the data size of classes, this paper extends the data attribute dimension into a higher dimension space using classification related information to enhance the classification accuracy. Two medical data sets, Pima Indians' diabetes and the BUPA liver disorders, are employed to illustrate the approach presented in this paper. The results indicate that the proposed method has better classification performance than SVM, C4.5 decision tree and two other studies.
AB - In medical data sets, data are predominately composed of "normal" samples with only a small percentage of "abnormal" ones, leading to the so-called class imbalance problems. In class imbalance problems, inputting all the data into the classifier to build up the learning model will usually lead a learning bias to the majority class. To deal with this, this paper uses a strategy which over-samples the minority class and under-samples the majority one to balance the data sets. For the majority class, this paper builds up the Gaussian type fuzzy membership function and α-cut to reduce the data size; for the minority class, we use the mega-trend diffusion membership function to generate virtual samples for the class. Furthermore, after balancing the data size of classes, this paper extends the data attribute dimension into a higher dimension space using classification related information to enhance the classification accuracy. Two medical data sets, Pima Indians' diabetes and the BUPA liver disorders, are employed to illustrate the approach presented in this paper. The results indicate that the proposed method has better classification performance than SVM, C4.5 decision tree and two other studies.
UR - http://www.scopus.com/inward/record.url?scp=77952554315&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952554315&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2010.03.005
DO - 10.1016/j.compbiomed.2010.03.005
M3 - Article
C2 - 20347072
AN - SCOPUS:77952554315
SN - 0010-4825
VL - 40
SP - 509
EP - 518
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
IS - 5
ER -