The data complexity index to construct an efficient cross-validation method

Der-Chiang Li, Yao Hwei Fang, Y. M.Frank Fang

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.

Original languageEnglish
Pages (from-to)93-102
Number of pages10
JournalDecision Support Systems
Volume50
Issue number1
DOIs
Publication statusPublished - 2010 Dec 1

Fingerprint

Sampling
Data mining
Experiments
Data Mining
Validation Studies
Noise
Cross-validation
Datasets
Subsampling
Model Evaluation
Experiment
Fold
Model evaluation

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Information Systems
  • Developmental and Educational Psychology
  • Arts and Humanities (miscellaneous)
  • Information Systems and Management

Cite this

Li, Der-Chiang ; Fang, Yao Hwei ; Fang, Y. M.Frank. / The data complexity index to construct an efficient cross-validation method. In: Decision Support Systems. 2010 ; Vol. 50, No. 1. pp. 93-102.
@article{8dd2e126df5e4b41b7b6d507c7270ac5,
title = "The data complexity index to construct an efficient cross-validation method",
abstract = "Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.",
author = "Der-Chiang Li and Fang, {Yao Hwei} and Fang, {Y. M.Frank}",
year = "2010",
month = "12",
day = "1",
doi = "10.1016/j.dss.2010.07.005",
language = "English",
volume = "50",
pages = "93--102",
journal = "Decision Support Systems",
issn = "0167-9236",
publisher = "Elsevier",
number = "1",

}

The data complexity index to construct an efficient cross-validation method. / Li, Der-Chiang; Fang, Yao Hwei; Fang, Y. M.Frank.

In: Decision Support Systems, Vol. 50, No. 1, 01.12.2010, p. 93-102.

Research output: Contribution to journalArticle

TY - JOUR

T1 - The data complexity index to construct an efficient cross-validation method

AU - Li, Der-Chiang

AU - Fang, Yao Hwei

AU - Fang, Y. M.Frank

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.

AB - Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods.

UR - http://www.scopus.com/inward/record.url?scp=78049484286&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78049484286&partnerID=8YFLogxK

U2 - 10.1016/j.dss.2010.07.005

DO - 10.1016/j.dss.2010.07.005

M3 - Article

VL - 50

SP - 93

EP - 102

JO - Decision Support Systems

JF - Decision Support Systems

SN - 0167-9236

IS - 1

ER -