Using virtual samples to improve learning performance for small datasets with multimodal distributions

Der Chiang Li, Liang Sian Lin, Chien Chih Chen, Wei Hao Yu

Research output: Contribution to journalArticle

Abstract

A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

Original languageEnglish
Pages (from-to)11883-11900
Number of pages18
JournalSoft Computing
Volume23
Issue number22
DOIs
Publication statusPublished - 2019 Nov 1

Fingerprint

Normal distribution
Learning algorithms
Statistics
Polynomials
Processing
p-Value
Experiments
Unimodal Distribution
Spatial Clustering
Learning
Data Preprocessing
Akaike Information Criterion
Fuzzy C-means
Support Vector Regression
t-test
Kernel Function
Bootstrap
Gaussian distribution
Forecasting
Learning Algorithm

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Geometry and Topology

Cite this

@article{5f7a0690b193467ba12950efb6a48d39,
title = "Using virtual samples to improve learning performance for small datasets with multimodal distributions",
abstract = "A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.",
author = "Li, {Der Chiang} and Lin, {Liang Sian} and Chen, {Chien Chih} and Yu, {Wei Hao}",
year = "2019",
month = "11",
day = "1",
doi = "10.1007/s00500-018-03744-z",
language = "English",
volume = "23",
pages = "11883--11900",
journal = "Soft Computing",
issn = "1432-7643",
publisher = "Springer Verlag",
number = "22",

}

Using virtual samples to improve learning performance for small datasets with multimodal distributions. / Li, Der Chiang; Lin, Liang Sian; Chen, Chien Chih; Yu, Wei Hao.

In: Soft Computing, Vol. 23, No. 22, 01.11.2019, p. 11883-11900.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Using virtual samples to improve learning performance for small datasets with multimodal distributions

AU - Li, Der Chiang

AU - Lin, Liang Sian

AU - Chen, Chien Chih

AU - Yu, Wei Hao

PY - 2019/11/1

Y1 - 2019/11/1

N2 - A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

AB - A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

UR - http://www.scopus.com/inward/record.url?scp=85059700750&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059700750&partnerID=8YFLogxK

U2 - 10.1007/s00500-018-03744-z

DO - 10.1007/s00500-018-03744-z

M3 - Article

AN - SCOPUS:85059700750

VL - 23

SP - 11883

EP - 11900

JO - Soft Computing

JF - Soft Computing

SN - 1432-7643

IS - 22

ER -