Active learning with simultaneous subject and variable selections

Yuan chin Ivan Chang, Ray-Bing Chen

研究成果: Article

摘要

Training data are essential for learning classification models. Therefore, if only a limited number of labeled subjects are available for use as training samples, whereas a considerable amount of unlabeled data already exists, then it is always desirable enlarging the training set by labeling more subjects in order to ameliorate classification models. When it is costly in time and capital to label unlabeled subjects, it is crucial to know how many labeled subjects are necessary for training a satisfactory classification model. Although, active learning methods can gradually recruit new unlabeled subjects and disclose their label information to enlarge the size of the training set, there is a lack of discussion about the size of training samples in the literature. Hence, when/how to appropriately stop an active learning procedure is studied in this paper. Since the sequential subject recruiting strategy is used in active learning procedures, it is natural to adopt the idea of sequential analysis to dynamically and adaptively determine the training sample size for learning. In this study, we propose a stopping criterion for a linear model-based active learning procedure, such that this learning process will asymptotically achieve its best possible empirical performance, in terms of the area under receiver the operating characteristic curve (ROC), when the procedure is stopped. Other statistical properties of the proposed procedure, including estimation consistency and variable selection, are also studied. The numerical results using both synthesized and a real example are reported.

原文English
頁(從 - 到)495-505
頁數11
期刊Neurocomputing
329
DOIs
出版狀態Published - 2019 二月 15

指紋

Problem-Based Learning
Patient Selection
Labels
Learning
Sample Size
Labeling
ROC Curve
Linear Models
Economics

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

引用此文

@article{cf382c9ac73b4ed2bb6bb15624650580,
title = "Active learning with simultaneous subject and variable selections",
abstract = "Training data are essential for learning classification models. Therefore, if only a limited number of labeled subjects are available for use as training samples, whereas a considerable amount of unlabeled data already exists, then it is always desirable enlarging the training set by labeling more subjects in order to ameliorate classification models. When it is costly in time and capital to label unlabeled subjects, it is crucial to know how many labeled subjects are necessary for training a satisfactory classification model. Although, active learning methods can gradually recruit new unlabeled subjects and disclose their label information to enlarge the size of the training set, there is a lack of discussion about the size of training samples in the literature. Hence, when/how to appropriately stop an active learning procedure is studied in this paper. Since the sequential subject recruiting strategy is used in active learning procedures, it is natural to adopt the idea of sequential analysis to dynamically and adaptively determine the training sample size for learning. In this study, we propose a stopping criterion for a linear model-based active learning procedure, such that this learning process will asymptotically achieve its best possible empirical performance, in terms of the area under receiver the operating characteristic curve (ROC), when the procedure is stopped. Other statistical properties of the proposed procedure, including estimation consistency and variable selection, are also studied. The numerical results using both synthesized and a real example are reported.",
author = "{Ivan Chang}, {Yuan chin} and Ray-Bing Chen",
year = "2019",
month = "2",
day = "15",
doi = "10.1016/j.neucom.2018.11.036",
language = "English",
volume = "329",
pages = "495--505",
journal = "Neurocomputing",
issn = "0925-2312",
publisher = "Elsevier",

}

Active learning with simultaneous subject and variable selections. / Ivan Chang, Yuan chin; Chen, Ray-Bing.

於: Neurocomputing, 卷 329, 15.02.2019, p. 495-505.

研究成果: Article

TY - JOUR

T1 - Active learning with simultaneous subject and variable selections

AU - Ivan Chang, Yuan chin

AU - Chen, Ray-Bing

PY - 2019/2/15

Y1 - 2019/2/15

N2 - Training data are essential for learning classification models. Therefore, if only a limited number of labeled subjects are available for use as training samples, whereas a considerable amount of unlabeled data already exists, then it is always desirable enlarging the training set by labeling more subjects in order to ameliorate classification models. When it is costly in time and capital to label unlabeled subjects, it is crucial to know how many labeled subjects are necessary for training a satisfactory classification model. Although, active learning methods can gradually recruit new unlabeled subjects and disclose their label information to enlarge the size of the training set, there is a lack of discussion about the size of training samples in the literature. Hence, when/how to appropriately stop an active learning procedure is studied in this paper. Since the sequential subject recruiting strategy is used in active learning procedures, it is natural to adopt the idea of sequential analysis to dynamically and adaptively determine the training sample size for learning. In this study, we propose a stopping criterion for a linear model-based active learning procedure, such that this learning process will asymptotically achieve its best possible empirical performance, in terms of the area under receiver the operating characteristic curve (ROC), when the procedure is stopped. Other statistical properties of the proposed procedure, including estimation consistency and variable selection, are also studied. The numerical results using both synthesized and a real example are reported.

AB - Training data are essential for learning classification models. Therefore, if only a limited number of labeled subjects are available for use as training samples, whereas a considerable amount of unlabeled data already exists, then it is always desirable enlarging the training set by labeling more subjects in order to ameliorate classification models. When it is costly in time and capital to label unlabeled subjects, it is crucial to know how many labeled subjects are necessary for training a satisfactory classification model. Although, active learning methods can gradually recruit new unlabeled subjects and disclose their label information to enlarge the size of the training set, there is a lack of discussion about the size of training samples in the literature. Hence, when/how to appropriately stop an active learning procedure is studied in this paper. Since the sequential subject recruiting strategy is used in active learning procedures, it is natural to adopt the idea of sequential analysis to dynamically and adaptively determine the training sample size for learning. In this study, we propose a stopping criterion for a linear model-based active learning procedure, such that this learning process will asymptotically achieve its best possible empirical performance, in terms of the area under receiver the operating characteristic curve (ROC), when the procedure is stopped. Other statistical properties of the proposed procedure, including estimation consistency and variable selection, are also studied. The numerical results using both synthesized and a real example are reported.

UR - http://www.scopus.com/inward/record.url?scp=85056722957&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056722957&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2018.11.036

DO - 10.1016/j.neucom.2018.11.036

M3 - Article

AN - SCOPUS:85056722957

VL - 329

SP - 495

EP - 505

JO - Neurocomputing

JF - Neurocomputing

SN - 0925-2312

ER -