TY - JOUR
T1 - Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates
AU - Cia-Mina, Alvaro
AU - Lopez-Fidalgo, Jesus
AU - Wong, Weng Kee
N1 - Publisher Copyright:
© 2025 The Authors.
PY - 2025
Y1 - 2025
N2 - Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J-optimality, that builds upon a popular optimal selection criterion that minimizes the Random-X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J-optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.
AB - Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J-optimality, that builds upon a popular optimal selection criterion that minimizes the Random-X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J-optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.
UR - https://www.scopus.com/pages/publications/105000467342
UR - https://www.scopus.com/pages/publications/105000467342#tab=citedBy
U2 - 10.1109/TBDATA.2025.3552343
DO - 10.1109/TBDATA.2025.3552343
M3 - Article
AN - SCOPUS:105000467342
SN - 2332-7790
VL - 11
SP - 2601
EP - 2614
JO - IEEE Transactions on Big Data
JF - IEEE Transactions on Big Data
IS - 5
ER -