Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates

研究成果: Article同行評審

摘要

Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J-optimality, that builds upon a popular optimal selection criterion that minimizes the Random-X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J-optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.

原文English
頁(從 - 到)2601-2614
頁數14
期刊IEEE Transactions on Big Data
11
發行號5
DOIs
出版狀態Published - 2025

All Science Journal Classification (ASJC) codes

  • 資訊系統
  • 資訊系統與管理

指紋

深入研究「Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates」主題。共同形成了獨特的指紋。

引用此