Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates

Research output: Contribution to journalArticlepeer-review

Abstract

Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J-optimality, that builds upon a popular optimal selection criterion that minimizes the Random-X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J-optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.

Original languageEnglish
Pages (from-to)2601-2614
Number of pages14
JournalIEEE Transactions on Big Data
Volume11
Issue number5
DOIs
Publication statusPublished - 2025

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates'. Together they form a unique fingerprint.

Cite this