TY - JOUR
T1 - Sample size calculation based on generalized linear models for differential expression analysis in RNA-seq data
AU - Li, Chung I.
AU - Shyr, Yu
N1 - Publisher Copyright:
© 2016 Walter de Gruyter GmbH, Berlin/Boston.
PY - 2016/12/1
Y1 - 2016/12/1
N2 - As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study's optimal sample size is now a vital step in experimental design. Current methods for calculating a study's required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.
AB - As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study's optimal sample size is now a vital step in experimental design. Current methods for calculating a study's required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates. To address this limitation, we propose an estimating procedure based on the generalized linear model. This easy-to-use method constructs a representative exemplary dataset and estimates the conditional power, all without requiring complicated mathematical approximations or formulas. Even more attractive, the downstream analysis can be performed with current R/Bioconductor packages. To demonstrate the practicability and efficiency of this method, we apply it to three real-world studies, and introduce our on-line calculator developed to determine the optimal sample size for a RNA-seq study.
UR - http://www.scopus.com/inward/record.url?scp=84999264563&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84999264563&partnerID=8YFLogxK
U2 - 10.1515/sagmb-2016-0008
DO - 10.1515/sagmb-2016-0008
M3 - Article
C2 - 27866174
AN - SCOPUS:84999264563
SN - 1544-6115
VL - 15
SP - 491
EP - 505
JO - Statistical Applications in Genetics and Molecular Biology
JF - Statistical Applications in Genetics and Molecular Biology
IS - 6
ER -