TY - JOUR
T1 - Gene expression analysis of combined RNA-seq experiments using a receiver operating characteristic calibrated procedure
AU - Jeng, Shuen Lin
AU - Chi, Yung Chan
AU - Ma, Mi Chia
AU - Chan, Shi Huang
AU - Sun, H. Sunny
N1 - Publisher Copyright:
© 2021
PY - 2021/8
Y1 - 2021/8
N2 - Because of rapid advancements in sequencing technology, the experimental platforms of RNA-seq are updated frequently. It is quite common to combine data sets from several experimental platforms for analysis in order to increase the sample size and achieve more powerful tests for detecting the presence of differential gene expression. The data sets combined from different experimental platforms will have a complex data distribution, which causes a major problem in statistical modeling as well as in multiple testing. Although plenty of research have studied this problem by modeling the batch effects, there are no general and robust data-driven procedures for RNA-seq analysis. In this paper we propose a new robust procedure which combines the use of popular methods (packages) with a data-driven simulation (DDS). We construct the average receiver operating characteristic curve through the DDS to provide the calibrated levels of significance for multiple testing. Instead of further modifying the adjusted p-values, we calibrated the levels of significance for each specific method and mean effect model. The procedure was demonstrated with several popular RNA-seq analysis methods (edgeR, DEseq2, limma+voom). The proposed procedure relaxes the stringent assumptions of data distributions for RNA-seq analysis methods and is illustrated using colorectal cancer studies from The Cancer Genome Atlas database.
AB - Because of rapid advancements in sequencing technology, the experimental platforms of RNA-seq are updated frequently. It is quite common to combine data sets from several experimental platforms for analysis in order to increase the sample size and achieve more powerful tests for detecting the presence of differential gene expression. The data sets combined from different experimental platforms will have a complex data distribution, which causes a major problem in statistical modeling as well as in multiple testing. Although plenty of research have studied this problem by modeling the batch effects, there are no general and robust data-driven procedures for RNA-seq analysis. In this paper we propose a new robust procedure which combines the use of popular methods (packages) with a data-driven simulation (DDS). We construct the average receiver operating characteristic curve through the DDS to provide the calibrated levels of significance for multiple testing. Instead of further modifying the adjusted p-values, we calibrated the levels of significance for each specific method and mean effect model. The procedure was demonstrated with several popular RNA-seq analysis methods (edgeR, DEseq2, limma+voom). The proposed procedure relaxes the stringent assumptions of data distributions for RNA-seq analysis methods and is illustrated using colorectal cancer studies from The Cancer Genome Atlas database.
UR - http://www.scopus.com/inward/record.url?scp=85106394925&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85106394925&partnerID=8YFLogxK
U2 - 10.1016/j.compbiolchem.2021.107515
DO - 10.1016/j.compbiolchem.2021.107515
M3 - Article
C2 - 34044204
AN - SCOPUS:85106394925
SN - 1476-9271
VL - 93
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
M1 - 107515
ER -