TY - JOUR
T1 - DupChecker
T2 - A bioconductor package for checking high-throughput genomic data redundancy in meta-analysis
AU - Sheng, Quanhu
AU - Shyr, Yu
AU - Chen, Xi
N1 - Funding Information:
This research was supported by NIH grants as follows: CA158472 (to QS and XC).
Publisher Copyright:
© 2014 Sheng et al.; licensee BioMed Central Ltd.
PY - 2014/9/30
Y1 - 2014/9/30
N2 - Background: Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis. Results: We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package. Conclusions: Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.
AB - Background: Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis. Results: We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package. Conclusions: Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.
UR - http://www.scopus.com/inward/record.url?scp=84908059445&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84908059445&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-15-323
DO - 10.1186/1471-2105-15-323
M3 - Article
C2 - 25267467
AN - SCOPUS:84908059445
SN - 1471-2105
VL - 15
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 323
ER -