EAGLE: Explicit Alternative Genome Likelihood Evaluator

Tony Kuo, Martin C. Frith, Jun Sese, Paul Horton

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results: Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions: EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle.

Original languageEnglish
Article number28
JournalBMC Medical Genomics
Volume11
DOIs
Publication statusPublished - 2018 Apr 20

Fingerprint

Genome
Benchmarking
Uncertainty
Exome
Eagles
DNA Sequence Analysis

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Cite this

Kuo, Tony ; Frith, Martin C. ; Sese, Jun ; Horton, Paul. / EAGLE : Explicit Alternative Genome Likelihood Evaluator. In: BMC Medical Genomics. 2018 ; Vol. 11.
@article{3ecdf7a63f9d4e47aa2a8e1e76a47d50,
title = "EAGLE: Explicit Alternative Genome Likelihood Evaluator",
abstract = "Background: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results: Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions: EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle.",
author = "Tony Kuo and Frith, {Martin C.} and Jun Sese and Paul Horton",
year = "2018",
month = "4",
day = "20",
doi = "10.1186/s12920-018-0342-1",
language = "English",
volume = "11",
journal = "BMC Medical Genomics",
issn = "1755-8794",
publisher = "BioMed Central",

}

EAGLE : Explicit Alternative Genome Likelihood Evaluator. / Kuo, Tony; Frith, Martin C.; Sese, Jun; Horton, Paul.

In: BMC Medical Genomics, Vol. 11, 28, 20.04.2018.

Research output: Contribution to journalArticle

TY - JOUR

T1 - EAGLE

T2 - Explicit Alternative Genome Likelihood Evaluator

AU - Kuo, Tony

AU - Frith, Martin C.

AU - Sese, Jun

AU - Horton, Paul

PY - 2018/4/20

Y1 - 2018/4/20

N2 - Background: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results: Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions: EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle.

AB - Background: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results: Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions: EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle.

UR - http://www.scopus.com/inward/record.url?scp=85045839955&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045839955&partnerID=8YFLogxK

U2 - 10.1186/s12920-018-0342-1

DO - 10.1186/s12920-018-0342-1

M3 - Article

C2 - 29697369

AN - SCOPUS:85045839955

VL - 11

JO - BMC Medical Genomics

JF - BMC Medical Genomics

SN - 1755-8794

M1 - 28

ER -