Naive Bayesian classifiers with multinomial models and noninformative generalized dirichlet priors for rRNA taxonomy assignment

  • 劉 冠良

Student thesis: Doctoral Thesis


The introduction of next generation sequencing (NGS) has created a major revolution in biological ecology Direct sequencing of hypervariable regions from rRNA genes can provide rapid and inexpensive analysis for ecological communities In order to get deep understanding from these data the Ribosomal Database Project developed the ‘RDP Classifier’ utilizing 8-mer nucleotide frequencies with Bayesian theorem to obtain taxonomy affiliation This classifier is computationally efficient and works well with massive short sequences However the binary model employed in the RDP classifier does not consider the repetitive 8-mers in each reference sequence Previous studies have pointed out that multinomial model usually results a better performance than binary model In this research we present the na?ve Bayesian classifiers with multinomial models that take repetitive 8-mers into account for classifying rRNA sequences The results were compared with those obtained from the binomial RDP classifier by 250-bp 400-bp 800-bp and full-length reads to demonstrate that the multinomial approach can generally achieve a higher predictive accuracy The number of instances for a specific class value in a rRNA sequence set can be less than ten In such a case allowing different confidence levels on the features in a noninformative prior have the potentiality to improve the performance of na?ve Bayesian classifier This study further develops a method to determine the best noninformative generalized Dirichlet priors for a na?ve Bayesian classifier with multinomial models The experimental results demonstrate that it can outperform the RDP classifier in all ranks and also suggest that the number of groups has a positive impact on the performance of the multinomial na?ve Bayesian classifier
Date of Award2015 Feb 2
Original languageEnglish
SupervisorTzu-Tsung Wong (Supervisor)

Cite this