Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classification

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for naïve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for naïve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of naïve Bayesian classifiers is preserved.

Original languageEnglish
Pages (from-to)123-144
Number of pages22
JournalData Mining and Knowledge Discovery
Volume28
Issue number1
DOIs
Publication statusPublished - 2014 Jan 1

Fingerprint

Classifiers
Computational efficiency
Random variables

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

@article{b11c2a5e4e13478ebb9b5a2588c3a02c,
title = "Generalized Dirichlet priors for Na{\"i}ve Bayesian classifiers with multinomial models in document classification",
abstract = "The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for na{\"i}ve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for na{\"i}ve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for na{\"i}ve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of na{\"i}ve Bayesian classifiers is preserved.",
author = "Tzu-Tsung Wong",
year = "2014",
month = "1",
day = "1",
doi = "10.1007/s10618-012-0296-4",
language = "English",
volume = "28",
pages = "123--144",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "1",

}

TY - JOUR

T1 - Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classification

AU - Wong, Tzu-Tsung

PY - 2014/1/1

Y1 - 2014/1/1

N2 - The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for naïve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for naïve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of naïve Bayesian classifiers is preserved.

AB - The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for naïve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for naïve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of naïve Bayesian classifiers is preserved.

UR - http://www.scopus.com/inward/record.url?scp=84891881106&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891881106&partnerID=8YFLogxK

U2 - 10.1007/s10618-012-0296-4

DO - 10.1007/s10618-012-0296-4

M3 - Article

VL - 28

SP - 123

EP - 144

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 1

ER -