Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology

Yu Sheng Lai, Chung Hsien Wu

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.

Original languageEnglish
Pages (from-to)34-64
Number of pages31
JournalACM Transactions on Asian Language Information Processing
Volume1
Issue number1
DOIs
Publication statusPublished - 2002 Mar 1

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology'. Together they form a unique fingerprint.

  • Cite this