Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology

Yu Sheng Lai, Chung Hsien Wu

研究成果: Article

32 引文 斯高帕斯(Scopus)

摘要

In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.

原文English
頁(從 - 到)34-64
頁數31
期刊ACM Transactions on Asian Language Information Processing
1
發行號1
DOIs
出版狀態Published - 2002 三月 1

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

指紋 深入研究「Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology」主題。共同形成了獨特的指紋。

  • 引用此