Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology

Yu Sheng Lai, Chung Hsien Wu

研究成果: Article

32 引文 斯高帕斯(Scopus)


In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.

頁(從 - 到)34-64
期刊ACM Transactions on Asian Language Information Processing
出版狀態Published - 2002 三月 1

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

指紋 深入研究「Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology」主題。共同形成了獨特的指紋。

  • 引用此