In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.
|Number of pages||31|
|Journal||ACM Transactions on Asian Language Information Processing|
|Publication status||Published - 2002 Mar 1|
All Science Journal Classification (ASJC) codes
- Computer Science(all)