A similarity measure for text classification and clustering

Yung Shen Lin, Jung Yi Jiang, Shie Jue Lee

研究成果: Article

151 引文 斯高帕斯(Scopus)

摘要

Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.

原文English
文章編號6420834
頁(從 - 到)1575-1590
頁數16
期刊IEEE Transactions on Knowledge and Data Engineering
26
發行號7
DOIs
出版狀態Published - 2014 七月

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

指紋 深入研究「A similarity measure for text classification and clustering」主題。共同形成了獨特的指紋。

  • 引用此