A similarity measure for text classification and clustering

Yung Shen Lin, Jung Yi Jiang, Shie Jue Lee

Research output: Contribution to journalArticle

131 Citations (Scopus)

Abstract

Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.

Original languageEnglish
Article number6420834
Pages (from-to)1575-1590
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Volume26
Issue number7
DOIs
Publication statusPublished - 2014 Jul

Fingerprint

Text processing
Gages

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

@article{24d113bd8397456a9e374cf86adc12be,
title = "A similarity measure for text classification and clustering",
abstract = "Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.",
author = "Lin, {Yung Shen} and Jiang, {Jung Yi} and Lee, {Shie Jue}",
year = "2014",
month = "7",
doi = "10.1109/TKDE.2013.19",
language = "English",
volume = "26",
pages = "1575--1590",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "7",

}

A similarity measure for text classification and clustering. / Lin, Yung Shen; Jiang, Jung Yi; Lee, Shie Jue.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 7, 6420834, 07.2014, p. 1575-1590.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A similarity measure for text classification and clustering

AU - Lin, Yung Shen

AU - Jiang, Jung Yi

AU - Lee, Shie Jue

PY - 2014/7

Y1 - 2014/7

N2 - Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.

AB - Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.

UR - http://www.scopus.com/inward/record.url?scp=84904411707&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904411707&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2013.19

DO - 10.1109/TKDE.2013.19

M3 - Article

AN - SCOPUS:84904411707

VL - 26

SP - 1575

EP - 1590

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 7

M1 - 6420834

ER -