Cross-lingual document representation and semantic similarity measure: A fuzzy set and rough set based approach

Hsun Hui Huang, Yau Hwang Kuo

研究成果: Article同行評審

34 引文 斯高帕斯(Scopus)

摘要

As cross-lingual information retrieval is attracting increasing attention, tools that measure cross-lingual semantic similarity between documents are becoming desirable. In this paper, two aspects of cross-lingual semantic document similarity measures are investigated: One is document representation, and the other is the formulation of similarity measures. Fuzzy set and rough set theories are applied to capture the inherently fuzzy relationships among concepts expressed by natural languages. Our approach first develops a language-independent sense-level document representation based on the fuzzy set model to reduce the barrier between different languages and further explores the fuzzyrough hybrid approach to obtain a more robust macrosense-level document representation through the partitioning of the integrated sense association network of the document collection into macrosenses. Then, Tverskys notion of similarity and the F1 measure on information retrieval are adopted to formulate, respectively, two document similarity measures with fuzzy set operations on the two proposed document representations. The effectiveness of our approach is demonstrated by its success rate in identifying the English translations to their corresponding Chinese documents in a collection of ChineseEnglish parallel documents. Moreover, the proposed approach can be easily extended to process documents in other languages. It is believed that the proposed representations, along with the similarity measures, will enable more effective text mining processes.

原文English
文章編號5549886
頁(從 - 到)1098-1111
頁數14
期刊IEEE Transactions on Fuzzy Systems
18
發行號6
DOIs
出版狀態Published - 2010 十二月 1

All Science Journal Classification (ASJC) codes

  • 控制與系統工程
  • 計算機理論與數學
  • 人工智慧
  • 應用數學

指紋

深入研究「Cross-lingual document representation and semantic similarity measure: A fuzzy set and rough set based approach」主題。共同形成了獨特的指紋。

引用此