The implementation of the Min-Hashing algorithm in Mahout

Hongya Wang, Xisong Wu, Shan Chang, Lih Chyun Shu

研究成果: Conference contribution

摘要

Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.

原文English
主出版物標題Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
編輯Amir Hussain, Mirjana Ivanovic
頁面1161-1166
頁數6
DOIs
出版狀態Published - 2015
事件4th International Conference on Electronics, Communications and Networks, CECNet2014 - Beijing, China
持續時間: 2014 12月 122014 12月 15

出版系列

名字Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
2

Other

Other4th International Conference on Electronics, Communications and Networks, CECNet2014
國家/地區China
城市Beijing
期間14-12-1214-12-15

All Science Journal Classification (ASJC) codes

  • 硬體和架構
  • 電氣與電子工程

指紋

深入研究「The implementation of the Min-Hashing algorithm in Mahout」主題。共同形成了獨特的指紋。

引用此