TY - GEN
T1 - The implementation of the Min-Hashing algorithm in Mahout
AU - Wang, Hongya
AU - Wu, Xisong
AU - Chang, Shan
AU - Shu, Lih Chyun
N1 - Publisher Copyright:
© 2015 Taylor & Francis Group, London.
PY - 2015
Y1 - 2015
N2 - Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.
AB - Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.
UR - http://www.scopus.com/inward/record.url?scp=84960122327&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84960122327&partnerID=8YFLogxK
U2 - 10.1201/b18592-208
DO - 10.1201/b18592-208
M3 - Conference contribution
AN - SCOPUS:84960122327
SN - 9781138028302
T3 - Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
SP - 1161
EP - 1166
BT - Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
A2 - Hussain, Amir
A2 - Ivanovic, Mirjana
T2 - 4th International Conference on Electronics, Communications and Networks, CECNet2014
Y2 - 12 December 2014 through 15 December 2014
ER -