The implementation of the Min-Hashing algorithm in Mahout

Hongya Wang, Xisong Wu, Shan Chang, Lih-Chyun Shu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.

Original languageEnglish
Title of host publicationElectronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
EditorsAmir Hussain, Mirjana Ivanovic
Pages1161-1166
Number of pages6
Publication statusPublished - 2015 Jan 1
Event4th International Conference on Electronics, Communications and Networks, CECNet2014 - Beijing, China
Duration: 2014 Dec 122014 Dec 15

Publication series

NameElectronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
Volume2

Other

Other4th International Conference on Electronics, Communications and Networks, CECNet2014
CountryChina
CityBeijing
Period14-12-1214-12-15

Fingerprint

Clustering algorithms
Learning algorithms
Learning systems
Websites
Experiments

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Electrical and Electronic Engineering

Cite this

Wang, H., Wu, X., Chang, S., & Shu, L-C. (2015). The implementation of the Min-Hashing algorithm in Mahout. In A. Hussain, & M. Ivanovic (Eds.), Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014 (pp. 1161-1166). (Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014; Vol. 2).
Wang, Hongya ; Wu, Xisong ; Chang, Shan ; Shu, Lih-Chyun. / The implementation of the Min-Hashing algorithm in Mahout. Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014. editor / Amir Hussain ; Mirjana Ivanovic. 2015. pp. 1161-1166 (Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014).
@inproceedings{cfdf6b7cf3524629a179957b47da9ae6,
title = "The implementation of the Min-Hashing algorithm in Mahout",
abstract = "Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.",
author = "Hongya Wang and Xisong Wu and Shan Chang and Lih-Chyun Shu",
year = "2015",
month = "1",
day = "1",
language = "English",
isbn = "9781138028302",
series = "Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014",
pages = "1161--1166",
editor = "Amir Hussain and Mirjana Ivanovic",
booktitle = "Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014",

}

Wang, H, Wu, X, Chang, S & Shu, L-C 2015, The implementation of the Min-Hashing algorithm in Mahout. in A Hussain & M Ivanovic (eds), Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014. Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014, vol. 2, pp. 1161-1166, 4th International Conference on Electronics, Communications and Networks, CECNet2014, Beijing, China, 14-12-12.

The implementation of the Min-Hashing algorithm in Mahout. / Wang, Hongya; Wu, Xisong; Chang, Shan; Shu, Lih-Chyun.

Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014. ed. / Amir Hussain; Mirjana Ivanovic. 2015. p. 1161-1166 (Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014; Vol. 2).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - The implementation of the Min-Hashing algorithm in Mahout

AU - Wang, Hongya

AU - Wu, Xisong

AU - Chang, Shan

AU - Shu, Lih-Chyun

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.

AB - Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.

UR - http://www.scopus.com/inward/record.url?scp=84960122327&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84960122327&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781138028302

T3 - Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014

SP - 1161

EP - 1166

BT - Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014

A2 - Hussain, Amir

A2 - Ivanovic, Mirjana

ER -

Wang H, Wu X, Chang S, Shu L-C. The implementation of the Min-Hashing algorithm in Mahout. In Hussain A, Ivanovic M, editors, Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014. 2015. p. 1161-1166. (Electronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014).