The implementation of the Min-Hashing algorithm in Mahout

Hongya Wang, Xisong Wu, Shan Chang, Lih Chyun Shu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Min-Hashing was originally proposed as an efficient clustering algorithm that groups similar web pages into the same cluster with probability guarantee. In this paper, we focused on Min-Hashing implementations using MapReduce in Mahout, which is an open-source project of distributed and scalable machine learning algorithms. In particular, we observed a significant deviation between the real and expected performance of the minhash clustering package in Mahout. After careful examination of the relevant sourcecode, we identified two fatal conceptual mistakes in the implementation. Then, we rewrote the core part of the problematic Min-Hashing implementation in Mahout following the standard LSH algorithm. To validate the soundness of the revised version, we conducted extensive experiments with several real datasets. Experimental results confirmed the validity of our implementation, which could be integrated as a standard package in future versions of Mahout.

Original languageEnglish
Title of host publicationElectronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
EditorsAmir Hussain, Mirjana Ivanovic
Pages1161-1166
Number of pages6
DOIs
Publication statusPublished - 2015
Event4th International Conference on Electronics, Communications and Networks, CECNet2014 - Beijing, China
Duration: 2014 Dec 122014 Dec 15

Publication series

NameElectronics, Communications and Networks IV - Proceedings of the 4th International Conference on Electronics, Communications and Networks, CECNet2014
Volume2

Other

Other4th International Conference on Electronics, Communications and Networks, CECNet2014
Country/TerritoryChina
CityBeijing
Period14-12-1214-12-15

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'The implementation of the Min-Hashing algorithm in Mahout'. Together they form a unique fingerprint.

Cite this