TY - GEN
T1 - Federated MapReduce to transparently run applications on multicluster environment
AU - Wang, Chun Yu
AU - Tai, Tzu Li
AU - Jui-Shing, Shu
AU - Jyh-Biau, Chang
AU - Ce-Kuen, Shieh
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/9/22
Y1 - 2014/9/22
N2 - In the Cloud era, data is generated everywhere, how to efficiently analyze those 'Big Data' that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.
AB - In the Cloud era, data is generated everywhere, how to efficiently analyze those 'Big Data' that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.
UR - http://www.scopus.com/inward/record.url?scp=84923924566&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84923924566&partnerID=8YFLogxK
U2 - 10.1109/BigData.Congress.2014.50
DO - 10.1109/BigData.Congress.2014.50
M3 - Conference contribution
AN - SCOPUS:84923924566
T3 - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014
SP - 296
EP - 303
BT - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014
A2 - Chen, Peter
A2 - Chen, Peter
A2 - Jain, Hemant
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE International Congress on Big Data, BigData Congress 2014
Y2 - 27 June 2014 through 2 July 2014
ER -