IDP: An Innovative Data Placement Strategy for Hadoop in Heterogeneous Environments

  • 黃 虹橋

Student thesis: Master's Thesis


Cloud computing is a kind of parallel distributed computing system that becomes more and more popular in modern world MapReduce is a popular model in cloud computing which is an important programming model for large-scale data-parallel application Furthermore Hadoop is an open-source implementation of MapReduce model which is usually used for data-intensive application such as data mining and web indexing The current Hadoop implementation assumes that every node in a cluster has equivalent computing capability and task are data-local However this assumption induces that homogeneity and data locality requirement would not be satisfied in private cluster and virtualized data centers which may increase extra overhead and degrade MapReduce performance In this paper we propose a data placement strategy to deal with the imbalanced workload problem on DataNode Basing on computing capability of each node in a heterogeneous Hadoop cluster the proposed strategy can balance the data that was stored in the DataNode such that the cost of data transfer time can be tremendously reduced As a result the Hadoop overall performance can be greatly improved Experimental results demonstrate that the proposed data placement strategy can highly decrease the execution time and thus improves Hadoop performance in a heterogeneous cluster
Date of Award2014 Aug 25
Original languageEnglish
SupervisorSun-Yuan Hsieh (Supervisor)

Cite this