posted on 2017-02-06, 05:31authored byNidhi Tiwari
The increase in number of users and digital content has led
to an increase in the size and energy consumption of data centers. An efficient
use of energy is essential to address the concerns of cost and sustainability.
Many data centers contain MapReduce clusters of hundreds and thousands of
machines to efficiently process the infrequent batch and interactive Big Data
workloads. Nevertheless, a large number of machines are underutilized for a
long time. This makes the MapReduce clusters energy inefficient. In this thesis
we focus on improving the energy efficiency of MapReduce clusters to reduce the
energy consumption of data centers. MapReduce frameworks automate | the execution of
data-parallel tasks on a distributed cluster of commodity nodes; and the replication of data and
tasks for reliability. While such a design provides high scalability, fault
tolerance and easy programming interface, it poses several challenges to the
use of common resource consolidation methods for improving energy efficiency.
For instance, the workload consolidation on fewer nodes will have a negative impact on the performance and
availability of the system. The use of popular dynamic voltage and frequency scaling mechanisms,
which consider only the CPU-utilization, may not be optimal for the IO-intensive data
processing MapReduce systems. Likewise, tuning of the MapReduce configuration
parameters for energy efficiency is not simple because the number of
configuration parameters is large; a parameter can have conflicting impacts on
performance and energy; and the parameters are not necessarily orthogonal, that
is, changing the value of one parameter can actually influence the impact
caused by some other parameters. In this thesis, we use statistical and empirical methods to
address the challenges of configuring the parameters to improve the energy efficiency
of MapReduce systems without impacting their performance, fault-tolerance and
scalability. We first characterize the energy efficiency of MapReduce workloads
with respect to the built-in CPU-governors to determine the most effective
power settings. Next, we use factorial design of experiments to study the
effects of configuration parameters on performance and energy consumption with
a view to identify the most influential ones efficiently. We then perform a
detailed performance and energy characterization for the critical parameters
and derive respective empirical models using the linear regression technique.
We analyze the energy and performance models of a variety of MapReduce
workloads to understand the relative impact of CPU-frequency and other critical
parameters. We further present a MapReduce Configurator, which employs the
performance and energy models, to tune the critical parameters for energy
efficiency. We perform the characterizations and evaluations on multiple real
clusters, each consisting of a MapReduce platform (e.g. Hadoop-1 and Yarn)
deployed on a hardware (e.g. nodes with Intel Pentium G-2020 and Intel E5-2450
processor), with benchmark applications ranging from micro-level benchmarks
(e.g. wordcount and sort) to macro-level machine learning applications (e.g. Kmeans
and Pagerank). With
the use of the MapReduce Configurator, we achieve, approximately, 20-100%
improvement in energy efficiency of typical MapReduce workloads in two
architecturally different clusters. We demonstrate that tuning of just the
CPU-frequency setting improves the energy efficiency of machine learning
workloads by an average 25% over the default CPU-governor setting. Through
extensive empirical evaluations, we establish the generality and effectiveness
of our MapReduce Configurator and models. We also observe that the use of
energy aware configuration, determined using MapReduce Configurator, reduces
the energy consumption of MapReduce clusters without impacting their
performance. This helps in reducing the operational costs of data-centers.
History
Campus location
Australia
Principal supervisor
Umesh Bellur
Additional supervisor 1
Maria Indrawan-Santiago
Additional supervisor 2
Santonu Sarkar
Year of Award
2016
Department, School or Centre
Information Technology (Monash University Caulfield)
Additional Institution or Organisation
Indian Institute of Technology Bombay, India (IITB)