NetApp Open Solution for Hadoop

Lab Validation Report NetApp Open Solution for Hadoop

Open Source Data Analytics with Enterprise-class Storage Services

By Brian Garrett, VP, ESG Lab, & Julie Lockner, Sr Analyst & VP, Data Management

May 2012 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.

Lab Validation: NetApp Open Solution for Hadoop 2

© 2012, Enterprise Strategy Group, Inc. All Rights Reserved.

Contents

Introduction .................................................................................................................................................. 3 Background ............................................................................................................................................................... 3 NetApp Open Solution for Hadoop .......................................................................................................................... 5

ESG Lab Validation ........................................................................................................................................ 6 Getting Started ......................................................................................................................................................... 6 Performance and Scalability ..................................................................................................................................... 7 Efficiency ................................................................................................................................................................. 10 Recoverability ......................................................................................................................................................... 12

ESG Lab Validation Highlights ..................................................................................................................... 16

Issues to Consider ....................................................................................................................................... 16

The Bigger Truth ......................................................................................................................................... 17

Appendix ..................................................................................................................................................... 18

All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.

ESG Lab Reports

The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments. This ESG Lab report was sponsored by NetApp.



Introduction

This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly reliable, ready to deploy, scalable storage solution for Enterprise Hadoop.

Background

Driven by unrelenting data volume growth, the need for real-time data processing and data analytics, and the increasing complexity and variety of data sources, ESG expects broad adoption of MapReduce data processing and analytics frameworks over the next two to five years. These frameworks require new approaches for storing, integrating and processing “big data.” ESG defines big data as any data set that exceeds the boundaries and sizes of traditional IT processing; big data sets can range from ten to hundreds of terabytes in size.

Data analytics is a top IT priority for forward-looking IT organizations. In fact, a recent ESG survey indicates that more than half (54%) of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be a top-five IT priority and 38% plan on deploying a new data analytics solution in the next 12-18 months. A growing number of IT organizations are using the open source Apache Hadoop MapReduce framework as a foundation for their big data analytics initiatives. As shown in Figure 1, more than 50% of organizations polled by ESG are using

Hadoop, planning to deploy Hadoop in the next 12 months, or considering Hadoop.1

Figure 1. Plans to Implement a MapReduce Framework such as Apache Hadoop

Source: Enterprise Strategy Group, 2011.

As with any exciting and emerging technology, big data analytics also has its challenges. Management is an issue because the platforms are expensive and require new server and storage purchases, integration with existing data sets and processes, training in new technologies, an analytics toolset, and people with expertise in dealing with it. When IT managers were asked about their data analytics challenges, 47% named data integration complexity, 34% cited a lack of skills necessary to properly manage large data sets and derive value from them, 29% said data set sizes limiting their ability to perform analytics, and 28% said difficulty in completing analytics within a reasonable period of time.

1 Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011.

Already using, 8% Plan to implement within 12 months,

13%

No plans to implement at this

time but interested, 35%

No plans to implement, 33%

Don’t know, 11%

What are your organization’s plans to implement a MapReduce framework (e.g., Apache Hadoop) to address data analytics challenges? (Percent of respondents,

N=270)

http://www.netapp.com/

http://www.enterprisestrategygroup.com/2011/09/the-impact-of-big-data-on-data-analytics/



Looking beyond the high level organizational challenges associated with a big data analytics initiative, the Hadoop framework adds technology and implementation issues that need to be considered. The common reference architecture for a Hadoop cluster leverages commodity server nodes with internal hard drives; for conventional data centers with mature ITIL processes, this introduces two challenges. First, data protection is, by default, handled in the Hadoop software layer; every time a file is written to the Hadoop Distributed File System (HDFS), two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest and throughput performance, but also reduces disk capacity utilization. Second, high availability is limited based on an existing single point of failure in the Hadoop metadata repository. This single point of failure will eventually be addressed by the Hadoop community, but, in the meantime, analytics downtime due to a name node failure is a key concern. As shown in Figure 2, a majority of ESG survey respondents (55%) indicate that three hours or less of data analytics platform downtime would result in a significant revenue loss or other adverse business impact.2

Figure 2. Data Analytics Downtime Tolerance

Source: Enterprise Strategy Group, 2011.

NetApp, in collaboration with leading Hadoop distribution vendors, is working to develop reference architectures, best practices, and solutions that address these challenges while maximizing the speed, efficiency, and availability of open source Hadoop deployments.

2 Source: ESG Survey, The Convergence of Big Data Processing, Hadoop, and Integrated Infrastructure, December 2011.

None, 6%

Less than 1 hour, 21%

1 hour to 3 hours, 26%

4 hours to 10 hours, 18%

11 hours to 24 hours, 10%

1 day to 3 days, 10%

More than 3 days, 4%

Don’t know, 6%

Please indicate the amount of downtime your organization’s data analytics platforms can tolerate before your organization experiences significant revenue

loss or other adverse business impact. (Percent of respondents, N=399)



NetApp Open Solution for Hadoop

Hadoop is an open source and significant emerging technology for solving business problems around large volumes of mostly unstructured data that cannot be analyzed with traditional database tools. The NetApp Open Solution for Hadoop combines the power of the Hadoop framework with flexible storage, professional support, and services of NetApp and its partners to deliver higher Hadoop cluster availability and efficiency. Based on a reference architecture, it focuses on scaling Hadoop from its departmental origins to an enterprise infrastructure with independent compute and storage scaling, faster cluster ingest, and faster job completion under failure conditions.

NetApp Open Solution for Hadoop extends the value of the open source Hadoop framework with enterprise-class storage and services. As shown in Figure 3, NetApp FAS2040 and E2660 storage replace traditional DAS internal hard drives within a Hadoop cluster. Compute and storage resources are decoupled with SAS attached NetApp E2660 arrays and the recoverability of a failed Hadoop name node is improved with a NFS attached FAS2040. The storage components are completely transparent to the Hadoop distribution and require no modification to the native, underlying Hadoop platform. Note that while the FAS2040 is used for this testing configuration, any other product in the FAS storage family can also be used.

Figure 3. NetApp Open Solution for Hadoop

The NetApp Open Solution for Hadoop includes:

NetApp E2660s with hardware RAID and hot-swappable disks increases efficiency, performance, scalability, availability, serviceability, and manageability compared to a traditional Hadoop deployment with internal hard drives and replication at the application layer. With data being protected by hardware RAID, higher storage utilization rates can be achieved by reducing the default Hadoop replication count.

A NetApp FAS2040 with shared NFS attached capacity accelerates recoverability after a primary name node failure, compared to a traditional Hadoop deployment with internal hard drives.

A high speed 10 Gbps Ethernet network and direct attached 6 Gbps SAS-attached E2600s with network free hardware RAID increases the performance, scalability and efficiency of the Hadoop infrastructure.

High capacity E2660 disk arrays and a building block design that decouples the compute and storage layers provide near-linear scalability that’s ideally suited for big data analytics applications with extreme compute and storage capacity requirements.

A field-tested solution comprised of open source Apache Hadoop distribution and enterprise-class NetApp storage, with professional design services and support reduces risk and accelerates deployment.



ESG Lab Validation

ESG Lab performed hands-on evaluation and testing of solution at a NetApp facility in Research Triangle Park, North Carolina. Testing was designed to demonstrate that the NetApp Open Solution for Hadoop can perform and scale linearly as data volumes and load increase, can recover from both a single and double node failure with no disruption to a running Hadoop job, and can quickly recover from a name node failure. The performance and scalability benefits of using network-free hardware RAID and a lower Hadoop replication count were evaluated as well. Testing was performed using open source software, workload generators, and monitoring tools.

Getting Started

A Hadoop cluster with one name node, one secondary name node, one job tracker node, and up to 24 data nodes was used during ESG Lab testing. Rack-mounted servers with quad core Intel Xeon processors and 48GB of RAM were connected to six NetApp E2660s, with the name node and secondary name node connected to a single NetApp FAS2040. Each NetApp E2660 was filled with 60 2TB 7200 RPM NL-SAS drives for a total raw capacity of 720TB. A building block approach was used, with groups of four data nodes sharing an E2660 through 6 Gbps SAS connections. A 1 Gbps Ethernet network was used for the cluster interconnect and NFS connections to name and job tracker nodes. Cloudera Distribution for Hadoop software was installed over the Red Hat Linux operating system on each of the nodes in the cluster.3

Figure 4. The ESG Lab Test Bed

3 Configuration details are listed in the Appendix.



Performance and Scalability

Hadoop uses a shared nothing programming paradigm and a massively parallel clustered architecture to meet the extreme compute and capacity requirements of big data analytics applications. Aiming to augment the performance and scalability capacity of traditional database architectures, Hadoop brings the compute power to the data. The name node and job trackers handle distribution and orchestration as the data nodes do all of the analytical processing work.

HDFS is a distributed network file system used by nodes in a Hadoop cluster. Software mirroring is the default data protection scheme within the HDFS file system. For every block of data written into the HDFS file system, an additional two copies are written to other nodes for a total of three copies. This is referred to as a replication count of three, and is the default for most Hadoop implementations that rely on internal hard drive capacity. This software data mirroring increases the processing load on data nodes and the utilization of the shared network between nodes. To put this into perspective, consider what happens when a 2TB data set is loaded into a Hadoop cluster with a default replication count of three: in this example, 2TB of application data results in 6TB of raw data being processed and moved over the network.

A NetApp E2660 with hardware RAID reduces the processing and network overhead associated with software mirroring, which increases the performance and scalability of a Hadoop cluster. With up to 15 high capacity, high performance disk drives (2TB, 7.2K NL-SAS) available for each data node, the performance of a Hadoop cluster is magnified compared to a traditional Hadoop cluster with internal SATA drives. A right-sized building block approach provides near-linear scalability as compute and storage capacity are added to a cluster.

ESG Lab Testing

ESG Lab performed a series of tests to measure the performance and scalability of a 24-data-node NetApp Open Solution for Hadoop. Note that there are actually 27 nodes, 24 data nodes, one name node, one secondary name node and one job tracker node. The TeraGen utility, included in the Hadoop open source distribution, was used to simulate the loading of a large analytic data set. Testing was performed with cluster sizes of 8, 16, and 24 data nodes and a Hadoop replication count of two. Testing began with the creation of a 1TB data set on an 8-data-node cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node cluster. The results are presented in Figure 5 and Table 1.

Figure 5. Data Loading Performance Analysis



Table 1. Performance Scalability Test Results: Data Loading with the TeraGen Utility

Data nodes 8 16 24

NetApp E2660 arrays 2 4 6

NetApp E2660 drives 120 240 360

Usable capacity (TB) 180 360 720

Hadoop data set size (TB) 1 2 3

Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18

Aggregate throughput (MB/sec) 1,574 3,222 4,630

What the Numbers Mean

The NetApp Solution for Hadoop was designed to scale performance in near-linear fashion as data nodes and E2660 disk arrays are added to the cluster. This modular building block approach can also be used to provide consistent levels of performance as a data set grows.

The job completion time for each of the TeraGen runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays added linearly.

In this example, the solution scaled up to 24 data nodes and six E2660 arrays with a total of 360 drives and 720TB of usable disk capacity.

As the number of data nodes increased and the volume of data generated increased linearly, the completion time remained flat, at approximately 10 minutes (+/- 3%). This demonstrates the linear performance scalability of the NetApp Solution for Hadoop.

A job completion time of ten minutes for the creation of a 3TB data set indicates that the 24-node NetApp solution sustained a high aggregate throughput rate of 4.630 GB/sec.

An aggregate data creation rate of 4.630 GB/sec can be used to create 16.7TB of data per hour.

Performance testing continued with a similar series of tests designed to measure the scalability of the solution when processing long running data analytics jobs. The open source TeraSort utility included in the Hadoop distribution was used during this phase of testing. Using the data created with TeraGen, TeraSort was tested with cluster sizes of 8, 16, and 24 data nodes, a map count of seven, and a reducer count of five per data node. Testing began with a sort of the 1TB data set on an eight-data-node cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node cluster. The elapsed job run time was recorded after each test. Each test began with a freshly created TeraGen data source. The results are presented in Table 2 and Figure 6.

Table 2. Performance Scalability Test Results: Data Analytics with the TeraSort Utility

Data nodes 8 16 24


Job completion time (hh:mm:ss) 00:29:19 00:30:19 00:30:21

Aggregate throughput (MB/sec) 542 1,049 1,571



Figure 6. Data Analytics Performance Analysis


The job completion time for each of the TeraSort runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays increased linearly.

As the number of data nodes grew and the volume of data generated increased linearly, job completion time remained flat at approximately 30 minutes (+/- 2%).

As shown in Figure 6, aggregate analytics throughput scaled linearly as data nodes and E2660 arrays were added to the cluster.

Why This Matters

A growing number of organizations are deploying big data analytics platforms to improve the efficiency and profitability of their businesses. ESG research indicates that data analytics and managing data growth are among the top five IT priorities in more than 50% of organizations. When asked about their data analytics challenges, 29% said data set sizes are limiting their ability to perform analytics, and 28% reported difficulty in completing analytics within a reasonable period of time.

The NetApp Open Solution for Hadoop combines the compute scalability of a shared Hadoop cluster with the storage efficiency and scalability of network-free hardware RAID. Because the solution was designed to have the Hadoop data replication setting lower than the default and because it standardizes on a 10GbE network, there is less chance of having a network bottleneck compared to a traditional Hadoop deployment as data volumes grow.

ESG Lab confirmed that NetApp has created a big data analytics solution with near-linear performance scalability that dwarfs the capabilities of traditional databases and disk arrays—testing with a 24-node cluster and a 3TB data set scaled up to 4.63 GB/sec of aggregate load throughput and 1.57 GB/sec of aggregate analytics throughput.



Efficiency

The NetApp Open Solution for Hadoop improves capacity and performance efficiency compared to a traditional Hadoop deployment. With protection from disk failures provided by NetApp E2660s with hardware RAID, the Hadoop default replication setting of three can be reduced to two. NetApp E2660s with network-free hardware RAID-5 (6+1) and a Hadoop replication count of two increase storage capacity utilization by 22%, compared to a Hadoop cluster with internal drives and a default replication count of three. Network-free hardware RAID also increases the performance and scalability of the cluster due to a reduction in the amount of mirrored data flowing over the network.

ESG Lab Testing

The TeraGen tests were repeated with a replication count of two as the size of the cluster was increased from eight to 24 data nodes. The elapsed job time was compared with those collected earlier with a default Hadoop replication count of three. The results are summarized in Figure 7 and Table 3.

Figure 7. Increasing Hadoop Cluster Efficiency with the “NetApp Effect”

Table 3. Performance Efficiency Test Results: Data Loading with TeraGen

Replication Count

Data nodes 8 16 24


2 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18

2 Aggregate throughput (MB/sec) 1,573 3,221 4,629

3 Job completion time (hh:mm:ss) 00:15:32 00:16:11 00:16:44

3 Aggregate throughput (MB/sec) 1,023 1,964 2,849




As the number of data nodes grew and the volume of data generated increased linearly, the job completion time remained flat at approximately ten minutes (+/- 2%) with a NetApp-enabled replication count of two.

Job completion time increased by 50% or more with a Hadoop default replication count of three due to the extra processing and network overhead associated with triple mirroring.

The increase in cluster efficiency (the NetApp effect) not only reduced job completion times, but also increased aggregate throughput.

As shown in Figure 7, the NetApp effect was magnified as the size of the cluster and the amount of network traffic increased. Note how the “Replication 2 with NetApp” line (green, circles) increases linearly compared to the “Replication 3” line (red, triangles). Also note how the gap between the two increases as the cluster grows due to the increase in network traffic.

The NetApp effect resulted in a peak aggregate throughput improvement of 62.5% during the 24-node test (4.629 vs. 2.849 GB/sec).

Why This Matters

Data growth shows no signs of abating. As data accumulates, there is a corresponding burden on IT to maintain acceptable levels of performance, whether that is measured by the speed with which an application responds, the ability to aggregate and deliver data, or the ultimate business value of information. Management teams are recognizing that their growing data stores bring massive, and largely untapped, potential to improve business intelligence. At the same time, they also recognize the challenges that big data poses to existing analytics tools and processes, as well as the impact data growth is having on the bottom line in the form of increased requirements for storage capacity and compute power. It is for these reasons that IT managers are struggling to meet the conflicting goals of keeping up with explosive data growth and lowering the cost of delivering data analytics services.

The default replication count for Hadoop is three. This is strongly recommended for data protection with Hadoop configurations with internal disk drives. Replication is also needed for cluster self-healing. “Self-healing” is used to describe Hadoop’s ability to ensure job completion in the event of task failure. It does this by reassigning failed tasks to other nodes in the cluster. This is made possible by the replication of blocks throughout the cluster.

With the NetApp Open Solution for Hadoop, replication is not required for data protection since data is protected with hardware RAID. As a result, a replication count of two is sufficient for self-healing. Hadoop MapReduce jobs that write data to the HDFS, such as data ingest, benefit from the lower replication count: they generally run faster and require less storage space than a Hadoop cluster with internal disk storage and a replication count of three.

During ESG Lab testing with a 24-node cluster, the NetApp effect reduced disk capacity requirements by 22% as it increased aggregate data load performance by 62%. In other words, organizations can manage more data at a lower cost with NetApp.



Recoverability

When a name node fails, a Hadoop administrator needs to recover the metadata and restart the Hadoop cluster using a standby, secondary name node.

In a Hadoop server cluster with internal storage, when a disk drive fails, the entire data node is “blacklisted” and no longer available to execute tasks. This can result in degraded performance and the need for a Hadoop administrator to take the data node offline, service and replace the failed component, and then redeploy. This process can take several hours to complete. This single point of failure is being addressed by the open source Hadoop community, but was not yet generally available when this report was published.

NetApp Open Solution for Hadoop increases the availability and recoverability of a Hadoop cluster in three significant ways:

1. Recovery from a name node failure is accelerated dramatically using an NFS attached FAS2040 instead of internal storage on the primary and secondary name nodes. If and when a name node failure occurs, a quick recovery from an NFS attached FAS2040 can restore analytics services in minutes instead of hours.

2. NetApp E2600s with hardware RAID provide transparent recovery from hard drive failures. The data node is not blacklisted and any job tasks that were running continue uninterrupted.

3. The NetApp E2660 management console (SANtricity) provides a centralized management GUI for monitoring and managing drive failures. This reduces the complexity associated with manually recovering from drive failures in a Hadoop cluster with internal drives.

ESG Lab Testing

A variety of errors were tested with a 24-data-node Hadoop cluster running a 3TB TeraSort job. As shown in Figure 8, errors were injected to validate that jobs continue to run after data node and E2660 hard drive failures, and that the cluster can be quickly recovered after a name node failure. A dual drive failure was also tested to simulate and measure job recovery time after an internal hard drive failure in a traditional Hadoop cluster.

Figure 8. ESG Lab Error Injection Testing



Disk Drive Failure

To simulate a disk drive failure, a drive was taken offline while a Hadoop TeraSort job was running.4 The Hadoop job tracker web interface was used to confirm that the job completed successfully. The NetApp E2660 SANtricity management console was used to identify which drive had failed and monitor automatic recovery from a hot spare. A SANtricity management console screenshot taken shortly after the drive had failed is shown in Figure 9.

Figure 9. Transparent Recovery from a Hard Drive Failure with E2660 Hardware RAID

Another TeraSort job was started. While it was running, a lab manager physically replaced the failed hard drive. The TeraSort job completed without error, as expected. Another TeraSort job was started and a dual drive error was introduced to simulate and measure the job completion time after a traditional Hadoop hard drive failure in a data node.5 As shown in Table 4, the TeraSort job took slightly longer (5.7% longer) to complete during the single drive failure with the hardware RAID recovery of the NetApp E2660. The simulated internal drive failure took more than twice as long (236.2%) as the data node was blacklisted and job tasks were restarted on surviving nodes.

Table 4. Drive Failure Recovery Results

Test Scenario Job Completion Time

(hh:mm:ss) Throughput

(MB/sec) Delta

(vs. Healthy Cluster)

Healthy cluster 00:30:21 1,821 N/A

NetApp E2660 drive failure 00:32:06 1,486 -5.7%

Internal data node drive failure 01:12:13 660 -237.9%

4 Drive failures were introduced when the Hadoop job tracker indicated that the TeraSort job was 80% complete.

5 In a Hadoop cluster using internal disk drives, a local file system is created on each disk. If a disk fails, that file system fails. A local disk

failure was simulated during ESG Lab testing by failing two disk drives in the same RAID 5 volume group. All data on that file system was lost and all tasks running on that file system failed. The job tracker detected this and reassigned failed tasks to other nodes where copies of the lost blocks exist. With the NetApp solution, a single disk drive has very little impact on running tasks, and all data in the local file system using that LUN remains available as RAID reconstruct begins. With direct attached disks, if a single disk fails, a file system fails as described above.



The screen shot shown in Figure 10 shows Hadoop job tracker status after the successful completion of the TeraSort job following the simulated internal hard drive failure. Note how the non-zero failed/killed counts indicate the number of map and reduce tasks that were restarted on surviving nodes (439 and 5, respectively).

Figure 10. Jobs Completion after a Simulated Internal Hard Drive Failure

The screen shot shown in Figure 11 summarizes the status of the Hadoop Distributed File System (HDFS) after the data node with a simulated internal hard drive failure was blacklisted. These errors didn’t occur with the E2660 drive failure, as the Hadoop job ran uninterrupted.

Figure 11. Hadoop Self-healing in Action: Cluster Summary after a Simulated Internal Drive Failure



Name Node Failure

The Hadoop name node server was halted while a TeraSort job was running with a goal of demonstrating how an NFS attached NetApp FAS2040 can be used to quickly recover from the single point of failure when a name node goes offline in a Hadoop cluster. As shown in Figure 12, the job failed as expected after 13 minutes and 23 seconds. After the job failed, name node metadata was copied to the secondary name node and the name node daemon was started on the secondary name node server. The procedure outlined in the NetApp Open Solution for Hadoop Solutions Guide6 was used to copy metadata to the secondary name node and start the name node daemon on the secondary name node.

Figure 12. Job Failure after a Name Node Failure: NetApp FAS2040 Recovery Begins

Five minutes after getting started with the recovery process, the Hadoop cluster was up and running. An fsck of the HDFS file system indicated that the cluster was healthy and a restarted TeraSort job completed without error.

Why This Matters

A majority of respondents to a recent ESG survey indicated that three hours or less of data analytics platform downtime would result in significant revenue loss or other adverse business impact. The single point of HDFS failure in the open source Hadoop distribution that was generally available as of this writing can lead to three or more hours of data analytics platform unavailability.

ESG Lab has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes (five minutes during ESG Lab testing). NetApp E2660s with hardware RAID dramatically improved recoverability after simulated hard drive failures. The complexity and performance impact of a blacklisted name node was avoided as 3TB TeraSort analytics job with NetApp completed more than twice as quickly as with a simulated internal hard drive failure.

6 http://media.netapp.com/documents/tr-3969.pdf

http://media.netapp.com/documents/tr-3969.pdf



ESG Lab Validation Highlights

The capacity and performance of a NetApp solution scaled linearly when data nodes and NetApp E2660 storage arrays were added to a Hadoop cluster.

ESG Lab tested up to 24 data nodes and six NetApp E2660 arrays with 720TB of usable disk capacity. Load performance testing with the TeraGen utility delivered linear performance scalability. A 24-node cluster sustained a high aggregate load throughput rate of 4.630 GB/sec. Big data analytics performance testing with the TeraSort utility yielded linear performance scalability as

data nodes and E2660 arrays were added. Network-free hardware RAID and a lower Hadoop replication count reduced network overhead, which

increased the aggregate performance of the cluster. A peak aggregate throughput improvement of 62.5% was recorded during the 24-node test (4.629 vs. 2.849 GB/sec).

A MapReduce job running during a simulated internal drive failure took more than twice as long (225%) to complete than during failure of a hardware RAID protected E2660 drive.

An NFS attached NetApp FAS2040 for name node metadata storage was used to recover from a primary name node failure in five minutes, compared to multiple hours in a traditional configuration.

Issues to Consider

While the results demonstrate how the NetApp Open Solution for Hadoop is ideally suited to meet the extreme compute and storage performance needs of big data analytic load and long running queries, applications with lots of small files, multiple writers, or many users with low response time requirements may be better suited for traditional relational databases and storage solutions.

The single point of failure issue in the Hadoop distribution used during this ESG Lab Validation is being fixed in the open source community, but was not yet available and therefore not tested as part of ESG Lab’s assessment of the NetApp Open Solution for Hadoop. Even so, future releases of Hadoop that resolve the name node failure problem are still expected to rely on NFS shared storage as a functional requirement. NetApp, with its FAS family, is an industry leader in NFS shared storage.

The test results presented in this report are based on a benchmarks deployed in a controlled environment. Due to the many variables in each production data center environment, capacity planning and testing in your own environment are recommended.

A growing number of best practices, tuning guidelines, and proof points are available for reference when planning, deploying, and tuning a Hadoop Open Solution for NetApp. To learn more, visit: http://www.netapp.com/hadoop.

http://www.netapp.com/hadoop



The Bigger Truth

Whether measured by increased revenues, market share gains, reduced costs, or scientific breakthroughs, data analytics has always played a key role in the ability to harness value from electronically-stored information. What has changed recently is that, as more business processes have become automated, information that was once stored in separate online and offline repositories and formats is now readily available for amalgamation and analysis to increase business insight and enhance decision support. Business executives are asking more of their data and are expecting faster and more impactful answers. The result is an ever-increasing priority on data analytics activities and, subsequently, more pressure on existing business analyst and IT teams to deliver.

Hadoop is a powerful open source framework for data analytics. It’s an emerging and fast growing solution that’s considered one of the most impactful technology innovations since HTML. While ESG research indicates that a small number of organizations are using Hadoop at this time, interest and plans for adoption over the next 12-18 months is high (48%).

For those new to Hadoop, there is a steep learning curve. Very few enterprise applications are built to run on massively parallel clusters, so there is much to learn. The NetApp Open Solution for Hadoop is a tested and proven reference architecture storage appliance that reduces the risk and time associated with Hadoop adoption.

NetApp has embraced the open source Hadoop model and is working with major distributors to support open source Hadoop software running on industry standard servers. Instead of promoting the use of a proprietary clustered file system, NetApp has embraced the use of the open source Hadoop file system (HDFS). Instead of promoting the use of SAN or NAS attached storage, NetApp has embraced the use of direct attached storage. Using SAS direct connected NetApp E2660 arrays with hardware protected RAID, the NetApp solution improves performance, scalability, and availability compared to typical internal hard drive Hadoop deployments. Thanks to an NFS attached NetApp FAS2040 for shared access to metadata, recovery from a Hadoop name node failure is reduced from hours to minutes.

With up to 5 GB/sec of aggregate TeraGen load performance on a 24-node cluster, ESG Lab has confirmed that the NetApp Solution for Hadoop provides excellent near-linear performance scalability that dwarfs the capabilities of traditional disk arrays and databases. NetApp E2660s with network-free hardware RAID improved the efficiency and performance of the cluster by 66% compared to a traditional Hadoop deployment with triple mirroring. The value of transparent RAID recovery was obvious after drive failures were simulated: the performance impact on a long running sort job was less than 6% compared to more than 200% for a simulated internal drive failure that blacklisted a Hadoop data node.

If you’re looking to accelerate the delivery of insight to your business with an enterprise-class big data analytics infrastructure, ESG Lab recommends a close look at the NetApp Open Solution for Hadoop—it reduces risk with a storage solution that delivers reliability, fast deployment, and scalability of open source Hadoop for the enterprise.



Appendix

The configuration of the test bed that was used during the ESG Lab Validation is summarized in Table 5.

Table 5. Configuration Summary

Servers

HDFS data nodes HDFS name node HDFS secondary name node HDFS job tracker

24 servers, each with quad core Intel Xeon CPU, 48GB RAM 1 server with quad core Intel Xeon CPU, 48GB RAM 1 server, with quad core Intel Xeon CPU, 48GB RAM 1 server, with quad core Intel Xeon CPU, 48GB RAM

Network

10 GbE host connect 10 GbE switched fabric

One 10GbE CAN connection for all data nodes, name node, secondary name node, job tracker Cisco Nexus 5010, 10 GigE, Jumbo Frames (MTU=9000)

Storage

HDFS data node storage HDFS name node storage Operating system boot drives

6 NetApp E2660 6Gb SAS host connect, 6+1 RAID-5, 2TB near line SAS 7.2K RPM drives, 360 drives total, version 47.77.19.99 1 NetApp FAS2040, 1GbE NAS host connect, 6 disks, 1TB each, 7.2K RPM, Data ONTAP 8.0.2 7 mode Local 1TB 7.2K RPM SATA drive in each node

Software

Operating system Red Hat Enterprise Linux version 5, update 6 (RHEL5.6)

Analytics platform Cloudera Hadoop (CDH3u2)

HDFS Configuration Changes vs. Cloudera V3U2 Distribution

Local file system XFS

Map/reduce tasks per data node 7/5

Table 6 lists the differences between Hadoop core-site.xml defaults and the settings used during ESG Lab testing.

Table 6. Hadoop core-site Settings

Option Name Purpose Actual/Default

fs.default.name Name of the default file system specified as a URI (IP address

or hostname of the name node along with the port to be used).

hdfs:// 10.61.189.64:8020/

[Default Value: file:///]]

webinterface.private.actions Enables or disables certain management functions within

the Hadoop Web user interface, including the ability to kill jobs and modify job priorities.

true / false

fs.inmemory.size.mb Memory in MB to be used for merging map outputs during

the reduce phase. 200 / 100

io.file.buffer.size Size in bytes of the read/write buffer. 262144 / 4096

topology.script.file.name

Script used to resolve the slave node’s name or IP address to a rack ID. Used to invoke Hadoop rack awareness. The

default value is null and results in all slaves being given a rack ID of “/default-rack.”

/etc/hadoop/conf/topology_script

[Default value is null]

topology.script.number.args Sets the maximum acceptable number of arguments to be

sent to the topology script at one time. 1 / 100

hadoop.tmp.dir Hadoop temporary directory storage.

/home/hdfs/tmp

[Default value: /tmp/hadoop-${user.name}]



Table 7 lists the differences between Linux sysctl.conf defaults and the settings used during ESG Lab testing.

Table 7. Linux sysctl.conf Settings

Parameter Description Actual /Default

net.ipv4.ip_forward Controls IP packet forwarding. 0 / 0

net.ipv4.conf.default.rp_filter Controls source route verification. 1 / 0

net.ipv4.conf.default.accept_ source_route

Do not accept source routing. 0 / 1

kernel.sysrq Controls the system request debugging functionality of the

kernel. 0 / 1

kernel.core_uses_pid Controls whether core dumps will append the PID to the

core filename. Useful for debugging multithreaded applications.

1 / 0

kernel.msgmnb Controls the maximum size of a message, in bytes. 65536 / 16384

kernel.msgmax Controls the default maximum size of a message queue. 65536 / 8192

kernel.shmmax Controls the maximum shared segment size, in bytes. 68719476736 / 33554432

kernel.shmall Controls the maximum number of shared memory

segments, in pages. 4294967296 / 2097512

net.core.rmem_default Sets the default OS receive buffer size. 262144 / 129024

net.core.rmem_max Sets the max OS receive buffer size. 16777216 / 131071

net.core.wmem_default Sets the default OS send buffer size. 262144 / 129024

net.core.wmem_max Sets the max OS send buffer size. 16777216 / 131071

net.core.somaxconn Maximum # of sockets the kernel will serve at one time. Set

on name node, secondary name node and job tracker. 1000 / 128

fs.file-max Sets the total number of file descriptors. 6815744 / 4847448

net.ipv4.tcp_timestamps Disables the TCP time stamps if set to “0” 0 / 1

net.ipv4.tcp_sack Enables select ACK for TCP. 1 / 1

net.ipv4.tcp_window_scaling Enables TCP window scaling. 1 / 1

kernel.shmmni Sets the maximum number of shared memory segments. 4096 / 4096

kernel.sem Sets the maximum number and size of semaphore sets that

can be allocated.

250 32000 100 128 /

250 32000 32 128

fs.aio-max-nr Sets the maximum number of concurrent I/O requests. 1048576 / 65536

net.ipv4.tcp_rmem Sets min, default, and max receive window size. 4096 262144 16777216 /

4096 87380 4194304

net.ipv4.tcp_wmem Sets min, default, and max transmit window size. 4096 262144 16777216 /

4096 87380 4194304

net.ipv4.tcp_syncookies Disables TCP syncookies if set to “0”. 0 / 0

sunrpc.tcp_slot_table_entries Sets the maximum number of in-flight rpc requests between a client and a server. This value is set on the name node and

secondary name node to improve NFS performance. 128 / 16

vm.dirty_background_ratio Maximum percentage of active system memory that can be

used for dirty pages before dirty pages are flushed to storage.

1 / 10



Table 8 lists the differences between Hadoop hdfs-site.xml defaults and the settings used during ESG Lab testing.

Table 8. HDFS Site Settings


dfs.name.dir

Path on the local file system where the name node stores the namespace and transaction logs persistently. If this is a comma-delimited list of directories (as used in this configuration), then the name table is replicated in all of the directories for redundancy.

Note: Directory /mnt/fsimage_bkp is a location on NFS-mounted NetApp FAS storage where name node metadata is mirrored and protected, a key feature of NetApp’s Hadoop solution.

/local/hdfs/namedir/mnt/fsimage_bkp

[Default value: ${hadoop.tmp.dir}/dfs/name]

dfs.hosts Specifies a list of machines authorized to join the Hadoop

cluster as a data node.

/etc/hadoop-0.20/conf/dfs_hosts


dfs.data.dir Directory paths on the data node local file systems where

HDFS data blocks are stored.

/disk1/data,/disk2/data

[Default value: ${hadoop.tmp.dir}/dfs/data]

fs.checkpoint.dir Directory path where checkpoint images are stored (used by

secondary name node).

/home/hdfs/namesecondary1

[Default value: ${hadoop.tmp.dir}/dfs/namesecondary]

dfs.replication HDFS block replication count. Hadoop default is 3. The NetApp Hadoop solution uses a replication setting of 2.

2 / 3

dfs.block.size HDFS data storage block size in bytes. 134217728 (128MB) / 67108864

dfs.namenode.handler.count Number of server threads for the name node. 128 / 10

dfs.datanode.handler.count Number of server threads for the data node 64 / 3

dfs.max-repl-streams Maximum number of replications a data node is allowed to

handle at one time. 8 / 2

dfs.datanode.max.xcievers Maximum number of files a data node will serve at one time. 4096 / 256



Table 9 lists the differences between mapred-site.xml defaults and the settings used during ESG Lab testing.

Table 9. mapred-site Settings


mapred.job.tracker Job tracker address as a URL (Job tracker IP address

or hostname with port number).

10.61.189.66:9001

[Default value: local]

mapred.local.dir Comma-separated list of the local file system where

temporary MapReduce data is written.

/disk1/mapred/local,/disk2/mapred/local

[Default value: ${hadoop.tmp.dir}/mapred/local

mapred.hosts Specifies the file containing the list of nodes

allowed to join the Hadoop cluster as task trackers.

/etc/hadoop-0.20/conf/mapred.hosts


mapred.system.dir Path in HDFS where the MapReduce framework

stores control files.

/mapred/system [Default value:

${hadoop.tmp.dir}/mapred/system]

mapred.reduce. tasks.speculative.

execution

Enables the job tracker to detect slow-running reduce tasks, assign them to run in parallel on other

nodes, use the first available results, and then kill the slower running reduce tasks.

false / true

mapred.map.tasks. speculative.execution

Enables the job tracker to detect slow-running map tasks, assign them to run in parallel on other nodes,

use the first available results and then kill the slower running map tasks.

false / true

mapred.tasktracker. reduce.tasks.maximum

Maximum number of reduce tasks that can be run simultaneously on a single task tracker node.

5 / 2

mapred.tasktracker.map. tasks.maximum

Maximum number of map tasks that can be run simultaneously on a single task tracker node.

7 / 2

mapred.child.java.opts Java options passed to the task tracker child

processes. (In this case, 1 GB defined for heap memory used by each individual JVM).

-Xmx1024m / -Xmx200m

io.sort.mb Total amount of buffer memory allocated to each merge stream while sorting files on the mapper, in

MB. 340 / 100

mapred.jobtracker. taskScheduler

Job tracker task scheduler to use (in this case use the FairScheduler).

org.apache.hadoop.mapred.FairScheduler

[Default value: org.apache.hadoop.mapred.JobQueueT

askScheduler]

io.sort.factor Number of streams to merge at once while sorting

files. 100 / 10

mapred.output.compress Enables/disables MapReduce output file

compression. false / false

mapred.compress.map. output

Enables/disables map output compression. false / false

mapred.output.compression.type Sets output compression type. block / record

mapred.reduce.slowstart.completed.maps

Fraction of the number of map tasks that should be complete before reducers are scheduled for the

MapReduce job. 0.05 / 0.05




mapred.reduce.tasks Total number of reduce tasks available for the

entire cluster.

40 for 8 DataNodes

80 for 16 DataNodes

120 for 24 DataNodes

[Default value: 1]

mapred.map.tasks Total number of map tasks available for the entire

cluster.

56 for 8 DataNodes



[Default value: 2]

mapred.reduce.parallel. copies

Number of parallel threads used by reduce tasks to fetch outputs from map tasks.

64 / 5

mapred.compress.map.output Enable/disable map output compression. false / false

mapred.inmem.merge. threshold

Number of map outputs in the reduce task tracker’s memory at which map data is merged and spilled to

disk. 0 / 1000

mapred.job.reduce. input.buffer.percent

Percent usage of the map outputs buffer at which the map output data is merged and spilled to disk.

1 / 0

mapred.job.tracker. handler.count

Number of job tracker server threads for handling RPCs from the task trackers.

128 / 10

tasktracker.http. threads

Number of task tracker worker threads for fetching intermediate map outputs for reducers.

60 / 40

mapred.job.reuse.jvm. num.tasks

Maximum number of tasks that can be run in a single JVM for a job. A value of "-1" sets the number

to "unlimited." -1 / 1

mapred.jobtracker.restart.recover Enables job recovery after restart. true / false

20 Asylum Street | Milford, MA 01757 | Tel: 508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com

Date post:	20-Aug-2015
Category:	Technology
Upload:	netapp
View:	1,012 times
Download:	0 times

NetApp Open Solution for Hadoop

Technology