Date post: | 11-Nov-2014 |
Category: |
Technology |
Upload: | huguk |
View: | 591 times |
Download: | 1 times |
Why was it so important to usTo open the MapReduce framework12/11/2013
Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
2Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
3Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort
4
• 50% of all mainframes run Syncsort• 1,500 Mainframe Customers: Most
used & trusted 3rd party mainframe software
• Speed leader for ETL & Sort• A history of innovation
• 25+ Issued & Pending Patents• Large global customer base
• 15,000+ deployments in 68 countries• First-to-market, fully integrated
approach to Hadoop ETL
For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data!
Our customers are achieving the impossible, every day!
Integrating Big Data… Smarter!
Key Partners
Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
5Syncsort Confidential and Proprietary - do not copy or distribute
6
Smart Contributions to Improve Hadoop
Plugin Shipping on CDH 4.2 and later
Augmenting Critical Batch Processing Capabilities
JIRA
4807 Allow MapOutputBuffer to be pluggable
4808 Allow Reduce-side merge to be pluggable
4809 Make classes required for 2454 public
4812 Create reduce input merger plug-in
Description
4842 Shuffle race can hang reducer
2461 HDFS file name globbing in libhdfs
4482 Backport of 2454 to MapReduce 1 & 1.2
Syncsort Confidential and Proprietary - do not copy or distribute
Opening the MapReduce Framework
Mapper Output Sorter Shuffle Input
Sorter Reducer
7Syncsort Confidential and Proprietary - do not copy or distribute
Here to perform functional logic on our engine
Here to perform functional logic on our engine
Here and here to replace MapReduce native sort
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
8Syncsort Confidential and Proprietary - do not copy or distribute
9Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort: Just integrating data … faster
Sort Join Aggregate Copy Merge
+
A simple DI engine easy to deploy, operate, and administer
ETL like development GUI
Auto-tuning Best patented algorithms
Fast, fast, faster than any other
The more data the better
From Data to Big Data
10Syncsort Confidential and Proprietary - do not copy or distribute
60s70s
80s 90s 2000s
2010s
Next?
Mainframe PC Internet RevolutionMobile & Social Media
Revolution
$$$Variety
Quarterly Weekly Daily Intra-day Right / Real-time Monthly$$$Velocity
$$$Volume
Smart Architecture
11Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Cluster
Hadoop Integration… for Real(No Code Generation. No Compiling. No Bolts. No Nuts!)
Runs natively within MapReduce Small footprint installs on every node Open source contributions extend
capabilities of MapReduce Pluggable sort Expanded use cases (i.e. “No sort” option) Vertical scalability Design flexibility (MapMapReduceReduce)
Unleash Hadoop’s Potential
No need to worry about this…
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
12Syncsort Confidential and Proprietary - do not copy or distribute
13
Because Mainframe Is Big Data Too!
Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe
Connect
Translate
Load & Process
• Read files directly from mainframe• No software required on mainframe• Already installed on 50% of mainframes
• Parse & transform: packed decimal, EBCDIC/ASCII, multi-format
• No coding required
• Load directly to HDFS• Offload batch data processing• Find more insights
Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort DMX-h + Cloudera Manager
14Syncsort Confidential and Proprietary - do not copy or distribute
Installation
Management
Monitoring
Support Integration
API
CDH Cluster + ISV softwareCloudera Manager
Syncsort DMX-h
CDH Nodes DMX-h on every CDH node
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
15Syncsort Confidential and Proprietary - do not copy or distribute
16Syncsort Confidential and Proprietary - do not copy or distribute
Test cases
Sort Acceleration– Terasort
• Run terasort with DMX-h and without DMX-h in various configurations to compare performance.
ETL– Use DMX-h to perform several different ETL jobs and compare against
equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0).• File Change Data Capture (CDC)• Web Log Aggregation
Syncsort Confidential and Proprietary - do not copy or distribute
File CDC
PigJava
149Lines of Code
70Lines of Code
DMX-h
Syncsort Confidential and Proprietary - do not copy or distribute
Web Log Aggregation
PigJava
DMX-h
94Lines of Code
48Lines of Code
19Syncsort Confidential and Proprietary - do not copy or distribute
Cluster Specs:– 763 node cluster
• 1 node – job tracker • 1 node - name node• 1 node – secondary name node• 760 data and task nodes
Hadoop cluster configuration changes (from defaults):
– 128 MB HDFS Block size (file.blocksize)– 1.5 GB map/ 4GB reduce task JVM
memory (mapred.child.java.opts)– Maximum 22 map tasks and 4 reduce
tasks per node (mapred.tasktracker.map.tasks.maximum & mapred.tasktracker.reduce.tasks.maximum)
Cluster Node Specs:– 12 cores - Dual Intel Westmere (Hex-
core) CPUs, 2.93 GHz, 12 MB Cache– 48GB DDR3 RDIMM Memory– 12 x 2TB 3.5” drives Seagate 7200rpm.– Disk 0 + Disk 1 are RAID1 (mirrored)
for OS.• 100 MB/Sec write• 115 MB/Sec read
– 10 single disk JBOD– Mellanox ConnectX®-3 VPI NIC
(Supported data rates 40GbE;10GbE)– RHEL 6.1 64-bit– Java 1.6 (jdk.x86_64-2000:1.6.0_29-
fcs)
Cluster Configuration – DMX-h Ran on 763 Nodes!
20Syncsort Confidential and Proprietary - do not copy or distribute
Sort Acceleration - Terasort
Use Case
ETL or Sort
Acceleration
Alternative
Data Size (GB)
Native/Alternati
ve Elapsed
time
DMX-h Elapsed
Time
Elapsed Time
Improvement
Native/Alternative Memory
(GB)
DMX-h Physical
Memory (GB)
Memory
Improveme
nt
Native/Alternative CPU
Time DMX-h CPU
Time
CPU Improveme
nt
Native/Alterna
tive MB/Sec/Node
DMX-h MB/Sec/Node
TERASORT
Sort Acceleration Native
512 0:01:47 0:01:45 2%
12,863
12,873 0%
114,297
62,491 45%
6.5
6.6
TERASORT
Sort Acceleration Native
1,024 0:02:29 0:01:11 52%
14,512
14,522 0%
194,896
98,972 49%
9.3
19.4
TERASORT
Sort Acceleration Native
1,536 0:04:02 0:01:23 66%
14,684
14,694 0%
287,055
143,759 50%
8.6
25.0
TERASORT
Sort Acceleration Native
4,096 0:03:31 0:02:29 29%
31,520
31,549 0%
927,379
380,442 59%
26.2
37.0
TERASORT
Sort Acceleration Native
10,242 0:08:51 0:05:14 41%
47,935
47,951 0%
2,835,927
1,460,101 49%
26.4
44.6
TERASORT
Sort Acceleration Native
20,484 0:14:55 0:12:28 16%
106,153
105,239 1%
6,112,296
3,696,727 40%
31.0
37.4
TERASORT
Sort Acceleration Native
102,400 1:12:12 0:51:59 28%
387,262
387,211 0%
30,436,624
16,589,332 45%
32.3
44.9
21Syncsort Confidential and Proprietary - do not copy or distribute
File CDC
Use Case
ETL or Sort
Acceleration
Alternative
Data Size (GB)
Native/Alternative Elapsed
time
DMX-h Elapsed
Time
Elapsed Time Improvement
Native/Alternative Memory
(GB)
DMX-h Physical
Memory (GB)
Memory
Improvement
Native/Alternative
CPU Time DMX-h
CPU Time
CPU Improvement
Native/Alterna
tive MB/Sec/Node
DMX-h MB/Sec/Node
FileCDC ETL Pig 148 0:05:31 0:01:33 72%
79,876
79,559 0%
79,876
79,559 0%
0.6
2.2
FileCDC ETL Pig 450 0:05:11 0:01:58 62%
243,834
182,869 25%
243,834
182,869 25%
1.9
5.3
FileCDC ETL Pig
1,515 0:07:49 0:03:44 52%
845,263
557,226 34%
845,263
557,226 34%
4.4
9.4
22Syncsort Confidential and Proprietary - do not copy or distribute
Web Log Aggregation
Use CaseAlternative
Data Size (GB)
Native/Alternative
Elapsed time
DMX-h Elapsed
Time
Elapsed Time
Improvement
Native/Alternative Memory (GB)
DMX-h Physical Memory (GB)
Memory Improve
ment
Native/Alternative CPU
Time DMX-h CPU
Time
CPU Improve
ment
Native/Alternative MB/Sec/
Node
DMX-h MB/Sec/
Node WebLogAggregation -
Split Size & fixes Pig
2,067 0:01:12 0:00:58 19%
13,499
7,813 42%
145,972 56,496 61%
40.1
49.8
WebLogAggregation - Split Size & fixes Pig
4,135 0:01:42 0:01:23 19%
18,003
15,579 13%
300,627
152,390 49%
56.1
69.6
WebLogAggregation - Split Size & fixes Pig
10,240 0:05:16 0:02:04 61%
40,773
39,091 4%
807,473
335,537 58%
45.3
115.4
WebLogAggregation - Split Size & fixes Pig
20,480 0:07:54 0:06:58 12%
78,654
78,128 1%
1,339,453
568,107 58%
60.4
68.4
23
www.syncsort.com/try +
Running on CDH
Test Drive DMX-h:Bridge the Gap Between Big Iron & Big Data!
• Self-contained image• Use case accelerators for • mainframe, Hadoop and more!
…and Quite Possibly The Only Approach!
A Smarter Approach…
( )