Post on 17-Jul-2015
transcript
Profiling Hadoop Applications
Basant Verma
Agenda
• Profiling General Background
• Available Options
• Profile using Free and Open Source tools
• Profile using YourKit
• Other troubleshooting tools
What does Profiling Provide?
• Profiling runtime / CPU usage:– what lines of code the program is spending the most
time in– what call/invocation paths were used to get to these
lines• naturally represented as tree structures
• Profiling memory usage:– what kinds of objects are sitting on the heap– where were they allocated– who is pointing to them now– memory leaks
Profiler Types and Components
• Components needed for profiling– Profiling Agent
• Collects profiled data (samples, traces, exceptions etc.)
– Analysis Tool• Provides interface for analyzing profiled data and help user
identify potential problems
• Types of Profilers– insertion
– sampling
– instrumenting
Available Options
• Sun JDK Tools– hprof: Profiler (uses jvmti)– jmap: Provides memory map (dump) heap– jhat: Analyze memory dump– jstack: Provide thread dump– Jvisualvm: GUI based profile data analyzer
• Open Source– Visual VM (same as jvisualvm but downloaded as independent app)
• Uses HPROF internally for profiling. Provides GUI for analysis of heap dump and profiler outputs
– NetBeans Profiler• Similar to VisualVM but integrated into IDE
– Eclipse MAT (Memory Analysis Tool)• Can load .hprof files
• Commercial– YourKit– JProfile
USING HPROF
7
Official hprof Documentation
usage: java -Xrunhprof:[help]|[<option>=<value>, ...]
Option Name and Value Description Default
--------------------- ----------- -------
heap=dump|sites|all heap profiling all
cpu=samples|times|old CPU usage off
monitor=y|n monitor contention n
format=a|b text(txt) or binary output a
file=<file> write data to file off
depth=<size> stack trace depth 4
interval=<ms> sample interval in ms 10
cutoff=<value> output cutoff point 0.0001
lineno=y|n line number in traces? Y
thread=y|n thread in traces? N
doe=y|n dump on exit? Y
msa=y|n Solaris micro state accounting n
force=y|n force output to <file> y
verbose=y|n print messages about dumps y
http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html
8
Sample hprof usage
• To measure CPU usage, try the following:java -Xrunhprof:cpu=samples,depth=6,heap=dump
• Settings:– Takes samples of CPU execution– Record call traces that include the last 6 levels on the
stack– Dumps the heap map (bigger file size but helps in
finding problems)
• Creates the file java.hprof.txt in the current directory
HPROF with Hadoop
• Hadoop uses hprof as the default profiler
• Profiling related parameters
Purpose JobConf API Command line Parameter
Enable Profiling setProfileEnabled(true) mapred.task.profile=true
Additionalparameters for Profiler
setProfileParams(…) mapred.task.profile.params
Range of sampled task to profile
setProfileTaskRange mapred.task.profile.maps
mapred.task.profile.reduces
Example
• Using Java API
• Using Command line parameters
jobConf.setProfileEnabled(true);
jobConf.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites” +
“,depth=4,thread=y,file=%s");
jobConf.setProfileTaskRange(true, "0-2");
jobConf.setProfileTaskRange(false, "0-1");
hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \
-Dmapred.task.profile=true \
-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all, depth=4,thread=y,file=%s \
-Dmapred.task.profile.maps=0-2 \
-Dmapred.task.profile.reduces=0-1 \
input output
Collecting Profiler Output
• Hadoop JobClient automatically downloads profile logs from all the profiled tasks– If output format type is not specified, hprof creates profile
output in text format (format=a)
• Profiler Outputs are also available via History WebUI
• You can also download profile output using curl– curl -o attempt_201305161037_0004_m_000000_0.hprof
"http://17.115.13.191:50060/tasklog?plaintext=true&attemptid=attempt_201305161037_0004_m_000000_0&filter=profile"
Task User Log
Analyze Profiler output
• You can use VisualVM, NetBeans profiler or YourKit for analyzing the profiling data.– The above tools support only binary format of hprof
output (i.e. option format=b)
• Example– Run profiler with Hadoop job
– Load Profiler output using VisualVM menu option
hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \
-Dmapred.task.profile=true \
-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all,
depth=4,thread=y,format=b,file=%s \
input output
Analyze Profile Output in VisualVM
Object Query Language
• VisualVM and jhat support special query language (OQL) to query Java heap.
– Example : Select all Strings with length 1K or more
• More information about OQL is available at http://visualvm.java.net/oqlhelp.html
select s from java.lang.String where s.count > 1024;
Analyze Profile Output in Eclipse MAT
Profiling Pig Jobs
• Use Hadoop command line parameters
• More information about Pig job profiling is available at Pig Wiki
– https://cwiki.apache.org/PIG/howtoprofile.html
pig -Dmapred.task.profile=true \
-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,thread=y,verbose=n
\
-Dmapred.task.profile.maps=0-2 \
-Dmapred.task.profile.reduces=0-0 \
mypigscript.pig
Profiling Hive Queries
• Set appropriate Hadoop parameters before submitting the queries
hive> set mapred.task.profile=true;
hive> set mapred.task.profile.params=-agentlib:hprof=heap=dump,format=b,file=%s;
hive> set mapred.task.profile.maps=0-2;
hive> set mapred.task.profile.reduces=0-0;
hive>
hive> <hive query>
USING YOURKIT
YourKit Profiler - Summary
• Commercial Java Profiling Tool
– Free tryout and Open Source licenses are available
• Used by many Open Source projects including Hadoop, Pig, Hive etc.
• Features
– On-Demand Profiling
– CPU, Memory and Concurrency profiling methods
– Has integration (Eclipse, NetBeans, IntelliJ)
– Above all, has relatively low performance overhead
Using YourKit Profiler
• You will need to install YourKit profiler (just the profiler lib) on to each TaskTracker
• Tell Hadoop to use a different profiler
• Theoretically, you can also use DistributedCache to make binaries available on TaskTracker machines– Though, I did not have success with this
hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \
-Dmapred.task.profile=true \
-Dmapred.task.profile.params=-
agentpath:<yourkit_path>/libyjpagent.jnilib=dir=/tmp/yourkit_snapnshot,sampling,disablej2ee \
-Dmapred.task.profile.maps=0-2 \
-Dmapred.task.profile.reduces=0-1 \
input output
Small Glitch
• Hadoop JobClient.waitforCompletion(…) will throw error since profile logs are not available in the default directory.
• However, the job will continue to run successfully.• To avoid this, you can instead use mapred.child.java.opts option to specify
the profiling parameters
YourKit to Analyze Jobs
• Can analyze profile output from both YourKitProfiler and hprof/jmap.
OTHER TOOLS
Using other Tools
• JDK Tool ‘jmap’– Can be used for capturing heap map of a running Java
process and later used for analysis inside VisualVM or YourKit
• $ jmap -dump:live,format=b,file=xyz.hprof <jvm-pid>• Don’t run jmap with -histo:live option on JT or NN
– Java process can also be instructed to generate hprofdump of heap map in case of OutOfMemoryError
• -XX:+HeapDumpOnOutOfMemoryError
• JDK Tool ‘jhat’– Can read heap dump in hprof format and provides a
light weight web interface to analyze profiler output
Other Tools (Cont…)
• Hadoop Vaidya (Simple Diagnostic Tool)
– Identifies common performance problem related to Hadoop Jobs (unbalanced partitioning, granularity of tasks, combiners etc.)
– Works merely on Hadoop Job (does not understands the specifics of Hive/Pig)
Other Recommendation
• If possible try running Hadoop (MR/Pig/Hive) in local mode using LocalJobRunner
– LocalJobRunner runs the entire MapReduce job in a single JVM
– It simplifies profiling and log collection
– Can also be used for attaching debugger from IDE
Resources
• Troubleshooting Java application– http://www.oracle.com/technetwork/java/javase/toc-135973.html
• Profile Hadoop Job (Chapter 5 - “Hadoop – The definitive Guide”)– http://my.safaribooksonline.com/book/databases/hadoop/978059652
1974/tuning-a-job/id3545664
• Profiling Pig Job– https://cwiki.apache.org/PIG/howtoprofile.html
• ‘hprof’ Official Documentation– http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html
• YourKit Profiler– http://www.yourkit.com