Hunk*Performance* and*TroubleshooFng* best*pracFce*...Troubleshooting Main Points 1. Hunk UI shows...

Post on 25-Jun-2020

10 views 0 download

transcript

Copyright*©*2015*Splunk*Inc.*

Raanan*Dagan*Praveen*Burgu**

Hunk*Performance*and*TroubleshooFng*best*pracFce*

Disclaimer*

2*

During*the*course*of*this*presentaFon,*we*may*make*forward*looking*statements*regarding*future*events*or*the*expected*performance*of*the*company.*We*cauFon*you*that*such*statements*reflect*our*current*expectaFons*and*esFmates*based*on*factors*currently*known*to*us*and*that*actual*events*or*results*could*differ*materially.*For*important*factors*that*may*cause*actual*results*to*differ*from*those*contained*in*our*forwardNlooking*statements,*please*review*our*filings*with*the*SEC.*The*forwardNlooking*statements*made*in*the*this*presentaFon*are*being*made*as*of*the*Fme*and*date*of*its*live*presentaFon.*If*reviewed*aQer*its*live*presentaFon,*this*presentaFon*may*not*contain*current*or*

accurate*informaFon.*We*do*not*assume*any*obligaFon*to*update*any*forward*looking*statements*we*may*make.**

*In*addiFon,*any*informaFon*about*our*roadmap*outlines*our*general*product*direcFon*and*is*subject*to*change*at*any*Fme*without*noFce.*It*is*for*informaFonal*purposes*only*and*shall*not,*be*incorporated*into*any*contract*or*other*commitment.*Splunk*undertakes*no*obligaFon*either*to*develop*the*features*

or*funcFonality*described*or*to*include*any*such*feature*or*funcFonality*in*a*future*release.*

Who*are*you?*

3*

•  Raanan*Dagan*N*Sr.*SE,*Big*Data*specialist*•  Praveen*Burgu*–*Sr.*SoQware*Engineer*

Do*not*distribute*

Agenda*

–  Performance*!  10*ways*to*opFmize*Hunk*search*performance:*MR*Jobs,*

Timestamp*ExtracFon,*Caching**–  Troubleshoot**!  Inspect*search*job*issues:*MR*Jobs,*Performance,*

Timestamp*

*

4*

Hunk*Performance*

Do*not*distribute*

Hunk Performance Main Points 1.  Run MR Jobs 2. HDFS Storage 3.  VIX with Timestamp / indexes.conf 4.  File Format 5.  Compression types / File size 6.  Event breaking / Props.conf 7.  Report Acceleration 8. Hardware 9.  Search Head Clustering 10. Other Flags (Threads, Splits)

6*

#1:*Make*Sure*you*use*MR*Jobs*

7*

Not*MR*Jobs*–*Just*Splunk*

"   Index=xyz****

**

Not*MR*Jobs*–*Just*Splunk*

"   Index=xyz*|*stats*count**and*using*Verbose*Mode*

**

Yes,*this*will*run*MR*Jobs*

"   Index=xyz*|*stats*count**and*using*Smart*Mode*

**

Allows*you*to*use*the*Power*of*Hadoop*MR*Jobs*parallel*

processing**

#*2:*HDFS*Storage**This*is*BAD*

"   /data/root/dir/...*

**

This*is*GOOD*

"   /data/root/dir/2014/10/01/....*

"   /data/root/dir/2014/10/02/....***

This*is*BETTER*

"   /data/root/dir/2014/10/01/app=apache/...***"   /data/root/dir/2014/10/01/app=mysql/...*

Allows*you*to*bring*subset*of*data*from*HDFS*based*on*Fme*

extracFon**

8*

#*3:*VIX*with*Timestamp*/*Indexes.conf**

9*

HDFS5=5/user/splunk/data/20141123/14/SFServer/myfile.gz5*[hadoop]**

vix.provider*=*MyHadoopProvider**

vix.input.1.path*=*/user/splunk/data/*/*/${server}/...**

vix.input.1.accept*=*\.gz$**

vix.input.1.et.regex*=*.*?/data/(\d+)/(\d+)/.*?.gz**

vix.input.1.et.format*=*yyyyMMddHH**

vix.input.1.et.offset*=*0**

vix.input.1.lt.regex*=*.*?/data/(\d+)/(\d+)/.*?.gz**

vix.input.1.lt.format*=*yyyyMMddHH**

vix.input.1.lt.offset*=*3600**

Time*extracFon*will*enable*you*to*use*the*Time*Picker*in*the*Hunk*UI*to*bring*Subset*of*the*

data**

#4:*File*Format**

10*

"   Don’t*add*mulFple*sources*into*one*file**

*

"   Use*a*selfNdescribing*format*for*the*data*whenever*possible;*e.g.*json,*avro,*csv,*Parquet,*ORC,*RC,*etc.*

*

"   If*using*a*log*file,*look*at*this*list*for*Splunk*Known*Source*Types*(sourcetype=access_combined)**hzp://docs.splunk.com/DocumentaFon/Splunk/latest/Data/Listofpretrainedsourcetypes*

"   Look*at*the*Splunk*App*Store*for*600*other*opFons*to*break*the*events*/*fields*hzp://apps.splunk.com*

Hunk*will*benefit*if*the*file*has*some*structure.**Otherwise*we*will*need*to*use*REGEX*to*extract*

fields**

#5:*Compression*type*/*File*size**

11*

This*is*BAD*(Large*NonNsplizable)**

"   500MB*GZ*file*

**

This*is*BAD*(too*many*MR*Jobs)*

"   10,000*X*1kb*files*

**

This5is5GOOD5(Large*spilizable)*

"   500MB*LZO*or*Snappy*file*

**

This5is5GOOD5(NonNsplizable,*but*1*MR*per*file)*

"   127MB*or*63MB*GZ*file*

To*avoid*too*many*MR*Jobs,*or*running*out*of*memory*make*

sure*to*use*the*correct*compression*or*file*size**

#6:*IndexNFme*pipeline*processing*hzp://docs.splunk.com/DocumentaFon/Hunk/latest/Hunk/PerformancebestpracFces*

12*

445

515

515

1055

1055

1795

1905

0* 20* 40* 60* 80* 100* 120* 140* 160* 180* 200*

MLA5+5LM5+5TF5+5AP55

MLA5+5LM5+5TF5+5TP55

MLA5+5LM5+5TF5

MLA5+5LM5+5TP5

MLA5+5LM5

MLA55

Default55

Time*(s)*

MLA:%MAX_TIMESTAMP_LOOKAHEAD%=%30%%TP:%%TIME_PREFIX%=%^%%TF:%%TIME_FORMAT%=%%a,%%d%%b%%Y%%H:%M:%S%%Z%%LM:%%SHOULD_LINEMERGE%=%false%%AP:%%ANNOTATE_PUNCT%=%false%%%

~4X*

#7:*Report*AcceleraFon**

13*

Report*acceleraFon*will*improve*performance*–*Bring*data*from*

Cache*

NOTE:*vix.env.HADOOP_HEAPSIZE*=*1024*or*above**

Do*not*distribute*

Splunk*and*Hadoop*N*Caching*opFons*

14*

#8:*Hardware*

15*

A*good*Hardware*with*mulFple*cores*can*be*very*beneficial*to*interact*with*hundreds*of*end*

users**

Dedicated5search5head5•  Intel*64Nbit*chip*architecture*•  4*CPUs,*4*cores*per*CPU,*at*least*2*Ghz*per*core*•  12*GB*RAM*•  2*x*300*GB,*10,000*RPM*SAS*hard*disks,*configured*in*RAID*1*•  Standard*1Gb*Ethernet*NIC,*opFonal*2nd*NIC*for*a*

management*network*•  Standard*64Nbit*Linux**5Data5Nodes*=*The*SplunkD*indexer*is*installed,*by*default,*on*each*data*node*‘/tmp/splunk’*directory.**You*just*need*to*make*sure*you*have*about*40MB,*or*more,*space*in*that*directory*5

#9:*Search*Head*Clustering*

16*

Add*Many*Concurrent*Users*

1.  No*Single*Point*of*Failures*=*Dynamic*Captain*2.  “One*ConfiguraFon”*across*SH*=*AutomaFc*Config*replicaFon**3.  Horizontal*Scaling*=*Ability*to*add*/*remove*SH*nodes*on*running*

cluster5

Hunk*/*Hadoop*Client*

Number5of5Jobs:5•  vix.splunk.search.mr.threads*****N*#*of*threads*to*use*when*reading*map*results*from*HDFS*•  vix.splunk.search.mr.maxsplits***N*maximum*number*of*splits*in*an*MR*job*(Default*to*10000)**Number5of5copies5to5each5data5node:5•  vix.splunk.setup.bundle.setup.Vmelimit****N*Fme*limit*in*ms*for*seÅng*up*bundle*on*TT*•  vix.splunk.setup.bundle.replicaVon********N*set*custom*replicaFon*factor*for*bundle*on*hdfs*•  vix.splunk.setup.package.replicaVon*******N*set*custom*replicaFon*factor*for*splunk*package*on*hdfs**VIX5overrides:55•  vix.input.[N].recordreader5N*list*of*recordreaders*to*use*when*processing*this*input,*these*RR*are*

tried*before*those*at*the*provider*level.*For*example,*ImageRecordReader*–*PCapRecordReader*–*ZipRecordReader*–*EncrypFonRecordReader**

•  vix.input.[N].spli\er5–*For*example,*ParquetSplitGenerator**•  vix.input.[N].required.fields5–*For*example,*In*smart*mode*always*extract*Timestamp*field**

#10:*Other*OpFmizaFon*Flags*

17*

Hunk*TroubleshooFng*

Do*not*distribute*

Troubleshooting Main Points

1. Hunk UI shows errors 2.  Search.log to debug Hunk / Hadoop

client issues 3. Hadoop logs to debug Hadoop Server

issues 4.  Job -> Inspect Job to debug many

performance issues

19*

Do*not*distribute*

Each log line in the file that involves Hunk ERP operations is annotated with ERP.<provider>… and contains links for spawned MR job(s). You may need to follow these links to troubleshoot MR issues. To enable more detailed logging and monitoring flow modes, edit the following parameters in the provider setting:

By*seÅng*to*1,*search.log*will*have*DEBUG*level*logging*events.*

By*default,*Hunk*searches*run*in*mixed*mode.*To*disable,*set*the*value*to*0.*

By*default,*Hunk*makes*the*best*effort*to*prune*unnecessary*columns/fields*to*improve*search*performance.*For*debugging,*you*can*turn*this*off*and*have*ERP*return*all*columns*to*Hunk*to*do*the*filtering*and*final*processing*at*the*search*head.*

20*

Troubleshooting – Enable Debugging

21*

Example*#*1,*No*MapReduce*Job*in*Hadoop*

To*check*if*a*MapReduce*job*is*working,*you*can*append*a*reporFng*search*job.*

22*

TroubleshooFng*–*No*Map*Reduce*Job*

Find*search.log*

In*this*example,*a*search*returns*some*results*but*it*seems*like*it*is*stuck*aQer*the*iniFal*streaming*results.*Just*the*fact*that*it*has*returned*some*result*indicates*that*Hunk*can*access*data*in*HDFS.**

If*you*encounter*issues*while*building*your*reports,*search.log*is*the*place*to*look.*You*can*access*the*file*via*the*job*inspector.*

1*2*

3*

23*

If*you*encounter*an*error*while*running*a*basic*search,*you*can*find*a*complete*search*job*detail*in*the*job*inspector.*

In*Search.log*–*Pinpoint*the*error*

Hunk*log*lines*are*denoted*with*ERP.*followed*by*a*provider*name.*In*this*example,*a*job*was*submized*and*Hunk*is*contacFng*ResourceManager*(YARN).*

In*Search.log*–*Pinpoint*the*error*

However,*it*looks*like*Hunk*cannot*connect*to*the*ResourceManager.*

25*

Error*will*be*display*in*UI*and*search.log*

Eventually*repeated*azempts*failed*and*the*ERP*throws*an*excepFon.*

And*the*error*message*is*shown*on*the*parFal*results*page*indicaFng*that*the*MapReduce*job*was*unable*to*start.*You*suspect*that*maybe*the*ResourceManger*node*is*down*and*so*you*contact*the*Hadoop*administrator.*

26*

Troubleshoot*Hadoop*Server*issues**

A*Hadoop*administrator*checks*the*ResourceManager*and*finds*that*the*node*is*running*and*no*job*from*Hunk*has*been*queued.*With*that*informaFon,*you*can*narrow*down*the*issue*to*a*network*connecFon*or*a*Hunk*configuraFon*error.*

In*this*example,*the*culprit*was*misconfigured*address*to*the*ResourceManager.*AQer*fixing*the*address,*the*job*was*able*to*complete*successfully.*For*more*examples*of*error*message,*check:*hzp://docs.splunk.com/DocumentaFon/Hunk/latest/Hunk/TroubleshootHunk**

27*

28*

Example*#*2,*Real*World*N*Bad*Performance*

No*MapReduce*Job*=*Not*a*Good*start*

Steam.bytes*=*Splunk*generate*results*

Yes,*MapReduce*Job*=*Bezer*

report.bytes*=*Hadoop*generate*results*MR.SPLK*=*Leverage*Hadoop*

Examine*HDFS*Storage*

Hadoop.dirs*/*files*.listed*=*How*many*directories*Splunk*need*to*scan*

VIX*with*Timestamp*on*the*files*=*Not*great*

Scan*8,760*files*–*filter*out*8,688*=*Only*72*files*used*for*search*RecommendaFon*is*to*build*Timestamp*on*Directories*

NoNSplizable*Very*Large*File*=*Bad*

1*MR*Job*for*very*large*file*is*not*ideal*

YesNSplizable*Very*Large*File*=*Good*

MulFple*Jobs*means*we*leverage*Hadoop*parallel*system*

Report*AcceleraFon*=*Great*

cache.bytes*=*HDFS*results*(No*need*for*MR)*

Summary*

Do*not*distribute*

Summary*N*Performance*1.  Run MR Jobs 2. HDFS Storage 3.  VIX with Timestamp / indexes.conf 4.  File Format 5.  Compression types / File size 6.  Event breaking / Props.conf 7.  Report Acceleration 8. Hardware 9.  Search Head Clustering 10. Many Other Flags (Threads, Splits)

37*

Do*not*distribute*

Summary*N*TroubleshooFng*

1. Hunk UI shows errors 2.  Search.log to debug Hunk / Hadoop client

issues 3. Hadoop logs to debug Hadoop Server

issues 4.  Job -> Inspect Job to debug many

performance issues

38*

THANK*YOU*

Common*Issues*We*See*Issue5 Clue5for5Issue5 PotenVal5SoluVon5Performance* Job*takes*a*long*Fme* Most*likely*customer*is*not*running*MR*Jobs*

Change*to*index*=*xyz*|*stats*count*by*xyz*+*smart*mode*

Memory* No*Error!*Job*is*just*hanging*..* Lower*vix.mapred.job.map.memory.mb*=*1024*OR*Increase*the*memory*on*the*Hadoop*side*

Heartbeat* In*the*search.log*you*will*see*“operaFon*took*longer*than*the*heartbeat*interval”*

vix.splunk.heartbeat*=*0**

Timestamp*/*Fields*ExtracFon*in*Smart*Mode*

Events*are*not*showing*correctly* vix.input.[N].required.fields*=*Timestamp*Or*Props.conf*

Hive*Jars*missing*or*Hive*issues*

In*search.log*you*will*see*ExcepFon*in*thread*"main"*java.lang.NoSuchFieldError***

Add*Jars*to*vix.env.HUNK_THIRDPARTY_JARS*Or*Look*in*answers*for*Hive**

Data*nodes*/tmp*directory*will*not*install*SplunkD*

In*Hadoop*logs*(not*in*Splunk*logs)*you*will*see*permission*or*issues*wriFng*to*/tmp/splunk**

Change*vix.splunk.home.hdfs*Or*Fix*permission*/*size*

40*