Download - Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement

1©Cloudera,Inc.Allrightsreserved.

TheImpalaCookbookFromCloudera’sImpalaTeamUpdatedJan.2017

*CurrentlyanApacheIncubatorproject


TopicOutline

• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala

• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics

• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet


TopicOutline

• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala




PhysicalandSchemaDesign- Outline

• Schemadesignbestpractices• Datatypes• Partitiondesign• Complextypes• Commonquestions

• Physicaldesign• Fileformat– whentousewhat?• Blocksize(option)


PhysicalandSchemaDesign- DataTypes• UseNumericTypes(notstrings)• Avoidstringtypeswhenpossible• Strings=>highermemoryconsumption,morestorage,slowertocompute(80%slowercomparedtonumeric)• E.gdatestring“20161031”orunixtime“1479459272”,switchtobigint

• DecimalvsFloat/Double• Decimaliseasiertoreasonabout• Currentlycan'tuseDecimalaspartitionkeyorinUDFs

• UseStringonlyfor:• HBaserowkey– stringistherecommendedtype!• Timestamp– oktousestring,butconsiderusingnumericaswell!

• PreferStringoverChar/Varchar(exceptforSAS)


PartitionDesign:3SimpleRules

1. IdentifyaccesspatternsfromtheusecaseorexistingSQL.2. Estimatethetotalnumberofpartitions(betterbe<100k!).3. (Optional)Ifneeded,reducethenumberofpartitions.


PartitionDesign– IdentifyAccessPatterns

• ThetablecolumnsthatarecommonlyusedintheWHEREclausearecandidatesforpartitionkeys.• Dateisalmostalwaysacommonaccesspatternandthemostcommonpartitionkey.

• CanhaveMULTIPLEPARTITIONKEYS!Examples:• Select col_1 from store_sales where sold_date between ‘2014-01-31’ and ‘2016-02-23’;

• Select count(revenue)from store_sales where store_group_id in (1,5,10);

• Partition keys => sold_date, store_group_id


PartitionDesign– Estimatethe#partitions

• Estimatethenumberofdistinctvalues(NDV)foreachpartitionkey(fortherequiredstorageduration).Example:• Ifpartitionedbydateandneedtostorefor1year,thenNDVfordatepartitionkeyis365.• numstore_groupwillgrowovertime,butitwillneverexceed52(oneforeachstate).

• Totalnumberofpartitions=NDVforpartkey1*NDVforpartkey2*…*NDVforpartitionkeyN.Example:• Totalnumberofpartitions=365(fordatepart)*52(forstore_group)~=19k

•Makesure#partitions<=100k!


PartitionDesign– TooManyPartitions?

• Removesome“unimportant”partitionkeys.• Ifapartitionkeyisn’tasroutinelyusedoritdoesn’timpacttheSLA,removeit!

• Createpartition“buckets”andspecifybucketidindownstreamselectqueries.• Usemonthratherthandate.• Createartificialstore_group togroupindividualstores.• Technique:useprefixorhash• Hash(store_id)%store_groupsize=>hashittostore_group• Substring(store_id,0,2)=>usethefirst2digitsasartificialstore_group• e.g.select…fromtblwherestore_id=5

andstore_group=store_id%50;//Assume50storegroups


NestedTypes:KnownIssues

• Themaximumsizeofasinglerowfedintoajoinbuildcannotexceed8MB(buffered-tuple-streamblocksize).Forcomplextypesthismeans:• Sizeofallmaterializedcollections(postprojectionandfiltering)cannotexceed8MBforcertainplanshapes(expectedtoberare).

• IMPALA-2603:Incorrectplanisgeneratedforinlineviewreferencingcomplextypes.• Inlineviewhasseveralrelativetablerefsthatrefertodifferentancestorqueryblockswithinthesamenestinghierarchy.


SchemaDesign– CommonIssues

• Numberofcolumns- 2kmax• Notahardlimit;ImpalaandParquetcanhandleevenmore,but…• ItslowsdownHiveMetastoremetadataupdateandretrieval• Itleadstobigcolumnstatsmetadata,especiallyforincrementalstats

• Timestamp/Date• Usetimestampfordate;• Dateaspartitioncolumn:usestringorint(20150413asaninteger!)

• BLOB/CLOB– usestring• Stringsize- nodefinitiveupperboundbut1MBseemsok• Larger-sizedstringcancrashImpala!• Useitsparingly- thewhole1MBstringwillbeshippedeverywhere


PhysicalDesign– FileFormat

• Parquet/Snappy• Thelong-termstorageformat• Alwaysgoodforreading!•Writeisveryslow(reportedly10xslowerthanAvro).

• SnappyvsGzip• SnappyisusuallyabettertradeoffbetweencompressionratioandCPU.• But,runyourownbenchmarktoconfirm!

• Forwrite-once-read-oncetmpETLtables,considertextinsteadbecause:•Writingisfaster.• Impalacanperformthewrite.


PhysicalDesign– BlockSize

• Numberofblocksdefinesthedegreeofparallelism:• TrueforbothMapReduceandImpala• EachblockisprocessedbyasingleCPUcore• ToleverageallCPUcoresacrossthecluster,numblocks>=numcore

• Largerblocksize:• BettercompressionandIOthroughput,butfewerblocks,couldreduceparallelism

• Smallerblocksize:• Moreparallelism,butcouldreduceIOthroughput• CancausemetadatabloatandcreatebottlenecksonHDFSNameNode(NNRPCoverhead40K-50K/s),HiveMetastore,Impalacatalogservice


PhysicalDesign– BlockSize• ForApacheParquet,~256MBisgoodandnoneedtogoabove1GB.• Don’tgobelow64MBexceptwhenyouneedmoreparallelism!• (Advanced)Ifyoureallywanttoconfirmtheblocksize,usethefollowingequation:• BlockSize<=p*t*c/s• p– diskscanrateat100MB/sec• t– desiredresponsetimeofthequeryinsec• c– concurrency• s- %ofcolumnselected

• Regularlycompacttablestokeepthenumberoffilesperpartitionundercontrolandimprovescanandcompressionefficiency


TopicOutline

• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage




MemoryUsage– TheBasics

•Memoryisusedby:• Hashjoin– RHStablesafterdecompression,filteringandprojection• Groupby– proportionaltothe#groups• Parquetwriterbuffer– 256MBperpartition• IObuffer(sharedacrossqueries)

•Memoryheldandreusedbylaterqueries• Impalareleasesmemoryfromtimetotimein1.4andlater


Catalogmemoryusage

•Metadatacacheheapmemoryusagecanbecalculatedby• numoftables*5KB+numofpartitions*2KB+numoffiles*750B+numoffileblocks*300B+sum(incrementalcolstatspertable)• Incrementalstats

•Foreachtable,numcolumns*numpartitions*400B• Catalog-Updatetopicsizeshouldbenomorethan2GB• http://<statestorehost>:25010/topics


MemoryUsage– EstimatingMemoryUsage

• UseExplainPlan• Requiresstatistics!Memoryestimatewithoutstatsismeaningless.• Reportsper-hostmemoryrequirementforthisclustersize.• Re-runifyou’vere-sizedthecluster!


MemoryUsage– EstimatingMemoryUsage(Cont’d)

• EXPLAIN’smemoryestimateissues• Canbewayoff– muchhigherormuchlower.• groupby’sestimatecanbeparticularlyoff– whenthere’salargenumberofgroupbycolumns.• Memestimate=NDVofgroupbycolumn1*NDVofgroupbycolumn2*…NDVofgroupbycolumnn

• IgnoreEXPLAIN’sestimateifit’stoohigh!• Doyourownestimateforgroupby• GROUPBYmemusage=(totalnumberofgroups*sizeofeachrow)+(totalnumberofgroups*sizeofeachrow)/numnode


MemoryUsage– FindingActualMemoryUsage

• Searchfor“PerNodePeakMemoryUsage”intheprofile.• Thisisaccurate.Useitforproductioncapacityplanning.


MemoryUsage– FindingActualMemoryUsage(Cont’d)

• Forcomplexqueries,howdoIknowwhichpartofmyqueryisusingtoomuchmemoryorcausinganOut-Of-Memoryerror(i.e.hittingthemem-limit)?• Lookatthe“PeakMem”intheExecSummaryfromthequeryprofile!


MemoryUsage– HittingMem-limit

• Topcauses(inorder)ofhittingmem-limitevenwhenrunningasinglequery:

1. Lackofstatistics2. Lotsofjoinswithinasinglequery3. Big-tablejoiningbig-table4. Giganticgroupby


MemoryUsage– HittingMem-limit(Cont’d)

• Lackofstats•Wrongjoinorder,wrongjoinstrategy,wronginsertstrategy• ExplainPlantellsyouthat!

• Fix:ComputeStats<table>• Forhugetablethatcomputestatstakestoolongtofinish,youcanmanuallysettable/columnstats


MemoryUsage– HittingMem-limit(Cont’d)

• Lotsofjoinswithinasinglequery• select…from fact, dim1, dim2,dim3,…dimN where …

• Eachdimtablecanfitinmemory,butnotallofthemtogether• AsofCDH5.4/Impala2.2,• Impalamightchoosethewrongplan– BROADCAST• Impalasometimesrequire256MBastheminimalrequirementperjoin!

• FIX1:useshufflehint• Select … from fact join [shuffle] dim1 on … join [shuffle] dim2 …

• FIX2:pre-jointhedimtables(ifpossible).fewjoin=>betterperf!


MemoryUsage- HittingMem-limit(Cont’d)

• Big-tablejoinbig-table• Big-table(afterdecompression,filtering,andprojection)isatablethatisbiggerthantotalclustermemorysize.• CDH5.4/Impala2.0andlaterwilldothis(viadisk-basedjoin).• (Advanced)Forasimplequery,youcantrythisadvancedworkaround– per-partitionjoin• RequiresthepartitionkeybepartofthejoinkeySelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) union allSelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)


MemoryUsage– Disk-basedJoin/Agg

• Disk-basedjoin/aggshouldbeyourlastresorttodealwithhittingmem-limit.• Relyondisk-basedjoin/aggifthereisonlyonejoin/aggoperatorinthequery.Forexample:• Good:selecta.*,b.*froma,bwherea.id=b.id• Good:selecta.id,a.timestamp,count(*)fromagroupbya.id,a.timestamp• OK:selectlarge_tbl.id,count(*)fromlarge_tbljointiny_tblon(id)groupbyid• Bad:selectt1.id,count(*)fromlarge_tbl_1t1,large_tbl_2t2wheret1.id=t2.idgroupbyt1.id• Bad:selecta.*,b.*,c.*froma,b,cwherea.id=b.idandb.col1=c.col2;

• Alwayssettheper-querymem-limit(2GBmin)whenusingdisk-basedjoin/agg!


MemoryUsage- AdditionalNotes

• Useexplainplanforestimate;useprofileforaccuratemeasurement• Dataskewcanuseunevenmemory/CPUusage• Reviewpreviouscommonissuesonout-of-memory• Evenwithdisk-basedjoininImpala2.0andlater,you'llwanttoreviewtheQueryTuningstepstospeedupqueriesandusememorymoreefficiently.


TopicOutline





HardwareRecommendations

• 128GB(assignedtoImpala)ormoreforbestprice/performance• SpindlesvsSSD• Spindlesaremorecosteffective•MostworkloadisCPUbound;SSDwon’tmakeadifferenceatall

• 10Gbnetwork


ClusterSizing– ObjectiveandKeys

• Objective:• Therecommendedclustersizeshouldrunthedesiredworkload withinagivenSLA throughouttheprojectedlifespan ofthecluster.

• Keys:•Workload- definesthefunctionalrequirement• SLA– definestheperformancerequirement• Projectedlifespan– howwillthingschangeovertime?


ClusterSizing- SLA

• QueryThroughput• Howmanyqueriesshouldthisclusterprocesspersecond?• Thisisthemoremeaningfulmeasurementof“computingpower”

• QueryResponseTime• Howfastdoyouneedthequerytorun?• Typically,singlequeryresponsetimeisn’ttoomeaningfulbecausetherearealwaysmultiplequeriesrunningconcurrently!• Someusecasesrequireveryfastresponsetimee.gpoweringawebUI.

•Willmorepeopleberunningqueriesovertime?Thismeanshigherquerythroughput!


ClusterSizing- Workload

• Fromtheworkload,you’llwanttoknow:• Howmuchmemorydoyouneed?• Howmuchprocessingpowerdoyouneed?(i.e.howcomplexistheworkload?)• HowmuchIObandwidthdoyouneed?

• Thebiggerthecluster,themoretotalmemory,CPU,anddisk-IObandwidthyouhave.• Butusually,thenetworkbandwidthisfixed.


ClusterSizing– MemoryRequirements

• Howmuchmemorydoyouneed?• Anyhugegroupbyexpression?• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode• Numberofdistinctgroup:hardtoguess;justgetaroughballpark.• Rowsize:#columnsinvolvedinthequery*columnwidth• Columnwidthforint4byte,bigint8byte,etc.Forstringcolumns,takesomeroughguess.• Increasingtheclustersizewon’thelpmuchtoreducetheper-nodememrequirement.


ClusterSizing– ConnectionRequirements

• Thelargerthecluster,themoreintraconnectionsneededbetweenimpalads.Withhighconcurrencyand/orverycomplexqueries,thiscouldcauseaconnectionstormandqueryfailures.• Impalacachesestablishedconnectionsandre-usesthem.Iftheworkloadisexecutedagain,theexistingconnectionpoolwillbeleveragedtosatisfyeachconnectionrequest.

• OnaKerberossecuredcluster,eachimpaladneedstoauthenticatewitheveryotherimpalad.Requires:N*NKDCtickets.ThiscouldoverwhelmasingleKDCserver.• Iftheclusterhasmorethan200nodes,considerusingmoreKDCserverstobalanceload.


ClusterSizing– WorkloadComplexity(Cont’d)

• (Advanced)Ifyou’rereadytodivedeepintoworkloadanalysis…• Typically,youcanassumethefollowingprocessingrate(3columns,parquetformatwithsnappy):• Scannode8~10mrowspercore• Joinnode(singlejoinpredicate)~10mrowspersecpercore• Aggnode(singleagg)~5mrowspersecpercore• Sortnode- ~17MBpersecpercore• Parquetwriter1~5MBpersecpercore

• Fromaqueryintheworkload,basedonthe#join/aggyoucanestimate#rowspassingthroughtheoperatornodethenestimatetheeffectofpartitionpruning.• Usingtheaboveprocessingrate,youcanestimatehowmuchcputimeittooktoprocessaquery.


ClusterSizing– Summary

• ClustersizingdependsonSLAandworkload.Youneedtoknowboth!•Memoryrequirementfordoingbigjoin/agginmemory:• TotalClustermem>=the2ndlargestbigtableinthejoinafterdecompression,filtering,andprojection• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode


TopicOutline





Benchmarking– WhyRunOne?

• UnderstandhowImpalaperforms,howitscales,howitcomparestothecurrentsystem•Measuresqueryresponsetimeaswellasquerythroughput

• UnderstandhowImpalautilizesresources•MeasureCPU,disk,networkandmemory


BenchmarkingImpala– PreparingtheWorkload• Shouldberelevantto(orsatisfy)thebusinessrequirement.• Don’trunselect * from big_tbl limit 10 – it’smeaningless.

• Shouldnotbedictatedonthequeryform.• Youshouldbepreparedtochangethequery/schematodeliverameaningfulbenchmark.• Tunetheschema/query!

• Stayclosetothequerythatyou’regoingtoruninproduction.• Iftheresulthastobewrittentodisk,thenwritetodiskandDONOTsendresultsbacktotheclient.• Don’tstreamallthedatatotheclient(i.e.dataextraction).

• Useafastclient:You’rebenchmarkingtheserver,nottheclient,sodon’tmaketheclientabottleneck.


Benchmarking– AvoidingTraps

• It’seasiertostartwithasmallerdatasetandsimplerquery.Tryingtorunacomplexqueryonahugedatasetonasmallclusterisnoteffective.• Adatasetthat’stoosmallcan’tutilizethewholecluster.Haveatleastoneblockperdisk.• DisableAdmissionControlwhenyou’redoingbenchmark!


Benchmarking– PreparingtheHardware

• Shouldbeassimilartogo-livehardwareaspossible• Recommended:atleast10nodeswith128GBeach• CAUTION:Iftheclusteristoosmall(i.e.3nodes),it’sveryhardtoseetheeffectofscalabilityandidentifypotentialbottlenecks


Benchmarking– MeasuringSingle-QueryResponseTime

• UseImpala-shell(simple,easytouse)withthe-Boption.Thisdisablesprettyformattingsoclientwon’tbethebottleneck.• Impala-shell –B –q “<your query>”

• Tosimulatetheeffectofbuffercache,runitafewtimestowarmthebuffercachebeforemeasuringtheresult.• Tosimulatetheeffectwithoutbuffercache,clearthebuffercachebyrunning:echo 1 > /proc/sys/vm/drop_caches


Benchmarking– MeasuringQueryThroughput

• BenchmarkusingJDBC(oruseJmeter!,makesureyousetNativeQuery=1)• Inputparameter:listofhoststoconnectto,workloadqueries,durationofthebenchmark,andnumberofconcurrentconnections.• Eachconnectionwillpickahosttoconnecttoandkeeprunningaqueryforthespecifiedduration.• ReportQPS.• JustkeepincreasingthenumberofconnectionsuntilQPSdoesn’tincrease– thatwillbetheQPSofthesystem.


Benchmarking– GeneralPerformanceNotes

• Gatherqueryprofilesandsystemutilization• YoucangetallthesefromCMor/var/log/impalad/profiles!

• PerformancevsHive- ImpalawillALWAYSbefasterafterpropertuning.Ifnot,somethingiswrongwiththebenchmark.• ImpalavsHive-on-Tez/SparkSQLbenchmark:https://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/• SeetheopenTPC-DStoolkittorunyourown!https://github.com/cloudera/impala-tpcds-kit


TopicOutline





Multi-tenancyBestPractices– AdmissionControl

• Recommended:StaticpartitioningafixedmemoryforImpalaanduseAdmissionControl• Seethe“MemoryUsage”sectionfordeterminingmemoryusage!

• ThebestpublicresourceisthisnewImpalaRM‘howto’guide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html


Multi-tenancyBestPractices– Preventinga“Runaway”Query• “Runaway"query=usersubmitteda"wrong"queryaccidentallythatconsumesasignificantamountofmemory• Limittheamountofmemoryusedbyanindividualqueryusingper query mem-limit:• SetitfromImpalashell(onaperquerybasis):set mem_limit=<per query limit>

• Setadefaultper-querymemlimit:-default_query_options=‘mem_limit=<per query limit>’

• Onimpala2.6andlater,youcansetdefaultper-querymemlimitatpoollevel


Multi-tenancyBestPractices– AdmissionControl

• HowtoapproachAdmissionControlconfig:•Workloaddependent– memoryboundornot?• “Memorybound”meansthatyou’veusedupthememoryallocatedtoImpalabeforehittinglimitoncpu,disk,network.

•Memorybound– usemem-limit• Non-memorybound– usenumberofconcurrentqueries


Multi-tenancyBestPractices– AdmissionControl(Cont’d)

• GoalsofMemorylimitsetting:• Preventseachgroupofusersfromovercommittingsystemmemory• Preventsqueryfromhittingmem-limit• (Secondary)Simulatesprioritybyallocatingmorememorytoimportantgroup



• Step1: Identifysampleworkloadfromeachuser“group”,suchasHR,Analystetc• Step2*: Foreachqueryintheworkload,identifytheaccuratememoryusagebyrunningthequery.Thisisthememoryrequirementforthequery.• Step3: Minimalmemoryrequirementforeachgroup=max(memrequirementfromthequeryset)*1.2(safetyfactor).Also,settheperquerymem-limitwiththisnumber.• Step4:Youcandividethememorybasedon%too,buteachgroupshouldhaveatleasttheminmemderivedfromStep3.• NOTE:sum(memassignedtoallgroups)canbegreaterthantotalmemoryavailable.ThisisOK.



•Moreonstep2• Ifthememoryestimatefromtheexplainplanisinaccurate:• FIX:Useper-querylimittooverrideit,butthatwillrequireyoutosubmitquerythroughtheshell.• FIX:Adjustthepoolmem-limitaccordingly;ifit’sovertheestimate,giveitahighermem-limitandviceversa.



• Limitingthenumberofconcurrentqueries• Goal:• AvoidoversubscriptiontoCPU,disk,networkbecausethiscanleadtolongerresponsetime(withoutimprovingthroughput).

•Worksbestwithhomogeneousworkload•Withheterogeneousworkload,youcanstillapplythesameapproach,buttheresultwon’tbeasoptimal.


TopicOutline





QueryTuningBasics- Overview

• Giventhatyourqueryruns,knowhowtomakeitfasterandconsumefewerresources.• AlwaysCompute Stats.• Examinethelogicofthequery.• ValidateitwithExplain Plan.• RuntimeFilter

• RuntimeAnalysis• UseQuery Profile toidentifybottlenecksandskew• Codegen


QueryTuningBasics– MoreonCompute Stats

• Compute Stats isveryCPU-intensive,butonImpala1.4andlaterismuchfasterthanpreviousversions.• SpeedReference:~40Mcellspersecpernode+HMSupdatetime(1secper100partitions).~7millioncellpersecpernodeifcodegenisdisabled(Iftablecontainstimestamp,charcolumnthatdon’tsupportcodegenyet)• Totalnumberofcellsofatable=numrows*numcols• Onlyneedtorecomputestatswithsignificantchangesofdata(30%ormore)• Compute Stats ontables,notview


QueryTuningBasics– IncrementalStatsMaintenance

• Compute Stats isslowandthrough2.0,ImpaladoesnotsupportIncrementalStats• ColumnStats(numberofdistinctvalue,min,max)canbeupdatedbyCompute Stats infrequently(when30%ormoredatahaschanged)•Whenaddinganewpartition,runacount(*)queryonthepartitionandupdatethepartitionrowcountstatsmanuallyviaALTER TABLE• YoucansetcolumnstatsmanuallyviaALTER TABLE aswell


QueryTuningBasics– IncrementalStatsMaintenance

• Impala2.1orlatersupports“ComputeIncrementalStats”,butuseitonlyifthefollowingconditionsaremet:• Forallthetablesthatareusingincrementalstats,Σ(numcolumns*numpartitions)<650000.• Thesizeoftheclusterislessthan50nodes.• Foreachtable,numcolumns*numpartitions*400B<200MB


QueryTuningBasics– ExamingQueryLogic

• Sometimes,thequerycanhaveredundantjoins,distinct,groupby,orderby(verycommonduringmigration).Removethem!• Forcommonjoinpatterns,considerpre-joiningthetables.• Usetheproperjoin:Example:

select fact.col, max(dim2.col)

from fact, dim1, dim2

where fact.key = dim1.key and fact.key2 = dim2.key

• Thejoinondim1shouldbeasemi-join!


QueryTuningBasics– ValidatingExplain Plan

• Keypoints:• Validatejoinorderandjoinstrategy.• ValidatepartitionpruningorHBaserowkeylookup.

• Evenwithstats,attimesthejoinorder/strategymightgowrong(particularlywithlayersofviews).CanforcejoinorderbyusingSTRAIGHT_JOINhint


QueryTuningBasics– ValidatingJoinOrder&Strategy

• JoinOrder• RHS issmallerthanLHS

• JoinStrategy- BROADCAST• RHS mustfitinmemory!


QueryTuningBasics– CommonJoinPerformanceIssues

• HashTable• Ideally,thejoinkeyshouldbeevenlydistributed;onlyafewrowssharethesamejoinkeyfromtheRHS.• Isitatrueforeignkeyjoinormorelikearangejoin?

•Wrongjoinorder– RHSisbiggerthanLHStablefromtheplan• ToomanyLHSrows

…..

…..

Hashchain


QueryTuningBasics– FindingBottlenecks

• UseExecSummary fromQueryProfiletoidentifybottlenecks


QueryTuningBasics– FindingSkew

• UseExecSummary fromQueryProfiletoidentifyskew•MaxTimeissignificantlymorethanAvgTime=>Skew!


QueryTuningBasics– FindingSkew(Cont’d)

• Inadditiontoprofile,alwaysmeasureCPU,memory,diskIOandnetworkIOacrossthecluster.• Anunevendistributionoftheloadmeansskew!

• ClouderaManager’schartscandothatbutonlyreportataminuteinterval• UseColmuxifyourworkloadisshort.


QueryTuningBasics– TypicalSpeed

• Inatypicalquery,weobservedfollowingprocessingrate:• scannode8~10mrowspercore• joinnode~10mrowspersecpercore• aggnode~5mrowspersecpercore• sortnode~17MBpersecpercore• Rowmaterializationincoordshouldbetiny• Parquetwriter1~5MBpersecpercore

• Ifyourprocessingrateismuchlowerthanthat,it’sworthadeeperlook


QueryTuningBasics– ImprovingScanNodePerformance

• HDFSScantime– checkouthowmuchdataisread;alwaysdoaslittlediskreadaspossible;reviewpartitionstrategy.• Columnmaterializationtime– onlyselectnecessarycolumns!Materializing1kcolisalotofwork.• Complexpredicates– string,regexarecostly;avoidthem.• Orderofpredicates– Impalawilltrytoorderpredicatesbyselectivityifstatsareavailable.Ifstatsarenotavailable,youshouldorderpredicatesbyselectivityinyourquery.


QueryTuningBasics– AggregatePerformanceTuning

• Neededwhenmanyrowsgoingintoaggregate• Reduceexpressioncomplexityingroupby

• ComplexUDA• NocodegenforUDAyet.TrytorewriteUDAasbuilt-inaggregate

• (Usually,notabigissue)


QueryTuningBasics– ExchangePerformanceIssues

• Toomuchdataacrossnetwork:• Checkthequeryondatasizereduction.• Checkjoinorderandjoinstrategy;wrongorder/strategycanhaveaseriouseffectonnetwork!• Foragg,checkthenumberofgroups– affectsmemorytoo!• Removeunusedcolumns.

• Keepinmindthatnetworkisatmost10GB.• Cross-racknetworkslowness• Queryprofileisusuallynotuseful.UseCMorothersystemmonitoringtools.


QueryTuningBasics– StorageSkew

• HDFSBlockPlacementSkew• HDFSbalancesdiskusageevenlyacrossthewholeclusteronly.Anindividualtable(orpartition)’sdatacouldbeclusteredinahandfulofnodes.• Ifthishappens,you’llseethatsomenodesarebusier(diskreadandusuallycpuaswell)thantheothers.• ThisisinherenttoHDFS:morepronouncedifquerydatavolumeistinywhencomparedtothetotalstoragecapacity.• Byrunningamixedworkloadthataccessdataofabiggersetoftables,thistypeofhdfsblockplacementskewusuallyevenout.


QueryTuningBasics– DataSkew

• PartitionedJoinNodePerformanceSkew• Joinkeydataskew.• Eachjoinkeyisre-shuffledandprocessedbyonenode.• Ifasinglejoinkeyvalueaccountforahugechuckofthedata,thenthenodethatprocessthatjoinkeywillbecomethebottleneck.

• Possibleworkaround:usebroadcastjoinbutitusesmorememory


QueryTuningBasics–ExpressionEvaluation

• Impalaevaluatesexpressionlazily(i.e.onlywhenthevalueisneeded)• Impaladealswithinlineviewsbysubstitutingtheselect-listexpressionsfromtheinlineviewintheparentqueryblock.• Expensiveexpressionsinsidetheinlineviewsmightbeexecutedmultipletimes

• Exampleselect concat(col1, col1, col1)

from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x

• Theregexp_extractwillbeevaluated3timesbythecoordinator!


QueryTuningBasics–ExpressionEvaluation(Cont’d)

• Toavoidredundantexpressionevaluation,weneedtomaterializethevalue.• Theseoperatorsmaterializetheexpression:Orderby,Union[all],Groupby• Example:

select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x order by col1;

select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl union all select null from one_row_tbl where false) x;

• Takeaway:Watchoutforexpensiveexpressioninsidean(inline)view!• Takeaway:Duetolazyevaluation,itcanaffectalltheexecnodesaswellasthecoordinator!


QueryTuningBasics– ClientsidePerformanceIssues

• Avoidlargedataextract.• It’susuallynotagoodideatodumplotsofdataoutusingJDBC/ODBC.

• ForImpala-shell,usethe–Boptiontofetchlotsofdata.

Totalquerytime

Idletime:clientisn’tfetching


QueryTuningBasics– ExcessivePlanningTime?

• UsuallyduetoMetadataLoadingwhenrunningthequerythefirsttimeafterrestart(orafterinvalidatemetadata)• Tryrunningthequeryagain.Thetimeshouldbeshort.

Planningtime


QueryTuningBasics– Codegen

• InCDH5.9andbefore,Codegenhasminimumcost(~500ms).forsimpleshortquery(<2s),mostoftimecouldbespentoncodegenparpareandoptimization.Disablecodegencouldgivemorethroughput.• Thiscostisreducedto10-20%inCDH5.10+

• Youcansetqueryoptiontodisablecodegen• SETDISABLE_CODEGEN=1

• Somedatatypesarenotsupported(CHAR,TIMESTAMP)


QueryTuningBasics– Runtimefilteranddynamicpartitionpruning(CDH5.7/Impala2.5orhigher)

• Improvequeryperformanceforveryselectiveequi-joinbyreducingIO,hashjoin,andnetworktraffic.• Ifjoinisonpartitioncolumn,Impalacanuseruntimefilterstodynamicallyprunepartitionsonprobeside,skipwholefiles.Notonlycanitimproveperformance,itcanalsoreducequeryresourceusage(bothmemoryandCPU)• Ifjoinisonnon-partitioncolumns,Impalacangeneratebloomfilters.Thiscangreatlyreducescanoutputofprobesideandintermediateresult.


Checkhowselectivethejoinis

Largehashjoinoutput(~2.8B),theoutputofupstreamoneismuch

smaller(32M,reducedtoonly~1%).Thisindicateruntimefiltercanbe

helpful.


Howitworks

1

2

3

4

RF002<- d.d_date_skFiltercomingfromExchNode20,AppliedtoScanNode02,reducescan

outputfrom2.8B to32Mandalltheupstreamhash

joinaswell.

RF001<- (d_month_seq)FiltercomingfromScanNode05,AppliedtoScanNode03,Reducescan

outputfrom73049 to31


Whydoesn’titworksometimes?

Filterdoesn‘ttakeeffectbecausescanwasshortandfinishedbeforefilterarrived.HDFS_SCAN_NODE(id=3):Filter1(1.00MB):- Rowsprocessed:16.38K(16383)- Rowsrejected:0(0)- Rowstotal:16.38K(16384)


Howtotuneit?

• IncreaseRUNTIME_FILTER_WAIT_TIME_MSto5000mstoletScanNode03waitlongertimeforthefilter.

HDFS_SCAN_NODE(id=3)Filter1(1.00MB)- InactiveTotalTime:0- Rowsprocessed:73049- Rowsrejected:73018- Rowstotal:73049- TotalTime:0

• Iftheclusterisrelativelybusy,considerincreasingthewaittimetoosothatcomplicatedqueriesdonotmissopportunitiesforoptimization.


Whichfileformatissupported

• Partitionfilter• Allfileformatsaslongasit’sapartitionedtableandtherearemappingsforpartitioncolumns.E.g.• Selectcount(*)fromt1,t2wheret1.partition_col=t2.col1 andt2.col2=“1234”

• Rowfilter• Onlyappliedforparquetformat

• Parquetcantakemostadvantageofruntimefilters.


RuntimefilterGotcha

• Runtimefiltersrequireextramemoryoneachhostandextraworkoncoordinator.Example:on100-nodecluster,one16MBfilter•Memorycost,16MB*100=1.6GBforquery• Networkcost(onlyforGLOBALmode),16MB*100*2=3.2GBnetworktrafficoncoordinator,andCPUtimetomergeandpublishfilters.•Morefilters,highercost.•Negativeimpact:Coordinatorbecomesthebottleneck

•Resolution:reduce#offiltersbysettingMAX_NUM_RUNTIME_FILTERStoalowernumberordisablerowfilterbysetDISABLE_ROW_RUNTIME_FILTERING=1.


TopicOutline





InteractionwithSentry,Hive,andParquet

• Setup:ClouderaManager5.xdoesagoodjob;verifythedependencyparents,suchasHiveMetastore,HDFS.• StabilityinHMSmightaffectImpala;checkHMShealth.

• File-basedandApacheSentrysecurity• EvenwithSentry,Impalaneedstoread/writealldir/files.Noimpersonation.


Summary

• Approachclustersizingsystematically- SLAandworkload• Benchmarkrunningtechniqueandmeasurement– QPS• UseAdmissionControlformulti-tenancy• Tuneyourqueries- identifyandattackbottlenecks• ClouderaManager5.0+providesatoolforverifyingwhethermanybestpracticeshavebeenfollowed


OtherResources

• ImpalaUserGuide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala.html• Impalawebsite(roadmap,repository,books,more)&Twitter:https://impala.apache.org,@ApacheImpala• Communityresources:•Mailinglist:[email protected]• Forum:http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/bd-p/Impala


Thankyouhttps://impala.apache.org@ApacheImpala