1©Cloudera,Inc.Allrightsreserved.
TheImpalaCookbookFromCloudera’sImpalaTeamUpdatedJan.2017
*CurrentlyanApacheIncubatorproject
2©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
3©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
4©Cloudera,Inc.Allrightsreserved.
PhysicalandSchemaDesign- Outline
• Schemadesignbestpractices• Datatypes• Partitiondesign• Complextypes• Commonquestions
• Physicaldesign• Fileformat– whentousewhat?• Blocksize(option)
5©Cloudera,Inc.Allrightsreserved.
PhysicalandSchemaDesign- DataTypes• UseNumericTypes(notstrings)• Avoidstringtypeswhenpossible• Strings=>highermemoryconsumption,morestorage,slowertocompute(80%slowercomparedtonumeric)• E.gdatestring“20161031”orunixtime“1479459272”,switchtobigint
• DecimalvsFloat/Double• Decimaliseasiertoreasonabout• Currentlycan'tuseDecimalaspartitionkeyorinUDFs
• UseStringonlyfor:• HBaserowkey– stringistherecommendedtype!• Timestamp– oktousestring,butconsiderusingnumericaswell!
• PreferStringoverChar/Varchar(exceptforSAS)
6©Cloudera,Inc.Allrightsreserved.
PartitionDesign:3SimpleRules
1. IdentifyaccesspatternsfromtheusecaseorexistingSQL.2. Estimatethetotalnumberofpartitions(betterbe<100k!).3. (Optional)Ifneeded,reducethenumberofpartitions.
7©Cloudera,Inc.Allrightsreserved.
PartitionDesign– IdentifyAccessPatterns
• ThetablecolumnsthatarecommonlyusedintheWHEREclausearecandidatesforpartitionkeys.• Dateisalmostalwaysacommonaccesspatternandthemostcommonpartitionkey.
• CanhaveMULTIPLEPARTITIONKEYS!Examples:• Select col_1 from store_sales where sold_date between ‘2014-01-31’ and ‘2016-02-23’;
• Select count(revenue)from store_sales where store_group_id in (1,5,10);
• Partition keys => sold_date, store_group_id
8©Cloudera,Inc.Allrightsreserved.
PartitionDesign– Estimatethe#partitions
• Estimatethenumberofdistinctvalues(NDV)foreachpartitionkey(fortherequiredstorageduration).Example:• Ifpartitionedbydateandneedtostorefor1year,thenNDVfordatepartitionkeyis365.• numstore_groupwillgrowovertime,butitwillneverexceed52(oneforeachstate).
• Totalnumberofpartitions=NDVforpartkey1*NDVforpartkey2*…*NDVforpartitionkeyN.Example:• Totalnumberofpartitions=365(fordatepart)*52(forstore_group)~=19k
•Makesure#partitions<=100k!
9©Cloudera,Inc.Allrightsreserved.
PartitionDesign– TooManyPartitions?
• Removesome“unimportant”partitionkeys.• Ifapartitionkeyisn’tasroutinelyusedoritdoesn’timpacttheSLA,removeit!
• Createpartition“buckets”andspecifybucketidindownstreamselectqueries.• Usemonthratherthandate.• Createartificialstore_group togroupindividualstores.• Technique:useprefixorhash• Hash(store_id)%store_groupsize=>hashittostore_group• Substring(store_id,0,2)=>usethefirst2digitsasartificialstore_group• e.g.select…fromtblwherestore_id=5
andstore_group=store_id%50;//Assume50storegroups
10©Cloudera,Inc.Allrightsreserved.
NestedTypes:KnownIssues
• Themaximumsizeofasinglerowfedintoajoinbuildcannotexceed8MB(buffered-tuple-streamblocksize).Forcomplextypesthismeans:• Sizeofallmaterializedcollections(postprojectionandfiltering)cannotexceed8MBforcertainplanshapes(expectedtoberare).
• IMPALA-2603:Incorrectplanisgeneratedforinlineviewreferencingcomplextypes.• Inlineviewhasseveralrelativetablerefsthatrefertodifferentancestorqueryblockswithinthesamenestinghierarchy.
11©Cloudera,Inc.Allrightsreserved.
SchemaDesign– CommonIssues
• Numberofcolumns- 2kmax• Notahardlimit;ImpalaandParquetcanhandleevenmore,but…• ItslowsdownHiveMetastoremetadataupdateandretrieval• Itleadstobigcolumnstatsmetadata,especiallyforincrementalstats
• Timestamp/Date• Usetimestampfordate;• Dateaspartitioncolumn:usestringorint(20150413asaninteger!)
• BLOB/CLOB– usestring• Stringsize- nodefinitiveupperboundbut1MBseemsok• Larger-sizedstringcancrashImpala!• Useitsparingly- thewhole1MBstringwillbeshippedeverywhere
12©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– FileFormat
• Parquet/Snappy• Thelong-termstorageformat• Alwaysgoodforreading!•Writeisveryslow(reportedly10xslowerthanAvro).
• SnappyvsGzip• SnappyisusuallyabettertradeoffbetweencompressionratioandCPU.• But,runyourownbenchmarktoconfirm!
• Forwrite-once-read-oncetmpETLtables,considertextinsteadbecause:•Writingisfaster.• Impalacanperformthewrite.
13©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– BlockSize
• Numberofblocksdefinesthedegreeofparallelism:• TrueforbothMapReduceandImpala• EachblockisprocessedbyasingleCPUcore• ToleverageallCPUcoresacrossthecluster,numblocks>=numcore
• Largerblocksize:• BettercompressionandIOthroughput,butfewerblocks,couldreduceparallelism
• Smallerblocksize:• Moreparallelism,butcouldreduceIOthroughput• CancausemetadatabloatandcreatebottlenecksonHDFSNameNode(NNRPCoverhead40K-50K/s),HiveMetastore,Impalacatalogservice
14©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– BlockSize• ForApacheParquet,~256MBisgoodandnoneedtogoabove1GB.• Don’tgobelow64MBexceptwhenyouneedmoreparallelism!• (Advanced)Ifyoureallywanttoconfirmtheblocksize,usethefollowingequation:• BlockSize<=p*t*c/s• p– diskscanrateat100MB/sec• t– desiredresponsetimeofthequeryinsec• c– concurrency• s- %ofcolumnselected
• Regularlycompacttablestokeepthenumberoffilesperpartitionundercontrolandimprovescanandcompressionefficiency
15©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
16©Cloudera,Inc.Allrightsreserved.
MemoryUsage– TheBasics
•Memoryisusedby:• Hashjoin– RHStablesafterdecompression,filteringandprojection• Groupby– proportionaltothe#groups• Parquetwriterbuffer– 256MBperpartition• IObuffer(sharedacrossqueries)
•Memoryheldandreusedbylaterqueries• Impalareleasesmemoryfromtimetotimein1.4andlater
17©Cloudera,Inc.Allrightsreserved.
Catalogmemoryusage
•Metadatacacheheapmemoryusagecanbecalculatedby• numoftables*5KB+numofpartitions*2KB+numoffiles*750B+numoffileblocks*300B+sum(incrementalcolstatspertable)• Incrementalstats
•Foreachtable,numcolumns*numpartitions*400B• Catalog-Updatetopicsizeshouldbenomorethan2GB• http://<statestorehost>:25010/topics
18©Cloudera,Inc.Allrightsreserved.
MemoryUsage– EstimatingMemoryUsage
• UseExplainPlan• Requiresstatistics!Memoryestimatewithoutstatsismeaningless.• Reportsper-hostmemoryrequirementforthisclustersize.• Re-runifyou’vere-sizedthecluster!
19©Cloudera,Inc.Allrightsreserved.
MemoryUsage– EstimatingMemoryUsage(Cont’d)
• EXPLAIN’smemoryestimateissues• Canbewayoff– muchhigherormuchlower.• groupby’sestimatecanbeparticularlyoff– whenthere’salargenumberofgroupbycolumns.• Memestimate=NDVofgroupbycolumn1*NDVofgroupbycolumn2*…NDVofgroupbycolumnn
• IgnoreEXPLAIN’sestimateifit’stoohigh!• Doyourownestimateforgroupby• GROUPBYmemusage=(totalnumberofgroups*sizeofeachrow)+(totalnumberofgroups*sizeofeachrow)/numnode
20©Cloudera,Inc.Allrightsreserved.
MemoryUsage– FindingActualMemoryUsage
• Searchfor“PerNodePeakMemoryUsage”intheprofile.• Thisisaccurate.Useitforproductioncapacityplanning.
21©Cloudera,Inc.Allrightsreserved.
MemoryUsage– FindingActualMemoryUsage(Cont’d)
• Forcomplexqueries,howdoIknowwhichpartofmyqueryisusingtoomuchmemoryorcausinganOut-Of-Memoryerror(i.e.hittingthemem-limit)?• Lookatthe“PeakMem”intheExecSummaryfromthequeryprofile!
22©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit
• Topcauses(inorder)ofhittingmem-limitevenwhenrunningasinglequery:
1. Lackofstatistics2. Lotsofjoinswithinasinglequery3. Big-tablejoiningbig-table4. Giganticgroupby
23©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit(Cont’d)
• Lackofstats•Wrongjoinorder,wrongjoinstrategy,wronginsertstrategy• ExplainPlantellsyouthat!
• Fix:ComputeStats<table>• Forhugetablethatcomputestatstakestoolongtofinish,youcanmanuallysettable/columnstats
24©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit(Cont’d)
• Lotsofjoinswithinasinglequery• select…from fact, dim1, dim2,dim3,…dimN where …
• Eachdimtablecanfitinmemory,butnotallofthemtogether• AsofCDH5.4/Impala2.2,• Impalamightchoosethewrongplan– BROADCAST• Impalasometimesrequire256MBastheminimalrequirementperjoin!
• FIX1:useshufflehint• Select … from fact join [shuffle] dim1 on … join [shuffle] dim2 …
• FIX2:pre-jointhedimtables(ifpossible).fewjoin=>betterperf!
25©Cloudera,Inc.Allrightsreserved.
MemoryUsage- HittingMem-limit(Cont’d)
• Big-tablejoinbig-table• Big-table(afterdecompression,filtering,andprojection)isatablethatisbiggerthantotalclustermemorysize.• CDH5.4/Impala2.0andlaterwilldothis(viadisk-basedjoin).• (Advanced)Forasimplequery,youcantrythisadvancedworkaround– per-partitionjoin• RequiresthepartitionkeybepartofthejoinkeySelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) union allSelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)
26©Cloudera,Inc.Allrightsreserved.
MemoryUsage– Disk-basedJoin/Agg
• Disk-basedjoin/aggshouldbeyourlastresorttodealwithhittingmem-limit.• Relyondisk-basedjoin/aggifthereisonlyonejoin/aggoperatorinthequery.Forexample:• Good:selecta.*,b.*froma,bwherea.id=b.id• Good:selecta.id,a.timestamp,count(*)fromagroupbya.id,a.timestamp• OK:selectlarge_tbl.id,count(*)fromlarge_tbljointiny_tblon(id)groupbyid• Bad:selectt1.id,count(*)fromlarge_tbl_1t1,large_tbl_2t2wheret1.id=t2.idgroupbyt1.id• Bad:selecta.*,b.*,c.*froma,b,cwherea.id=b.idandb.col1=c.col2;
• Alwayssettheper-querymem-limit(2GBmin)whenusingdisk-basedjoin/agg!
27©Cloudera,Inc.Allrightsreserved.
MemoryUsage- AdditionalNotes
• Useexplainplanforestimate;useprofileforaccuratemeasurement• Dataskewcanuseunevenmemory/CPUusage• Reviewpreviouscommonissuesonout-of-memory• Evenwithdisk-basedjoininImpala2.0andlater,you'llwanttoreviewtheQueryTuningstepstospeedupqueriesandusememorymoreefficiently.
28©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
29©Cloudera,Inc.Allrightsreserved.
HardwareRecommendations
• 128GB(assignedtoImpala)ormoreforbestprice/performance• SpindlesvsSSD• Spindlesaremorecosteffective•MostworkloadisCPUbound;SSDwon’tmakeadifferenceatall
• 10Gbnetwork
30©Cloudera,Inc.Allrightsreserved.
ClusterSizing– ObjectiveandKeys
• Objective:• Therecommendedclustersizeshouldrunthedesiredworkload withinagivenSLA throughouttheprojectedlifespan ofthecluster.
• Keys:•Workload- definesthefunctionalrequirement• SLA– definestheperformancerequirement• Projectedlifespan– howwillthingschangeovertime?
31©Cloudera,Inc.Allrightsreserved.
ClusterSizing- SLA
• QueryThroughput• Howmanyqueriesshouldthisclusterprocesspersecond?• Thisisthemoremeaningfulmeasurementof“computingpower”
• QueryResponseTime• Howfastdoyouneedthequerytorun?• Typically,singlequeryresponsetimeisn’ttoomeaningfulbecausetherearealwaysmultiplequeriesrunningconcurrently!• Someusecasesrequireveryfastresponsetimee.gpoweringawebUI.
•Willmorepeopleberunningqueriesovertime?Thismeanshigherquerythroughput!
32©Cloudera,Inc.Allrightsreserved.
ClusterSizing- Workload
• Fromtheworkload,you’llwanttoknow:• Howmuchmemorydoyouneed?• Howmuchprocessingpowerdoyouneed?(i.e.howcomplexistheworkload?)• HowmuchIObandwidthdoyouneed?
• Thebiggerthecluster,themoretotalmemory,CPU,anddisk-IObandwidthyouhave.• Butusually,thenetworkbandwidthisfixed.
33©Cloudera,Inc.Allrightsreserved.
ClusterSizing– MemoryRequirements
• Howmuchmemorydoyouneed?• Anyhugegroupbyexpression?• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode• Numberofdistinctgroup:hardtoguess;justgetaroughballpark.• Rowsize:#columnsinvolvedinthequery*columnwidth• Columnwidthforint4byte,bigint8byte,etc.Forstringcolumns,takesomeroughguess.• Increasingtheclustersizewon’thelpmuchtoreducetheper-nodememrequirement.
34©Cloudera,Inc.Allrightsreserved.
ClusterSizing– ConnectionRequirements
• Thelargerthecluster,themoreintraconnectionsneededbetweenimpalads.Withhighconcurrencyand/orverycomplexqueries,thiscouldcauseaconnectionstormandqueryfailures.• Impalacachesestablishedconnectionsandre-usesthem.Iftheworkloadisexecutedagain,theexistingconnectionpoolwillbeleveragedtosatisfyeachconnectionrequest.
• OnaKerberossecuredcluster,eachimpaladneedstoauthenticatewitheveryotherimpalad.Requires:N*NKDCtickets.ThiscouldoverwhelmasingleKDCserver.• Iftheclusterhasmorethan200nodes,considerusingmoreKDCserverstobalanceload.
35©Cloudera,Inc.Allrightsreserved.
ClusterSizing– WorkloadComplexity(Cont’d)
• (Advanced)Ifyou’rereadytodivedeepintoworkloadanalysis…• Typically,youcanassumethefollowingprocessingrate(3columns,parquetformatwithsnappy):• Scannode8~10mrowspercore• Joinnode(singlejoinpredicate)~10mrowspersecpercore• Aggnode(singleagg)~5mrowspersecpercore• Sortnode- ~17MBpersecpercore• Parquetwriter1~5MBpersecpercore
• Fromaqueryintheworkload,basedonthe#join/aggyoucanestimate#rowspassingthroughtheoperatornodethenestimatetheeffectofpartitionpruning.• Usingtheaboveprocessingrate,youcanestimatehowmuchcputimeittooktoprocessaquery.
36©Cloudera,Inc.Allrightsreserved.
ClusterSizing– Summary
• ClustersizingdependsonSLAandworkload.Youneedtoknowboth!•Memoryrequirementfordoingbigjoin/agginmemory:• TotalClustermem>=the2ndlargestbigtableinthejoinafterdecompression,filtering,andprojection• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode
37©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
38©Cloudera,Inc.Allrightsreserved.
Benchmarking– WhyRunOne?
• UnderstandhowImpalaperforms,howitscales,howitcomparestothecurrentsystem•Measuresqueryresponsetimeaswellasquerythroughput
• UnderstandhowImpalautilizesresources•MeasureCPU,disk,networkandmemory
39©Cloudera,Inc.Allrightsreserved.
BenchmarkingImpala– PreparingtheWorkload• Shouldberelevantto(orsatisfy)thebusinessrequirement.• Don’trunselect * from big_tbl limit 10 – it’smeaningless.
• Shouldnotbedictatedonthequeryform.• Youshouldbepreparedtochangethequery/schematodeliverameaningfulbenchmark.• Tunetheschema/query!
• Stayclosetothequerythatyou’regoingtoruninproduction.• Iftheresulthastobewrittentodisk,thenwritetodiskandDONOTsendresultsbacktotheclient.• Don’tstreamallthedatatotheclient(i.e.dataextraction).
• Useafastclient:You’rebenchmarkingtheserver,nottheclient,sodon’tmaketheclientabottleneck.
40©Cloudera,Inc.Allrightsreserved.
Benchmarking– AvoidingTraps
• It’seasiertostartwithasmallerdatasetandsimplerquery.Tryingtorunacomplexqueryonahugedatasetonasmallclusterisnoteffective.• Adatasetthat’stoosmallcan’tutilizethewholecluster.Haveatleastoneblockperdisk.• DisableAdmissionControlwhenyou’redoingbenchmark!
41©Cloudera,Inc.Allrightsreserved.
Benchmarking– PreparingtheHardware
• Shouldbeassimilartogo-livehardwareaspossible• Recommended:atleast10nodeswith128GBeach• CAUTION:Iftheclusteristoosmall(i.e.3nodes),it’sveryhardtoseetheeffectofscalabilityandidentifypotentialbottlenecks
42©Cloudera,Inc.Allrightsreserved.
Benchmarking– MeasuringSingle-QueryResponseTime
• UseImpala-shell(simple,easytouse)withthe-Boption.Thisdisablesprettyformattingsoclientwon’tbethebottleneck.• Impala-shell –B –q “<your query>”
• Tosimulatetheeffectofbuffercache,runitafewtimestowarmthebuffercachebeforemeasuringtheresult.• Tosimulatetheeffectwithoutbuffercache,clearthebuffercachebyrunning:echo 1 > /proc/sys/vm/drop_caches
43©Cloudera,Inc.Allrightsreserved.
Benchmarking– MeasuringQueryThroughput
• BenchmarkusingJDBC(oruseJmeter!,makesureyousetNativeQuery=1)• Inputparameter:listofhoststoconnectto,workloadqueries,durationofthebenchmark,andnumberofconcurrentconnections.• Eachconnectionwillpickahosttoconnecttoandkeeprunningaqueryforthespecifiedduration.• ReportQPS.• JustkeepincreasingthenumberofconnectionsuntilQPSdoesn’tincrease– thatwillbetheQPSofthesystem.
44©Cloudera,Inc.Allrightsreserved.
Benchmarking– GeneralPerformanceNotes
• Gatherqueryprofilesandsystemutilization• YoucangetallthesefromCMor/var/log/impalad/profiles!
• PerformancevsHive- ImpalawillALWAYSbefasterafterpropertuning.Ifnot,somethingiswrongwiththebenchmark.• ImpalavsHive-on-Tez/SparkSQLbenchmark:https://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/• SeetheopenTPC-DStoolkittorunyourown!https://github.com/cloudera/impala-tpcds-kit
45©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
46©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl
• Recommended:StaticpartitioningafixedmemoryforImpalaanduseAdmissionControl• Seethe“MemoryUsage”sectionfordeterminingmemoryusage!
• ThebestpublicresourceisthisnewImpalaRM‘howto’guide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html
47©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– Preventinga“Runaway”Query• “Runaway"query=usersubmitteda"wrong"queryaccidentallythatconsumesasignificantamountofmemory• Limittheamountofmemoryusedbyanindividualqueryusingper query mem-limit:• SetitfromImpalashell(onaperquerybasis):set mem_limit=<per query limit>
• Setadefaultper-querymemlimit:-default_query_options=‘mem_limit=<per query limit>’
• Onimpala2.6andlater,youcansetdefaultper-querymemlimitatpoollevel
48©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl
• HowtoapproachAdmissionControlconfig:•Workloaddependent– memoryboundornot?• “Memorybound”meansthatyou’veusedupthememoryallocatedtoImpalabeforehittinglimitoncpu,disk,network.
•Memorybound– usemem-limit• Non-memorybound– usenumberofconcurrentqueries
49©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• GoalsofMemorylimitsetting:• Preventseachgroupofusersfromovercommittingsystemmemory• Preventsqueryfromhittingmem-limit• (Secondary)Simulatesprioritybyallocatingmorememorytoimportantgroup
50©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• Step1: Identifysampleworkloadfromeachuser“group”,suchasHR,Analystetc• Step2*: Foreachqueryintheworkload,identifytheaccuratememoryusagebyrunningthequery.Thisisthememoryrequirementforthequery.• Step3: Minimalmemoryrequirementforeachgroup=max(memrequirementfromthequeryset)*1.2(safetyfactor).Also,settheperquerymem-limitwiththisnumber.• Step4:Youcandividethememorybasedon%too,buteachgroupshouldhaveatleasttheminmemderivedfromStep3.• NOTE:sum(memassignedtoallgroups)canbegreaterthantotalmemoryavailable.ThisisOK.
51©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
•Moreonstep2• Ifthememoryestimatefromtheexplainplanisinaccurate:• FIX:Useper-querylimittooverrideit,butthatwillrequireyoutosubmitquerythroughtheshell.• FIX:Adjustthepoolmem-limitaccordingly;ifit’sovertheestimate,giveitahighermem-limitandviceversa.
52©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• Limitingthenumberofconcurrentqueries• Goal:• AvoidoversubscriptiontoCPU,disk,networkbecausethiscanleadtolongerresponsetime(withoutimprovingthroughput).
•Worksbestwithhomogeneousworkload•Withheterogeneousworkload,youcanstillapplythesameapproach,buttheresultwon’tbeasoptimal.
53©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
54©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics- Overview
• Giventhatyourqueryruns,knowhowtomakeitfasterandconsumefewerresources.• AlwaysCompute Stats.• Examinethelogicofthequery.• ValidateitwithExplain Plan.• RuntimeFilter
• RuntimeAnalysis• UseQuery Profile toidentifybottlenecksandskew• Codegen
55©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– MoreonCompute Stats
• Compute Stats isveryCPU-intensive,butonImpala1.4andlaterismuchfasterthanpreviousversions.• SpeedReference:~40Mcellspersecpernode+HMSupdatetime(1secper100partitions).~7millioncellpersecpernodeifcodegenisdisabled(Iftablecontainstimestamp,charcolumnthatdon’tsupportcodegenyet)• Totalnumberofcellsofatable=numrows*numcols• Onlyneedtorecomputestatswithsignificantchangesofdata(30%ormore)• Compute Stats ontables,notview
56©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– IncrementalStatsMaintenance
• Compute Stats isslowandthrough2.0,ImpaladoesnotsupportIncrementalStats• ColumnStats(numberofdistinctvalue,min,max)canbeupdatedbyCompute Stats infrequently(when30%ormoredatahaschanged)•Whenaddinganewpartition,runacount(*)queryonthepartitionandupdatethepartitionrowcountstatsmanuallyviaALTER TABLE• YoucansetcolumnstatsmanuallyviaALTER TABLE aswell
57©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– IncrementalStatsMaintenance
• Impala2.1orlatersupports“ComputeIncrementalStats”,butuseitonlyifthefollowingconditionsaremet:• Forallthetablesthatareusingincrementalstats,Σ(numcolumns*numpartitions)<650000.• Thesizeoftheclusterislessthan50nodes.• Foreachtable,numcolumns*numpartitions*400B<200MB
58©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExamingQueryLogic
• Sometimes,thequerycanhaveredundantjoins,distinct,groupby,orderby(verycommonduringmigration).Removethem!• Forcommonjoinpatterns,considerpre-joiningthetables.• Usetheproperjoin:Example:
select fact.col, max(dim2.col)
from fact, dim1, dim2
where fact.key = dim1.key and fact.key2 = dim2.key
• Thejoinondim1shouldbeasemi-join!
59©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ValidatingExplain Plan
• Keypoints:• Validatejoinorderandjoinstrategy.• ValidatepartitionpruningorHBaserowkeylookup.
• Evenwithstats,attimesthejoinorder/strategymightgowrong(particularlywithlayersofviews).CanforcejoinorderbyusingSTRAIGHT_JOINhint
60©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ValidatingJoinOrder&Strategy
• JoinOrder• RHS issmallerthanLHS
• JoinStrategy- BROADCAST• RHS mustfitinmemory!
61©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– CommonJoinPerformanceIssues
• HashTable• Ideally,thejoinkeyshouldbeevenlydistributed;onlyafewrowssharethesamejoinkeyfromtheRHS.• Isitatrueforeignkeyjoinormorelikearangejoin?
•Wrongjoinorder– RHSisbiggerthanLHStablefromtheplan• ToomanyLHSrows
…..
…..
Hashchain
62©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingBottlenecks
• UseExecSummary fromQueryProfiletoidentifybottlenecks
63©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingSkew
• UseExecSummary fromQueryProfiletoidentifyskew•MaxTimeissignificantlymorethanAvgTime=>Skew!
64©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingSkew(Cont’d)
• Inadditiontoprofile,alwaysmeasureCPU,memory,diskIOandnetworkIOacrossthecluster.• Anunevendistributionoftheloadmeansskew!
• ClouderaManager’schartscandothatbutonlyreportataminuteinterval• UseColmuxifyourworkloadisshort.
65©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– TypicalSpeed
• Inatypicalquery,weobservedfollowingprocessingrate:• scannode8~10mrowspercore• joinnode~10mrowspersecpercore• aggnode~5mrowspersecpercore• sortnode~17MBpersecpercore• Rowmaterializationincoordshouldbetiny• Parquetwriter1~5MBpersecpercore
• Ifyourprocessingrateismuchlowerthanthat,it’sworthadeeperlook
66©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ImprovingScanNodePerformance
• HDFSScantime– checkouthowmuchdataisread;alwaysdoaslittlediskreadaspossible;reviewpartitionstrategy.• Columnmaterializationtime– onlyselectnecessarycolumns!Materializing1kcolisalotofwork.• Complexpredicates– string,regexarecostly;avoidthem.• Orderofpredicates– Impalawilltrytoorderpredicatesbyselectivityifstatsareavailable.Ifstatsarenotavailable,youshouldorderpredicatesbyselectivityinyourquery.
67©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– AggregatePerformanceTuning
• Neededwhenmanyrowsgoingintoaggregate• Reduceexpressioncomplexityingroupby
• ComplexUDA• NocodegenforUDAyet.TrytorewriteUDAasbuilt-inaggregate
• (Usually,notabigissue)
68©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExchangePerformanceIssues
• Toomuchdataacrossnetwork:• Checkthequeryondatasizereduction.• Checkjoinorderandjoinstrategy;wrongorder/strategycanhaveaseriouseffectonnetwork!• Foragg,checkthenumberofgroups– affectsmemorytoo!• Removeunusedcolumns.
• Keepinmindthatnetworkisatmost10GB.• Cross-racknetworkslowness• Queryprofileisusuallynotuseful.UseCMorothersystemmonitoringtools.
69©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– StorageSkew
• HDFSBlockPlacementSkew• HDFSbalancesdiskusageevenlyacrossthewholeclusteronly.Anindividualtable(orpartition)’sdatacouldbeclusteredinahandfulofnodes.• Ifthishappens,you’llseethatsomenodesarebusier(diskreadandusuallycpuaswell)thantheothers.• ThisisinherenttoHDFS:morepronouncedifquerydatavolumeistinywhencomparedtothetotalstoragecapacity.• Byrunningamixedworkloadthataccessdataofabiggersetoftables,thistypeofhdfsblockplacementskewusuallyevenout.
70©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– DataSkew
• PartitionedJoinNodePerformanceSkew• Joinkeydataskew.• Eachjoinkeyisre-shuffledandprocessedbyonenode.• Ifasinglejoinkeyvalueaccountforahugechuckofthedata,thenthenodethatprocessthatjoinkeywillbecomethebottleneck.
• Possibleworkaround:usebroadcastjoinbutitusesmorememory
71©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics–ExpressionEvaluation
• Impalaevaluatesexpressionlazily(i.e.onlywhenthevalueisneeded)• Impaladealswithinlineviewsbysubstitutingtheselect-listexpressionsfromtheinlineviewintheparentqueryblock.• Expensiveexpressionsinsidetheinlineviewsmightbeexecutedmultipletimes
• Exampleselect concat(col1, col1, col1)
from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x
• Theregexp_extractwillbeevaluated3timesbythecoordinator!
72©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics–ExpressionEvaluation(Cont’d)
• Toavoidredundantexpressionevaluation,weneedtomaterializethevalue.• Theseoperatorsmaterializetheexpression:Orderby,Union[all],Groupby• Example:
select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x order by col1;
select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl union all select null from one_row_tbl where false) x;
• Takeaway:Watchoutforexpensiveexpressioninsidean(inline)view!• Takeaway:Duetolazyevaluation,itcanaffectalltheexecnodesaswellasthecoordinator!
73©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ClientsidePerformanceIssues
• Avoidlargedataextract.• It’susuallynotagoodideatodumplotsofdataoutusingJDBC/ODBC.
• ForImpala-shell,usethe–Boptiontofetchlotsofdata.
Totalquerytime
Idletime:clientisn’tfetching
74©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExcessivePlanningTime?
• UsuallyduetoMetadataLoadingwhenrunningthequerythefirsttimeafterrestart(orafterinvalidatemetadata)• Tryrunningthequeryagain.Thetimeshouldbeshort.
Planningtime
75©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– Codegen
• InCDH5.9andbefore,Codegenhasminimumcost(~500ms).forsimpleshortquery(<2s),mostoftimecouldbespentoncodegenparpareandoptimization.Disablecodegencouldgivemorethroughput.• Thiscostisreducedto10-20%inCDH5.10+
• Youcansetqueryoptiontodisablecodegen• SETDISABLE_CODEGEN=1
• Somedatatypesarenotsupported(CHAR,TIMESTAMP)
76©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– Runtimefilteranddynamicpartitionpruning(CDH5.7/Impala2.5orhigher)
• Improvequeryperformanceforveryselectiveequi-joinbyreducingIO,hashjoin,andnetworktraffic.• Ifjoinisonpartitioncolumn,Impalacanuseruntimefilterstodynamicallyprunepartitionsonprobeside,skipwholefiles.Notonlycanitimproveperformance,itcanalsoreducequeryresourceusage(bothmemoryandCPU)• Ifjoinisonnon-partitioncolumns,Impalacangeneratebloomfilters.Thiscangreatlyreducescanoutputofprobesideandintermediateresult.
77©Cloudera,Inc.Allrightsreserved.
Checkhowselectivethejoinis
Largehashjoinoutput(~2.8B),theoutputofupstreamoneismuch
smaller(32M,reducedtoonly~1%).Thisindicateruntimefiltercanbe
helpful.
78©Cloudera,Inc.Allrightsreserved.
Howitworks
1
2
3
4
RF002<- d.d_date_skFiltercomingfromExchNode20,AppliedtoScanNode02,reducescan
outputfrom2.8B to32Mandalltheupstreamhash
joinaswell.
RF001<- (d_month_seq)FiltercomingfromScanNode05,AppliedtoScanNode03,Reducescan
outputfrom73049 to31
79©Cloudera,Inc.Allrightsreserved.
Whydoesn’titworksometimes?
Filterdoesn‘ttakeeffectbecausescanwasshortandfinishedbeforefilterarrived.HDFS_SCAN_NODE(id=3):Filter1(1.00MB):- Rowsprocessed:16.38K(16383)- Rowsrejected:0(0)- Rowstotal:16.38K(16384)
80©Cloudera,Inc.Allrightsreserved.
Howtotuneit?
• IncreaseRUNTIME_FILTER_WAIT_TIME_MSto5000mstoletScanNode03waitlongertimeforthefilter.
HDFS_SCAN_NODE(id=3)Filter1(1.00MB)- InactiveTotalTime:0- Rowsprocessed:73049- Rowsrejected:73018- Rowstotal:73049- TotalTime:0
• Iftheclusterisrelativelybusy,considerincreasingthewaittimetoosothatcomplicatedqueriesdonotmissopportunitiesforoptimization.
81©Cloudera,Inc.Allrightsreserved.
Whichfileformatissupported
• Partitionfilter• Allfileformatsaslongasit’sapartitionedtableandtherearemappingsforpartitioncolumns.E.g.• Selectcount(*)fromt1,t2wheret1.partition_col=t2.col1 andt2.col2=“1234”
• Rowfilter• Onlyappliedforparquetformat
• Parquetcantakemostadvantageofruntimefilters.
82©Cloudera,Inc.Allrightsreserved.
RuntimefilterGotcha
• Runtimefiltersrequireextramemoryoneachhostandextraworkoncoordinator.Example:on100-nodecluster,one16MBfilter•Memorycost,16MB*100=1.6GBforquery• Networkcost(onlyforGLOBALmode),16MB*100*2=3.2GBnetworktrafficoncoordinator,andCPUtimetomergeandpublishfilters.•Morefilters,highercost.•Negativeimpact:Coordinatorbecomesthebottleneck
•Resolution:reduce#offiltersbysettingMAX_NUM_RUNTIME_FILTERStoalowernumberordisablerowfilterbysetDISABLE_ROW_RUNTIME_FILTERING=1.
83©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
84©Cloudera,Inc.Allrightsreserved.
InteractionwithSentry,Hive,andParquet
• Setup:ClouderaManager5.xdoesagoodjob;verifythedependencyparents,suchasHiveMetastore,HDFS.• StabilityinHMSmightaffectImpala;checkHMShealth.
• File-basedandApacheSentrysecurity• EvenwithSentry,Impalaneedstoread/writealldir/files.Noimpersonation.
85©Cloudera,Inc.Allrightsreserved.
Summary
• Approachclustersizingsystematically- SLAandworkload• Benchmarkrunningtechniqueandmeasurement– QPS• UseAdmissionControlformulti-tenancy• Tuneyourqueries- identifyandattackbottlenecks• ClouderaManager5.0+providesatoolforverifyingwhethermanybestpracticeshavebeenfollowed
86©Cloudera,Inc.Allrightsreserved.
OtherResources
• ImpalaUserGuide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala.html• Impalawebsite(roadmap,repository,books,more)&Twitter:https://impala.apache.org,@ApacheImpala• Communityresources:•Mailinglist:[email protected]• Forum:http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/bd-p/Impala
87©Cloudera,Inc.Allrightsreserved.
Thankyouhttps://impala.apache.org@ApacheImpala