+ All Categories
Home > Data & Analytics > SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Date post: 12-Apr-2017
Category:
Upload: gabrielle-knowles
View: 29 times
Download: 0 times
Share this document with a friend
57
Splunk scaling & best practice Nico van der Walt Client Architect, Splunk Copyright © 2016 Splunk Inc.
Transcript
Page 1: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Splunkscaling&bestpractice

NicovanderWaltClientArchitect, Splunk

Copyright©2016SplunkInc.

Page 2: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Introduction3TierApproachForwardingArchitectureIndexingArchitectureSearchArchitectureSizingRecapSizingExamplesMonitoringQ&A

AGENDA

Page 3: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

3TierApproach

Page 4: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SizingConsiderations

VitalInfo• Amountofincomingdata• Amountofindexed(stored)data• Numberofconcurrentusers• Numberofsavedsearches• Typesofsearches• SpecificSplunkapps

http://docs.splunk.com/Documentation/Splunk/latest/Installation/Performancechecklist

Page 5: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Splunk3TierArchitecture

5

Enterprise-classScale,ResilienceandInteroperability

SenddatafromthousandsofserversusinganycombinationofSplunkforwarders

Autoload-balancedforwardingtoSplunkIndexers

OffloadsearchloadtoSplunkSearchHeads

Page 6: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ReferenceHardware

Allinstancesx64,CPU>2Ghzpercore*http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware

†http://docs.splunk.com/Documentation/ES/latest/Install/DeploymentPlanning

6

Role CoreSplunk* EnterpriseSecurity(ES) †

Indexer12CPUcores12GBofRAM800IOPS/indexerRAID1+0dataingest:150-250GB/day

16CPUcores32GBofRAM800IOPS/indexerRAID1+0dataingest:100GB/day

SearchHead16CPUcores12GBofRAM2x300GB10krpmSASinRAID1

16CPUcores32GBofRAM2x300GB10krpmSASinRAID1

Page 7: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

RequiredReading

DistributedDeploymentManual• http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distributedoverview

Highlights• Referencehardwarespecs• Howsearchesaffectperformance• Dense/Rare/Sparse

• Appconsiderations• Summarytable

7

Page 8: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ForwardingArchitecture

Page 9: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ForwardingTierDesignFactorsSyslogCollectors(HA)DBConnectInputs

• Eg.McAfeeEPOdata

TAInputs• Eg.CheckPoint

AssortedInputs• MicrosoftADlogs• MicrosoftExchangeServer• MicrosoftSharepointlogs• Log4j,Linux,IIS• …

9

Page 10: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SyslogCollectors

• Bestpracticetousededicatedsyslogservers• Syslog-NG/rSyslogrecommended• Syslogcanwriteeventstodedicatedlogfilesallowingforeasysourcetypeclassification

oninputs

10

Page 11: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SyslogCollectors

UsingaLoadBalancer/VIPwithLinuxHeartbeattoprovidefailoverforthesysloglistenerSyslog-NGPEClient-sidefailover

11

Syslog-NG Server Syslog-NG Server

Syslog 514/tcp & 514/udp

Router (Physical)

Load Balancer

Load Balancer

Page 12: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ForwarderforTA’s

TA-McAfeerequiresDBConnecttopullendpointeventsTA-CheckpointusestheLEAClienttoretrieveFirewalllogeventsNotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

12

Heavy Forwarder, Linux

ePO Database

Checkpoint Server

TA-McAfee(DBConnect)

TA-Checkpoint

Firewall

Page 13: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

DeploymentServer

Deployment Server

Splunk Forwarders to get apps from splkds.internal.door2door.com:8089

13

● DeploymentServertomanageLinuxandWindowsforwarders

● NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

Page 14: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ForwardingTier

Syslog-NG ServerForwarders, LinuxForwarders,

Windows

Deployment Server

Windows SharePoint Server

Heavy Forwarder, Linux

ePO Database

Checkpoint Server

Windows AD ServerSyslog-NG Server

Indexers

Syslog 514/tcp & 514/udp

TA-McAfee(DBConnect)

TA-Checkpoint

Splunk AutoLB to splkidx.internal.door2door.com:9997Splunk Forwarders to get apps from splkds.internal.door2door.com:8089

Router (Physical)

Load Balancer

Load Balancer

Firewall

14

Page 15: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ForwardingTierDesignBestPractices

UseaSyslogServerforSyslogdataBecarefulwithIntermediateforwarders• Theycanintroducebottlenecks• ReducethedistributionofeventsacrossIndexers

MayneedtoincreaseUFthruputsettingforhighvelocitysources• [thruput]• maxKBps

AutoLBwillspreadoverallavailableindexers,butdon’tassumeevenly!• EnableforceTimebasedAutoLBforUF

15

Page 16: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

DataDistributionImbalanceEvendatadistributioniscrucialinparallelcomputingWaystoimprovedatadistribution:

• Enableparallelpipelines onheavyforwarders(Inserver.conf)• RoutedirectlyfromUniversalforwarderswherepossible• Makethefollowing changestoforwarders’outputs.conf:

• forceTimebasedAutoLB=true• autoLBFrequency=x

Examinesavedsearchtimewindows.Examplebelowhasmanysearchesovera5minutewindow, andsomesearchesover1minutewindow,autoLBFrequencytimesnumberofindexersshould bedivisible by5minutes, or1minuteifpossible

|tstats summariesonly=tcountWHEREindex=“*” bysplunk_server_time |timechart span=5msum(count) bysplunk_server

16

6Indexers;autoLBFrequency=30Unevendistribution ofworkloadover5minuteperiods.Unpredictableworkloadvariation

6Indexers;autoLBFrequency=15Betterdistributionover5minutes.autoLBFrequency=10wouldbeevenbetterasthereare6indexers

Page 17: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

DataImbalance- Troubleshoot

Troubleshooting:• Validatefirewallrulesareinplace• Checkthatallforwardershavethecorrectoutputs• Ensureindexersallalllisteningonproperport• Doessplunkd.loghaveanythingtosay?• UsetheIndexingOverviewandConfigurationOverview(btoolsavestheday)

OtherCauses:• Simplemisconfiguration• Dataprocessingqueuesfillingupandforwarderstimingoutandjumpingtonextindexer

• CheckDistributedIndexingPerformanceintheDMCforqueuefilling- typicalsignofdiskperformanceissues

• Indexeraffinity- theforwardersgetstucktooneindexerbecauseEOFnevermet• forceTimebasedAutoLBcanhelp!http://blogs.splunk.com/2014/03/18/time-based-load-balancing/

17

Page 18: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

HowManyDeploymentServers?

Ruleofthumbsays:1per10kclients@10– 15minpollingperiodAdjustpollingperiodtoincreasetotalclientssupportedSmalldeploymentscansharethesameinstanceasothermanagementinstances(LicenseMaster,ClusterMaster,etc.)Lowrequirementfordiskperformance(goodcandidateforvirtualization)Orusesomethingotherthandeploymentserver• puppet,SCCM,cfengine,chef…

Page 19: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

IndexingArchitecture

Page 20: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

IndexingTier

DesignFactorsPeakingestvolumeHighAvailability– IndexerReplication10%DiskSpaceContingencyDataretention

ClusterSizingCalculatorhttp://splunk-sizing.appspot.com

20

Page 21: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

HowManyIndexers?

Ruleofthumbsays:1indexerper150- 250GB/day80– 100GBwithEnterpriseSecurity

Leaveroomfor:• Dailypeaks

Needmoreindexersfor:• Heavyreporting•Moreusers• Slowerdisks,slowerCPUs,fewerCPUs

Page 22: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

StorageCalculations

RAIDConfiguration• Amountofrawdisk• Faulttolerance• AvailableIOPS

FilesystemOverhead• inodesconsumespace

Wiggleroom• Additionalreplicatedbucketswhenanodefails• Unbalancedreplicatedbuckets

Splunkinternallogs,SummaryIndexes,ReportAcceleration,AcceleratedDataModels

22

Page 23: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

StorageTypes

LocalvsDirectAttachedvsSANvsNASSSD/FlashvsSpinningDisk• SSDsoffermuchhigherIOPSwithnolatency• SignificantperformanceincreaseswithSparseSearches

23

Page 24: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

IndexReplication(akaIndexClustering)Whatisit?

• Dataisreplicatedto1ormoreindexersbasedonindexes• SplunkClusterMastercontrolled

Basics• MasterNode(managesindexingandsearchinglocation)• HorizontalScaling

HAvsDR• HA- Dataismadeavailableon1ormoreindexersinonelocation• DR– Multisite clustering.Alldataexists inmultiple locations

Page 25: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

BenefitsofClustering

• Dataredundancy• Dataavailability• Indexerresiliency• Simplermanagementofindexers• Simplersetupofdistributedsearch• Multi-siteclusteringallowssite-specificsearchtoreduceWANtraffic

25

Page 26: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

IndexClusteringSizingReplicationfactorüDeterminethenumberofrebuildablecopiesofdatatomaintain

SearchfactorüDeterminethenumberofsearchablecopiesofthedata

DataRetentionequationbasedonsyslogdataü TotaldiskusageacrossclusterinGB=(RepFactor*0.15+SearchFactor*0.35)*DatasetSizeGB

IncreaseinI/O,CPU,anddiskrequirement• Meansdailyindexingvolumeperserverwillbelower

Searchfactorincreasediskusageby~30%(rawdata+tsidx)Replicationfactorincreasesdiskusageby~10%(onlyrawdata)

Page 27: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ClusterMasterServer

• IndexerAppsaredeployedviaCM• NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

27

Page 28: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

IndexingTier

Master Cluster Node

28

Page 29: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SearchArchitecture

Page 30: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SearchTier

DesignFactors• HighAvailability• SearchHeadClustering• #users• #concurrentsearches• Forwardalldatatoindexers

30

Page 31: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SearchHeadClustering

Whatisit?• Groupsearchheadsintoaclusterasasingleentity• ProvidesHAattheSearchHeadlayer• SplunkHeadCaptaincontrolled• RAFTprotocoltopickcaptain

Basics• Acaptaingetselecteddynamically(pre6.3)orcanbedefinedmanually(6.3)• Knowledgeobjectsandsearchartifactsarereplicated• Searchworkloaddistribution• ReplicationusinglocalstorageNOToverNFS

Page 32: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SHC&Deployer

• SearchHeadClusterAppsneedtobeinstalledbytheDeployer• Aminimumof3SearchHeadsarerequiredforaSHC• Noexchange,nodbxwithSHC• ESwillstillrequireaseparateSearchHeadordedicatedSHC• UseLDAP/AD/SSOforuserAuthentication• LoadBalancerconfiguredforstickysessions

32

Page 33: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SearchTier

Search HeadSearch Head Search Head

Load Balancer

Deployer License Server

33

Page 34: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

HowManySearchHeads?

Ruleofthumbsays:1per20– 40concurrentqueriesLimitisconcurrentqueriesSearchQuerynormallyusesupto1CPUcore

• 6.3Parallelizationcanleveragemore

Don’taddsearchheads;addindexers:indexersdomostwork• UnlessyouneedHA/SearchClustering

Scaleverticallyifinfrastructureallowsit.AddCPU,addmemory.

Page 35: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SizingExamples

Page 36: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

RealWorldExamplesCiscoUnifiedComputingSystem(UCS)

• SearchHead:• UCSC220M4• 24cores• Indexer:• UCSC240M4• 24cores

Page 37: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

CiscoValidatedDesign(CVD)forUCS267pageReferenceManualfordeploying1TB/dayonUCSValidatedandBenchmarkedbyCiscoandSplunk

37

Page 38: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

DistributedDeployment– CommonComponents

Search-Head 3 XCiscoUCSC220-M4RackServers,eachwith:▫ CPU:2X E5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller (2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)

Admin/MasterNodes 2 XCiscoUCSC220-M4RackServers,eachwith:▫ 2X E5-2620v3(12cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)

NetworkFabric 2 XCiscoUCS6248UP48- PortFabricInterconnects

Page 39: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

DistributedDeployment– Retentionvs.Performance

DistributedDeploymentwithHighCapacity DistributedDeploymentwithHighPerformanceIndexer 16XC240-M4rackservers, eachwith:

▫ CPU:2XE5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 24X1.2TB10KSASinRAID10

2X120GBSSDinRAID1forOS

16XC220-M4rackservers, eachwith:▫ CPU:2XE5-2680v3 (24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 6X800GBSSD-EPinRAID5▫ 2X600GB10KSFFSASHDDw/RAID1forOS

RetentionCapability >1TB/Dayw/1year+retention >1.25TB/Dayw/90dayretention

IndexingCapacity 4TB/Day 8TB/DayIndexingCapacityw/Replication

2TB/Day 4TB/Day

RawIndexCapacity 236TB 64TBExpectedDataCapacity At2:1compression:

472TBAt2:1compression:

128TBKeyUse-Cases ▫ Enterprisesrequiringlargerdataretention ▫ Abilitytosupportlargenumberofconcurrentusersthatrequire

fasterresponse timeServersCount 21(37RU) 21(21RU)Scalability ▫ AdditionalSearch-Head(s)

▫ 1to16additionalIndexers(refertoHighCapacityIndexerconfiguration)

▫ AdditionalSearch-Head(s)▫ 1to16additionalIndexers(RefertoHighPerformanceIndexer

configuration)

Page 40: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

CloudDeploymentsCloudConsiderations

• Authenticationrestrictions• Datatransfercosts• Security– SSLTunnel• Zones• Hybriddeployments

VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf

AWShttps://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf

Azurehttp://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf

Page 41: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

RealWorldExamplesAmazonWebServicesEC2

• SearchHead:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage

• Indexer:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage• d2.4xlarge(IR)

Page 42: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Splunk CloudOverview

Page 43: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

FullFeatured Enterprise Ready Easy

WhatWeBuilt

Page 44: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

FULLFEATURESETOFSPLUNKENTERPRISE

Page 45: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

ACCESSTOAPPS

Page 46: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

High availability across Indexers & Search

Heads

Multiple AWS availability zones

Dedicated Cloud environments

- Secure- 10x Bursting

Splunk Cloud fully monitored using Splunk Enterprise

Builtfor100%Uptime

Page 47: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Forward dataSearch

MonitorGet value fast

What You DoHardware setup

StorageScaling

Monitoring

What We Do

Page 48: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Hybrid Search

Search Head(s)

Indexer(s)

Search Head(s)

Indexer(s)

On Premises Private Cloud Public Cloud On Premises Private Cloud Public Cloud

Single Pane of Glass Visibility

Page 49: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

SizingRecap

Page 50: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Top5ThingsToConsider

• IndexerStoragerequirements• Minimumbuy-inforaSHCis3• UseVMsforCM/LS/DS/Deployerifpossible• ConsideradedicatedSHformanagement

• DistributedManagementConsole• SplunkHealthCheckOverviewApp• SearchActivityApp

• Whenindoubt– addanotherIndexer

50

Page 51: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

MoreIsBetter?CPUs

• 8,12,16,24,32,etc….• Pipelines - New6.3featureforparallelization!• Indexingcanhandlehigherburstswithmultiple indexpipelinesets• Certainsearchescanbeimprovedwithmultiple searchpipelinesets

• Historicalbatch– return thedatawithoutworrying abouttimeorder (…|statscount)• Indexersstillneedtodo theheavylifting (searchexistson indexerANDsearchhead)

Memory• Good forsearchheadsandindexers(16+GB)

• BenefitsfromextraRAMusedbyOSforcaching

Disks• Fasterisbetter- 10k– 15krpmstrongly recommended, SSDpreferred• MoredisksinRAID1+0=Faster• RAID5+1or6canbegood forColdbuckets• SSDscanalsoprovidebenefitforraretermsearchesandmanyconcurrentjobs

Page 52: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

PuttingItAllTogether

52

Page 53: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Monitoring

Page 54: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

MonitoringToolsSowhat’soutthereandwhat’sthedifference?DistributedManagementConsole(DMC)– Built inandonlyavailableonv6.2+

• http://docs.splunk.com/Documentation/Splunk/latest/Admin/ConfiguretheMonitoringConsole• Splunksupportedandfocusesonallfacetsofthedeployment

FireBrigade• https://splunkbase.splunk.com/app/1632/• Detailed lookatindex/bucketactivityandcapacity

SoS(SplunkonSplunk)• https://splunkbase.splunk.com/app/748/• LegacySplunktroubleshootingtool

SplunkHealthOverview• https://splunkbase.splunk.com/app/1919/• Combinationofviewsfoundtobehelpfulinthefield

Note:Deploymentmonitorappisdeprecated– trytostayawayfromitManyoftheseappfunctionalities arebeingrolledintheDMC

54

Page 55: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Howarethings,overall?Highlevelenvironmentstatus– quickviewofwhat’sup/down/notreporting:

• Forwarderhealth- findingforwardersthatwehaven’tseenforawhile• Datasourcehealth- howareourdatafeedsdoing?• RESTendpoints(|rest/services/server/info)- lookingatsysteminformation,possiblyunderprovisionedones

SpottingwarningsanderrorswithinSplunk_internal:• index=_internalsourcetype=splunkd (log_level=ERRORORlog_level=WARN)|clustershowcount=t|tablecluster_counthostlog_level

message|sort– cluster_count|renamecluster_countAScount,log_levelASlevel• index=_internalsourceype=splunkdlog_level!=INFO|timechartcountbycomponent

Trackresourceusage:• Sayhelloto_introspection(Splunk6.1+)• Capturesdiskandotherresourcemetrics(bydefaultonfullinstalls)• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework

Dashboardstohelpsavetheday:• HealthStatus- SplunkHealthOverview• Instance- DistributedManagementConsole• IndexingPerformance- DistributedManagementConsole• ResourceUsage- SplunkHealthOverview• LicenseUsage- Splunk HealthOverview 55

Page 56: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

EnvironmentOverview

Whatarewereportingon?•_internal•_introspection•metadataandusingtstatshttp://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats

•RESTendpoints• |rest/services/server/info• |rest/services/server/roles• |rest/services/server/status/resource-usage

56

Howtousethetoolsavailabletocheckoverallhealth…

Page 57: SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Q&A


Recommended