+ All Categories
Home > Documents > Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*!...

Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*!...

Date post: 28-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
27
Kate LawrenceGupta Principal Engineer, Comcast Joe Cramasta Sr. Engineer, Comcast Taming the Beast: Managing Splunk for X1 @ Comcast Copyright @ 2014 Comcast
Transcript
Page 1: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Copyright  ©  2014  Splunk  Inc.  

Kate  Lawrence-­‐Gupta  Principal  Engineer,  Comcast  Joe  Cramasta  Sr.  Engineer,  Comcast  

Taming  the  Beast:    Managing  Splunk  for  X1  @  Comcast  

Copyright  @  2014  Comcast  

Page 2: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Disclaimer  

2  

During  the  course  of  this  presentaMon,  we  may  make  forward-­‐looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cauMon  you  that  such  statements  reflect  our  current  expectaMons  and  

esMmates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  

please  review  our  filings  with  the  SEC.  The  forward-­‐looking  statements  made  in  the  this  presentaMon  are  being  made  as  of  the  Mme  and  date  of  its  live  presentaMon.  If  reviewed  aWer  its  live  presentaMon,  this  presentaMon  may  not  contain  current  or  accurate  informaMon.  We  do  not  assume  any  obligaMon  to  update  any  forward-­‐looking  statements  we  may  make.  In  addiMon,  any  informaMon  about  our  roadmap  outlines  our  general  product  direcMon  and  is  subject  to  change  at  any  Mme  without  noMce.  It  is  for  informaMonal  purposes  only,  and  shall  not  be  incorporated  into  any  contract  or  other  commitment.  Splunk  undertakes  no  obligaMon  either  to  develop  the  features  or  funcMonality  described  or  to  

include  any  such  feature  or  funcMonality  in  a  future  release.  

Page 3: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Lineup  Who  are  Kate  and  Joe?  !   Outline  how  Splunk  operaMons  have  grown  over  the  past  2  years  to  support  the  X1  pla[orm  naMonal  customer  ramp-­‐up  

!   Touch  on  some  of  operaMonal  best  pracMces  we’ve  used  to  accommodate  having  a  large  installaMon  in  terms  of  both  volume  and  high  search  load  

!   Deep  dive  into  2  operaMonal  problems  and  the  technical  soluMons  designed  and  deployed  

3  

Page 4: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Kate  !   Principal  Engineer  responsible  for  mulMple  Splunk  installaMons  at  Comcast  

!   Manages  dedicated  team  providing  Splunk  as  an  operaMonal  service  to  hundreds  of  internal  customers  including  developers,  execs  and  other  operaMonal  support  teams  

!   13  years  of  experience  in  operaMons,  monitoring,  and  systems  administraMon  

! Splunk  user  since  ~2006  iniMally  for  Nagios  reporMng  –  RevoluMon  Award  winner  at  Splunk  .conf2013!  

4  

Page 5: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Joe  !   Senior  Engineer  @  Comcast  Cable  !   Has  been  with  the  Comcast  for  7  years  working  on  mulMple  projects  

!   Kate's  my  Boss  J  !   Started  using  Splunk  in  2009  to  get  visibility  into  a  MicrosoW  Hosted  Exchange  Environment  

5  

Page 6: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Comcast  Overview  !   Global  media  and  technology  company  consisMng  of  Comcast  Cable  and  NBCUniversal  

!   Comcast  Cable:  NaMon’s  largest  video,  hi-­‐speed  internet  and  phone  provider  under  the  XFINITY  brand  

! NBCUniversal:  One  of  world’s  leading  media  and  entertainment  companies  

!   Company  facts:  –  64  Billion  Revenue  (2013  Financial  report)  –  125,000+  Employees  

6  

Page 7: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Then…  ! Splunk  was  iniMally  deployed  for  staMsMcal  analysis  of  applicaMon  logs,  mainly  as  development  soluMon  

!   Volume  was  low  and  performance  slow  !   Supported  roughly  50  internal  users    !   1  staffer  dedicated  about  75%  of  the  Mme  !   Ran  Splunk  on  virtualized  indexers  and  search  heads  w/NFS  backed  storage  

7  

Page 8: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Now…  ! Splunk  is  deployed  globally  across  

ALL  services  and  is  considered  “criMcal  path”  for  both  monitoring  and  development  

!   Indexed  volume  has  jumped  by  a  factor  of  12x  

!   SupporMng  roughly  250+  users  and  dozens  of  automaMons  

!   5  staffers  dedicated  100%  of  the  Mme  ! Splunk  runs  on  dedicated  hardware  

and  storage  across  mulMple  datacenters  

!   99.99%  upMme  and  less  than  5  seconds  of  indexing  latency  

8  

Page 9: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Best  PracMces  !   Use  source  control  to  enable  tracking,  roll-­‐back  plans  and  change  management  for  all  of  our  deployments  

!   Standardize  rules  around  using  Splunk  by  sewng  limits  and  quotas  !   Normalize  alerts  for  easy  organizaMon  and  tracking  !   Use  a  central  search  head  for  “management”  of  your  Splunk  environment  –  Peer  this  search  head  to  your  other  search  heads  and  your  indexers  –  Track  and  report  on  your  license  usage  and  alert  on  “hosts  gone  wild”  –  Measure  your  installaMon  –  set  Key  Performance  Indicators  (KPI)  and  track  

your  growth  and  capacity  

9  

Page 10: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Big  Wins    

 

Organized  forwarders  across  produc2on    

Increased  Up2me    to  99.99%    

Support  from  Management  

Got    Crea2ve!  

Made  Splunk  a  consistently  reliable  method  to  get  data  fast  and  efficiently    

“The  only  way  to  track  all  logs  is  with  Splunk”  

Encouraged  internal  departments  to  add  data  into  Splunk  through:  

•  Training  

•  Brown-­‐bag  sessions  

•  Evangelism  

•  Begging  

We’ve  encountered  some  unique  problems  with  the  pure  size  and  complexity  of  our  installaMon  which  forced  us  to  come  up  with  our  own  soluMons  

10  

Page 11: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Problem  of  Scale:    

Joe  and  I  have  chosen  2  to  deep  dive  into…  

11  

Page 12: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Problem  #1  –  CollecMng  New  Logs  

       

!   No  Mmestamp/line-­‐breaking  consistency  in  the  logs  that  are  created  

!   Almost  every  Mme  we  need  to  collect  a  new  log  file  an  indexer  configuraMon  change  is  needed  to  accommodate  for  the  Mmestamp/line  breaking  sewngs  

!   Results  in  us  having  to  restart  the  Splunk  service,  which  impacts  search  and  alerMng  

!   This  downMme  requires  change  control  approval  and  can  only  be  performed  once  a  week  on  Saturdays  

12  

Page 13: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Normally…  inputs.conf  on  forwarder  

   

[monitor:///opt/eddie/log.txt]  sourcetype=eddie        [monitor:///opt/clark/log.txt]  sourcetype=clark          [monitor:///opt/rusty/log.txt]  sourcetype=rusty  

 

props.conf  on  indexer  

 [source::/opt/eddie/log.txt]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false      [source::/opt/clark/log.txt]  TIME_PREFIX=^\[  TIME_FORMAT=%F  %H:%M:%S,%3N  LINE_BREAKER  =([\r\n]+)\[  SHOULD_LINEMERGE  =  false    [source::/opt/rusty/log.txt]  SHOULD_LINEMERGE  =  false  TIME_FORMAT=%a  %b  %d  %H:%M:%S  TIME_PREFIX=^  

13  

Page 14: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Wouldn’t  it  be  nice….  

   

What  we  noMced  is  that  many  of  the  new  props.conf  stanzas  we  create  for  logs  that  already  existed  but  were  matching  some  other  source  file  

 

What  if  we  could  leverage  a  previously  exisMng  props.conf  stanza  to  handle  the  Mmestamp/linebreaking  recogniMon?  

 

We  sMll  need  to  be  able  to  define  unique  sourcetypes  for  all  of  our  logs  and  we  can’t  modify  the  source  of  where  the  logs  came  from  

 

 14  

Page 15: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

The  SoluMon:  Imposter  

 

First  we  needed  to  look  at  all  of  our  current  props.conf  stanzas  and  de-­‐dupe  them,  leaving  us  with  a  list  of  stanzas  that  each  handle  parsing  a  unique  log  format.      With  this  list  we  then  changed  all  of  the  stanza  names  to  match  on  a  wild-­‐carded  sourcetype  name      

[source::/opt/eddie/log.txt]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false      [source::/opt/clark/log.txt]  TIME_PREFIX=^\[  TIME_FORMAT=%F  %H:%M:%S,%3N  LINE_BREAKER  =([\r\n]+)\[  SHOULD_LINEMERGE  =  false    [source::/opt/rusty/log.txt]  SHOULD_LINEMERGE  =  false  TIME_FORMAT=%a  %b  %d  %H:%M:%S  TIME_PREFIX=^  

 [(?::){0}*-­‐STANZA_1]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false      [(?::){0}*-­‐STANZA_2]  TIME_PREFIX=^\[  TIME_FORMAT=%F  %H:%M:%S,%3N  LINE_BREAKER  =([\r\n]+)\[  SHOULD_LINEMERGE  =  false    [(?::){0}*-­‐STANZA_3]  SHOULD_LINEMERGE  =  false  TIME_FORMAT=%a  %b  %d  %H:%M:%S  TIME_PREFIX=^  

15  

Page 16: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

The  SoluMon:  Imposter  

 

Now  all  we  needed  to  do  was  add  on  to  the  end  of  our  sourcetype  name  the  stanza  name  which  matched  our  log  format.  

props.conf  [(?::){0}*-­‐STANZA_1]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false      [(?::){0}*-­‐STANZA_2]  TIME_PREFIX=^\[  TIME_FORMAT=%F  %H:%M:%S,%3N  LINE_BREAKER  =([\r\n]+)\[  SHOULD_LINEMERGE  =  false    [(?::){0}*-­‐STANZA_3]  SHOULD_LINEMERGE  =  false  TIME_FORMAT=%a  %b  %d  %H:%M:%S  TIME_PREFIX=^  

inputs.conf  [monitor:///opt/eddie/log.txt]  sourcetype=eddie-­‐STANZA_3          [monitor:///opt/clark/log.txt]  sourcetype=clark-­‐STANZA_1          [monitor:///opt/rusty/log.txt]  sourcetype=rusty-­‐STANZA_2  

 16  

Page 17: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

The  SoluMon:  Imposter  

 

Tomorrow  if  Eddie,  Clark,  and  Rusty  all  decide  they  are  going  to  start  using  some  other  log  format  all  we  need  to  do  is  change  the  matching  stanza  name  on  the  forwarder  

props.conf  [(?::){0}*-­‐STANZA_1]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false      [(?::){0}*-­‐STANZA_2]  TIME_PREFIX=^\[  TIME_FORMAT=%F  %H:%M:%S,%3N  LINE_BREAKER  =([\r\n]+)\[  SHOULD_LINEMERGE  =  false    [(?::){0}*-­‐STANZA_3]  SHOULD_LINEMERGE  =  false  TIME_FORMAT=%a  %b  %d  %H:%M:%S  TIME_PREFIX=^  

inputs.conf  [monitor:///opt/eddie/log.txt]  sourcetype=eddie-­‐STANZA_3          [monitor:///opt/clark/log.txt]  sourcetype=clark-­‐STANZA_1          [monitor:///opt/rusty/log.txt]  sourcetype=rusty-­‐STANZA_2  

 17  

Page 18: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

But  Now  My  Sourcetypes  Are  Named  Funny  L  

 

With  a  li�le  transform  magic  we  are  able  to  strip  off  the  assigned  stanza  name  that  we  added  to  the  sourcetype  field  leaving  us  with    the  original  sourcetype  name  as  if  no  funny  business  ever  happened.        

props.conf    

[(?::){0}*-­‐STANZA_1]  DATETIME_CONFIG=CURRENT  SHOULD_LINEMERGE  =  false  TRANSFORMS-­‐changeSourceType=stripitSourcetype  

transforms.conf    

[stripitSourcetype]  SOURCE_KEY  =  MetaData:Sourcetype  REGEX  =  (.*)-­‐STANZA_\w+  FORMAT  =  $1  DEST_KEY  =  MetaData:Sourcetype  

18  

Page 19: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Problem  #2  –  Distributed  Deployments  Problem:  

!   One  of  our  Splunk  implementaMons  required  deploying  Splunk  into  data  centers  where  our  OperaMons  group  wouldn’t  manage  the  forwarders,  network  infrastructure  or  Access  Control  Lists  

!   Traffic  would  be  restricted  to  within  the  datacenter  in  many  cases  

!   This  leW  us  with  the  scenario  where  forwarders  and  indexers  would  only  be  able  to  communicate  with  a  local  deployment  server  

 

19  

Page 20: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Features  Needed  Key  Features  needed:  !   A  way  to  keep  local  deployment  servers  in  sync  with  a  master  copy  !   Could  be  integrated  easily  with  the  previous  source-­‐control  process  (git)  !   Regions  could  be  managed  together  and  separately  based  on  different  change  windows    

!   Would  minimize  human  error  and  manual  copying  of  files  !   Would  have  reporMng  and  logging  !   Have  a  cool  code-­‐name…  Jenga!  

20  

Page 21: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Overview  !   1st  we  store  all  of  our  config  files  in  a  GIT  repo  !   Then  these  configs  sync  down  to  the  Deployment  Servers  every  10  minutes  

!   Through  a  web  UI  we  can:  –  View  all  the  serverclasses    –  Check  that  a  region/deployment  server  has  been  updated  to  match  the  GIT  

producMon  copy  –  Reload  1  or  mulMple  classes  at  a  Mme  

21  

Page 22: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Jenga…  Let’s  Go  a  Li�le  Deeper  !   A  git  repo  stores  the  files  approved  for  producMon  in  a  master  “golden”  branch  

! Jenga  Agent    –  Pulls  down  the  producMon  configs  files  from  the  

master  branch  with  a  GIT  fetch  –  Rsyncs  them  to  the  default  deployment  directory  –  Updates  and  reloads  the  serverclass.conf.  –  Updates  a  text  file  on  the  Deployment  Server  with  

the  latest  HASH  from  the  producMon  GIT  repo  –  Publishes  that  HASH  through  a  custom  REST  API  

endpoint  that  can  be  read  by  the  Jenga  UI  

 

22  

Page 23: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Jenga  UI.  

!   The  UI  lists  all  the  classes  available  per  region  by  looping  through  the  REST  API  on  the  Deployment  Server  and  publishing  them  as  opMons  to  the  user.  

!   Calls  the  Deployment  Server  custom  REST  API  endpoint  for  the  latest  known  Local  HASH  !   Calls  the  ProducMon  GIT  repo  for  the  Current  ProducMon  HASH  

23  

Page 24: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Jenga  workflow  !   Joe    

–  Modifies  an  inputs.conf  that  is  already  deployed    –  Checks  the  changes  into  a  new  GIT  branch  –  And  pushes  that  branch  to  our  GIT  repo  

!   I  review  the  changes  and  if  they  are  OK  I  merge  it  with  the  master  branch  

!   A  new  HASH  is  created  for  this  revision  and  is  published  to  the  Jenga  UI  as  the  CURRENT  version  of  producMon  

!    Joe  with  an  approved  window  goes  to  the  Jenga  UI  –   Chooses  the  region(s)  he  wants  –   Reloads  the  class(es)  once  he  sees  that  the  Local  and  Current  

HASHs  are  in  sync  

24  

Page 25: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Q  &A  Contact  info  

Kate_Lawrence-­‐Gupta  (splunk.com)  Email  –  Kate_Lawrence-­‐[email protected]    Cramasta  (splunk.com)  Email  –  [email protected]      

25  

Page 26: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

Special  Offer:  Try  Splunk  MINT  Express  for  Free!  Splunk  MINT  offers  a  fast  path  to  mobile  intelligence.  How  fast?    

Find  out  with  a  6-­‐month  trial*  

•  Register  for  your  free  trial:  h�p://mint.splunk.com/conf2014offer  

•  Download  the  Splunk  MINT  SDKs  •  Add  the  Splunk  MINT  line  of  SDK  code  and  publish**    

•  Start  gewng  digital  intelligence  at  your  fingerMps!    

*Offer  valid  for  .conf2014  aDendees  and  coworkers  of  aDendees  only.  

**Trial  allows  monitoring  of  up  to  750,000  monthly  acHve  users  (MAUs).  

 

26  

Page 27: Taming*the*Beast:** Managing*Splunk*for*X1*@ …...Kate*! Principal*Engineer*responsible*for*mulMple*Splunk*installaons*atComcast! Manages*dedicated*team*providing*Splunk*as*an*operaonal*service*to*hundreds*of

THANK  YOU  


Recommended