Post on 12-Apr-2017
transcript
PowerPoint Presentation
http://fayerplay.comlost artof troublesh@papa_fire
ootingLeon Fayer
{me}
20+ years breaking & fixingdev, architect, [DevOps]
vp @ OmniTIfix other peoples
@papa_fire
why troubleshooting?
@papa_fire
cloud ruined everythingit really did
@papa_fire
Most reliable way to fix Windows problems1997DevOps mantra for managing cloud-based systems2017
when in doubt - rebootdestroy and rebuild
old McDonaldhad a farm
old McDonaldlost a farm
due to mad cow disease
troubleshooting - a form of problem solving
@papa_fire
problem solving - ability to fix things that you know nothing about
@papa_fire
why is problem solving important?
@papa_fire
because systems are complex
@papa_fire
because of Murphys law
@papa_fire
because someone is always watching
@papa_fire
{disclamer}
@papa_fire
@papa_fire
wishfulthinking
@papa_fire
reality
@papa_fire
where to begin?
@papa_fire
replicate
@papa_fire
OUR TEAM
isolate
@papa_fire
fix?
@papa_fire
whats the problem?
its broken!
@papa_fire
understanding
OUR TEAM
understandproblem
@papa_fire
we cant support 100s req/minwe need to scale better!
@papa_fire
we cant support 100s req/minwe need to scale better!
improve performance
@papa_fire
performance problem
@papa_fire
perceived problem
@papa_fire
actual problem
@papa_fire
OUR TEAM
understandbusiness
@papa_fire
I dont give a **** if thedatacenter is on fireas long as I am stillmaking money
@papa_fire
what doesit mean to you?
@papa_fire
@papa_fire
sales
@papa_fire
@papa_fire
content
@papa_fire
content
ad revenue
@papa_fire
every technical decisionpowers a business need
@papa_fire
OUR TEAM
understandimpact
@papa_fire
@papa_fire
is there alesser of two evils?
sometimes breaking = fixing
@papa_fire
80% now > 100% tomorrow
@papa_fire
incremental improvements
@papa_fire
anatomy of a problem
@papa_fire
anatomy of a problem
problem
norm
norm
@papa_fire
anatomy of a problem
problem
norm
acceptable
norm
@papa_fire
anatomy of a problem
problem
norm
acceptable
norm
fixfixfixfix
@papa_fire
what have welearned?
understanding ofwhats importantcause and effectlargest impactacceptable risk
@papa_fire
what not to do
@papa_fire
dont assume
@papa_fire
@papa_fire
I didnt build it
its not documented
it passedthe tests
works indev
everythinglooks right
@papa_fire
@papa_fire
dont feed your egosolve the problem
@papa_fire
ask for help
@papa_fire
OUR TEAM
tools
@papa_fire
loggingmonitoringprofiling
@papa_fire
loggingactionableconciseparsable
@papa_fire
OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
@papa_fire
OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
useful information[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:03] API GET data:[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
@papa_fire
OUR TEAM[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
information I need[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)
@papa_fire
monitoringall inclusivebusiness-firstcorrelatable
@papa_fire
whats the problem?
its broken!
@papa_fire
revenue
@papa_fire
revenue
@papa_fire
revenue
user performance
@papa_fire
revenue
database load
user performance
@papa_fire
revenue
database load
decline rate
user performance
@papa_fire
profiling
@papa_fire
OUR TEAM
when you have the whatbut still have no idea why
@papa_fire
OUR TEAM
#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344
@papa_fire
OUR TEAM
#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344/api/mobile/get_all_events 368584344
@papa_fire
OUR TEAM
TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}/api/get_item/60693 19773404
@papa_fire
down the rabbit hole
@papa_fire
troubleshootingis
required skilleducationaliterativefrustratingrewarding
@papa_fire
@papa_fire
https://www.track5media.com/wp-content/uploads/2016/06/workers-gathered-around-comuputer-screen.jpghttp://more-sky.com/data/out/10/IMG_379964.jpghttps://ruwix.com/pics/trolls/9-rubix-cube-neversolved.jpghttp://blog.cartif.com/wp-content/uploads/2016/02/evolucion.pnghttps://cdn-images-1.medium.com/max/2000/1*t-yZUIXuaXo97yiqYtpC5A.jpeghttp://www.6speedonline.com/forums/attachment.php?attachmentid=286232&stc=1&d=1380726388http://www.wallpapers.faketrix.com/content/animal/feathered/page-2/1024/Ostrich-non-flying-winged-animals.jpghttp://oldmanyellsat.cloud/oldman.jpghttp://cdn.wccftech.com/wp-content/uploads/2016/05/4195797-windows-7-alternate-blue.jpg https://www.poweradmin.com/blog/wp-content/uploads/2015/10/amazon-aws.pnghttps://supportingcmu.org/image/Herd.pnghttp://www.publicdomainpictures.net/pictures/30000/velka/green-fields-1351063140pg3.jpghttps://hurtigruten.global.ssl.fastly.net/assets/48dee2/globalassets/photos/voyages/explorer-voyages/2017-18/ms-fram-antarctica/the-frozen-land-of-the-penguins/2500x1250_r739816dominicbarrington.jpg?width=1600&height=800&transform=DownFillhttps://www.thegeneralistit.com/wp-content/uploads/2015/11/dreamstime_xxl_38819851-Business-woman-eliminate-problem-and-find-solution.jpghttp://paperzip.co.uk/wp-content/uploads/2016/01/word-of-the-day-newspaper.jpghttp://vignette3.wikia.nocookie.net/starwars/images/7/72/DeathStar1-SWE.png/revision/latest?cb=20150121020639https://lcarsgfx.files.wordpress.com/2014/10/prometheus1.pnghttps://cdn.meme.am/cache/instances/folder699/400x/65194699.jpghttp://blog.weespring.com/wp-content/uploads/2014/06/baby-safety-manual-5.jpghttps://4.bp.blogspot.com/-2fGfDw-sohs/V9_CAwCcnaI/AAAAAAAACos/zrARBywD2qAZOphkQMC7WZGdV3vMY5nTACLcB/s1600/Stop%2Bwhining.jpghttps://ih0.redbubble.net/image.14163956.5143/raf,750x1000,075,t,black_white.u4.jpghttp://www.inspireddad.org/wp-content/uploads/uploads/2013/02/ducttape_0930a8_3926013.jpghttps://katieleigh.files.wordpress.com/2014/10/img_0683.jpghttp://pre02.deviantart.net/020c/th/pre/i/2016/094/8/0/down_the_rabbit_hole_by_irenhorrors-d7hgsr3.jpghttp://i1-linux.softpedia-static.com/screenshots/Valgrind_1.pnghttp://i.imgur.com/m6Rkbdx.gifcredits
questions?
@papa_fire