Date post: | 09-Jan-2017 |
Category: |
Technology |
Upload: | priyanka-aash |
View: | 52 times |
Download: | 0 times |
AppliedMachineLearning�Wolff�Wallace�Zhao
APPLIED MACHINE LEARNING FOR DATA
EXFIL AND OTHER FUN TOPICS
Matt Wolff, Chief Data Scientist Brian Wallace, Security Researcher Xuan Zhao, Data Scientist
AppliedMachineLearning�Wolff�Wallace�Zhao
GOALS OF THIS TALK: APPLIEDMACHINELEARNING• Iden5fysuitableproblemsforMLapproaches• DemonstratebyexamplehowtoapplyML• Helpjumpstartaddi5onalresearchintheSecurity+MLspace
AppliedMachineLearning�Wolff�Wallace�Zhao
WHOWEARE/CYLANCE• Endpointsecuritycompanybuiltaroundthecapabili5esofar5ficialintelligence• Protec5ngmillionsofenterpriseendpoints• Foundedin2012,$177mmraised• Booth#1124
Ma3WolffChiefDataScien5st
BrianWallaceSeniorSecurityResearcher
XuanZhaoDataScien5st
AppliedMachineLearning�Wolff�Wallace�Zhao
TALKOVERVIEW
• MachineLearningIntroduc5on• NMAPClustering• FeatureSpaces• Distances• Clustering
• BotnetPanelIden5fica5on• Classifica5on• FeatureReduc5on
• Obfusca5ngDatawithMarkovChains
AppliedMachineLearning�Wolff�Wallace�Zhao
MACHINELEARNINGOVERVIEW
• Machinelearningtechniquesaredatadriven• Availabledatashouldbeabletosolvetheprobleminameaningfulway• Approachesexistfordealingwithrawdata,aswellaslabeledorannotateddata
AppliedMachineLearning�Wolff�Wallace�Zhao
MACHINELEARNINGOVERVIEW
• Givensomedata,differenttypesofmachinelearningcanbeapplied• Clusteringisusefulforfindingsimilarityacrossdatasettouncovertrendsorotherinsight• Withlabeleddata,classifica5oncanbeusefultobuildpredic5vemodels
AppliedMachineLearning�Wolff�Wallace�Zhao
MACHINELEARNINGOVERVIEW
• OYen,rawdatahastobetransformedinsomewaytobeusedbymachinelearningalgorithms• Typicalprocessistoextractfeaturesfromdata,andturnthosefeaturesintovectors• VectorsarethenfedintoMLalgorithmsfortrainingorotherpurposes
AppliedMachineLearning�Wolff�Wallace�Zhao
MACHINELEARNINGOVERVIEW• Recommendedresources• Scikit-learn.org• Python
• SourcecodeforalltoolsinthistalkavailableontheCylancepublicgitrepo• h^ps://github.com/CylanceSPEAR
• Shouldbeabletopulltalksourceandstartmodifyingasneededtosuitdatadrivenproblemsinyourownorganiza5onorresearchgroup.
AppliedMachineLearning�Wolff�Wallace�Zhao
• NMAPisapopularportscanningtool• ProduceslargeamountofdataperIPaddress• ScansoverlargenumberofIPscanbedifficulttomakesenseof• NMAPClusteringisatoolwhichclusters(groups)IPsbasedontheiropenports,services,etc
TOOL–NMAPCLUSTERING
AppliedMachineLearning�Wolff�Wallace�Zhao
FEATURES• Featuresareinforma5ve,discrimina5veinforma5onthatcandescribeasample/observa5on/phenomenon/etc.• Featureextrac5onispivotaltothemachinelearningpipeline• OurfeaturesarebasedonNMAPoutput• Eachportisafeature,eachserviceoneachportisafeature,eachversionofeachserviceoneachportisafeature,etc• Scriptoutputincludedinfeatures(includingwebsite5tles,publickeys,etc)
AppliedMachineLearning�Wolff�Wallace�Zhao
VECTORS• Numericalrepresenta5onofasample(IPinNMAPcase)• Arrayofvalueswhichrepresentallfeaturesfromonesample• Vectorscanbethoughtofaspointsinhighdimensionalspace• Eachfeatureisadimension,thevalueofthefeatureinthevectoristheposi5oninthatdimension• Ifwehaveonlytwofeatures,itisreallyeasytovisualize
AppliedMachineLearning�Wolff�Wallace�Zhao
VECTORS–3D
File FilenameLength
Filesize(kB) Sizeofheaders(kB)
calc.exe 8 918.528 63notepad.exe 11 193.024 45malware.exe 11 193.024 10
AppliedMachineLearning�Wolff�Wallace�Zhao
𝑎
𝑏𝑐
• Distance:Describethediscrepancybetweentwopoints• Physicaldistancebetweentwopoints:
Pythagorean’s theorem:
DISTANCES
a2+b2=c2
AppliedMachineLearning�Wolff�Wallace�Zhao
File FilenameLength Filesize(kB) Sizeofheaders(kb)
calc.exe 8 918.528 63
notepad.exe 11 193.024 45
malware.exe 11 193.024 10
2D 3D
AppliedMachineLearning�Wolff�Wallace�Zhao
DISTANCESMul5pleDistanceMetrics–As long as an operation satisfy certain mathematical criteria, it can be used as a distance metric
Manhattan
Euclidean
Point 1
Point 2
AppliedMachineLearning�Wolff�Wallace�Zhao
CLUSTERING• Withawaytomeasuredistance,wecangroupitemsbyhowclosetheyare,akaclustering• Clustersaredis5nctgroupsofsamples(IP)whichhavebeengroupedtogether• Clusteringisgenerallyunsupervisedlearning• Differentalgorithmswithdifferentconfigura5onsgroupthesesamplesindifferentways
AppliedMachineLearning�Wolff�Wallace�Zhao
k-Means• Clusteringalgorithm• Usersuppliesk(des5natednumberofclusters)• Allsamplesareassignedtorandomclusters• Centerofeachclusteriscomputedbytakingmean(average)ofallsamplesincluster• Samplesarethenassignedtotheclusterwhosecentertheyareclosestto• Centersarerecomputed,algorithmloopsun5lnosampleschangeclusters
AppliedMachineLearning�Wolff�Wallace�Zhao
NMAPCLUSTERING–MANUAL/AUTOMATIC• Manualallowsyoutosupplyyourownclusteringparameters• Automa5ctriesmanydifferentmethodswiththeore5cally-foundop5malparametersandpickswhatitdeterminestobethebest• Demowithmanualstrategy• Demowithautoma5cstrategy
AppliedMachineLearning�Wolff�Wallace�Zhao
NMAPCLUSTERING-INTERACTIVE• IncorporatetheUser’sdecisionintotheclusteringprocess.• TheClusteringresultwillbecustomizedaccordingtocustomer’spreferenceinthisway• Process(willshowwithademo):
1. Userdecidewhethertheclusterneedstobesplitornot:
2. Ifyes,thensplitusingdivisiveclustering3. Ifno,finalizethiscluster4. Recursivelysplitun5lusersare
sa5sfiedwithalltheclusters
Y
Y N
Finalized
N
Finalized
N
Finalized
Y–SplitN–No-Split
AppliedMachineLearning�Wolff�Wallace�Zhao
TOOL–IDPANEL• Botnetpanels(commandandcontrolwebsites)canbedifficulttoiden5fy• Needpreviousknowledgeofthebotnetpanels• OYenmodifiedtoavoiddetec5onorvanity• Manyarebasedoffothers,sodis5nguishingcanbedifficult
• Wecantrainamodeltoiden5fyifwearelookingatabotpanelandwhichoneitis,withasmallnumberofrequests• Minimizingthenumberofrequeststoclassifyimprovesstealthandrateofclassifica5on
AppliedMachineLearning�Wolff�Wallace�Zhao
CLASSIFICATION• Thisisaclassifica5onproblem• Classifica5onanswers“Isitwhatwearelookingfor?”• Classifica5onisgenerallysupervisedlearning• Supervisedlearningrequirestrainingsamplestohavelabels• Classifica5onmethodsrangefromsimpletohighlycomplex
AppliedMachineLearning�Wolff�Wallace�Zhao
IDPANELFEATURES• Botnetpanelsaresimilartonormalwebsites• Containvariousfiletypes,oYenedited• HTTPresponsecodes+contentcomparison• Encodingcontentasfeaturesdifficult• ssDeepprovidesacon5nuousvaluebycomparingcontent
AppliedMachineLearning�Wolff�Wallace�Zhao
COLLECTINGDATA• Collec5onofknownbotnetpanels• Requestallknownpathsforallknownbotnetpaneltypes• StoreHTTPstatuscodeandssDeepofcontent• Collec5onofsitesthatarenotbotnetpanelsneededaswell
AppliedMachineLearning�Wolff�Wallace�Zhao
DECISIONTREES• Decisiontreesaresimpleclassifiers• Splitsthedatasetonefeatureata5meun5ldecisionisconfident• Resultsinatreeofquerieswheretheresultsproduceadecision• Simpletotrain
AppliedMachineLearning�Wolff�Wallace�Zhao
ENSEMBLEOFDECISIONTREES• Asingledecisiontreemaybeoverfocusedontrainingdata• Canalleviatethisproblembybuildingmul5pleDecisionTreesforeachlabel• CombiningtheresultsallowseachDecisionTreetovote• Par5alanswersmays5llbeofinteresttotheuser• Ensemblescanobtainbe^erpredic5veperformance
PonyDT1 PonyDT2 PonyDT3 DexterDT1 DexterDT2 DexterDT3 ZeusDT1 ZeusDT2 ZeusDT3
It’sPony
AppliedMachineLearning�Wolff�Wallace�Zhao
IDPANELDEMO–COMMANDLINE• Quickwaytocheckifawebsitedirectorycontainsabotnetpanel• Easytobatchsearchingofmul5plewebsites/directories• Easytogrepresults
AppliedMachineLearning�Wolff�Wallace�Zhao
IDPANELDEMO–CHROMEEXTENSION• Everywebsitedirectoryvisitedistested• Resultsarestoredinbrowser• ssDeepportedfromCtoJavascript• h^ps://github.com/kripken/emscripten• ExtensionavailableinChromeextensionstore(free,ofcourse)
AppliedMachineLearning�Wolff�Wallace�Zhao
TOOL–MARKOVOBFUSCATE• Dataexfiltra5onfromanetworkoYenrequiresavoidinganoutboundfirewall• Deeppacketinspec5onlookstoblockanythingundesirable• Easytoencryptdata,butitsalsoeasytodropinforma5onthatcan’tberead• Wecanmakeourdatalooklikesomethingelseen5rely
AppliedMachineLearning�Wolff�Wallace�Zhao
• Simplemachinelearningmethodforcharacterizingsequencedata• Learnsthetransi5onpa^ernfromastatetoanotherbasedonhowlikelyastatecomesaYeranotherstateinthetrainingdata• Wecancreatesequenceswithtransi5onpa^ernsthatarelearnedfromthedataitwastrainedon
MARKOVCHAIN
AABBABABA
A B
A 1(25%) 3(75%)
B 2(100%) 0(0%)
Tran
siXo
nMatrix
TrainingData
A
B
AA
BA
B P(ABB)=0
P(ABA)=0.75P(AAB)=0.1875
P(AAA)=0.0625
AppliedMachineLearning�Wolff�Wallace�Zhao
POPULARUSECASESOFMARKOVCHAINSWeatherPredic,on
Sunny Rainy
Sunny 0.9999 0.0001
Rainy 0.9 0.1
Recommenda,on
0.9 0.95
0.1
0.05
0.030.13
AppliedMachineLearning�Wolff�Wallace�Zhao
ENCODINGDATAWITHAMARKOVCHAIN• Givenatransi5onmatrix,wecansortitemsbyhowlikelytheyaretofollowourcurrentitem• Ifwechoosethe5thmostlikelyitem,wecaniden5fyit’sthe5thmostlikelywithamodeltrainedonthesamedata• Thisencodesthenumber5inthetransi5onfromourfirstitemtoourseconditem
AppliedMachineLearning�Wolff�Wallace�Zhao
MARKOVOBFUSCATE-ENCODING• Trainourmodelwithabook• Observingtransi5onsfromwordtoword• Generatedatabasedontransi5onprobabili5es• Demo
AppliedMachineLearning�Wolff�Wallace�Zhao
MARKOVOBFUSCATE-WRAPPING• SimpletotransferourdatathroughapipelinethatlookslikenormalHTTPtraffic• Lookslikeauserpos5ngtotheirblog• Demo
AppliedMachineLearning�Wolff�Wallace�Zhao
MARKOVOBFUSCATE–HAVINGFUN• TrainourmodelsonTaylorSwiYlyrics• TrainaMarkovModelbasedonTaylorSwiYsongs• Playthegeneratelyricsthroughfes5valwithtones/beatslearnedfromsongs• Firstlive“TylanceSwiY”concert,demo
AppliedMachineLearning�Wolff�Wallace�Zhao
WRAPPINGUP• Anyproblemwherethereisasignificantamountofdatageneratedcouldbenefitfromamachinelearningapproach• Lotsofgreatonlineresourcetohelpanyonegetstarted• HavinglabeledorannotateddatamakesmoreMLapproachedviablecomparedtounlabeleddata
AppliedMachineLearning�Wolff�Wallace�Zhao
QUESTIONS?• Email:[email protected]• Stopbybooth#1124• Careeropportuni5es:h^ps://www.cylance.com/cylance-careers