+ All Categories
Home > Technology > Applied Machine learning for data exfiltration and other fun topics

Applied Machine learning for data exfiltration and other fun topics

Date post: 09-Jan-2017
Category:
Upload: priyanka-aash
View: 52 times
Download: 0 times
Share this document with a friend
38
Applied Machine Learning Wolff Wallace Zhao APPLIED MACHINE LEARNING FOR DATA EXFIL AND OTHER FUN TOPICS Matt Wolff, Chief Data Scientist Brian Wallace, Security Researcher Xuan Zhao, Data Scientist
Transcript

AppliedMachineLearning�Wolff�Wallace�Zhao

APPLIED MACHINE LEARNING FOR DATA

EXFIL AND OTHER FUN TOPICS

Matt Wolff, Chief Data Scientist Brian Wallace, Security Researcher Xuan Zhao, Data Scientist

AppliedMachineLearning�Wolff�Wallace�Zhao

GOALS OF THIS TALK: APPLIEDMACHINELEARNING•  Iden5fysuitableproblemsforMLapproaches• DemonstratebyexamplehowtoapplyML• Helpjumpstartaddi5onalresearchintheSecurity+MLspace

AppliedMachineLearning�Wolff�Wallace�Zhao

WHOWEARE/CYLANCE•  Endpointsecuritycompanybuiltaroundthecapabili5esofar5ficialintelligence•  Protec5ngmillionsofenterpriseendpoints•  Foundedin2012,$177mmraised•  Booth#1124

Ma3WolffChiefDataScien5st

BrianWallaceSeniorSecurityResearcher

XuanZhaoDataScien5st

AppliedMachineLearning�Wolff�Wallace�Zhao

TALKOVERVIEW

• MachineLearningIntroduc5on• NMAPClustering•  FeatureSpaces• Distances• Clustering

• BotnetPanelIden5fica5on• Classifica5on•  FeatureReduc5on

• Obfusca5ngDatawithMarkovChains

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• Machinelearningtechniquesaredatadriven• Availabledatashouldbeabletosolvetheprobleminameaningfulway• Approachesexistfordealingwithrawdata,aswellaslabeledorannotateddata

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• Givensomedata,differenttypesofmachinelearningcanbeapplied• Clusteringisusefulforfindingsimilarityacrossdatasettouncovertrendsorotherinsight• Withlabeleddata,classifica5oncanbeusefultobuildpredic5vemodels

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW

• OYen,rawdatahastobetransformedinsomewaytobeusedbymachinelearningalgorithms•  Typicalprocessistoextractfeaturesfromdata,andturnthosefeaturesintovectors• VectorsarethenfedintoMLalgorithmsfortrainingorotherpurposes

AppliedMachineLearning�Wolff�Wallace�Zhao

MACHINELEARNINGOVERVIEW• Recommendedresources•  Scikit-learn.org•  Python

•  SourcecodeforalltoolsinthistalkavailableontheCylancepublicgitrepo•  h^ps://github.com/CylanceSPEAR

•  Shouldbeabletopulltalksourceandstartmodifyingasneededtosuitdatadrivenproblemsinyourownorganiza5onorresearchgroup.

AppliedMachineLearning�Wolff�Wallace�Zhao

• NMAPisapopularportscanningtool• ProduceslargeamountofdataperIPaddress•  ScansoverlargenumberofIPscanbedifficulttomakesenseof• NMAPClusteringisatoolwhichclusters(groups)IPsbasedontheiropenports,services,etc

TOOL–NMAPCLUSTERING

AppliedMachineLearning�Wolff�Wallace�Zhao

FEATURES•  Featuresareinforma5ve,discrimina5veinforma5onthatcandescribeasample/observa5on/phenomenon/etc.•  Featureextrac5onispivotaltothemachinelearningpipeline• OurfeaturesarebasedonNMAPoutput•  Eachportisafeature,eachserviceoneachportisafeature,eachversionofeachserviceoneachportisafeature,etc•  Scriptoutputincludedinfeatures(includingwebsite5tles,publickeys,etc)

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS• Numericalrepresenta5onofasample(IPinNMAPcase)• Arrayofvalueswhichrepresentallfeaturesfromonesample• Vectorscanbethoughtofaspointsinhighdimensionalspace•  Eachfeatureisadimension,thevalueofthefeatureinthevectoristheposi5oninthatdimension•  Ifwehaveonlytwofeatures,itisreallyeasytovisualize

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS–2D

AppliedMachineLearning�Wolff�Wallace�Zhao

VECTORS–3D

File FilenameLength

Filesize(kB) Sizeofheaders(kB)

calc.exe 8 918.528 63notepad.exe 11 193.024 45malware.exe 11 193.024 10

AppliedMachineLearning�Wolff�Wallace�Zhao

𝑎

𝑏𝑐

•  Distance:Describethediscrepancybetweentwopoints•  Physicaldistancebetweentwopoints:

Pythagorean’s theorem:

DISTANCES

a2+b2=c2

AppliedMachineLearning�Wolff�Wallace�Zhao

File FilenameLength Filesize(kB) Sizeofheaders(kb)

calc.exe 8 918.528 63

notepad.exe 11 193.024 45

malware.exe 11 193.024 10

2D 3D

AppliedMachineLearning�Wolff�Wallace�Zhao

DISTANCESMul5pleDistanceMetrics–As long as an operation satisfy certain mathematical criteria, it can be used as a distance metric

Manhattan

Euclidean

Point 1

Point 2

AppliedMachineLearning�Wolff�Wallace�Zhao

CLUSTERING• Withawaytomeasuredistance,wecangroupitemsbyhowclosetheyare,akaclustering• Clustersaredis5nctgroupsofsamples(IP)whichhavebeengroupedtogether• Clusteringisgenerallyunsupervisedlearning• Differentalgorithmswithdifferentconfigura5onsgroupthesesamplesindifferentways

AppliedMachineLearning�Wolff�Wallace�Zhao

k-Means• Clusteringalgorithm• Usersuppliesk(des5natednumberofclusters)• Allsamplesareassignedtorandomclusters• Centerofeachclusteriscomputedbytakingmean(average)ofallsamplesincluster•  Samplesarethenassignedtotheclusterwhosecentertheyareclosestto• Centersarerecomputed,algorithmloopsun5lnosampleschangeclusters

AppliedMachineLearning�Wolff�Wallace�Zhao

k-Means

AppliedMachineLearning�Wolff�Wallace�Zhao

NMAPCLUSTERING–MANUAL/AUTOMATIC• Manualallowsyoutosupplyyourownclusteringparameters• Automa5ctriesmanydifferentmethodswiththeore5cally-foundop5malparametersandpickswhatitdeterminestobethebest• Demowithmanualstrategy• Demowithautoma5cstrategy

AppliedMachineLearning�Wolff�Wallace�Zhao

NMAPCLUSTERING-INTERACTIVE•  IncorporatetheUser’sdecisionintotheclusteringprocess.•  TheClusteringresultwillbecustomizedaccordingtocustomer’spreferenceinthisway•  Process(willshowwithademo):

1.  Userdecidewhethertheclusterneedstobesplitornot:

2.  Ifyes,thensplitusingdivisiveclustering3.  Ifno,finalizethiscluster4.  Recursivelysplitun5lusersare

sa5sfiedwithalltheclusters

Y

Y N

Finalized

N

Finalized

N

Finalized

Y–SplitN–No-Split

AppliedMachineLearning�Wolff�Wallace�Zhao

TOOL–IDPANEL• Botnetpanels(commandandcontrolwebsites)canbedifficulttoiden5fy•  Needpreviousknowledgeofthebotnetpanels•  OYenmodifiedtoavoiddetec5onorvanity•  Manyarebasedoffothers,sodis5nguishingcanbedifficult

• Wecantrainamodeltoiden5fyifwearelookingatabotpanelandwhichoneitis,withasmallnumberofrequests• Minimizingthenumberofrequeststoclassifyimprovesstealthandrateofclassifica5on

AppliedMachineLearning�Wolff�Wallace�Zhao

CLASSIFICATION•  Thisisaclassifica5onproblem• Classifica5onanswers“Isitwhatwearelookingfor?”• Classifica5onisgenerallysupervisedlearning•  Supervisedlearningrequirestrainingsamplestohavelabels• Classifica5onmethodsrangefromsimpletohighlycomplex

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELFEATURES• Botnetpanelsaresimilartonormalwebsites• Containvariousfiletypes,oYenedited• HTTPresponsecodes+contentcomparison•  Encodingcontentasfeaturesdifficult•  ssDeepprovidesacon5nuousvaluebycomparingcontent

AppliedMachineLearning�Wolff�Wallace�Zhao

COLLECTINGDATA• Collec5onofknownbotnetpanels• Requestallknownpathsforallknownbotnetpaneltypes•  StoreHTTPstatuscodeandssDeepofcontent• Collec5onofsitesthatarenotbotnetpanelsneededaswell

AppliedMachineLearning�Wolff�Wallace�Zhao

DECISIONTREES•  Decisiontreesaresimpleclassifiers•  Splitsthedatasetonefeatureata5meun5ldecisionisconfident•  Resultsinatreeofquerieswheretheresultsproduceadecision•  Simpletotrain

AppliedMachineLearning�Wolff�Wallace�Zhao

ENSEMBLEOFDECISIONTREES•  Asingledecisiontreemaybeoverfocusedontrainingdata•  Canalleviatethisproblembybuildingmul5pleDecisionTreesforeachlabel•  CombiningtheresultsallowseachDecisionTreetovote•  Par5alanswersmays5llbeofinteresttotheuser•  Ensemblescanobtainbe^erpredic5veperformance

PonyDT1 PonyDT2 PonyDT3 DexterDT1 DexterDT2 DexterDT3 ZeusDT1 ZeusDT2 ZeusDT3

It’sPony

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELDEMO–COMMANDLINE• Quickwaytocheckifawebsitedirectorycontainsabotnetpanel•  Easytobatchsearchingofmul5plewebsites/directories•  Easytogrepresults

AppliedMachineLearning�Wolff�Wallace�Zhao

IDPANELDEMO–CHROMEEXTENSION•  Everywebsitedirectoryvisitedistested• Resultsarestoredinbrowser•  ssDeepportedfromCtoJavascript• h^ps://github.com/kripken/emscripten•  ExtensionavailableinChromeextensionstore(free,ofcourse)

AppliedMachineLearning�Wolff�Wallace�Zhao

TOOL–MARKOVOBFUSCATE•  Dataexfiltra5onfromanetworkoYenrequiresavoidinganoutboundfirewall•  Deeppacketinspec5onlookstoblockanythingundesirable•  Easytoencryptdata,butitsalsoeasytodropinforma5onthatcan’tberead• Wecanmakeourdatalooklikesomethingelseen5rely

AppliedMachineLearning�Wolff�Wallace�Zhao

•  Simplemachinelearningmethodforcharacterizingsequencedata•  Learnsthetransi5onpa^ernfromastatetoanotherbasedonhowlikelyastatecomesaYeranotherstateinthetrainingdata• Wecancreatesequenceswithtransi5onpa^ernsthatarelearnedfromthedataitwastrainedon

MARKOVCHAIN

AABBABABA

A B

A 1(25%) 3(75%)

B 2(100%) 0(0%)

Tran

siXo

nMatrix

TrainingData

A

B

AA

BA

B P(ABB)=0

P(ABA)=0.75P(AAB)=0.1875

P(AAA)=0.0625

AppliedMachineLearning�Wolff�Wallace�Zhao

POPULARUSECASESOFMARKOVCHAINSWeatherPredic,on

Sunny Rainy

Sunny 0.9999 0.0001

Rainy 0.9 0.1

Recommenda,on

0.9 0.95

0.1

0.05

0.030.13

AppliedMachineLearning�Wolff�Wallace�Zhao

ENCODINGDATAWITHAMARKOVCHAIN• Givenatransi5onmatrix,wecansortitemsbyhowlikelytheyaretofollowourcurrentitem•  Ifwechoosethe5thmostlikelyitem,wecaniden5fyit’sthe5thmostlikelywithamodeltrainedonthesamedata•  Thisencodesthenumber5inthetransi5onfromourfirstitemtoourseconditem

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE-ENCODING•  Trainourmodelwithabook• Observingtransi5onsfromwordtoword• Generatedatabasedontransi5onprobabili5es• Demo

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE-WRAPPING•  SimpletotransferourdatathroughapipelinethatlookslikenormalHTTPtraffic•  Lookslikeauserpos5ngtotheirblog• Demo

AppliedMachineLearning�Wolff�Wallace�Zhao

MARKOVOBFUSCATE–HAVINGFUN•  TrainourmodelsonTaylorSwiYlyrics•  TrainaMarkovModelbasedonTaylorSwiYsongs• Playthegeneratelyricsthroughfes5valwithtones/beatslearnedfromsongs•  Firstlive“TylanceSwiY”concert,demo

AppliedMachineLearning�Wolff�Wallace�Zhao

WRAPPINGUP• Anyproblemwherethereisasignificantamountofdatageneratedcouldbenefitfromamachinelearningapproach•  Lotsofgreatonlineresourcetohelpanyonegetstarted• HavinglabeledorannotateddatamakesmoreMLapproachedviablecomparedtounlabeleddata

AppliedMachineLearning�Wolff�Wallace�Zhao

QUESTIONS?•  Email:[email protected]•  Stopbybooth#1124• Careeropportuni5es:h^ps://www.cylance.com/cylance-careers


Recommended