SampleSelectionforTesting
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
2
NoticeandDisclaimerofLiabilityConcerningtheUseofAMTSODocuments
ThisdocumentispublishedwiththeunderstandingthatAMTSOmembersaresupplyingthisinformationforgeneraleducationalpurposesonly.Noprofessionalengineeringoranyotherprofessionalservicesoradvice isbeingofferedhereby. Therefore,youmustuseyourownskillandjudgmentwhenreviewingthisdocumentandnotsolelyrelyontheinformationprovidedherein.
AMTSObelievesthattheinformationinthisdocumentisaccurateasofthedateofpublicationalthoughithasnotverifieditsaccuracyordeterminedifthereareanyerrors.Further,suchinformationissubjecttochangewithoutnoticeandAMTSOisundernoobligationtoprovideanyupdatesorcorrections.
Youunderstandandagreethat thisdocument isprovidedtoyouexclusivelyonanas-isbasiswithoutanyrepresentationsorwarrantiesofanykindwhetherexpress, impliedorstatutory. Without limitingthe foregoing, AMTSO expressly disclaims all warranties of merchantability, non-infringement,continuousoperation,completeness,quality,accuracyandfitnessforaparticularpurpose.
InnoeventshallAMTSObeliableforanydamagesorlossesofanykind(including,withoutlimitation,any lost profits, lost data or business interruption) arising directly or indirectly out of any use of thisdocument including, without limitation, any direct, indirect, special, incidental, consequential,exemplary and punitive damages regardless of whether any person or entity was advised of thepossibilityofsuchdamages.
Thisdocument isprotectedbyAMTSO’s intellectualpropertyrightsandmaybeadditionallyprotectedbytheintellectualpropertyrightsofothers.
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
3
SampleSelectionforTestingIntroduction
Theclassificationandappropriate,well-foundedselectionofsamplesfortestingisnecessaryinordertomakeatestreliable,unbiased,relevantandmeaningful.Followingthesepracticesproperlylessenstheriskofrenderingtestingandtestresultsofdoubtfulvalidityandtheconclusionsbaseduponthemarelesslikelytobemisleading.
Inanytest,sampleselectionisimportant.Ingeneral,thequalityofthesamplesusedismoreimportantthanthequantity,butareasonableminimumquantityofsamplesisnecessary.
Sampleselectioncanbebrokendownintothefollowingprocesses:
• Collecting
• Validation
• Classification
Collectingofsamplesistheprocessofgathering/selectingfiles,URLs,orotherobjectstobeusedastestcases.
Validationofsamplesistheprocessofmakingsurethatthefileorobjecttobeusedfunctionsproperlyinthedefinedtestingenvironment.
Classification(orverification)istheprocessofproperlycategorizingthefilesorobjectsintotheircorrectcategory set,which can be as simple as a good, bad or “gray” set, or as complex asworms, trojans,rootkits, adware, “potentially unwanted”, or othermore detailed categories. Itmay also include sub-categorieswithinthegood,badorgrayset,asdescribedlater.
Byfollowingtheseprocesses,andthebestpracticesassociatedwitheach,anytesterwillhaveagoodfoundation for conducting a test. Collect the pieces; validate that they work; and verify them foraccuratecategorization.
Collecting
Thesourceofsamplestobeusedinatestreallydoesoftendictatethesuccessorfailureofatest.Thisisoftenoneoftheveryfirstquestionsthatatesterneedstoask.Focusingonasingle,specificsourcemaybe acceptable as long as it was the specific purpose of the test and as long as it has been properlydefinedas thetestobjective.However, itcanalsoresult in thenarrowingof thetestcoverage,whichmight limittheaudiencetargetedbythereview.Forexample,aconsumer-orientedsourceofsamplesmightnotbeofinteresttoorrelevanttoacorporateaudience,andviceversa.Thesourcingofsamplesshouldbealignedwithandappropriatetothetestpurpose,butcoverageofawiderrangeofsourcesoftenappealstoacorrespondinglywiderrangeofaudiencesandisrecommendedinprinciple.
Samplescanbecategorizedfromtwodifferentpointsofview:
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
4
1. Howwerethesamplescollected?Examplesofthistypeofsourcecategoriesinclude:honeypots,passivecrawlers,activecrawlers,ISP,etc.
2. Wherehavethesamplesbeencollectedfrom?Examplesofthistypeofsourcecategoriesare:URI,intranet,email,filesharing,socialnetworks,peertopeer,etc.
It’s important to takeboth categoriesof sample source into considerationwhen the test is designed,conductedandtailoredfortheaudience.Duetothehighvolumes,regionaldistributionfactors,andtheintended classification of existing samples, description of the collectionmethodology becomes a keyfactorindeterminingthelikelihoodanddegreeofbiasinagiventest.
Theidealsourceofsamplesoffersreal-world,prevalent,fresh,diversesamplescollectedindependentlyofsecuritysoftwareproviders.Itisimportantthattestersactivelycollectsamplesandcreatetheirownsources/collections,sothatthesamplesareasindependentandneutralaspossible.ObtainingsamplesfromindependentsourcesisdiscussedalsoinAMTSO’s IssuesInvolvedinthe‘Creation’ofSamplesforTestingdocumentatwww.amtso.org.
Validationcanbeaproblem,basedontheresourcestypicallyavailabletotesters,especiallywhenusingindependentsources.However,atthispointwereiteratethatvalidationandverificationoftestsamplesbyscanningwithmultipleproductsdoesnot in itselfofferreliable,accurate,vendor-neutralvalidationorverification.
If using samples drawn from the feeds of various AV companies, the selection must be done in abalancedwaysothatbiasisnotintroducedevenbeforethetestisactuallyconducted.Thisistheareainwhich metadata sharing may come useful. Various attributes can be taken into account, such asmalware(geo)prevalence,age,familynameandsoon.ThelistofsignificantattributestoshareisunderdiscussionwithintheIEEEICSGworkinggroup.
Lastly,thefreshnessofcollectedsamplesisalsoimportant,sinceitaffectshowrelevantatestsetistothe real-life threat landscape. For example, a trojan discovered 5 years agomay still be as potent astrojans found today, but the likelihood of seeing such a 5-year-old threatmight be low compared tothreatsjustfoundtoday.Inthecaseofshort-livedthreats,aonedayoldURLmightalreadybeobsolete.
Table1intheAppendixprovidesguidancefortestersonthesourcestheymightuseandtheprosandconsofeachsource,basedonAMTSOgoodpracticeguidelinesforsamplecollection.
This assessment is not meant to identify the best method of collection but merely to indicate theamountofpost-collectionefforta testerneeds intoput inbuildingupappropriateandrepresentativetest sets. For example, when collecting from non-security industry sources, the independence anddiversity gained is balanced by the increased post-collection effort needed to validate and verify thecollected objects. When collecting from commercial sources, although the collected objects may befreshandvalidated,diversityandindependencemaytakeahit.
There is no single, ideal way to collect samples for tests. A tester needs to balance the factorsmentionedhereinordertobuildagoodsetofsamplesthatcanincreasethequalityofthetest.
Validation
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
5
The samplevalidationprocessessentially consistsofa seriesof tests tomake sure that the sample isfunctional.Thereareseveralwaystovalidatesamples:hand-checking,usageofautomatedtools(auto-replicatingsystems,sandboxing)orbyusingvariousspecializedtoolstocheckfilegeometry,integrityorfunctionality (not applicable toall sample types).Bestpractices show that validation ismost valuablewhenit’sbasedonsamplefunctionalityandperformedinthesameenvironmentasthetestwilluse.Inthisway,thevalidationcanbealsodoneduringorafterthetest.However,thevalidationprocedurestillneedstobedocumented.
Merely scanning the samples using various products and accepting or rejecting according to thedetectionresultscannotbeconsideredanacceptablemethodofvalidationforseveralreasons:
• Vendors do not only use exact detections, so it is not guaranteed that a sample detected asmaliciousisreallyanintactorworkingsample
• Occasionallydetectionsarecreatedforsamplesthatareknownnottobevalid,workingobjects
• Detectionscreatedforworkingsamplesmayalsodetectnon-workingsamplesinadvertently(i.e.notonpurpose),dependingonthedetectionalgorithm.
• UsingAVproductsforsampleverificationcanaddahugebiasinthetest,especiallyifthesameproductsaregoingtobetestedonthose“verified”samples.
• AV products are known to occasionally follow each other’smisclassifications (a.k.a. cascadedfalsepositives)
• Usingcloudscanners (fordetails refer toAMTSO’sdocumentBestPractices forTesting In-the-CloudSecurityProducts)shouldbeavoidedbeforethetestisperformed,sinceitmayaffectthetestresultsbyleakinginformationaboutthetestsetinadvance.
• Usingexternalmulti-scannerserviceshasalltheproblemslistedabove,andmore:forexample,itaddstheriskofleakingthetestsetandlosingcontrolovertheproductsettings.
AMTSOhasalreadypublishedacceptablevalidationmethodsandtestersareadvisedtoreadtheAMTSOdocumentBestPracticesforValidationofSamplesforsuggestionsonhowsamplescanbevalidated.
Classification
The classification process involves the categorization of the collected and validated sample set. Thisusuallyinvolvesgroupingsamplesasgood(non-malicious),bad(malicious),orgray(whethertheobjectismaliciousdependsontheintentoftheauthor/distributorandtheunderstandingofthetargetuser–for example, whether the presentation of the object is unequivocally misleading), or as any othercategoriesthataredefinedandwhichareintendedtobeincludedinthetest.Theintendedcategoriesneedtobeclearlystatedinthedocumentedtestobjectives.
Classificationasgood,badorgray,canbe furtherbrokendown intosub-categoriesdependingon thetests that need to be performed. For example, malware can be broken down into trojans, worms,viruses, and soon,while clean files canbebrokendownbasedonprevalenceor criticality. Althoughpresentedhereasaseparatestep,classificationmaybeperformedatthetimeofvalidation.Inthiscase,
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
6
the behavior of a file or object is observed and noted while checking whether it is working or not.Classificationproceduresalsoneedtobedocumentedandconsistent.
Thetesterhastodefinethecharacteristics,theparametersandboundariesofwhatisconsideredtobegood, bad, gray, or any other category. These definitions or definition references need to bedocumented.Thisisespeciallythecaseiftheydonotaligntothegenerallyaccepteddefinitions(iftheyexist at all) for the mentioned categories. Lastly, the classification and/or categorization must berelevantforthepurposeofthetest.
Below are some practices used in verifying the sample’s behavior and the questions a tester has toassessifthismethodisanoption:
a. ReverseEngineeringVerificationofeachsample.
i. Woulditbeprohibitedfortesterstoapplyreverseengineering?Atthispointitis necessary to establish whether, for example, reverse engineering isprohibitedbylaw.
ii. IsitpracticalfromaTime/Cost/Resourceperspective?
b. UsingAnalysistools
i. CommercialTools
1. Aresomeofthetoolsprohibitivelyexpensive?
2. Doesthetoolprovidethenecessaryfunctionality?
3. Some malware detects commercial tools. Does this lessen theirusefulnessandeventuallyleadtowardsmisclassification?
4. Are the functionalities and/or limitations of the commercial toolknown?
ii. OpenSourceTools
1. Some malware detects open source tools. Does this lessen theirusefulnessandeventuallyleadtowardsmisclassification?
2. Doestheopensourcetoolprovidethenecessaryfunctionality?
3. Are the functionalities and/or limitations of the open source toolknown?
4. Hastheopensourcetoolbeenmodifiedforthetest?Someopensourcetoolsrequirethepublicationofthemodifications.
iii. InternallyDevelopedTools
1. How much disclosure should be provided when using internallydevelopedtools?
2. Shoulditbeexplainedwhysuchatoolwasdeveloped?
c. UsingMultipleScanners(shouldnotbeusedalone)
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
7
i. Howmanyscannershavetoconcurforverificationtoberelevant?
ii. Howdoes the choice of scanners used for verification affect the test? Whatmeasureshavebeentakentoavoidbiasinfavourofanyofthetestedvendors?Willthisinformationbedisclosed?
iii. Does the detection name affect the classification of samples? What if theclassification/names change over time? What about generic detections andmultipleclassifications?
iv. Howreliablearethescannerresults?
d. UsingaCleanCollection
i. Howwasthecleancollectioncollected/validated/classified?ii.Howbroadlyhasthe clean collection been selected? Are commercial, shareware, and/orfreewareapplicationsincluded?
Otherfactorsthattestersshouldconsiderintheverificationprocessare:
Freshness
An important aspect of any anti-threat or anti-theft technology is proactive protection. This is bestevaluatedusing freshandcurrently relevant threats.Thus theageof samplesand/or theageof theirsources (in caseofURL,domainsas testobjects)need tobe taken in consideration.Sample selectionandcategorizationisasignificantissueinalltestmethodologies,andtofullytesttheresponsivenessofreal-timesystems,samplesshouldnormallybeas‘fresh’aspossible.Bestpracticewouldbetovalidatein advance; however, an acceptable compromisemight be to show thatmaximum freshness can beachieved by testing solutions against all available samples and performing sample validation and/orclassificationlater.Inthiscaseonlysuccessorfailureagainstproven-validsamplesshouldbetakenintoconsiderationwhenreportingresults.
Prevalence
Whilemakingsurethesamplesinatestsetarediverseandcompriseasufficientlylargevarietyoffiles(eithermaliciousorclean), itmay–dependingonthetestscope–alsobeveryimportanttotakeintoaccounttheirprevalence.Thisisjustasvalidformalwareasitisforcleanfiles–thetestersshouldavoidspecific,low-spreadproblematicsoftware(grayware)that’sknowntobelikelytotriggerfalsepositivesordisputeddetectionbecauseof theirnature. Forboth innocentandmalicious samples, it shouldbetakenintoconsiderationthatprevalencemaydependonsourceandmethodofcollection.
Forexample,ifsamplesaresourcedonlyfromonegeographicalregion,it istobeexpectedthatthesewill be prevalent within the area in which they were collected, but that doesn’t necessarily reflectprevalenceworldwide.
Prevalencemetadata (forexample themodeldevelopedby IEEE ICSGworkinggroupmembers) couldbecome a valuable source for determining sample prevalence. Vendors are encouraged to sharemetadataandtestersareencouragedtousemultiplesourcesinordertoreducetheriskofbias.
Diversity
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
8
Diversityinthissensereferstoboththevarietyofmalwarefamiliestestedandtheunderlyingbehaviorof the malware. A sample set is diverse when it reflects the real world distribution of the samplesrelevant for the testing purpose. In particular, the resource-intensive tests (like dynamic or cleaningtests) are frequently carried out on a smaller sample set than large-scale static tests. It is thusmoreimportantthatthesamplesetisdiverse.
It isnotbestpracticeto includea largenumberofsamplestoreachsomedesiredquantityiftheyarenotdiverse.Suchanapproachto“padding”thenumberofsamplesdoesn’tnecessarilyaddanyvalue;infact,themoresamplesthereare,theworsethey’reusuallyverified.
Diversity might be also limited in the special cases of testing detection capabilities regardingpolymorphicviruses,server-sidepolymorphicmalware,andsoon.
Althoughdiversitymaylowertheminimumquantityofstatisticallyrelevantnumberofsamplesinatestset, the higher the number the higher the test accuracy should be as long as the set is reasonablydiverseandthequalityofthesamplesismaintained.
ReasonableNumberofSamples
Asbriefly touchedupon in thepreviousparagraph, thenumberof samplesused is very important, inorder to make testing statistically meaningful. In reality, the number of samples tested is stronglydependentonthevalidationmethodand/ordifficultieswithconductingofthetest,sincetheresourcesrequired varywith each test, evenamong tests froma specific tester. Tensof samples canhardlybeconsideredstatisticallyrelevantforanytestunlesstheyrepresentahighproportionofaverysmalltotalpopulation. The sample size and choice of samples should be statistically adequate to support theconclusions of the test.Where practical, the tester should quote themargin of error or at very leastexplain limitations of the test results imposed by methodology. One of the deciding factors is thestatisticalvalidityofthesampleset.
Geo-locationIssues
One of the important issues that testers need to take into account is the market coverage of theproductstested.Forexample,detectingalegitimateChinesetoolbarmaybeanon-criticalfalsepositiveinGermany,andnotcauseanyproblemsthere,butneverthelessbeafalsealarm.However,thissameprogram would have a critical impact on products active in China. This applies mainly to legitimatesoftware and grayware, where different acceptance rates for these applications can be observed invariousregionsaroundtheglobe.
Withmalicioussamplesgeo-locationis lessimportant,sincesuchprogramsaremaliciousregardlessofworldregionandplatform.
Thetestermightbestronglyinfluencedbytheregionheresidesin,andneedstobecarefulnottodrawconclusions that are too generalized for the data, and beyond the geographical scope of the testsamples/scenarios.
Reputation
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
9
Reputation informationdeliveredbythevendors isassumedtobethesame inallcases,but inrealitythis is often not the case. Depending on the business philosophy of each vendor, the reputation ofspecificgraywaremaybeclassifiedverydifferentlybydifferentproducts.Thisaspectshouldbe takeninto consideration either when classifying the sample, or when evaluating results, as well as whenconfiguring software under test: certain examples of grayware may not be detected by a particularproductbydefault.
MeetingtheObjectiveoftheTest
Whilegatheringsamplesforthetest it isnecessarytofocusonthepurposeofthetestandselectthesamplesaccordingly.Differenttestingscenariosrequiredifferenttypesofsamples.Considerationsthatneedtobetakenintoaccountforspecifictypesoftestinclude:
• Static: no data files; in case of SFX bundled files it should be considered that the unpackingsupport foreveryproductmightbedifferentandwould thus influencetheresults.These filesmay not be a problem in dynamic tests/whole product testswhere the security solution caninterceptthemaliciouscontentwhilebeingunpacked.
• Polymorphicmalwaredetectiontest:inthiscasethediversityofsamplescanbelimited,andthenumbersofsamplesbelongingtothesamefamilyorvariantratherhigh.
• Potentially unsafe/unwanted applications/adware/spyware tests: this is very subjective, beingdependentontheopinionsofthevendorandcustomer,andisdifferentcountrytocountryaswell; thetestingofthesecategoriesofsoftware isverysensitiveandcontroversialandcanbeinfluencedby the tester’sownopinions.Verifying that samples really fall into this category isextremelyresource-intensiveandtheresultscouldbebroughtintoquestion.Detaileddiscussioncan be found in the Considerations for Anti-Spyware Product Testing document by the Anti-SpywareCoalition(ASC).
• Dynamic/behavior/wholeproduct: Itneeds tobeconfirmed that the samplesused in the testexhibitmaliciousactivityinsuchawaythataproductbeingtestedhasanopportunitytoblockthis malicious activity. The samples used must exhibit malicious behavior in ways that arereflectedinthetestmethodology.Otherwisethereisariskthatproductswillbepenalizedforfailingtodetectorblockmalwarethatitwouldhavecaughtintherealworld.
• Thisalsomeansthattheselectionofthesamplescannotbemadebyremovingsampleswhicharedetectedbyaspecificdetectionmethod1oftheproducttotestanotherdetectionmethod2ofthesameproduct.Thisapproachwouldbe inaconflictwiththeproduct’sarchitectureanddesign,andwouldnotreflectreal-worldprotection.
• Exploits prevention: The testermustmake sure that the samples actually get to execute theexploitationcode–sothesystemhastobevulnerable,theexploitsmatchtheplatformandtheenvironmentiscorrectlyconfiguredtoreflectrealworldattackanddefense.
• URLblocking/web attacks prevention: validity ofURL samples is very transient: a testermustensure that the samples are valid at the time theURL is used as a test case. Because of thedynamic nature of threats in this form (geo, ttl, server-side poly, platform-specific, browser-specific,time-specific,“servedonlyonce”)specialcareandconsiderationsshouldbeapplied.
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
10
• In the cloud tests: it has to be taken into account that the detection of the sample can bealtered by the test itself, for example the result might depend on the actualfilename/path/attributes.
• Clean set tests/FP tests: To provide a balanced test of user experience, tests need to includelookingforfalsepositivesbytestingagainstcleanapplications.Theseapplicationsshouldcoverthe set of common operations that users undertake on their machines, e.g. installingapplications,updatingapplications,runningapplications,applyingoperatingsystempatchesandinstalling and using browser plugins. Installed applications should be run to ensure that theyfunctioncorrectly.Reputationof thebenignsamplescanbe taken intoaccountand thecleansetsshouldrepresentthereal-worldsituationasmuchaspossible(forfurtherinformationcanbefoundinAMTSO’sFalsePositiveTestingGuidelines).
• Cleaningtests:Giventhehighresourcerequirementsofthesetests,testersarenotabletotestagainst many samples, so sample prevalence is a critical factor, along with diversity andcriticality.
• Unpacking tests/SFX tests: in this type of test it is usually difficult to collect a significant anddiversetestsetfromthefield.TestersshouldrefertoAMTSO’sIssuesInvolvedinthe"Creation"ofSamplesforTestingdocumentforadviceonacceptablepracticeswhenartificiallygeneratedsamplesareaddedtothetestset.
• Performancetests:Shouldbegenerallyperformedoncleanfilesratherthanonmalicious,butthis depends on the methodology (for further details refer to AMTSO’s Performance TestingGuidelinesdocument).
• Targeted attacks: the target environment and the attack scenario have to be reconstructedproperly,whichisusuallyextremelydifficult.
Theclassificationprocessdoesentailagreatdealofeffortandthought:however,thisisaprerequisiteforsoundtesting.
Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior
writtenconsentofthepublisher.
11
Appendix
SourcesSecurityVendor
SecurityIndustryResearch/Projects
CommercialSources
Non-securityindustrysources
TesterCollected
Examples AVCompanySecurityWorkingGroups
Samplefeedsprovidedfora
fee
ISPs,Universities
Honeypots,crawlers
Validationisperformed
Shouldnotbereliedon
Shouldnotbereliedon
Shouldnotbereliedon
Shouldnotbereliedon
Shouldnotbereliedon
FreshnessMayormay
notbeLikely Likely Likely Likely
PrevalenceMayormay
notbeavailable
Likely Unlikely Unlikely Unlikely
Diversity Likely Unlikely Unlikely Unlikely UnlikelyIndependence(notbiasedinfavourofone
ormorevendors)
Highlyunlikely Unlikely Unlikely Likely Likely
Table1:CharacteristicsofDifferentSampleSources______________________________________________________________________________
ThisdocumentwasadoptedbyAMTSOonFebruary24,2012