+ All Categories
Home > Documents > Large-Scale Identification of Malicious Singleton...

Large-Scale Identification of Malicious Singleton...

Date post: 07-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
12
Large-Scale Identification of Malicious Singleton Files Bo Li Vanderbilt University [email protected] Kevin Roundy Symantec Research Labs KevinRoundy@symantec.com Chris Gates Symantec Research Labs ChrisGates@symantec.com Yevgeniy Vorobeychik Vanderbilt University [email protected] ABSTRACT We study a dataset of billions of program binary files that appeared on 100 million computers over the course of 12 months, discovering that 94% of these files were present on a single machine. Though malware polymorphism is one cause for the large number of singleton files, additional factors also contribute to polymorphism, given that the ratio of benign to malicious singleton files is 80:1. The huge number of be- nign singletons makes it challenging to reliably identify the minority of malicious singletons. We present a large-scale study of the properties, characteristics, and distribution of benign and malicious singleton files. We leverage the in- sights from this study to build a classifier based purely on static features to identify 92% of the remaining malicious singletons at a 1.4% percent false positive rate, despite heavy use of obfuscation and packing techniques by most malicious singleton files that we make no attempt to de-obfuscate. Fi- nally, we demonstrate robustness of our classifier to impor- tant classes of automated evasion attacks. CCS Concepts Security and privacy Software security engineer- ing; Keywords Singleton files; malware detection; robust classifier 1. INTRODUCTION Despite continual evolution in the attacks used by mali- cious actors, labeling software files as benign or malicious remains a key computer security task, with nearly 1 million malicious files being detected per day [?]. Some of the most reliable techniques label files by combining the context pro- vided by multiple instances of the file. For example, Polo- nium judges a file based on the hygiene of the machines on which it appears [?], while Aesop labels a file by inferring its software-package relationships to known good or bad files, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CODASPY’17, March 22-24, 2017, Scottsdale, AZ, USA c 2017 ACM. ISBN 978-1-4503-4523-1/17/03. . . $15.00 DOI: http://dx.doi.org/10.1145/3029806.3029815 based on file co-occurrence data [?]. These detection tech- nologies are unable to protect customers from early instances of a file because they require the context from multiple in- stances to label malware reliably, only protecting customers from later instances of the file. Thus, the hardest instance of a malware file to label is its first, and regrettably, the first instance is also the last in most cases, as most malware sam- ples appear on a single machine. In 2015 around 89% of all program binary files (such as executable files with .EXE and .DLL extensions on Windows computers) reported through Norton’s Community Watch program existed on only one machine, a rate that has increased from 81% since 2012. To make matters worse, real-time protection must label files that have been seen only once even though they may even- tually appear on many other machines, putting the effective percentage of unique files at any given time at 94%. We present the first large-scale study of singleton files and identify novel techniques to label such files as benign or ma- licious based on their contents and context. We define a singleton file as any file that appears on exactly 1 machine. We consider two files to be distinct when a cryptographic hash taken over their contents (such as SHA-256) yields a different result, meaning that two files that differ by a single bit are considered distinct even though they may be func- tionally equivalent. Due to the fact that malware is often polymorphic, many malicious files are among these singletons. However, single- ton executable files do not trend towards being malicious; in fact the opposite is true: the ratio of benign to malicious singleton files is 80 to 1, resulting in a skewed dataset. This ratio gives low prevalence malware a large set of files to hide amongst and it makes effective classification models difficult to train, as most machine learning models require relatively balanced data sets for effective training. We study the root causes behind the large numbers of benign singleton files in Section ?? and study malicious singletons in Section ??. We study the properties of machines that are prolific sources of benign singleton files in Section ??. We filter obviously be- nign singletons by profiling the prominent categories of be- nign singleton files that appear on such systems (Section ??). We present the full machine learning pipeline and the fea- tures we use to classify these samples in Section ??. We present experimental results in Section ??. Since the phenomenon of malicious singleton files was largely driven by the arms race between security vendors and mali- cious adversaries in the first place, it is important to analyze robustness of our model against evasion attacks, and we do so in Section ??. We form the interactions between and ad-
Transcript
Page 1: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

Large-Scale Identification of Malicious Singleton Files

Bo LiVanderbilt University

[email protected]

Kevin RoundySymantec Research [email protected]

Chris GatesSymantec Research [email protected]

Yevgeniy VorobeychikVanderbilt University

[email protected]

ABSTRACTWe study a dataset of billions of program binary files thatappeared on 100 million computers over the course of 12months, discovering that 94% of these files were present on asingle machine. Though malware polymorphism is one causefor the large number of singleton files, additional factors alsocontribute to polymorphism, given that the ratio of benignto malicious singleton files is 80:1. The huge number of be-nign singletons makes it challenging to reliably identify theminority of malicious singletons. We present a large-scalestudy of the properties, characteristics, and distribution ofbenign and malicious singleton files. We leverage the in-sights from this study to build a classifier based purely onstatic features to identify 92% of the remaining malicioussingletons at a 1.4% percent false positive rate, despite heavyuse of obfuscation and packing techniques by most malicioussingleton files that we make no attempt to de-obfuscate. Fi-nally, we demonstrate robustness of our classifier to impor-tant classes of automated evasion attacks.

CCS Concepts•Security and privacy → Software security engineer-ing;

KeywordsSingleton files; malware detection; robust classifier

1. INTRODUCTIONDespite continual evolution in the attacks used by mali-

cious actors, labeling software files as benign or maliciousremains a key computer security task, with nearly 1 millionmalicious files being detected per day [?]. Some of the mostreliable techniques label files by combining the context pro-vided by multiple instances of the file. For example, Polo-nium judges a file based on the hygiene of the machines onwhich it appears [?], while Aesop labels a file by inferring itssoftware-package relationships to known good or bad files,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CODASPY’17, March 22-24, 2017, Scottsdale, AZ, USAc© 2017 ACM. ISBN 978-1-4503-4523-1/17/03. . . $15.00

DOI: http://dx.doi.org/10.1145/3029806.3029815

based on file co-occurrence data [?]. These detection tech-nologies are unable to protect customers from early instancesof a file because they require the context from multiple in-stances to label malware reliably, only protecting customersfrom later instances of the file. Thus, the hardest instanceof a malware file to label is its first, and regrettably, the firstinstance is also the last in most cases, as most malware sam-ples appear on a single machine. In 2015 around 89% of allprogram binary files (such as executable files with .EXE and.DLL extensions on Windows computers) reported throughNorton’s Community Watch program existed on only onemachine, a rate that has increased from 81% since 2012.To make matters worse, real-time protection must label filesthat have been seen only once even though they may even-tually appear on many other machines, putting the effectivepercentage of unique files at any given time at 94%.

We present the first large-scale study of singleton files andidentify novel techniques to label such files as benign or ma-licious based on their contents and context. We define asingleton file as any file that appears on exactly 1 machine.We consider two files to be distinct when a cryptographichash taken over their contents (such as SHA-256) yields adifferent result, meaning that two files that differ by a singlebit are considered distinct even though they may be func-tionally equivalent.

Due to the fact that malware is often polymorphic, manymalicious files are among these singletons. However, single-ton executable files do not trend towards being malicious;in fact the opposite is true: the ratio of benign to malicioussingleton files is 80 to 1, resulting in a skewed dataset. Thisratio gives low prevalence malware a large set of files to hideamongst and it makes effective classification models difficultto train, as most machine learning models require relativelybalanced data sets for effective training. We study the rootcauses behind the large numbers of benign singleton files inSection ?? and study malicious singletons in Section ??. Westudy the properties of machines that are prolific sources ofbenign singleton files in Section ??. We filter obviously be-nign singletons by profiling the prominent categories of be-nign singleton files that appear on such systems (Section ??).We present the full machine learning pipeline and the fea-tures we use to classify these samples in Section ??. Wepresent experimental results in Section ??.

Since the phenomenon of malicious singleton files was largelydriven by the arms race between security vendors and mali-cious adversaries in the first place, it is important to analyzerobustness of our model against evasion attacks, and we doso in Section ??. We form the interactions between and ad-

Page 2: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

versary and our malware detection system as a Stackelberggame [?] and simulate evasion attacks on real singleton filesto demonstrate that our proposed pipeline performs robustlyagainst attacker interference.

In summary, we make the following contributions:1. We provide the first detailed discussion of the role that

benign polymorphism plays in making singleton file clas-sification a challenging problem.

2. We identify root causes of benign polymorphism and lever-age these to develop a method for filtering the most“obvi-ous” benign files prior to applying malware classificationmethods.

3. We develop an algorithm that classifies 92% of malicioussingletons as such, at a 1.4% false positive rate. We doso purely on the basis of static file properties, despite ex-tensive obfuscation in most malware files, which we makeno attempt to reverse.

4. We explore the adversarial robustness of multiple classi-fication models to an important class of automated eva-sion/mimicry attacks, demonstrating the robustness of aperformant set of features derived from static file proper-ties.

2. SINGLETON FILES IN THE WILDTo address the paucity of information about singleton

files, we study their causes, distribution patterns, and in-ternal structure. We describe the predominant reasons forwhich software creators produce benign and malicious sin-gleton files. For benign singletons, we identify the softwarepackages that are the strongest predictors of the presence ofbenign singleton files on a machine. For malicious software,many singletons are produced from a relatively much smallerbase of malware families. Thus, to better understand the na-ture of the polymorphism that is present in practice acrossa large body of singleton malware, we study the static prop-erties of malicious singleton files across all malware familiesand within individual families.

2.1 Dataset DescriptionIn the interest of performing a reproducible study, we per-

form the following study over data that is voluntarily con-tributed by members of the Norton Community Watch pro-gram, which consists of metadata about the binary files ontheir computers and an identifier of the machines on whichthey appear. Symantec shares a representative portion ofthis anonymized dataset with external researchers throughits WINE program [?]. We use an extended window oftime from 2012 through 2015 to generate high-level statisticsabout singleton data, and refer to this dataset as D0. Wealso use an 8-month window of data from 2014 for a morein depth analysis of the properties of singleton files and ma-chines on which they appear, we call this D1. A portionof the files in D1 is labeled with high-confidence benign ormalicious labels. We form dataset D2 by selecting a subsetof the previous data that consists of labeled singleton files,and for which the file itself is available, allowing us to extractadditional static features from the files that we describe inSection ??. The ground truth labels are generated by man-ually inspection and other high confident evidence. Thisdataset comprises 200,000 malicious and 16 million benignsingleton files, and is the basis of the experimental evalua-tion of Section ??.

2.2 Benign Singleton FilesThe abundance of benign singletons may be surprising

given that there are not obvious benefits to distributing le-gitimate software as singleton files. Of course, some softwareis rare simply because it attracts few users, as in the case ofsoftware on a build machine that performs daily regressiontests. However, there are also less obvious, but no less sig-nificant reasons behind the large numbers of singleton files,including the following:

1. The .NET Framework seeks to enable localized perfor-mance optimizations by distributing software in MicrosoftIntermediate Language code so it can be compiled intonative executable code by the .NET framework on themachine where it will execute, in a way that is specific tothe machine’s architecture. This is evident in practice,as .NET produces executables that are unique in mostcases. Its widespread use makes it the largest driver ofbenign singleton files in our data.

2. Many classes of binary rewriting tools take a program bi-nary file as input, producing a modified version as output,typically to insert additional functionality. For instance,tools such as Themida and Armadillo add resistance totampering and reverse engineering, frequently to protectintellectual property and preserve revenue streams, as inthe example of freemium games that require payment tounlock in-game features and virtual currency. Other ex-amples of binary rewriting tools include the RunAsAdmintool referenced in Table ??, which modifies executables sothat administrative privileges are required to run them.

3. In many cases, software embeds product serial numbersor license keys in its files, resulting in a different hash-based identifier for otherwise identical files.

4. Singleton files can be generated by software that pro-duces executable files in scenarios where other file formatsare more typically used. For instance, Microsoft’s ActiveServer Pages framework generates at least one DLL forevery ASP webpage that references .NET code. Anotherexample is ActiveCode’s Building Information Modelingsoftware that creates project files as executables ratherthan as data files. It is not uncommon for these frame-works to generate thousands of singleton binaries on asingle machine.

5. Interrupted or partial downloads can result in files thatappear to be singletons, even though they are really pre-fixes of a larger more complete file. If the entire file isavailable for inspection, this can be checked, but ourdataset includes metadata for many files that have notbeen physically collected.

In Figure ?? we show the most common substring usedin benign singleton filenames as extracted from dataset D1,many of which hint at the above factors. In particular, themost-observed filename pattern is “app-web-”, which is seenin DLL files supporting web-pages created by ASP Web Ap-plications. These files are often singletons because they arecompiled from .NET code.

Using a subset of the data from dataset D0, we demon-strated in Figure ?? (a) that singleton files are not uniformlydistributed across systems. The figure shows the number ofmachines that possess specific counts of singleton and non-singleton files. Figure ?? (b) is another way to view thesame data, showing that almost 40% of machines have fewor no singleton files and more than 94% of the systems havefewer than 100 singletons. Thus, the majority of singleton

Page 3: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

Figure 1: Percent of singleton files containing a spe-cific substring.

files come from the heavy tail of the distribution representingrelatively few systems. Note that this data is from a specificperiod in time, and so machines with low numbers of non-singleton files indicate machines that experienced minimalchanges/updates during the period when data was collected.

Figure 2: (a) Number of machines with a specificnumber of singleton/non-singleton files, (b) percentof machines that report more than X singleton andnon-singleton files.

To help us work towards a solution that could identifybenign singletons as such, we seek to better understand themachines on which they are most likely to exist. To identifysoftware packages that could be responsible for the creationof singletons, we turn to the clustering approach proposedby Tamersoy et al. [?], which identifies software packagesindirectly by clustering software files that are nearly alwaysinstalled together on a machine, or not at all (see Section ??for more details). Henceforth, we refer to these clusters assoftware packages. Once files are so clustered, we proceedby identifying the software packages that are most indica-tive of the presence of absence of singletons on a machine.Let S denote a specific software package (cluster). We iden-tified a set of 10 million machines from D1, each of whichcontains at least 10 benign singleton files, which we denoteby H (for HasSingletons). Likewise, we identified 10 millionmachines from D1 with no singleton files, which we denoteby N , for NoSingletons. We identify the predictiveness ofeach software package S by counting its number of occur-rences in eachH andN , and use these counts to compute theodds ratios (OR) of a machine containing singletons givenS, OR(S) = H/N . Intuitively, the higher OR(S) is for aparticular software package S, the more likely it is that this(benign) package generates many singletons. An OR(S) ra-tio that is close to 1 is indicative of a software package thatis equally likely to appear on machines that do and do notcontain singletons, and therefore probably does not gener-

ate singletons itself. On the other hand, an OR(S) that issignificantly lower than 1 indicates that machines on whichS is installed are tightly controlled or special-use systemsunlikely to contain singleton files.

Table ?? shows software packages that are strong predic-tors for the presence (or absence) of benign singletons ona machine. Software packages that correlate with increasednumbers of singletons include compiler-related tools (VisualStudio, SoapSuds, SmartClient), tools that wrap or modifyexecutables (RunAsAdmin, App-V), and software packagesthat include numerous signed singletons (Google Talk Plu-gin). Interestingly, there are also many software packagesthat correlate strongly with an absence of singletons on thesystem. These are indicative of tightly controlled or mini-malist special-purpose systems.

Our ability to identify software packages that lead to pres-ence/absence of many benign singleton files is a critical steptowards developing a method for classifying malicious vs.benign singletons. In particular, as described in Section ??,it enables us to prune a large fraction of files as benign beforeapplying machine learning methods, dramatically reducingthe false positive rate.

2.3 Malicious Singleton FilesMalware files skew heavily towards low-prevalence files,

and towards singleton files in particular. Using D0 we cansee that this trend has increased in recent years: 75% ofknown malware files were singletons in 2012, and the rateincreased to 86% by 2015. There are readily apparent rea-sons why malware files skew towards low-prevalence files,including the following:1. Avoiding signature-based detection: Users typically want

to prevent malware from running on their systems, andblocking a single high-prevalence file is much easier thanblocking large numbers of distinct yet functionally equiv-alent files. Polymorphism is a widespread technique forproducing many functionally equivalent program bina-ries, which aims to reduce the effectiveness of traditionalAnti-Virus signatures over portions of the file.

2. Resistance to reverse engineering and tampering: Manymalware authors pack, obfuscate or encrypt their bina-ries, often with the assistance of third-party tools thatare inexpensive or free. Polymorphism is often a welcomebyproduct of these techniques, though it is not necessarilythe primary objective.

3. Malware attribution resistance: The ease with which mal-ware authors can create many functionally equivalent mal-ware files makes the problem of attributing a maliciousfile to its author much harder than it would be if the samefile was used in all instances. For the same reason, poly-morphism makes it difficult for security researchers toassess a malware family’s reach. Modularity also allowsfor specific components to be used as needed, withoutunnecessarily exposing the binary to detection.

Despite the widespread availability and use of tools thatcan inexpensively apply polymorphism and obfuscation tomalware binaries, the security industry has developed effec-tive techniques to counter these. Much of the polymorphismseen in malware binaries is superficially applied by post-compilation binary obfuscation tools that“pack”the originalcontents of the malware file (by compressing or encryptingthe code), and add layers of additional obfuscation-relatedcode [?]. There are some obfuscation tools that are far more

Page 4: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

Have singleton: Control set OR Representative Filename Software Name

13770:1 Appvux.dll Microsoft App-V11792:1 Soapsuds.ni.exe SoapSuds Tool for XML Web Services110501:2 Blpsmarthost.exe SmartClient36515:2 gtpo3d host.dll Google Talk Plugin13868:1 Runasadmin.exe Microsoft RunAsAdmin Tool8511:1 Microsoft.office.tools.ni.dll Visual Studio

... ... ...1:1702 Policy.exe ???1:4392 vdiagentmonitor.exe Citrix VDI-in-a-Box

Table 1: Software packages that are most predictive of presence/absense of benign singleton files. Forsuccinctness, we represent each software package by its most prevalent filename.

complex than this, but most of them are used almost ex-clusively by either malicious or by benign software authors.Techniques used by the anti-virus industry to combat theseobfuscations are discussed at the end of this section.

To provide additional insight into the nature of malwarepolymorphism, we study the use of polymorphism by 800malware families that were observed in the wild in our D1dataset. Overall, we found that 31% of these families are dis-tributed exclusively as singletons, accounting for over 80%of all singleton malware files, while 60% of families rely ex-clusively on non-singletons. There is a subtle difference here,that by volume, the 60% of families account for many de-tections since they are higher prevalence, while the 80% ofsingletons account for a lower percent of all detections eventhough there are more of them, since they only occur on asingle system.

To identify malware families that exhibit a high degree ofpolymorphism, we extracted about 200 static features fromfiles belonging each malware family. Our features includemost fields in the Portable Executable file header of Win-dows Executable files (such as file size, number of sections,etc.), as well as entropy statistics taken from individual bi-nary sections, and information about dynamically linked ex-ternal libraries and functions that are listed in the file’s Im-port Table. For each malware family, we calculate variabilityscores as the average variance of our static features for thefiles belonging to that family. The families with the highestvariability scores are:

• Adware.Bookedspace• Backdoor.Pcclient• Spyware.EmailSpy• Trojan.Usuge!gen3• W32.Neshuta• W32.P illeuz• W32.Svich• W32.Tu1ik

These malware families vary greatly in form, function, andscale, though they do share properties that help account fortheir high variance. In particular, all of these families aremodular, infecting machines with multiple functionally dif-ferent files that are of similar prevalence and have dramati-cally different characteristics. In all cases, there is at least anorder of magnitude difference in file size between the largestand smallest binary. Furthermore, all samples apply binarypacking techniques sporadically rather than in all instances.Backdoor.Pcclient is a Remote Access Trojan and the

lowest prevalence family that has high variance in the staticfeatures. Polymorphism is not evident in this family; itselevated variance is a reflection of a modular design, multi-ple releases of some of those modules, and large differences

from one module to another. By contrast, W32.P illeuz isa very prevalent worm family, but its Visual Basic executa-bles achieve high variance through extensive obfuscation andhighly variable file sizes, which add to the worm’s modularityand occasional use of packing. W32.Neshuta is particularlyinteresting in that it infects all .exe and .com files on themachines that it compromises, resulting in many detectedunique executables of differing sizes, in addition to its ownmodular and polymorphic code.

API Purpose API FunctionAnti-Analysis IsDebuggerPresent

GetCommandLineWGetCurrentProcessIdGetT ickCountSleepTerminateProcess

Unpack Malware Payload GetProcAddressGetModuleHandleWGetModuleFileNameW

Load/Modify Library Code CreateFileMappingACreateFileMappingWMapV iewOfFileSetF ilePointerLockResource

Propagation GetTempPathWCopyFileACreateFileWWriteFile

Table 2: Categories of Windows API functions thatare disproportionately used by malware

The Windows API functions imported by malware filesprovide interesting insights into their behavior, and are use-ful as static features, because they are reasonably adver-sarially resistant. Though malware authors can easily addimports for API functions that they do not need, remov-ing APIs is significantly harder, as these may be needed tocompromise the system (e.g., CreateRemoteThread). Theonly inexpensive way in which a malware file can hide itsuse of API functions from static analysis is to use a binarypacking tool so that its Import Table is not unpacked untilruntime, when it is used to dynamically link to WindowsAPI functions. However, this technique completely altersthe file’s static profile and introduces the static fingerprintof the obfuscation tool, offering an indication that the file isprobably malicious. In addition, as discussed at the end ofthis section, these obfuscations can be reversed by anti-virusvendors.

Table ?? lists the API functions that are most dispro-portionately used by malware, categorized by the purposefor which malware authors typically use them. Many ofthese APIs support analysis resistance, either by detectingan analysis environment, hiding behavior from analysis, or

Page 5: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

by actively resist against analysis. Most other APIs that areindicative of malware have to do with linking or loading toadditional code at runtime, typically because the malwarepayload is packed, but also for more nefarious purposes, suchas malicious code injection and propagation.

Anti-Virus Industry Response to Obfuscation:The anti-virus industry has sought to adapt to malware’swidespread use of obfuscation tools by applying static anddynamic techniques to largely reverse the packing process ina way that preserves many of the benefits of static analysis.In particular, these techniques allow malicious code to beextracted, along with the contents of the Import AddressTable, which contains the addresses of functions importedfrom external dynamically linked libraries. Unpacking tech-niques include the “X-Ray” technique, which may be usedto crack weak encryption schemes or recognize the use ofa particular compression algorithm and reverse its effects[?]. Most unpacking techniques, however, have a dynamiccomponent and can be broadly classified into emulators andsecure sandboxes. Emulators do not allow malicious files toexecute natively on the machine or to execute real systemcalls or Windows API calls, but provide a good approxi-mation of a native environment nonetheless. They are fre-quently deployed on client machines so that any suspiciousfile can be emulated long enough to allow unpacking to oc-cur, after which the program’s malicious payload can be ex-tracted from memory and the de-obfuscated code can be re-covered and analyzed. Offline analysis of suspicious programbinaries typically uses a near-native instrumented environ-ment where the malware program can be executed and itsdynamically unpacked malicious payload can be extracted[?]. Though there are more elaborate obfuscation schemesthat can make executable files difficult to unpack with theaforementioned techniques, these are either not widely de-ployed (e.g., because they are custom-built for the malwarefamily) or are used predominantly by benign or malicioussoftware, but not both. Thus, effective benign vs. maliciousdeterminations can be made even in these cases, because theobfuscation toolkits leave a recognizable fingerprint.

Though the effectiveness of the above de-obfuscation tech-niques is open to debate, in our methodology for this pa-per, we make the deliberate choice to use no de-obfuscationtechniques at all in our attempts to classify singleton files.We demonstrate that malware classification based purely onstatic features can be successful, even in the face of extensivepolymorphism, by good and bad files alike. The success weachieve demonstrates that the obfuscation techniques thatare widely used by malware are themselves recognizable, andappreciably different from the kinds of polymorphism thatare common in benign files. We expect that the classificationaccuracy of our methodology would improve when appliedto files that have been de-obfuscated, given that other re-searchers have found this to be the case [?].

3. LEARNING TO IDENTIFY MALICIOUSSINGLETONS

Most prior efforts for identifying malicious files have ei-ther relied on the context in which multiple instances of thefile appear (e.g., Polonium [?] and Aesop [?] systems) orhave relied exclusively on static or dynamic features derivedfrom the file itself (e.g., MutantX-S [?]). The context that

is available for a singleton file is necessarily limited, makingthe aforementioned context-dependent techniques not ap-plicable. Making matters worse is the fact that the ratio ofbenign to malicious singleton files is nearly 80:1, which hasthe effect of multiplying the false positive rate of a malwaredetector by a factor of 80, and presents a significant classimbalance problem that makes effective classifier trainingdifficult.

To address the lack of context for singleton files and thepreponderance of benign singleton files, we leverage insightsgleaned from our empirical observations about singleton filesin the wild. In particular, as discussed in Section ??, a hand-ful of software packages generate the lion’s share of benignsingletons, while other packages correlate with their absence.Furthermore, the toolchains that generate benign singletonsin large numbers imbue them with distinctive static prop-erties that make them easy to label with high confidence.We use these insights to develop a pipeline that filters be-nign singleton files with high confidence, yielding a morebalanced dataset of suspicious files.

Figure 3: Pipeline of the singleton classification sys-tem.

Figure ?? presents a diagram of our pipeline. We take asinput a pair (f,m), where f is a file and m is the machineon which it resides. The first step of the pipeline, whichwe call machine profiling, determines whether m is likelyto host many benign singleton files. The second step is fileprofiling, in which we label obviously benign files, primar-ily from many-singleton machines, by determining that theyclosely match the benign files that are common on such sys-tems. The final step, classification, uses a supervised clas-sification algorithm (we explore the use of Support VectorMachines [?] and Recursive Neural Networks [?]) to ren-der a final judgment on the remaining files. We proceed bydescribing each of our pipeline’s components in detail.

3.1 Machine profilingMachine profiling operationalizes the following insight gleaned

from our empirical observations: since the distribution ofbenign singletons is highly non-uniform, singleton classifica-tion will benefit from identifying machines that are likely tohost many benign singletons. As discussed in Section ??, thesoftware packages present on a machine are highly predictiveof the presence or absence of benign singletons.

Page 6: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

The first challenge we face is that of automatically identi-fying software packages from telemetry about installations ofindividual program binary files. In mapping individual filesto software packages, we wish to proceed in an automatedway that is inclusive of rare software that is not available forpublic download. Our approach adopts the clustering por-tion of the Aesop system described by Tamersoy et al. [?], inwhich they leverage a dataset consisting of tuples of file andmachine identifiers, each of which indicates the presence offile f on machine m. Specifically, let F be a set of (high-prevalence) files (in the training data). For each file f ∈ F ,let M(f) be the set of machines on which f appears. AsAesop did, we use locality sensitive hashing [?] to efficientlyand approximately group files whose M(f) sets display lowJaccard distance to one another. The Jaccard distance be-tween two sets X and Y is defined as: J(X,Y ) = 1− X∩Y

X∪Y ,and we define the distance between two files f and f ′ interms of Jaccard distance as d(f, f ′) = J(M(f),M(f ′)). Wetune locality sensitive hashing to cluster files with high prob-ability when the Jaccard distance between the files is lessthan 0.2, and to cluster them very rarely otherwise. We ob-tain a collection of clusters C, such that each cluster C ∈ C,serves as an approximation of a software package, since Crepresents a collection of files that are usually installed to-gether on a machine or not at all.

We proceed by identifying the approximate software pack-ages that are the best predictors for the presence of singletonfiles. We formulate this task as a machine learning problem.We define a feature vector for each machine m that encodesthe set of software packages that exist on m. Specifically,given n clusters (software packages), we create a correspond-ing binary feature vector sm of length n, where smj = 1 iffcluster j is present on machine m. Next, we append a labellm to our feature vector such that we have {sm, lm} for eachmachine, with feature vectors sm corresponding to machinesand labels lm ∈ {H,N} representing whether the associatedmachine has benign singletons (labelH) or has no singletons(label N). With this dataset in hand, we are able to train asimple, interpretable classifier to predict lm to good effect.Had we used individual files as predictors, we would have tochoose a machine learning algorithm that behaves well in thepresence of strongly correlated features, but software pack-age identification dramatically reduces feature correlation.Thus, we select Naive Bayes as our classifier g(s), whichperforms well and gives us significant insight into the soft-ware packages that are the best indicators of the presenceor absence of benign singleton files, as reported in Table ??.Our classifier takes as input a feature vector s that repre-sents the software packages on a given machine, and outputsa prediction as to whether or not the machine has benignsingletons. To achieve a balanced dataset, we randomly se-lected 2,000,000 uninfected machines, half of which containsingletons and half of which do not.

3.2 File profilingGiven a classifier g(m) that determines whether a machine

m is expected to host benign many singletons, the next stepin our pipeline—file profiling—uses this information to iden-tify files that can be confidently labeled benign. The resultis both a more balanced dataset that makes our pipeline’sclassifier easier to train, as well as a high-confidence label-ing technique that reduces classifier’s false positive rate. Themain intuition behind our proposed file profiling method is

that benign singleton files bear the marks of the specific be-nign software packages that generate them. Of course, dif-ferent software generates singletons with dramatically dif-ferent file structures and file-naming conventions. Conse-quently, we seek to identify prototypical benign singletonsby clustering them based on their static properties, and fil-ter benign files that closely match these prototypes. Sincethe information we have about the software installed on anygiven machine is typically incomplete, we filter benign filesthat closely match benign-file prototypes on all machines,but require much closer matches on machines where benignsingletons are not expected. This point is operationalizedbelow through the use of a less aggressive filtering thresholdfor machines m labeled as N (no benign singletons) than formachines labeled H (having benign singletons).

The full path, filename, and size of singleton files are theprimary static attributes that we use in our file profilingstudy. We had little choice in this case because large col-lections of labeled benign singleton files that security com-panies share with external researchers are extremely hardto come by, and are limited in the telemetry they provide.In the interest of conducting a reproducible experiment, welimit ourselves to the metadata attributes provided for filesin Dataset D1 (see Section ??) that Symantec shares withexternal researchers through its WINE program [?]. ThoughD1 gives us a representative dataset of singleton files, it alsolimits us to a small collection of metadata attributes aboutfiles, of which the path, filename, and size are the most usefulattributes. In modest defense of the use of filename and pathas a feature, though it is true that a malicious adversary cantrivially modify the malware’s filename (and the path, to alesser extent), the malware author would frequently haveto do so at the cost of losing the social engineering benefitof choosing a filename that entices the user to install themalware.

Due to the feature limitations of the file profiling step, weproceed by developing techniques to maximize the discrim-inative value of the path and filename. We seek to leveragethe observation that a handful of root causes create a sig-nificant majority of benign singletons, and these origins areoften strongly evident in the filename and path of benign sin-gletons. Although malware files display significantly morediversity in their choice of filenames, these filenames typi-cally bear the marks of social engineering, and their pathsare frequently reflective of the vector by which they man-aged to install themselves on the machine, or are demon-strative of attempts to hide from scrutiny. Accordingly, weengineer features from filenames and paths to capture thenaming conventions used by benign singletons. Given a filef , we divide its filename into words using chunking tech-niques. Specifically, we identify separate words within eachfile name that are demarcated by whitespace or punctuation,and separate words based on CamelCase capitalization tran-sitions, and so on. Subsequently, we represent the filenameand path components in a “bag of words” feature representa-tion that is physically represented as a binary vector, wherethe existence of a word in the filename or path correspondsto a 1 in the associated feature, and a 0 indicates that theword is not a part of its name. In addition, we capture therelative frequencies of the words that appear in filenames bymeasuring the term frequency (TF) of each word. Term fre-quency is then used as a part of weighted Jaccard distancemeasure used to cluster files, as described below. More for-

Page 7: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

mally, let T ⊆ Rn represent the feature space of the singletonfiles, with n the number of features. Each singleton file fcan be represented by a feature vector t, which is the dotproduct of a binary bag of words vector w and the normal-ized term frequency vector q corresponding to each word,t = w · q, where tj is the jth feature value. Note that we ex-clude words that appear extremely frequently, such as exe,dll, setup, as stop words, to prevent the feature vector tfrom becoming dominated by these. For any two files f1and f2, the weighted Jaccard distance between them is then

calculated as J(f1, f2) = 1−∑

k min(f1k,f2

k)∑k max(f1k,f2k)

.

We use the weighted Jaccard distance to cluster benignsingleton files in the training data using the scalable NNDescent algorithm [?] implemented on Spark [?], which ef-ficiently approximates K-Nearest Neighbors and producesclusters C of of highly similar files.1 We gain further effi-ciency and efficacy gains by choosing a bag of words rep-resentation over edit distance when making filename com-parisons. This approach also has the benefit of producingan understandable model that identifies the most frequentfilename patterns present in benign singleton files, such asthose highlighted in Figure ??.

The final step in the file profiling process is to use theclusters derived above to filter benign files that align closelywith the profile of benign singletons. To this end, for eachbenign singleton cluster c ∈ C, we compute the cluster meanc = 1

|c|∑tj∈c

tj . For a given file f , we then find the cluster ;

let c∗ whose mean c is least distant from f , where distance isagain measured based on weighted Jaccard distance: J(c, f).Then, if file f resides on a machine m that is expected tohave singletons (that is, g(m) = H as defined in Section ??),we filter it as benign iff J(c∗, f) ≤ θH ; otherwise, it is filterediff J(c∗, f) ≤ θN , where θH and θN are the correspondingfiltering thresholds.

We select different θ values for the training and final ver-sions of our pipeline. For training, our primary goal is toreduce the 80:1 benign to malicious class imbalance ratio sothat we can train an effective classifier, whereas for testing,our goal is to achieve a high true positive rate while mini-mizing false positives. For purposes of creating a balancedtraining set, we select θN = 0.1 and θH = 0.3, which filters91.8% of benign singletons, resulting in a more manageable9:1 class imbalance ratio, at the cost of 7% of malware sam-ples being thrown out of our training set. However this doesnot affect the performance of our model adversely, since dur-ing testing we can be less aggressive with the thresholds andpass more files to the classifier. In practice, we found valuesaround θN = 0.07 and θH = 0.13 result in the best perfor-mance over the test data.

3.3 Malicious singleton detectionHaving filtered out a large portion of predicted benign

file instances, we are left with a residual data set of benignand malicious files that we classify using supervised-learningtechniques. Though the filtering of benign files by the pre-vious stages of our pipeline provide better class balance, wefound that significant improvements in classification accu-racy result when the residual data set is augmented by in-cluding 3 benign files that we sample randomly from each

1Note that this clustering of files is entirely distinct from theclustering of files in machine profiling, where non-singletonfiles are clustered based on machines that they appear on.

cluster C generated in the file profiling step. Doing so im-proves the classifier by adding additional benign files thatare representative of the overall population of benign sin-gleton files. We trained multiple classification algorithmswith different strengths to determine which would be mosteffective at singleton classification.

Feature engineering is also key to the performance of ourclassifiers. Whereas machine and file profiling were designedfor a backend system where a global view of the distributionof benign and malicious singleton files is available, here wedesign a classifier that we can deploy on client machines,based entirely on the static features of the file. Hence, weassume direct access to the files themselves and can buildrich feature sets over the files, so long as they are not ex-pensive to compute. This is in contrast to the telemetry usedfor machine and file profiling, for which network bandwidthconstraints and privacy concerns limited the telemetry thatcould be collected. As mentioned in Section ??, we make noattempt to reverse the effects of obfuscation attempts em-ployed by malware, finding that the use of the obfuscationtechniques themselves provides strong discriminative powerthat helps us to disambiguate between benign and malicioussingletons.

Features.The features used by our learning algorithms to classify

singleton program binary files fall into four categories.

1. The first category of features corresponds to features offile name and path. For these we used the same file nameand path bag-of-words feature representation here as inthe file profiling step of Section ??. To reduce the num-ber of features included in our model, we applied a chi-squared feature selection to choose the most discrimina-tive features [?].

2. The second category of features are derived from theheader information of the executable file. We includeall fields in the headers that are common to most win-dows executable files that exhibit some variability (someheader fields never change). These header fields includethe MS-DOS Stub, Signature, the COFF File Header,and the Optional Header (which is optional but nearlyalways present) [?].

3. We derive features from the Section Table found in thefile’s header, which describes each section of the binary,and also compute the entropy of each of the file’s sectionsas features.

4. Our third category of features is derived from the exter-nal libraries that are dynamically linked to the programbinary file. To determine which libraries the file links to,we create a feature for each of the most popular Win-dows library files (primarily Windows API libraries) thatrepresents the number of functions imported from the li-brary. We also create binary features for the individualfunctions in common Windows libraries that are mostcommonly used by malware. These take a value of 1when the function is imported and 0 otherwise.

In all, category 1’s bag of words features for filename andpath consist of 300 features, while category 2,3, and 4 fea-tures together comprise close to 1000 features.

Classification.We apply two learning models, a Recurrent Neural Net-

work (RNN) [?] and a Support Vector Machine with a

Page 8: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

radial basis function as its kernel [?], and compare theirperformance and ability to withstand adversarial manipula-tion in Section ??. The RNN model is particularly suitedfor textual data, so we train it solely using file names andpath information as features. Given the sequential proper-ties of the file name text, RNNs aim to make use of thedependency relationship among characters to classify mali-cious vsbenign singletons. The goal of the character-levellearning model is to predict the next character in a se-quence and thereby classify the entire sequence based onthe character distribution. Here, given a training sequenceof characters (a1, a2, ..., am), the RNN model uses the se-quence of its output vectors (o1, o2, ..., om) to obtain a se-quence of distributions P (ak+1|a≤k) = softmax(ok), wherethe softmax distribution is defined by P (softmax(ok) =

j) = exp(ok(j))/

∑k exp(ok

(l)). The learning model’s ob-jective is to maximize the total log likelihood of the trainingsequence, which implies that the RNN learns a probabilitydistribution over the character sequences used in a full path+ filename.

For the SVM model, we apply the text chunking techniquedescribed in Section ??, and use the bag-of-words repre-sentation as described above, concatenated with static andAPI-based features, where relevant. While numerous otherclassification algorithms could be used here, our purpose ofexploring RNN and SVM specifically is to contrast an ap-proach specifically designed for text data (making use offilename and path information exclusively) with a general-purpose learning algorithm that is known to perform well inmalware classification settings [?].

Putting Everything Together.The high-level algorithm for the entire training pipeline is

shown in Algorithm ??. The input to this algorithm is a

Algorithm 1 Train({Str, Ztr,Mtr, Ytr}):1: g = machineProfiling({Str, Ztr,Mtr, Ytr})2: (D, θH , θN , C) = fileProfiling({Str, Ztr,Mtr, Ytr}, g)3: h = learnClassifier(D)

4: return g, h, θH , θN , C

collection of tuples{si, zi,mi, yi} ∈ {S,Z,M, Y } describing file instances onmachines, which are partitioned into training (tr) and test-ing (te) for the pipeline. Each file instance is represented bysi, the 256-bit digest of a SHA-2 hash over its contents andthe size zi of the file in bytes. The machine is representedby a unique machine identifier mi, and each instance of thefile receives a label yi, which designates a file as benign,malicious, or unknown. Machine profiling processes the file-instance data to identify singleton files (those for which onlyone instance exists) from more prevalent software that itgroups into packages and uses to predict the presence or ab-sence of singletons. The end result of training the pipelineincludes the two classifiers: g classifies machines into H (hasbenign singletons) and N (no benign singletons), while hclassifies files as malicious or benign, trained based on theselected representative data D. Additional by-products in-clude, the clusters of benign files C and the thresholds θHand θN that determine how aggressively files projected tobe benign are filtered before the classifier h is applied.

Our test-time inputs include a set of singleton files thatwe withheld from training and our model parameters, and

it returns simply whether or not to label f as benign ormalicious. The specifics of the associated testing process,which use of our training pipeline, are given in Algorithm ??.

Algorithm 2 Predict({Ste,Mte}, g, h, θH , θN , C):

1: l = g(Mte) : label the machine as H or N2: c∗ = arg minc∈C J(Ste, c) // find closest cluster center to

Ste

3: if J(Ste, c) ≤ θl then4: return B // “benign” if Ste is close to a benign cluster

center

5: end if

6: return h(Ste) // otherwise, apply the classifier

4. EXPERIMENTAL EVALUATIONWe conduct experiments on a large real-world dataset,

dataset D2 as described in Section ??, to evaluate the pro-posed pipeline as well as analyze the robustness of learningsystem. As mentioned above, in implementing and deploy-ing such a system in practice, we face a series of tradeoffs.The first is how much information about each file we shouldbe collecting. On the one hand, more information will likelyimprove learning performance. On the other hand, collect-ing and analyzing data at such scale can become extremelyexpensive, both financially and computationally. Moreover,collection of detailed data about files on end-user machinescan become a substantial privacy issue. For all of these rea-sons, very little information is traditionally collected aboutfiles on end-user systems, largely consisting of file name andan anonymized path, as well as file hashes and machinesthey reside on. For a subset of files, deeper information isavailable, including static features as well as API calls, asdiscussed above. However, these involve a significant cost:for example, extracting API calls requires static analysis.Our experiments are therefore designed to assess how muchvalue these additional features have in classification, andwhether or not it is truly worthwhile to be collecting themat the scale necessary for practical deployment. Since weare the first work to deal with the singleton malware detec-tion problem, here we compare our proposed method withstandard machine learning algorithms in various settings.Our evaluation applies Machine Profiling (MP), File Profil-ing (FP), an RNN based on only file name features, a SVMbased on file name features, a SVM based on both file nameand the static features (SVMS), and a SVM based on filename, static features, and API function features (SVMSF).

4.1 Baseline EvaluationOur first efficacy study demonstrates the benefit provided

by our machine learning pipeline as compared to two nat-ural baselines. Our first baseline applies machine and fileprofiling, ranking all examples based on their similarity tobenign files, and identifying the samples that are furthestfrom benign cluster centers as malicious. Our second base-line is our best-performing classifier trained over our entirefeature set (SVMSF), but trained without the benefit of aninitial machine/file profiling step, which reduces the ratio ofbenign to malicious files from an 80:1 ratio to a 9:1 ratio.This baseline is similar to prior work in malware classifica-tion based on static features [?]. As seen in Figure ??, ourfull pipeline demonstrates clear improvement over the two

Page 9: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

baselines, with a significantly higher AUC score. The spoton the curve with the maximal F0.5 score achieves a 92.1%true positive rate at a 1.4% false positive rate, a dramaticimprovement over applying FP or SVMSF on its own. Dif-ferent locations on the ROC curve are achieved by selectingincreasing values for θN and θH . The maximal F0.5 score isachieved with θH = 0.13 and θN = 0.07.

Though uninformed downsampling of benign files mayreasonably be suggested as an alternative means to reducethe class imbalance and achieve better classification resultswith SVMSF, our attempts to do so resulted in classifiersthat perform worse than the SVMSF classifier of Figure ??.The reason for this is likely that downsampling decimatessmall clusters of benign files, resulting in a model that rep-resents benign singletons only by its most massively popu-lated clusters. Our pipeline can be thought of as providingan informed downsampling of benign files that reduces mas-sively populated clusters of benign files to a few prototypes,allowing the SVM to train a model that represents the fullgamut of benign singletons with the additional benefit ofdoing so over a more balanced dataset.

Figure 4: (a) ROC-curve comparison of the pipelineperformance with the two baselines: no machine/fileprofiling, and only machine/file profiling. (b) Com-parisons for models with different features withoutattacker.

4.2 Evaluating Performance of the Classifica-tion Step

To assess the relative importance of the three classes offeatures (text, static, and API) used by our model, we an-alyze the relative performance of just the last classificationstep of four models on the dataset produced by MP and FPfiltering: 1) RNN (using text features only), 2) SVM (us-ing text features only), 3) SVM with both text and staticfeatures, and 4) SVM with text, static, and API features.

To highlight the performance differences between theseclassifiers, we evaluate them over a test set of singletons fromwhich obviously benign singletons have been pre-filtered byfile profiling (for this reason this figure does not reflect theoverall performance of our pipeline as reported in Figure ??).Our first observation is that RNN outperforms SVM whenonly textual features are used, which is not surprising, giventhat RNN’s are particularly well suited to text data. Second,our model’s performance drops when training over filenameand anonymized path plus static features, which demon-strates the high discriminative value of the filename andanonymized path relative to features derived from headerinformation in the executable. However, these static fea-tures do offer value when we account for the potential foradversarial manipulation, as discussed in Section ??. Third,the value of features based on imported API functions is ev-ident in the performance of the SVMSF model compared to

all other models, particularly when we choose a thresholdthat limits the false positive rate, as security vendors areprone to do: The precision and recall scores that produce amaximal F0.5 score for SVMSF are 83% recall at a 1% falsepositive rate, as compared to 76% recall at a 5% false posi-tive rate for RNN, which is this model’s closest competitoron an Area Under the Curve (AUC) basis. Note that theperformance of the full pipeline is better than either of theseclassifiers alone (see Figure ??), because many of the benignfiles that are causing the FPs are labeled correctly usingthe machine and file profiling steps. Finally, our adversarialevaluation of these classifiers (Section ??) offers additionaljustification for incorporating static and imported function-based features into our model.

We evaluated the run-time required to train each step ofour pipeline, including Machine Profiling (MP), File Pro-filing (FP), and the selected classifier, which is one of thefollowing: RNN, SVM (based on only file name), SVMS, andSVMSF. The run-time of each step, when performed on asingle powerful machine, is illustrated in Figure ??. Train-ing Machine Profiling and File Profiling is fairly expensive,However, these two steps can be done offline, and updatedincrementally as new data arrives. Training the SVM classi-fiers is inexpensive, whereas training the RNN takes on theorder of three hours with GPU acceleration. Though we donot believe that this is a cause for concern, the inferior per-formance of the RNN as compared to SVMSF makes it lessappealing for inclusion in the final version of our pipeline.We do not include test-time performance evaluation sincethe cost to test a single file is negligible for all stages of thepipeline.

Figure 5: Comparisons of the runtime of differentcomponents within the pipeline.

4.3 Adversarial EvaluationThough the evaluation of our classifiers, presented in Fig-

ure ?? (b) is fairly typical for a malware classification tool,it is not necessarily indicative of the long-term performanceof a classifier once it has been massively deployed in thewild. In particular, what is missing is an evaluation of theability of our classifier to withstand the inevitable attemptsof malware authors to respond to its deployment by modi-fying their malicious singleton files to mimic benign file pat-terns in order to evade detection. Whereas researchers havetraditionally discussed an algorithm’s robustness to evasionbased on subjective arguments about the strength or weak-ness of individual features, the now well-developed body ofresearch on adversarial machine learning provides more rig-orous methods for evaluating the adversarial robustness of amachine learning method [?, ?], and provides guidelines fordeveloping more adversarially robust learning techniques [?,

Page 10: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

?].We proceed by providing an evaluation of our model’s ad-

versarial robustness. The adversarial resistance of a classifierevaluation presupposes a given classifier, h, that outputs fora given feature vector x, a label h(x) ∈ {−1,+1}, wherein our case, −1 represents a benign prediction and +1 rep-resents a malicious prediction. Given h, the adversary ismodeled as aiming to minimize the cost of evasion,

x∗ = arg minx′|h(x′)=−1

c(x, x′),

where c(x, x′) is the cost of using a malicious instance x′ inplace of x to evade h (by ensuring that h(x′) = −1, thatis, that the malicious file will be classified as benign). Theoptimal evasion is represented by x∗. Because this modelalways results in a successful evasion, no matter its cost, wefollow a more realistic model presented by Li and Vorobey-chik [?], where the evasion only occurs when its cost is withina fixed adversarial budget B, thus: c(x, x∗) ≤ B. Similarly,we mainly focus on the binary features here and prioritizethe ones that have the most distinguished values for mali-cious and benign to modify, focusing the adversaries budgeton the features that will be most useful for them to modifyunder the assumption that they known how to mimic benignsoftware. In effect, we assume that the adversary will evadedetection only if the gains from doing so outweigh the costs.The budget represents the percentage of the total numberof features that the attacker is able to modify. A naturalmeasure of the evasion cost c(x, x′) is the weighted l1 dis-tance between x and x′: c(x, x′) =

∑i ai‖xi − x′i‖. The

choices of weights can be difficult to determine in a princi-pled way, although some features will clearly be easier foran adversary to modify than others. We use ai = 1 for allfeatures i below as a starting point. As we will see, this al-ready provides us with substantial evidence that a classifierusing solely filename-based features is extremely exploitableby an adversary, without even accounting for the fact thatsuch features are also easier to modify for malware authorsthan, say, the functions they import from the Windows APIand other libraries.

Figure 6: Comparisons for models with attackerbudget as (a) 5 (b) 10.

We now perform a comparison of the same classifier andfeature combinations presented in Section ??, but we nowevaluate these classifiers using evasion attacks, as shown inFigures ?? (a) and (b) with budgets B = 5% and B = 10%,respectively. These figures highlight a significant trend: whereasthe RNN’s performance was previously rather close to thatof the SVM with filename, static, and imported function fea-tures, the former has displays poor adversarial resistance,while the latter is far more robust. The RNN’s AUC dropsto 0.857 under pressure from a weaker attacker, and to 0.78when pressured by a stronger one, whereas the AUC for the

SVM with the largest feature set only drops to 0.92 under asmaller adversarial budget, and to 0.88 with a larger one).The SVM based only on filename features performed evenworse than the RNN. Interestingly, while adding static fea-tures (and not imported function features) to the SVM de-grades its adversary-free performance, the classifier performsconsiderably better than the RNN and SVM with filenamefeatures, in the presence of an adversary.

In summary, our experimental results point consistently tothe use of a Support Vector Machine with features derivedfrom the filename, path, static properties of the file, andimported functions, as the model that performs the best,even against an active adversary. Thus, the best version ofour overall pipeline leverages this support vector machine asits classifier, achieving the overall performance results shownin Figure ??.

5. RELATED WORKThe problem of detecting malicious files has been stud-

ied extensively. Perdisci et al. have dealt with the staticdetection of malware files [?] and malware clustering us-ing HTTP features [?]. Other malware detection systemshave also been proposed [?, ?, ?]. Particularly relevant iswork that is designed to deal with low-prevalence malware.This prior art includes work designed to reverse the effect ofpacking-based obfuscation tools by either statically decom-pressing or decrypting the malicious payload [?], or simplyexecuting the program until it has unrolled its malicious pay-load into main memory [?]. At this point, traditional anti-virus signatures may be applied [?], and clustering may serveto identify new malicious samples based on their similarityto known malicious samples [?, ?]. By contrast, we make noeffort to undo obfuscation attempts, which are frequently ev-idence of malicious intent. Whereas these researchers havefocused on the causes behind low-prevalence malware, weaugment this by providing the first detailed study of benignsingleton files.

The importance of an adversarially robust approach tomalicious singleton detection is evident, given that the highvolume of singleton malware is largely the byproduct ofadaptations to anti-virus technology [?, ?, ?]. Researchershave formalized the notion of evasion attacks on classifiersthrough game theoretic modeling and analysis [?, ?]. Inone of the earliest such efforts, Dalve et al. [?] play outthe first two steps of best response dynamics in this game.However, there has been a disconnect between the learner-attacker game models and real world dataset validation inthese prior work. We bridge this gap by considering a verygeneral adversarial learning framework based on an evalua-tion of a real, large-scale dataset.

6. CONCLUSIONSWe analyzed a large dataset to extract insights about the

properties and distribution of singleton program binary filesand their relationships to non-singleton software. We lever-age the context in which singletons appear to filter benignfiles from our dataset, allowing us to train a model over amore balanced set of positive and negative examples. Webuild a classifier and feature set over the static contents ofthe file to effectively label benign and malicious singletons,in a way that is adversarial robust. Together, these com-ponents of our pipeline classify singletons much more effec-tively than either a context or a content-based approach cando on its own.

Page 11: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

7. REFERENCES[1] Bruckner, M., and Scheffer, T. Nash equilibria of

static prediction games. In Advances in neuralinformation processing systems (2009), pp. 171–179.

[2] Bruckner, M., and Scheffer, T. Stackelberggames for adversarial prediction problems. InProceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining(2011), ACM, pp. 547–555.

[3] Chang, C.-C., and Lin, C.-J. LIBSVM: A libraryfor support vector machines. ACM Transactions onIntelligent Systems and Technology 2 (2011),27:1–27:27.

[4] Chau, D. H., Nachenberg, C., Wilhelm, J.,Wright, A., and Faloutsos, C. Polonium:Tera-scale graph mining and inference for malwaredetection. In SIAM International Conference on DataMining (2011), vol. 2.

[5] Christodorescu, M., Jha, S., Seshia, S. A., Song,D., and Bryant, R. E. Semantics-aware malwaredetection. In Security and Privacy, 2005 IEEESymposium on (2005), IEEE, pp. 32–46.

[6] Corporation, M. Microsoft portable executable andcommon object file format specification. Revision 6.0.

[7] Dahl, G. E., Stokes, J. W., Deng, L., and Yu, D.Large-scale malware classification using randomprojections and neural networks. In Acoustics, Speechand Signal Processing (ICASSP), 2013 IEEEInternational Conference on (2013), IEEE,pp. 3422–3426.

[8] Dalvi, N., Domingos, P., Sanghai, S., Verma, D.,et al. Adversarial classification. In Proceedings of thetenth ACM SIGKDD international conference onKnowledge discovery and data mining (2004), ACM,pp. 99–108.

[9] Dong, W., Moses, C., and Li, K. Efficient k-nearestneighbor graph construction for generic similaritymeasures. In Proceedings of the 20th internationalconference on World wide web (2011), ACM,pp. 577–586.

[10] Dumitras, T., and Shou, D. Toward a standardbenchmark for computer security research: theworldwide intelligence network environment (wine). InProceedings of the First Workshop on BuildingAnalysis Datasets and Gathering Experience Returnsfor Security (BADGERS) (Salzburg, Austria, 2011).

[11] Gionis, A., Indyk, P., and Motwani, R. Similaritysearch in high dimensions via hashing. In Proceedingsof the 25th International Conference on Very LargeData Bases (VLDB) (Edinburgh, Scotland, UK, 1999).

[12] Guo, F., Ferrie, P., and Chiueh, T. A study of thepacker problem and its solutions. In Symposium onRecent Advances in Intrusion Detection (RAID)(Cambridge, MA, 2008), Springer Berlin / Heidelberg.

[13] Hu, X., Shin, K. G., Bhatkar, S., and Griffin, K.Mutantx-s: Scalable malware clustering based onstatic features. In Presented as part of the 2013USENIX Annual Technical Conference (USENIX ATC13) (San Jose, CA, 2013), USENIX, pp. 187–198.

[14] Jang, J., Brumley, D., and Venkataraman, S.Bitshred: feature hashing malware for scalable triageand semantic analysis. In Proceedings of the 18th

ACM conference on Computer and communicationssecurity (2011), ACM, pp. 309–320.

[15] Kolbitsch, C., Comparetti, P. M., Kruegel, C.,Kirda, E., Zhou, X.-y., and Wang, X. Effectiveand efficient malware detection at the end host. InUSENIX security symposium (2009), pp. 351–366.

[16] Kolter, J. Z., and Maloof, M. A. Learning todetect and classify malicious executables in the wild.Journal of Machine Learning Research 7 (2006),2721–2744.

[17] Li, B., and Vorobeychik, Y. Featurecross-substitution in adversarial classification. InAdvances in Neural Information Processing Systems(2014), pp. 2087–2095.

[18] Li, B., and Yevgeniy, V. Scalable optimization ofrandomized operational decisions in adversarialclassification settings. In Proc. InternationalConference on Artificial Intelligence and Statistics(2015).

[19] Liu, H., and Setiono, R. Chi2: Feature selectionand discretization of numeric attributes. InProceedings of 7th Intenational Conference on Toolswith Artificial Intelligence (ICTAI), pp. 388–391.

[20] Lowd, D., and Meek, C. Adversarial learning. InProceedings of the eleventh ACM SIGKDDinternational conference on Knowledge discovery indata mining (2005), ACM, pp. 641–647.

[21] LukosEvicIus, M., and Jaeger, H. Reservoircomputing approaches to recurrent neural networktraining. Computer Science Review 3, 3 (2009),127–149.

[22] Martignoni, L., Christodorescu, M., and Jha, S.Omniunpack: Fast, generic, and safe unpacking ofmalware. In Annual Computer Security ApplicationsConference (ACSAC) (Miami Beach, FL, 2007).

[23] Mikolov, T., Kombrink, S., Burget, L.,Cernocky, J., and Khudanpur, S. Extensions ofrecurrent neural network language model. In IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP) (ICASSP) (Prague,Czech Republic, 2011).

[24] Parameswaran, M., Rui, H., and Sayin, S. A gametheoretic model and empirical analysis of spammerstrategies. In Collaboration, Electronic Messaging,AntiAbuse and Spam Conf (2010), vol. 7.

[25] Perdisci, R., Ariu, D., and Giacinto, G. Scalablefine-grained behavioral clustering of http-basedmalware. Computer Networks 57, 2 (2013), 487–500.

[26] Perdisci, R., Lanzi, A., and Lee, W. Mcboost:Boosting scalability in malware collection and analysisusing statistical classification of executables. InComputer Security Applications Conference, 2008.ACSAC 2008. Annual (2008), IEEE, pp. 301–310.

[27] Perriot, F., and Ferrie, P. Principles and practiseof x-raying. In Virus Bulletin Conference (Chicago,IL, 2004).

[28] Roundy, K. A., and Miller, B. P. Binary-codeobfuscations in prevalent packer tools. ACMComputing Surveys (CSUR) 46, 1 (2013).

[29] Security, and Group, R. Internet security threatreport, 2015.

Page 12: Large-Scale Identification of Malicious Singleton Filespages.cs.wisc.edu/~roundy/papers/codaspy2017_singleton.pdfin benign singleton lenames as extracted from dataset D1, many of

[30] Suykens, J. A., and Vandewalle, J. Least squaressupport vector machine classifiers. Neural processingletters 9, 3 (1999), 293–300.

[31] Tamersoy, A., Roundy, K., and Chau, D. H. Guiltby association: large scale malware detection bymining file-relation graphs. In Proceedings of the 20thACM SIGKDD international conference on Knowledgediscovery and data mining (2014), ACM,pp. 1524–1533.

[32] Vorobeychik, Y., and Li, B. Optimal randomizedclassification in adversarial settings. In InternationalJoint Conference on Autonomous Agents andMultiagent Systems (2014), pp. 485–492.

[33] Zaharia, M., Chowdhury, M., Franklin, M. J.,Shenker, S., and Stoica, I. Spark: clustercomputing with working sets. In Proceedings of the2nd USENIX conference on Hot topics in cloudcomputing (2010), vol. 10, p. 10.

[34] Zhang, J., and Zhang, Q. Stackelberg game forutility-based cooperative cognitiveradio networks. InProceedings of the tenth ACM internationalsymposium on Mobile ad hoc networking andcomputing (2009), ACM, pp. 23–32.


Recommended