+ All Categories
Home > Documents > BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of...

BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of...

Date post: 08-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
10
1 BitAV: Fast Anti-Malware by Distributed Blockchain Consensus and Feedforward Scanning Charles Noyes, [email protected] Synopsis—In the age of information, of the Internet, the protection of our most vital infrastructure becomes increasingly important. Moores law continues to prove accurate, with the number of transistors on standard integrated circuits doubling about every two years, but virus scanning applications have not innovated on the same level and development has stagnated. Thus, the attack surfaces become larger and the targets more lucrative, while the defensive mechanisms are failing to improve at a comparable rate. I present the design and implementation of a novel anti- malware environment called BitAV. BitAV allows for the de- centralization of the update and maintenance mechanisms of the software, traditionally performed by a central host, and uses a staggered scanning mechanism in order to im- prove performance. The peer-to-peer network maintenance mechanism lowered the average update propagation speed by 500% and is far less susceptible to targeted denial- of-service attacks. The feedforward scanning mechanism significantly improved end-to-end performance of the mal- ware matching system, to a degree of an average 14× increase, by decomposing the file matching process into efficient queries that operate in verifiably constant (O(1)) time. I. I NTRODUCTION As the use of the Internet, and other massively networked systems like it, becomes increasingly widespread, the ease with which viruses proliferate grows with it. The result is the need for tech- nologies designed to block these viruses, generi- cally called malware (malicious software), at all major network stops, but especially at the terminal end-user point. The throughput of most end-user’s network connections, and thus the amount of po- tential data consumption, is greatly increasing as well. While network-based detection systems have reached speeds of over 1Gb/s, the speed of actual virus scanning and malware preventive systems has not kept pace. The amount of new malware released onto the public Internet is exploding [1]. As most anti- virus software currently filters suspect files through string matching against pseudo-unique identifiers, each new malware sample, and variant sub-sample, requires its own signature [2]. Thus the size of a anti-virus’s signature set S is related to γ , the number of all known malware samples, S γ . Please do note that ’signature’ is used interchange- ably with ’identifier,’ both of which mean the unique value resulting from the inputting of the candidate malware sample into a cryptographic hash function (a function which maps an arbitrary input to a set- length output with uniform distribution). Because virtually all anti-malware programs de- vote most of their resources to the matching of these signatures S to some arbitrary input stream, usually with an exact matching algorithm, the two main factors that determine the effectiveness of a solution are the ratio of detected to undetected inputs (possibly taking into account the rate of false-positives, although this is only a problem in solutions that utilize regular-expression-based multi- pattern algorithms) and the scalability of the signa- ture set. The second factor is due to the possibility of seemingly highly efficient implementations that are really only efficient in-memory, as they rely heavily on constant (very expensive) disk accesses, and therefore degrade user experience elsewhere in the system. Obviously all of these possible pattern matching schemes rely on having a known pattern set that acts as the corpus that inputs are matched against. In this case, it is the known malware identifiers. Thus, any anti-virus solution that aims to protect its users from future malware types and variants must have update mechanisms in place which are able to update the known pattern set. The apparent solution is just to have a centralized update server, but this is sub- optimal, especially for open-source efforts, because of the cost associated with it and the fact that it acts as an obvious and openly-facing target for malicious attackers. An ideal anti-malware system would be wholly efficient and extremely fast, but the two are gen- arXiv:1601.01405v1 [cs.CR] 7 Jan 2016
Transcript
Page 1: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

1

BitAV: Fast Anti-Malware by Distributed BlockchainConsensus and Feedforward Scanning

Charles Noyes, [email protected]

Synopsis—In the age of information, of the Internet,the protection of our most vital infrastructure becomesincreasingly important. Moores law continues to proveaccurate, with the number of transistors on standardintegrated circuits doubling about every two years, butvirus scanning applications have not innovated on the samelevel and development has stagnated. Thus, the attacksurfaces become larger and the targets more lucrative,while the defensive mechanisms are failing to improve ata comparable rate.

I present the design and implementation of a novel anti-malware environment called BitAV. BitAV allows for the de-centralization of the update and maintenance mechanismsof the software, traditionally performed by a central host,and uses a staggered scanning mechanism in order to im-prove performance. The peer-to-peer network maintenancemechanism lowered the average update propagation speedby 500% and is far less susceptible to targeted denial-of-service attacks. The feedforward scanning mechanismsignificantly improved end-to-end performance of the mal-ware matching system, to a degree of an average 14×increase, by decomposing the file matching process intoefficient queries that operate in verifiably constant (O(1))time.

I. INTRODUCTION

As the use of the Internet, and other massivelynetworked systems like it, becomes increasinglywidespread, the ease with which viruses proliferategrows with it. The result is the need for tech-nologies designed to block these viruses, generi-cally called malware (malicious software), at allmajor network stops, but especially at the terminalend-user point. The throughput of most end-user’snetwork connections, and thus the amount of po-tential data consumption, is greatly increasing aswell. While network-based detection systems havereached speeds of over 1Gb/s, the speed of actualvirus scanning and malware preventive systems hasnot kept pace.

The amount of new malware released onto thepublic Internet is exploding [1]. As most anti-virus software currently filters suspect files throughstring matching against pseudo-unique identifiers,

each new malware sample, and variant sub-sample,requires its own signature [2]. Thus the size ofa anti-virus’s signature set S is related to γ, thenumber of all known malware samples, S ∝ γ.Please do note that ’signature’ is used interchange-ably with ’identifier,’ both of which mean the uniquevalue resulting from the inputting of the candidatemalware sample into a cryptographic hash function(a function which maps an arbitrary input to a set-length output with uniform distribution).

Because virtually all anti-malware programs de-vote most of their resources to the matching ofthese signatures S to some arbitrary input stream,usually with an exact matching algorithm, the twomain factors that determine the effectiveness ofa solution are the ratio of detected to undetectedinputs (possibly taking into account the rate offalse-positives, although this is only a problem insolutions that utilize regular-expression-based multi-pattern algorithms) and the scalability of the signa-ture set. The second factor is due to the possibilityof seemingly highly efficient implementations thatare really only efficient in-memory, as they relyheavily on constant (very expensive) disk accesses,and therefore degrade user experience elsewhere inthe system.

Obviously all of these possible pattern matchingschemes rely on having a known pattern set that actsas the corpus that inputs are matched against. In thiscase, it is the known malware identifiers. Thus, anyanti-virus solution that aims to protect its users fromfuture malware types and variants must have updatemechanisms in place which are able to update theknown pattern set. The apparent solution is just tohave a centralized update server, but this is sub-optimal, especially for open-source efforts, becauseof the cost associated with it and the fact that it actsas an obvious and openly-facing target for maliciousattackers.

An ideal anti-malware system would be whollyefficient and extremely fast, but the two are gen-

arX

iv:1

601.

0140

5v1

[cs

.CR

] 7

Jan

201

6

Page 2: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

2

erally at odds with one another, and thus trade-offs must be made in the search for an acceptablemiddle-ground. Really the aim of this project is tofind that middle-ground.

Objective evaluation shows that our solution,BitAV, is an effective architecture that is moreoptimal than any currently available commercialor researched/published solution. Specifically, weshow:• Fast scanning speed with less memory

usage: By layering a cache-efficient bloomfilter on top of the more costly bloomierfilter, BitAV manages to increase end-to-endthroughput of the average-case input by 14×,and requires less memory to do so than tradi-tional algorithms.

• Scalability: BitAV can handle large numbersof signatures with ease, and further space-efficiency improvements in order-and-matchconstruction within our data structures willfurther improve scalability.

• Decentralized updates and maintenance:The community of users using and maintain-ing BitAV are provided a trustable, timelessconduit by which to work together that isnot dependent on any centralized authority orschema other than cryptographic verification.This is accomplished through the use of anovel blockchain variant.

• Easily implementable on all types of de-vices: BitAV should work on any architecture,provided that it has enough RAM and diskspace to store the identifiers and load theminto the memory. The low memory and diskspace usage contributes to this.

A. Virus Scanning Techniques OverviewSignature Matching: checking if a file is a known

virus, or contains bytecode known to be mali-cious, by searching for the hash of the identi-fier.

Heuristic Analysis: testing for polymorphism byexecuting the virus and searching for knownmalicious identifiers in-memory.

Behavioral Analysis: checking if a file contains acompletely unknown virus by running the filein an emulated environment; the downside isthe large overhead of the emulation.

While this paper explores novel methods to applythe first type of scanning mechanism, signature

matching, the others should not be discounted. Inthe future, when computational resources are so vastas to render their cost moot, they will likely be thebest choice. They are not currently because of theassociated reduction in end-to-end speed. Thus, thispaper really focuses on the creation of a bridge be-tween the currently used, and increasingly outdated,method and the likely future of anti-malware. Thefostered debate around this transitional period [3],[4], is ongoing, but the consensus view seems to bethat, for the time being, signature matching is thebest option.

There are two main approaches to signaturematching: exact and rolling. Exact signatures are ofwhole files, whereas rolling signatures ’roll’ oversome sections n of the file F , such that n = F/kwhere k is the size of each section. Rolling signa-tures are useful because if even one bit is flippedin a malware sample that codes for an identifier,the corresponding exact signature will change dra-matically (as a result of the waterfall effect of hashfunction computation), while the rolling signatureswill remain mostly the same (perhaps only onesection will be divergent) [5]. Thus, exact signaturesare more precise, whereas the rolling signatures aremore likely to detect slightly modified variants.

II. METHODS/DESIGN

This paper centers around the design and im-plementation of ’BitAV,’ an anti-malware systemthat uses novel techniques to propagate malwareidentifiers along a network of users in conjunctionwith an extremely efficient pattern-matching schemeof my own design to create the optimal anti-malwaresolution. The scanning mechanism utilizes a bloomfilter [6] and one of its derivative data structures,the bloomier filter [7], to create a structure thatallows for constant time key-value queries, withoutthe high probability of false-positives that comeswith probabilistic data stores.

This is not the first use of bloom filters to speedup pattern matching ( [8], [9]), nor the first to usetiered look-up systems based on stratified bloomfilter layering ( [10], [11]). This is, however, thefirst implementation that takes advantage of cheaphash functions, feedforward logical flow, and cache-resident (or in the case of systems that run adedicated processor, texture memory-resident andmassively parallelized) architecture. This module is

Page 3: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

3

Transaction

Verify

Sign

Owner 1'sPublic key

Hash

Owner 0'sSignature

Owner 1'sPrivate Key

Transaction

Verify

Sign

Owner 2'sPublic key

Hash

Owner 1'sSignature

Owner 2'sPrivate Key

Transaction

Owner 3'sPublic key

Hash

Owner 2'sSignature

Owner 3'sPrivate Key

Fig. 1: Currency Blockchain [15]

a more final refinement of previous works, namely[5] and [10].

The networking module acts as a way for usersacross a decentralized [12] network to both receiveand transmit information trustlessly, and works toimprove the reliability and efficiency of the updatenetwork. It does this using blockchain architecturethat allows for distributed anonymous consensusamong peers, with the ’vote’ (an abstracted rep-resentation of influence) weighting being a resultof computational power expended. Thus, in orderto overpower the rest of the network, a user (ora group of malicious actors) would need > 50%network power to gain control [13]. This solves theByzantine Generals’ problem nicely, as the expen-diture of computational power scales directly withthe amount of capital required to obtain it (eitherthrough hardware or, more often, electricity costs)[14]. This network model is very similar, and isactually derived from, the core Bitcoin protocol, laidout by the pseudonymous Satoshi Nakamoto in [15].

A. Blockchain Architecture Overview

Originally created by the visionary SatoshiNakamoto, Bitcoin has revolutionized the businessof digital currencies [16]. Bitcoin is, however, onlyone of the innumerable number of potential applica-tions of the blockchain (illustrated in Fig. 1). It hasthe potential to completely decentralize data storage,reputation systems, even democratic voting. Theseare all done through the creation of self-executingdigital contracts that are backed by intelligent assets(cryptocurrency ’coins’). Because there is monetary

Fig. 2: Orphan Chain Competition

value associated with these contracts, there is an in-centive to make sure that they are ’correct,’ and this(coupled with the proof-of-work [15] system presentin these applications) allows for the coordination ofnetworks that control valuable information over ananonymous network (in this case the information isthe transactions that determine currency ownership).Prior the invention of these mechanisms, it wassimply not possible to coordinate large numbers ofindividual activities into a cohesive network with-out a centralized governing body to watch overand verify the proceedings [17]. This problem ofcoordination is a well-known problem in the fieldof distributed networking, originally outlined in the80’s [18] and more recently encapsulated by the um-brella ’Byzantine Generals Problem’ [19], alluded toearlier.

The Generals problem questions how individualcomputer systems can come to a consensus withouta method of omniscient verification (which a centralbody would provide), in such a way that the networkis resilient to attacks by bad actors1. It posits thatthree divisions of the Byzantine army are campedoutside an enemy city in hopes of conquering it; anindependent commander directs each division and,in order to be successful, all three must attack at thesame time [19]. The generals can only communicatethrough an unreliable messenger, which may beinfluenced by a traitor in the group who is activelytrying to derail the generals’ efforts.

A blockchain solves this problem by forcingtransparency among the groups using it, and usescryptographic measures to allow for independentverification of transmitted information by all groups

1”The Byzantine Generals Problem seems deceptively simple. Itsdifficulty is indicated by the surprising fact that if the generals cansend only oral messages, then no solution will work unless more thantwo-thirds of the generals are loyal. In particular, with only threegenerals, no solution can work in the presence of a single traitor.”[19]

Page 4: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

4

New Signatures

Invalidated Signatures

fk58gurj… ap5qb3… c82kf06… 0a3kdf9… lg95ks0… t4dv86j…

05kd9g5… h58fkjd3…

v894kjf… fkr8g85…

w09we0… 20er0�3… wq0er80… 209w9g… w0ie0er…

Block 1 Transactions

Block 1

Header

Merkle Root

Genesis Block Prev. Hash Time

Version

b62jf16…

Fig. 3: BitAV Blockchain

in the network. In order to send a new message(or ’mine’ a new block), significant computationalpower must be expended; this makes it both expen-sive and cumbersome for a bad actor to coordinatean attack against the network. Blockchain protocolsthus ensure that transactions (included in each block,and the truly ’valuable’ part of the system) arevalid and secure, as long as > 50% of the networkis non-malicious [14], [16], [20]. The possibilityof multiple competing chains being extended atonce, illustrated in Fig. 2, is addressed by havingeach individual select the longest chain they knowof; as long as the proof-of-work computations areaccurate, the longest chain is verifiably the mostdifficult to compute, and therefore the least likelyto have been compromised or be a false chain.

B. BitAV Blockchain

The BitAV blockchain is similar to the coreBitcoin implementation’s, in that it uses crypto-graphic proofs to extend and verify the chain, butit radically differs in the way that the transactionfield works. Traditionally, the transaction (’tx’) fieldhouses a forward-flowing narrative of all knownstate-changes for the currency it is recording [21].BitAV replaces the tx field with two informationstores: the identifier and invalidation fields. The

identifier field allows for the addition of novelmalware identifiers; this does not mean ’new’ sig-natures, just those not currently present on the chainin some other block. This is illustrated in Fig. 3.

Each field in the block header (essentially themeta-data section) is needed to ensure verifiabilityand consensus. First, the version number is neededto prevent errors resulting from hard forks (updatesthat would break compatibility with previous ver-sions). The time is the approximate creation timeof the block and must be within the calculatedacceptable range for each new block; BitAV usesthe timestamp both to recalculate difficulty (foruse in the proof-of-work verification [’mining’ inBitcoin’s terminology]) and in some checks thatwork to ensure chronological sanity and canonicity.The previous hash field is what really makes theblockchain a ’chain,’ as it necessitates the inclusionof the most recent block’s hash in the next blockto be created; because of the difficulty of reverse-engineering the has function used (SHA256, in thiscase), the longer the chain the more difficult it isto recreate a verifiable blockchain. For reference,the difficulty in recreating the current Bitcoin chainis approximately 1

(2256)n, where n is the current

blockchain length. Note that this is the worst-casetime when attempting to recreate the exact hashesof all current blocks using falsified transactions;individual blocks would reduce it to 1

(2256)n−k , wherek is the depth of the recreation attempt from the’top’ of the chain. Additionally, this is assumingSHA-256 is used as a hash function; should anotherhash function of digest length d be used, the ap-proximate odds of finding a collision are 1

2d. Finally,

not pictured is the ’nonce’ that is included in theheader so that the hash of said header can be quicklymodified by changing the nonce. Users extend thechain by finding a nonce that results in the hashof the blocks header being a lower value than thecalculated difficulty level [15].

This architecture can actually be used for anynetworks whose aim is to share inherently valuableinformation across a network of users. The onlystipulation is that the information have some uniquecharacteristics. In this case, BitAV peers working toextend the chain can leverage open-source databasesto check whether a submitted identifier is known tobe good or bad; in either case, it allows for somelevel of pre-screening. There is still the possibilityof attack by the submission of identifiers that are

Page 5: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

5

Data

Hash function 101100110101

Hash

Encrypt hash using signer's

private key

111101101110

SignatureCertificate

Attach to data

Digitally signed data

Signing

Fig. 4: Cryptographic Signing of Data [22]

currently unknown to be good or bad, or moregenerally the submission of universally unknownidentifiers for some malicious purpose.

This vector of attack is mitigated, almost in itsentirety, by a novel voting scheme that we call’minority transaction consensus.’ This is perhaps themost innovative feature of our network’s design,and it is what drives the ’invalidation’ field of ourblockchain. To fully appreciate the significance ofthis scheme one must understand why it is thatthis would never work on a currency network, suchas Bitcoin (we theorize this is the reason that noother organization has though to implement such aprotocol). When dealing with immutable currencies,the ability to just ’rewind’ transactions and returncurrency to an entity that had appeared to havespent it is not allowable, as it opens the door forfar too much fraud and actually would turn Bitcoininto an even more abusable version of credit cardchargebacks (the preclusion of which Bitcoin toutsas one of its greatest strengths). In our network,however, each signature is not actually valuable toan individual, but to the network as a whole in amore probabilistic sense. Allowing individuals toreclaim currency which they have appeared to havespent, and is accepted into the network as canon andshown to have been validated to all peers, is harmful

on a case-by-case level, whereas the possibility of0.5% of all malware identifier invalidations beingmalicious would only marginally effect the networkas a whole.

We understand that given the newness of theblockchain architecture and the nature of these prob-lems being less scientific and more game-theoryoriented, the authors encourage all interested partiesto read up more fully on these concepts in [13]–[15],and especially [17], which is far more approachablethan many of the more highly specialized studies.

Each new identifier submission holds the generalformat of:[Identifier] [Pubkey] [Signature]The signature is the cryptographically signed hash

of the identifier, and to verify that the submission isnot a forgery. Verification can take place by usingthe included public key to decrypt the signature,hashing the decrypted value, and then checkingthe hash of the identifier against the digest of thedecrypted value’s hash [23], [24]. It is a very simpleand well established procedure, and the signaturestep is illustrated in Fig. 4.

Because we are able to ensure that all of thesubmissions under a specific key-pair are, in fact,generated by using that key-pair to sign the sub-mission, we can ’track’ a user through the networkby search from all the occurrences of their publickey in the ’new identifiers’ transaction field of eachblock on the chain. Knowing this, we posited that auser’s relative ’trust’ could be evaluated by countingthese occurrences and using the number of timesthey have altruistically added value to the network(in the form of new identifier submissions) as theweighting factor in deciding their trust.

When users trying to extend the chain are activelybroadcasting their status as current miners, they re-ceive both requests to add new identifiers and ’votes’from users attempting identifiers they believe areinvalid. Invalidations are only possible for signaturesadded within the last 10 (note that this is an arbitrarylimit and can be easily modulated once large-scaletesting is done) blocks, so that the backbone of theblockchain is unmanageably canonical. A nice sideeffect of using new blocks to change the state ofidentifiers in older blocks is that the data withinthose older blocks is never changed (and thus thehash remains constant), so the prev. hash field isnot compromised.

The actual calculation done by a miner in deter-

Page 6: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

6

NewInformation

InvalidatedInformation

fk58gurj…

ap5qb3…

c82kf06…

0a3kdf9…

lg95ks0…

t4dv86j…

05kd9g5…

h58fkjd3…

v894kjf…

fkr8g85…

dffsg8r4…

w09we0…

20er0fj3…

wq0er80…

209w9g…

w0ie0er…

ID: w0ie0er…

Increase Trust of pNy3Gik

Pubkey: pNy3Gik

Block 1 Transactions

NewInformation

InvalidatedInformation

v894kjf…

320dlsd…

dffsg8r4…

w09we0…

2asdlfk3…

389dfjsd…

209w9g…

w0ie0er…

0w09wb…

kdp439…

w435yb…

4mfgsas…

45ufgjks…

muwe3f…

29qiucm…

1i8rffds…

Block 2 Transactions

TX Sig: ... ID: w0ie0er…

Decrease Trust of pNy3GikProportionally Decrease Trust of nV2d3km,..

Pubkeys: nV2d3km,.. Sigs: ...

Fig. 5: BitAV Trust Determination Scheme

mining if an invalidation is allowable is simply:∑

Trust of Invalidators−∑

Trust of Submitter(1)

If the result of this is > 0, the invalidationgoes through. There is no punishment for a failedinvalidation vote because it would bloat the chain. Ifthe outcome is of an invalidation, the submitter losestrust equal to that of the result of the subtraction,and the invalidators lose trust proportionally to theirown trust level. For example, if an invalidation with10 voters 10 trust total, and they have equal initialtrust, they will all lose one ’trust.’ However, if onehas one more trust initially, they will lose more thanthe rest. While it may not seem intuitively ’fair’ togive penalties for invalidations to the voters, it isnecessary to ensure that a bad actor who has accruedlots of trust does not take over the network by inval-idating the submissions of any opposers. Note thatthe trust points here are simplified for the exampleand are abstracted from the real implementation;never the less, the core ideas hold true.

Bit-AV exploits the knowledge that most inputfiles are not malicious by using probabilistic datastructures with exactly precise no-match accuracy togreatly speed up the process of file scanning. It firstconstructs a counting bloom filter [6] (a probabilisticdata structure that uses bit vectors to efficientlymap data [25]), illustrated in Fig. 6 from the setof known identifiers S. We start with an integervector (as we want to allow for deletions [26]) ofsize m buckets, all of which are set to 0 duringcreation. For each identifier, k hash functions areapplied to its each signature α ∈ S, resulting in hash

0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0

{ x, y, z}

w

Fig. 6: Bloom Filter [9]

digests h1(α), h2(h1(α)+α), . . . hk((hk−1(α))+α).Note the use of the result of the first hash digestin the second, and the second in the third, etc., asthis allows us to use just one hash function whilestill being able to efficiently modulate k. To ensurethe digests h1(α), . . . hk((hk−1(α)) + α) are withinthe bounds of 1, . . .m, the result of each digest ismodulated by m.

C. BitAV Scanning Mechanism

Because of the possibility of hash collisions, inwhich two inputs result in the same output digest,we must consider the possibility of false positives.Given that after inserting n keys into our table ofsize m, the probability of a specific bit being 0 isexactly

(1− 1

m)kn, (2)

the probability of a false positive p is exactly

p = (1− (1− 1

m)kn)k ≈ (1− e−kn

m )k. (3)

Finally, we can derive that given a target falseprobability p, the minimum value of k that willproduce this probability is:

k =m

nln 2. (4)

After construction of this bloom filter, we areleft with a basic probabilistic data structure that canperform lookups in constant time while residing inthe level 2 (L2) CPU cache. This filter will neverproduce a false-negative result, assuming properimplementation, but each lookup has the probabilityof being falsely positive p.

In a bloom filter’s worst case scenario, the entireknown set of data must be looked up in the filterto confirm an uncertain match from the hashingoperation; this scenario can occur when a filter istoo small for the number of elements it contains

Page 7: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

7

(and thus the false-positive rate would be extremelyhigh). Hash collisions that result even before thetable size modulation operation can occur, but theyare far less likely.

Bloomier filters [7] solve this problem well inthis case, as they allow for key-value lookups usinga vector structure similar to those in bloom filters.The query runs in constant time and the spacerequirement is only O(nτ), where τ is the size of thebucket used in the order-and-match finder operation[11], [27].

Info =k⊕

i=1

knownPatterns(Hashi(Suspect))

(5)Equation 5 denotes the method by which informa-

tion is returned from the bloomier structure. An n-ary XOR operation (symbolized by the

⊕operator)

is performed on the hash digests, 1 to k, of thesuspect string(s). The digests h1(α), . . . hk(α) areused as indices in the bloomier index table (forexample, the first position would be at h1(α)). Thedata in the index table at all of these positions,collectively, is XORed with the total index table,returning the information.

Because of the ease and efficiency with whichEq. 5 performs simple key-value lookups, and theusage of index table XORing, many parallels canbe draw between our bloomier filters and structureslike IBLT’s [28], KBF’s [29], and other bloomier-like data stores. While it is possible that one of thesesimilar data structures outperforms a bloomier filter,our model can be easily adapted should that proveto be the case. As of now no clear consensus existson the matter.

The integration of these two filtering methodsnaturally leads to a system in which informationflows downward, stopping when it hits an impassiblejunction, but sometimes slipping through the cracksin the form of false-positives. I call this mechanism afeedforward bloom-bloomier filter (FBBF) becauseof the preclusion of a feedback loop developingand the necessitation of downward flow in the datastream. The only modifications come from the inputdisturbances caused by the networking module.

The full FBBF mechanism is illustrated in Fig. 7.The general format of the scanning algorithm isderived from [10], which uses a two-tiered bloomfiltering mechanism. Unlike in our filter, their laststep was a full pattern matching against a sub-

0 3 1 0 0 2 0 3 0 1 0 0 0 2 0 0

XOR Operation XOR Operation

1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0

Target File F2(Filter Miss)

Target File F1(Filter Hit)

Signatures for ExactMatching

SuspectFiles ∑F

0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Fig. 7: Feedforward Bloom-Bloomier Filter

set of the larger signature set (that was still fairlysubstantial). Our mechanism doesn’t require thisstep, as it operates in constant time throughout.BitAV-Hash(

∑) → δ takes the set of signatures

and outputs an integer vector δ that contains thehash digests (signatures) of all known malware.

BitAV-Screen(δ, F ) → (λ, Fsuspect) constructsa feedforward filtering mechanism from δand a bloomier filter. Each file f ∈ F isscanned using φ. The tuple (λ, Fsuspect), whereFsuspect ⊆ F , is the list of files matchedby δ, and λ is string that the indices of thesignatures actually matched in Fsuspect.

BitAV-HitScan(λ,∑) → ∑′ takes λ and outputs

the set of signatures∑′ ⊆ ∑

that were matchedduring BitAV-Screen by querying our bloomierfilter.

BitAV-HitMatch(∑′, Fsuspect) → Fmalware takes in

the set of signatures∑′, the set of files Fsuspect,

and outputs the set of files Fmalware ⊆ Fsuspect

matching∑′⊕Fsuspect.

Note that the entire operation, including the finalstring matching, runs in constant time. The final’exact matching’ stage does not actually utilize anexact matching algorithm, as they run in linear timeand would reduce the worst-case run to O(n), butactually just XOR’s the hash of the suspect file withthe bloomier output. If the result is not 0 (as the

Page 8: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

8

BitAV

(CUDA)

BitAV

(CPU)

Micr

osoft

Bit

Defend

er

Sopho

s

Avast

McA

fee

Panda

Avira

0

100

200

300

400

Thr

ough

put

(MB

/s)

End-to-End Speed vs. Industry Solutions

Best Average Worst

Fig. 8: Throughput Graph

XOR of two exactly similar objects is always 0),then the result was a false positive. Otherwise, thefile is recognized as malicious and deleted.

III. RESULTS

My anti-virus implementation was, on average,1,400% faster at the process of scanning files thanthe mean of the industry solutions tested against myimplementation. This is including test data sets thatwere largely cache-misses or not already in the set ofknown malware for BitAV. I considered my averagespeed to be the mean of the ’average case’ end-to-end speed tests ( 70% clean files, 30% malwarebinaries), with all of the end-to-end throughputtest results represented in Fig. 8. Eliminating thosecases brings the average speed increase to 8,500%,twice the average performance gain of [5] withoutoptimizations for cache residency.

One of the main improvements on my originaldesign was the parallelization of the signature gener-ation for cases in which bulk file buffers were passedto BitAV. This was done through a simple CUDA

Fig. 9: Social Network Analysis of Real-WorldSimulation

MD5 implementation that was significantly faster athashing large buffers than CPU-driven mechanisms.It was not, however, faster when the input streamwas inconsistent (which would translate to on-accessin the real world). Disk I/O speeds are increasingfast enough that within the next decade they willmost likely not be the limiting factor for these kindsof pattern matching schemes.

Measurements of propagation time were muchmore difficult to obtain than scanning speed. Tomeasure it, a network of servers (mostly AWSmicro instances or virtual machines in accessibledatacenters) was constructed to run my software.I set up a honeypot server on my own remotemachine, and piped all of the garnered binariesinto a test blockchain (which was shared with allof the other servers). I measured the differencebetween VirusTotal’s first sample seen history andthe time it took me to identify the sample (allof which I had to manually check to make surethey were, in fact, malicious); this yielded BitAVsdata, and the data of the industry solutions wasdone by periodically checking VirusTotal’s reportAPI to monitor the detection status. The reasonthat most of my results are clustered into tiers isbecause VirusTotal’s database of provider results foreach sample is only updated periodically, leadingto highly clustered results. Once again BitAV ransignificantly faster than the tested industry solutions,500% on average, and our results were even morestatistically significant.

Page 9: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

9

The last measurement I took was of the con-nections made between test network servers. Therelationships are represented in a social networkanalysis graph in figure 4 (note that the Open-Ord ranking algorithm was used to arrange thenodes). The graph clearly shows that the nodes natu-rally clustered into a P2P-structured network. Thereare quite a few advantages to a networks hostingand maintenance being decentralized, namely thepreclusion of denial-of-service attacks (assuming theprotocol is not inherently vulnerable),

IV. CONCLUSION

My findings showed that the implementation ofthe proposed design exceeded expectations in allareas of performance. When fully optimized forcache residency and with manually tuned bytecodeour anti-virus was able to scan at a speed of over350MB/s, making it a viable solution for networkbased scanning. Yes, it is very possible that anentire network could be secured through deep packetinspection by way of BitAV’s revolutionary scanningmechanisms. The propagation speed tests similarlyshowed BitAV’s clear preeminence over ’industrystandard’ solutions.

As more devices are brought online it becomes in-creasingly important to make sure that all networksare able to be secured with relative ease and no othersolution provides as much modularity as one that ishosted by a community of its own users. Adaptationis limited only by the enthusiasm of the network’susers to contribute to development, and previousopen-source software development efforts in similarareas have shown this to be the least likely limitingfactor. Furthermore, with the advent of malwarethat is developed and spread by a bad actor withinfluence over the institutions built around cyberdefense (e.g. NSA’s REGIN and SIGINT malwarecampaigns, whose identifiers have still not beenadded to a majority of the commercially availablesolutions), it is advantageous to users to know thatthere is no bias associated with the addition of newmalware identifiers.

To be clear, my blockchain-based consensusscheme is the only one in existence that works forunvalidatable data across an anonymous network.The potential applications of this architecture areinnumerable. I believe that once atomic binding ofcurrency networks to informatory networks becomes

a reality (most likely before the year is out), mynetworking architecture will have the potential todisrupt every industry based around valuable, time-critical data.

V. ACKNOWLEDGEMENTS

I would like to thank prof. Chi So, informationsecurity department at the Viterbi School of Engi-neering at the University of Southern California, forintroducing me to the field. In addition I wouldlike to thank Pei Cao, Ozgun Erdogan, SungminCho, prof. David Brumley, and Sang Kil Cha forproviding useful discussion and source code fortheir implementations of Hash-AV and SplitScreen,respectively. Finally, VirusTotal provided access totheir private research A.P.I. to gather data for thepropagation speed test and malware samples for thescanning speed tests.

REFERENCES

[1] M. Garetto, W. Gong, and D. Towsley, “Modeling malwarespreading dynamics,” IEEE INFOCOM 2003. Twenty-secondAnnual Joint Conference of the IEEE Computer and Commu-nications Societies (IEEE Cat. No.03CH37428), vol. 3, 2003.

[2] T. Chen and J. Robert, “The evolution of viruses and worms,”Statistical Methods in Computer, pp. 1–16, 2004.

[3] W. Grossman, “Does antivirus have a future?” The Guardian,vol. 20, 2007.

[4] M. Fitzgerald, “The future of antivirus.”[5] O. Erdogan and P. Cao, “Hash-AV: Fast virus signature scan-

ning by cache-resident filters,” GLOBECOM - IEEE GlobalTelecommunications Conference, vol. 3, pp. 1767–1772, 2005.

[6] B. H. Bloom, “Space/time trade-offs in hash coding withallowable errors,” pp. 422–426, 1970.

[7] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “TheBloomier filter: an efficient data structure for static supportlookup tables,” Proceedings of the fifteenth . . . , 2004.

[8] K. Huang, D. Zhang, and Z. Qin, “Accelerating the bit-splitstring matching algorithm using Bloom filters,” ComputerCommunications, vol. 33, no. 15, pp. 1785–1794, 2010.

[9] D. Eppstein and M. T. Goodrich, “Straggler Identificationin Round-Trip Data Streams via Newton’s Identities andInvertible Bloom Filters,” Knowledge and Data Engineering,. . . , pp. 1–18, Apr. 2007.

[10] S. K. Cha, I. Moraru, J. Jang, J. Truelove, D. Brumley, andD. G. Andersen, “SplitScreen: Enabling efficient, distributedmalware detection,” Journal of Communications and Networks,vol. 13, pp. 187–200, 2011.

[11] N. Tuan, B. Hieu, and T. Thinh, “High performance patternmatching using bloom-bloomier filter,” . . . and InformationTechnology ( . . . , pp. 0–4, 2010.

[12] J. F. Buford, H. Yu, and E. K. Lua, P2P Networking andApplications, 2009.

[13] F. Reid and M. Harrigan, “An Analysis of Anonymity in theBitcoin System,” Jul. 2011.

Page 10: BitAV: Fast Anti-Malware by Distributed Blockchain ... › pdf › 1601.01405.pdf · The amount of new malware released onto the public Internet is exploding [1]. As most anti-virus

10

[14] J. Kroll, I. Davey, and E. Felten, “The economics of Bitcoinmining, or Bitcoin in the presence of adversaries,” Proceedingsof WEIS, no. Weis, pp. 1–21, 2013.

[15] S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic CashSystem,” Consulted, pp. 1–9, 2008.

[16] K. Okupski, “Bitcoin Developer Reference,” 2014.[17] A. Wright and P. De Filippi, “Decentralized blockchain tech-

nology and the rise of lex cryptographia,” Available at SSRN2580664, 2015.

[18] B. Preneel and M. Lowry, “Analysis and Design ofCryptographic Hash Functions,” Doct Dissertation KULeuven,2003.

[19] L. Lamport, R. Shostak, and M. Pease, “The byzantine generalsproblem,” ACM Transactions on Programming Languages andSystems (TOPLAS), vol. 4, no. 3, pp. 382–401, 1982.

[20] G. Fox, “Deanonymistion of clients in Bitcoin P2P Network,”Computing in Science & Engineering, pp. 15–29, 2010.

[21] V. Buterin, “A next-generation smart contract and decentralizedapplication platform,” Etherum, pp. 1–36, 2014.

[22] Acdx, “Digital Signature diagram,” 2008.[23] D. Johnson, A. Menezes, and S. Vanstone, “The Elliptic Curve

Digital Signature Algorithm (ECDSA),” International Journalof Information Security, vol. 1, pp. 36–63, 2001.

[24] A. Juels, M. Luby, and R. Ostrovsky, “Security of blinddigital signatures,” Advances in CryptologyCRYPTO’97, pp.150–164, 1997.

[25] E. Karpilovsky, “Bloom Filters,” in COS598E, 2005, p. 106.[26] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and

G. Varghese, “An improved construction for counting bloomfilters,” in Algorithms–ESA 2006. Springer, 2006, pp. 684–695.

[27] D. Charles and K. Chellapilla, “Bloomier filters: A secondlook,” in Algorithms-ESA 2008. Springer, 2008, pp. 259–270.

[28] M. T. Goodrich and M. Mitzenmacher, “Invertible bloomlookup tables,” in 2011 49th Annual Allerton Conference onCommunication, Control, and Computing (Allerton). IEEE,Sep. 2011, pp. 792–799.

[29] S. Xiong, Y. Yao, Q. Cao, and T. He, “kBF: A Bloom Filterfor key-value storage with an application on approximate statemachines,” INFOCOM, 2014 Proceedings . . . , pp. 1150–1158,Apr. 2014.


Recommended