Linux Provenance Modules: Trustworthy Whole-System ...bates/documents/UF-REP-2015-578.pdf · Linux...

Linux Provenance Modules:Trustworthy Whole-System Provenance for the Linux Kernel

Adam Bates1, Dave Tian1, Kevin R.B. Butler1, and Thomas Moyer2

1University of Florida2MIT Lincoln Laboratory

AbstractIn a provenance-aware system, mechanisms gather andreport metadata that describes the history of each ob-ject being processed on the system. This allows usersto track, and understand, how a piece of data came toexist in its current state. However, while past work hasdemonstrated the usefulness of provenance for myriadapplications, less attention has been given to securingprovenance-aware systems. Provenance itself is a ripeattack vector, and the authenticity and integrity of prove-nance must be guaranteed before it can be put to use.

We present Linux Provenance Modules (LPM),the first general framework for the development ofprovenance-aware systems. We demonstrate that LPMcreates a trusted provenance-aware execution environ-ment, collecting complete whole-system provenancewhile imposing as little as 2.7% performance overheadon normal system operation. LPM introduces new mech-anisms for secure provenance layering and authenticatedcommunication between provenance-aware hosts, andalso interoperates with existing mechanisms to providestrong security assurances. To demonstrate the poten-tial uses of LPM, we design a Provenance-Based DataLoss Prevention (PB-DLP) system. We implement PB-DLP as a file transfer application that blocks the trans-mission of files derived from sensitive ancestors whileimposing just tens of milliseconds overhead. LPM is thefirst step towards widespread deployment of trustworthyprovenance-aware applications.

1 Introduction

A provenance-aware system automatically gathers andreports metadata that describes the history of each ob-

The Lincoln Laboratory portion of this work was sponsored by theAssistant Secretary of Defense for Research & Engineering under AirForce Contract #FA8721-05-C-0002. Opinions, interpretations, con-clusions and recommendations are those of the author and are not nec-essarily endorsed by the United States Government.

ject being processed on the system. This allows users totrack, and understand, how a piece of data came to ex-ist in its current state. The application of provenanceis of enormous interest in a variety of disparate com-munities including scientific data processing, databases,software development, and storage [56, 66]. Provenancehas also been demonstrated to be of great value to se-curity by identifying malicious activity in data centers[11, 34, 70, 82, 83], improving Mandatory Access Con-trol (MAC) labels [58, 59, 60], and assuring regulatorycompliance [9].

Unfortunately, most provenance collection mecha-nisms in the literature exist as fully-trusted user spaceapplications [35, 34, 53, 70]. Even kernel-based prove-nance mechanisms [56, 61] and sketches for trustedprovenance architectures [52, 55] fall short of providinga provenance-aware system for malicious environments.The problem of whether or not to trust provenance is fur-ther exacerbated in distributed environments, or in lay-ered provenance systems, due to the lack of a mechanismto verify the authenticity and integrity of provenance col-lected from different sources.

In this work, we present Linux Provenance Modules(LPM), the first generalized framework for secure prove-nance collection on the Linux operating system. Mod-ules capture whole-system provenance, a detailed recordof processes, IPC mechanisms, network activity, andeven the kernel itself; this capture is invisible to the ap-plications for which provenance is being collected. LPMintroduces a gateway that permits the upgrading of lowintegrity workflow provenance from user space. LPMalso facilitates secure distributed provenance through anauthenticated, tamper-evident channel for the transmis-sion of provenance metadata between hosts. LPM inter-operates with existing security mechanisms to establish ahardware-based root of trust to protect system integrity.

Achieving the goal of trustworthy whole-systemprovenance, we demonstrate the power of our approachby presenting a scheme for Provenance-Based Data Loss

1

Used Used Used Used Used Used Used WasControlledBy

WasGeneratedBy WasGeneratedByWasGeneratedBy WasGeneratedBy WasGeneratedBy

/etc/ld.so.cache:0 /lib/libc-2.12.so:0 /etc/rc.local:0 /bin/ps:0 /var/spool/cron/root:0 /etc/passwd:0 /etc/shadow:0

Malicious Binary

/etc/rc.local:1 /bin/ps:1 /var/spool/cron/root:1 /etc/passwd:1 /etc/shadow:1

root

Figure 1: A provenance graph showing the attack footprint of a malicious binary. Edges encode relationships that flowbackwards into the history of system execution, and writing to an object creates a second node with an incrementedversion number. Here, we see that the binary has rewritten /etc/rc.local, likely in an attempt to gain persistenceafter a system reboot.

Prevention (PB-DLP). PB-DLP allows administrators toreason about the propagation of sensitive data and controlits further dissemination through an expressive policysystem, offering dramatically stronger assurances thanexisting enterprise solutions, while imposing just mil-liseconds of overhead on file transmission. To our knowl-edge, this work is the first to apply provenance to DLP.

Our contributions can thus be summarized as follows:

• Introduce Linux Provenance Modules (LPM).LPM facilitates secure provenance collection at thekernel layer, supports attested disclosure at the ap-plication layer, provides an authenticated channelfor network transmission, and is compatible withthe W3C Provenance (PROV) Model [74]. In eval-uation, we demonstrate that provenance collectionimposes as little as 2.7% performance overhead.

• Demonstrate secure deployment. LeveragingLPM and existing security mechanisms, we cre-ate a trusted provenance-aware execution environ-ment for Linux. We port and extend the Hi-Fi sys-tem [61], provide a second module that interoper-ates with the SPADE system [36], and describe howLPM is being used to create provenance-informedMAC policies [68]. We show that, in realistic ma-licious environments, ours is the first proposed sys-tem to offer secure provenance collection.

• Introduce Provenance-Based Data Loss Preven-tion (PB-DLP). We present a new paradigm forthe prevention of data leakage that searches objectprovenance to identify and prevent the spread ofsensitive data. PB-DLP is impervious to attemptsto launder data through intermediary files and IPC.We implement PB-DLP as a file transfer applica-tion, and demonstrate its ability to query object an-cestries in just tens of milliseconds.

The rest of this paper is structured as follows. In Sec-tion 2, we present background on provenance, and ex-plain how it compares to past efforts in the area of in-

formation flow security. In Section 3 we present the de-sign, implementation, and deployment of Linux Prove-nance Modules. We analyze the security of such a de-ployment in Section 4. LPM system performance is eval-uated in Section 6. Section 5 presents the design an im-plementation of an exemplar LPM application that em-ploys a provenance-based approach to data loss preven-tion. Other aspects of LPM are discussed in Section 7,and in Section 9 we conclude.

2 Background

Data provenance, sometimes called lineage, describesthe actions taken on a data object from its creation upto the present. Provenance can be used to answer a va-riety of historical questions about the data it describes.Such questions include, but are not limited to, “Whatprocesses and datasets were used to generate this data?"and “In what environment was the data produced?" Con-versely, provenance can also answer questions about thesuccessors of a piece of data, such as “What objects onthe system were derived from this object?" Although po-tential applications for such information are nearly lim-itless, past proposals have conceptualized provenance indifferent ways, indicating that a one-size-fits-all solutionto provenance collection is unlikely to meet the needs ofall of these audiences.

The commonly accepted representation for data prove-nance is a directed acyclic graph (DAG). In this work, weuse the W3C PROV-DM specification [74] because it ispervasive and facilitates the exchange of provenance be-tween deployments. An example PROV-DM graph of amalicious binary is shown in Figure 1. This graph de-scribes an attack in which a binary running with rootprivilege reads several sensitive system files, then ed-its those files in an attempt to gain persistent access tothe host. Edges encode relationships between nodes,pointing backwards into the history of system execution.Writing to an object triggers the creation of a second ob-ject node with an incremented version number. This par-

2

ticular provenance graph could serve as a valuable foren-sics tool, allowing system administrators to better under-stand the nature of a network intrusion.

2.1 Linux Security ModulesThe Linux Security Modules (LSM) framework providesa set of standardized authorization hooks for implement-ing flexible mandatory access control in the Linux kernel[76]. LSM provides complete mediation of operationson key kernel data types by ensuring that an authoriza-tion step occurs prior to permitting these operations toexecute. LSM is designed on the principle of generality;rather than offering a one-size-fits-all solution to accesscontrol, one of a number of different security modulesprovide the logic for the authorization hooks. The cor-rectness of the authorization hook’s placement through-out the kernel has been verified through both static anddynamic analysis [25, 43, 81]. In this work, we arguethat the principled design of Linux Security Modules alsoserves as a strong foundation for the collection of whole-system provenance.

2.2 Trusted ComputingThe secure deployment of our proposed system relieson the Trusted Platform Module (TPM) 1, a small cryp-tographic processor2 available in many commodity sys-tems today. The TPM provides a hardware-based root oftrust for storing keys and measurements, and generatingattestations, or evidence of a set of collected measure-ments, representing the current state of the system.

The measurements are SHA1 hashes that are stored inPlatform Configuration Registers (PCRs). These PCRshave a limited interface, only supporting the ability toadd a new measurement, via the extend operation,or resetting them to their default value. When a newmeasurement is added to the PCR, the current value ofthe PCR and the new measurement are concatenated,and this value hashed. The result of the hash is storedin the PCR, forming a hash chain of measurements.This ensures that an adversary cannot modify an alreadyrecorded measurement. This only considers the “nor-mal” PCRs, and not the PCRs that support the dynamicroot of trust for measurement, where the PCRs start witha value of -1, and are set to 0 only after executing a setof CPU instructions that establish a secure execution en-vironment 3. For more information on the utility of thisfunction, see [54]

1 See http://www.trustedcomputinggroup.org/developers/trusted_platform_module

2 The TPM is not designed as a cryptographic co-processor thataccelerates cryptographic operations. Instead, it provides a basis forestablishing trust in a system.

3 See https://www.kernel.org/doc/Documentation/intel_txt.txt

In addition to storing measurements, the TPM pro-vides a reporting mechanism that can be used to provethe integrity of the system (at least the parts that havebeen measured). This is done via the quote opera-tion, which can also be called an attestation. The TPMquote is a digitally signed “proof” of the set of measure-ments recorded in the Platform Configuration Registers.A client wishing to validate the set of measurements canconnect to the system, and request a quote. As part of therequest, the client provides the list of PCRs he is inter-ested in validating, and a nonce. The TPM takes these,reads the values in the PCRs, and signs the nonce andPCR values. This is then sent to the client for validation.

One use of this quote mechanism is to validate theset of software binaries that are loaded on the system,as is done with the Linux Integrity Measurement Archi-tecture [65]. Another use is to validate the integrity ofcritical boot files, and provide a trusted, or secure, bootof the system. Several TPM-aware bootloaders exist, in-cluding TrustedGRUB 4, OSLO [46], and Intel tboot 5.All three support a trusted boot mechanism, where thesystem measures the boot-time files (the kernel and ini-tial RAM disk, and if present the hypervisor), and storesthe measurements in the TPM for later verification. Se-cure boot, as supported by Intel tboot, takes this one stepfurther, and stores a policy in the TPM’s NVRAM. Whenthe system boots, tboot will compare the set of measure-ments to the stored policy, and only if the measurementsmatch, will the system boot. The execution the boot-loader relies on the DRTM secure execution environmentto protect the validation of measurements.

An additional feature of the TPM is the ability to sealdata to specific measurements. What this means is thatthe data is encrypted using a key that can only be ac-cessed if the PCR values are set to specific values, i.e.the measurements match some set of known values. Oneuse for this is to seal an encryption key for a hard drive tothe PCR values that represent the kernel and initial RAMdisk image for the approved OS. This ensures that onlythe approved OS can access the sealed data on the drive,providing a secure-boot mechanism for systems that areTPM-aware but lacks support for Intel’s tboot.

2.3 Data Loss PreventionData Loss Prevention (DLP) is enterprise software thatseeks to minimize the loss and exfiltration of sensitivedata and intellectual property by monitoring and con-trolling information flows in large, complex organiza-tions [5]. As encryption can be used to protect data atrest from unauthorized access, the true DLP challengeinvolves preventing leakage at the hands of authorized

4 See http://sf.net/projects/trustedgrub5 See http://sf.net/projects/tboot

3

users, both malicious and well-meaning agents. This lat-ter group is a surprisingly big problem in the fight to con-trol an organization’s intellectual property; a 2013 studyconducted by the Ponemon Institute found that over halfof companies’ employees admitted to emailing intellec-tual property to their personal email accounts, with 41percent admitting to doing so on a weekly basis [8]. Em-ployees’ reasons for doing are at times completely inno-cent, such as wishing to be able to work from home on apersonal device.

Enterprise DLP software is therefore comprised of avariety of security mechanisms, including system moni-toring tools and access control mechanisms [1, 2, 3, 5, 7].

System monitoring involves discovering where sen-sitive data is stored and monitoring its legitimate flowthrough a system. Access control involves preventingthose flows from beginning illegitimate flows, as spec-ified by the data loss policy, by allowing the data to flowpast controlled boundaries.

DLP systems are proprietary and are marketed so asto abstract away the complex operational details of theiroperation, so we cannot offer a complete explanation oftheir core features. However, some of the mechanisms insuch systems are known. Many DLP products use a regu-lar expression-based approach to identify sensitive data,operating similarly to a general-purpose version of Cor-nell’s Spider6. For example, in PCI compliance7, DLPmight attempt to identify credit card numbers in out-bound emails by searching for 16 digit numbers that passa Mod-10 validation check [51]. Of course, other 16 digitnumbers that are not actually credit cards, perhaps a pur-chase order number, will occasionally pass the validationstep, inevitably leading to false positives. This approachis also unable to differentiate between legitimate sensi-tive data and training data, the widespread distribution ofwhich is common in enterprises with large developmentteams.

Other DLP systems use a label-based approach toidentify sensitive data, tagging document metadata withsecurity labels. The Titus system accomplishes this byhaving company employees manually annotate the doc-uments that they create [6]; plug-ins for applications(e.g., Microsoft Office and Outlook) then prevent thedocument from being transmitted to or opened by otheremployees that lack the necessary clearance. Becauseboth the classification and enforcement mechanisms areapplication-specific, as opposed to system wide, sensi-tive information can quickly become unlabeled by pass-ing through one of many unmonitored channels. Theseexisting DLP approaches are therefore difficult to con-figure and prone to failure, primarily because they can-not provide complete observation of a system’s informa-

6 See http://www2.cit.cornell.edu/security/tools7 See https://www.pcisecuritystandards.org

tion flows. Thus, DLP systems have historically offeredmarginal utility at great price.

3 Linux Provenance Modules

To serve as the foundation for secure provenance-awaresystems, we present Linux Provenance Modules (LPM).LPM is a generic framework that creates a provenancecollection layer in the kernel. We present a workingdefinition for the provenance our system will collect in§3.1. In §3.2 we consider the capabilities and aims ofa provenance-aware adversary, and identify security anddesign goals in §3.3. The LPM design is presented in§3.4, and in §3.5 we demonstrate its secure deployment.

3.1 Defining Whole-System ProvenanceIn the design of LPM, we adopted a model for whole-system provenance8 that is broad enough to accom-modate the needs of a variety of existing provenanceprojects. To arrive at a definition, we inspect fourpast proposals that collected broadly scoped provenance:SPADE [36], LineageFS [66], PASS [56], and Hi-Fi [61].SPADE provenance is structured around primitive oper-ations of system activities with data inputs and outputs.It instruments file and process system calls, and asso-ciates each call to a process ID (PID), user identifier, andnetwork address. LineageFS uses a similar definition,associating process IDs with the file descriptors that theprocess reads and writes. PASS associates a process’soutput with references to all input files and the commandline and process environment of the process; it also ap-pends out-of-band knowledge such as OS and hardwaredescriptions, and random number generator seeds, if pro-vided. In each of these systems, networking and IPCactivity is primarily reflected in the provenance recordthrough manipulation of the underlying file descriptors.Hi-Fi takes an even broader approach to provenance,treating non-persistent objects such as memory, IPC, andnetwork packets as principal objects.

We observe that, in all instances, provenance-awaresystems are exclusively concerned with operations oncontrolled data types, which are identified by Zhang etal. as files, inodes, superblocks, socket buffers, IPCmessages, IPC message queue, semaphores, and sharedmemory [81]. Because controlled data types represent asuper set of the objects tracked by system layer prove-nance mechanisms, we define whole-system provenanceas a complete description of agents (users, groups) con-trolling activities (processes) interacting with controlleddata types during system execution.

8This term is coined in [61], but its definition is implicit the correct-ness of the system is not proven. We explicitly define the requirementsof a collection mechanism for whole system provenance in this work.

4

3.2 Threat Model & AssumptionsWe consider an adversary that has gained remote accessto a provenance-aware host or network. Once inside thesystem, the attacker may attempt to remove provenancerecords, insert spurious information into those records,or find gaps in the provenance monitor’s ability to recordinformation flows. A network attacker may also attemptto forge or strip provenance from data in transit. Be-cause captured provenance can be put to use in other ap-plications, the adversary’s goal may even be to target theprovenance monitor itself. The implications and meth-ods of such an attack are domain-specific. For example:

• Scientific Computing: An adversary may wish to ma-nipulate provenance in order to commit fraud, or to in-ject uncertainty into records to trigger a “Climategate”-like controversy [63].

• Access Control: When used to mediate access deci-sions [12, 58, 59, 60], an attacker could tamper withprovenance in order to gain unauthorized access, or toperform a denial-of-service attack on other users by ar-tificially escalating the security level of data objects.

• Networks: Provenance metadata can also be associ-ated with packets in order to better understand networkevents in distributed systems [11, 82, 83]. Coordi-nating multiple compromised hosts, an attacker mayattempt to send unauthenticated messages to avoidprovenance generation and to perform data exfiltration.

We define a provenance trusted computing base (TCB)to be the kernel mechanisms, provenance recorder, andstorage back-ends responsible for the collection andmanagement of provenance. Provenance-aware appli-cations are not considered part of the TCB. We make thefollowing assumption with regards to the TCB. In Linux,kernel modules have unrestricted access to kernel mem-ory, meaning that there is no mechanism for protectingLPM from the rest of the kernel. The kernel code is there-fore trusted; we assume that the stock kernel will notseek to tamper with the TCB. However, we do considerthe possibility that the kernel could be compromised af-ter installation through its interactions with user spaceapplications. To facilitate host attestation in distributedenvironments, we also assume access to a Public Key In-frastructure (PKI) for provenance-aware hosts to publishtheir public signing keys.

3.3 System GoalsWe set out to provide the following security assurancesin the design of of our system-layer provenance collec-tion mechanism. McDaniel et al. liken the needs of asecure provenance monitor [55] to the reference monitor

guarantees laid out by Anderson [10]: complete media-tion, tamperproofness, and verifiability. We define theseguarantees as follows:

G1 Complete. Complete mediation for provenance hasbeen discussed elsewhere in the literature in termsof assuring completeness [39]: that the provenancerecord be gapless in its description of system activ-ity. To facilitate this, LPM must be able to observeall information flows that pass through controlleddata types.

G2 Tamperproof. As many provenance use cases in-volve enhancing system security, LPM will be anadversarial target. The TCB must therefore be im-pervious to disabling or manipulation by processesin user space.

G3 Verifiable. The functionality of LPM must beverifiably correct. Additionally, local and remoteusers should be able to attest whether the host withwhich they are communicating is running the se-cured provenance-aware kernel.

Through surveying past work in provenance-awaresystems, we identify the following additional goals tosupport whole-system provenance:

G4 Authenticated Channel. In distributed environ-ments, provenance-aware systems must provide ameans of assuring authenticity and integrity ofprovenance as it is communicated over open net-works [12, 55, 61, 82]. While we do not seek toprovide a complete distributed provenance solutionin LPM, we do wish to provide the required build-ing blocks within the host for such a system to ex-ist. LPM must therefore be able to monitor ev-ery network message that is sent or received by thehost, and reliably explain these messages to otherprovenance-aware hosts in the network.

G5 Attested Disclosure. Layered provenance, whereadditional metadata is disclosed from higher opera-tional layers, is a desirable feature in provenance-aware systems, as applications are able to reportworkflow semantics that are invisible to the oper-ating system [57]. LPM must provide a gateway forupgrading low integrity user space disclosures be-fore logging them in the high integrity provenancerecord. This is consistent with the Clark-Wilson In-tegrity model for upgrading or discarding low in-tegrity inputs [21].

In order to bootstrap trust in our system, we have im-plemented LPM as a parallel framework to Linux Secu-rity Modules (LSM) [75, 76]. Building on these results,

5

user space

kernel space

Neo4j

NF Hooks

Prov. Module

RelayBuffer

SNAP

GZip

SQL

Prov. Hooks

IMA

TPM

Prov. Aware Applications

System Provenance Workflow Provenance

Integrity Measurements

ProvenanceRecorder

Figure 2: LPM places a set of provenance hooks aroundthe kernel. A provenance module registers to controlthese hooks, and also registers Netfilter hooks. The mod-ule transmits via a relay to a recorder in user space,which uses one of several storage back-ends. Therecorder is also responsible for evaluating the integrityof workflow provenance prior to storing it.

we show in Section 4 that this approach allows LPM toinherit the formal assurances that have been verified forthe LSM architecture.

3.4 Design & ImplementationAn overview of the LPM architecture is shown in Fig-ure 2. The LPM patch places a set of provenance hooksaround the kernel; a provenance module then registersto control these hooks, and also registers several Netfil-ter hooks; the module then observes system events andtransmits information via a relay buffer to a provenancerecorder in user space that interfaces with a datastore.The recorder also accepts disclosed provenance from ap-plications after verifying their correctness using the In-tegrity Measurements Architecture (IMA) [65].

In designing LPM, we first considered using an exper-imental patch to the LSM framework that allows “stack-ing” of LSM modules 9. However, at this time, no stan-dard exists for handling when modules make conflict-ing decisions, creating the potential unpredicted behav-ior. We also felt that dedicated provenance hooks werenecessary; by collecting provenance after LSM autho-rization routines, we ensure that the provenance historyis an accurate description of authorized system events. Ifprovenance collection occurred during authorization, aswould be the case with stacked LSMs, it would not bepossible to provide this property.

9See https://lwn.net/Articles/518345/

user spaceText Editor

kernel spaceopen System Call

Look Up Inode

Error Checks

DAC Checks

LSM Hook

LPM Hook

Access Inode

Examine context.Does request pass policy?Grant or deny.

Examine context.Collect provenance.If successful, grant.

LSM Module

LPM Module

"Authorized?"Yes or No

"Prov collected?"Yes or No

Figure 3: Hook Architecture for the open system call.Provenance is collected after DAC and LSM checks, en-suring that it accurately reflects system activity. LPMwill only deny the operation if it fails to generate prove-nance for the event.

Hooks Count PurposeBPRM 05 Observe program execution operations.Cred 10 Manage credentials.Dentry 03 Observe dentry operations.File 20 Observe file operations.Inode 30 Observe inode operations.IPC 10 Observe System V IPC Message Queues.Netlink 02 Observe Netlink Message.SB 19 Observe superblock operations.SEM 05 Observe System V Semaphores.SHM 05 Observe System V Shared Memory Segments.Socket 35 Observe socket and network operations.Task 24 Observe task operations.Unix 02 Observe Unix Domain Networking.VM 03 Observe virtual memory use.Kernel 05 Observe other misc. kernel access.

Table 1: Summary of Provenance Hooks. LPM places atotal of 178 unique hooks around the kernel.

3.4.1 Provenance Hooks

The LPM patch introduces a set of hook functions in theLinux kernel. These hooks behave similarly to the LSMframework’s security hooks in that they facilitate mod-ularity, and default to taking no action unless a moduleis enabled. Each provenance hook is placed directly be-neath a corresponding security hook. The return value ofthe security hook is checked prior to calling the prove-nance hook, thus assuring that the requested activity hasbeen authorized prior to provenance capture; we considerthe implications of this design in Section 4. A workflowfor the hook architecture is depicted in Figure 3. TheLPM patch places over 170 provenance hooks, one foreach of the LSM authorization hooks. In addition to thehooks that correspond to existing security hooks, we alsosupport Pohly et al.’s Hi-Fi [61] hook that is necessary to

6

1 int vfs_readdir(struct file *file, filldir_t filler,void *buf){

2 struct inode *inode = file->f_path.dentry->d_inode;3 int res = -ENOTDIR;4 if (!file->f_op || !file->f_op->readdir)5 goto out;67 res = security_file_permission(file, MAY_READ);8 if (res)9 goto out;

1011 res = provenance_file_permission(file,

file->f_provenance, MAY_READ);12 if (res)13 goto out;1415 res = mutex_lock_killable(&inode->i_mutex);16 if (res)17 goto out;1819 res = -ENOENT;20 if (!IS_DEADDIR(inode)) {21 res = file->f_op->readdir(file, buf, filler);22 file_accessed(file);23 }24 mutex_unlock(&inode->i_mutex);25 out:26 return res;27 }2829 EXPORT_SYMBOL(vfs_readdir);

Figure 4: Example provenance hook from thevfs_readdir function in fs/readdir.c. LPM insertsLines 11-13.

preserve Lamport timestamps on network messages [50].A complete list of provenance hooks is included in

Table 1. An example hook placement is shown in Fig-ure 4. The vfs_readdir functions attempts to read afile’s directory, placing it in the buf pointer. LPM in-troduces Line 11-13 of the function. Immediately afterthe security_file_permission affirms that thesubject has permission to take this action (Lines 7-9),provenance_file_permission is called so thatLPM can record the event. The return value of the prove-nance hook is checked before the operation is permitted.If the hook returns an error code such as ENOMEM, indi-cating that the attempt to allocate memory for the prove-nance record failed, vfs_readdir terminates withoutpermitting the operation, thus ensuring that provenanceis gapless (Goal G1).

3.4.2 Netfilter Hooks

LPM uses Netfilter hooks to implement a cryptographicmessage commitment protocol. In Hi-Fi, provenance-aware hosts communicated by embedding a provenancesequence number in the IP options field [62] of each out-bound packet [61]. This approach allowed Hi-Fi to com-municate as normal with hosts that were not provenance-aware, but unfortunately was not secure against a net-work adversary. In LPM, provenance sequence numbersare replaced with Digital Signature Algorithm (DSA)

user spaceWeb Browser

kernel spaceTCP Send Packet

IP Send Packet

Update IP Checksum

Netfilter Hook Iterate

IPTables Hook

LPM Hook

Network Card

Examine routing table.Does request pass policy?Grant or deny.

Sign IP HDR, Payload.Embed in IP Options.Update IP Checksum.

IPTables

LPM Module

"Ok with you?"Yes or No

"Prov embedded?"Yes or No

Figure 5: Packet transmission in LPM’s message com-mitment protocol. Netfilter hooks are used to sign overthe IP packet and then embed the signature into the IPOptions field.

signatures, which are space-efficient enough to embed inthe IP Options field. We have implemented full DSAsupport in the Linux kernel by creating signing rou-tines to use with the existing DSA verification func-tion. DSA signing and verification occurs in the NetFil-ter inet_local_out and inet_local_in hooks.In inet_local_out, LPM signs over the immutablefields of the IP header, as well as the IP payload (Figure5). In inet_local_in, LPM checks for the presenceof a signature, then verifies the signature against a config-urable list of public keys. If the signature fails, the packetis dropped before it reaches the recipient application,thus ensuring that there are no breaks in the continuity ofthe provenance log. The key store for provenance-awarehosts is obtained by a PKI and transmitted to the ker-nel during the boot process by writing to securityfs.LPM registers the Netfilter hooks with the highest prior-ity levels, such that signing occurs just before transmis-sion (i.e., after all other IPTables operations), and sig-nature verification occurs just after the packet enters theinterface (i.e., before all other IPTables operations).

3.4.3 Provenance Modules

Here, we introduce two of our own provenance modules(Provmon, SPADE), as well as briefly mention the workof our peers (UPTEMPO):

• Provmon. Provmon is an extended port of the Hi-Fisecurity module [61]. The original Hi-Fi code basewas 1,566 lines of code, requiring 723 lines to be mod-ified in the transition. Our extensions introduced 728additional lines of code. The process of porting did

7

PROV-DM Graph Representation LPM Hook Comment

Activity Process credfork Create a new Object node on each forkbprm_check_provenance Append process execution info to this fork’s Object node.

AgentUser

task_fix_setuid Get/Create one global Agent node representing User (UID)If no task_fix_setuid Get the Agent node from Parent credential

Object (Entity) Inode:Version, or

d_instantiate Create a new Object node for Inode at version zero.file_permission (mask=W,A) Create a new Object node for Inode at incremented version.

IP Address:Port, etc.

socket_{send/recv}msg (proto=UNIX) Get/Create an Object node for each socket.socket_{send/recv}msg (proto=INET) Get/Create an Object node for each remote (IP,Port) tuple.shm_shmat Get/Create an Object node for each shared memory address.

WasGeneratedBy WasGeneratedByObject Activity

file_permission (mask=W,A) Relation between an Activity and an Output Objectsocket_sendmsg

Used UsedObjectActivity

file_permission (mask=R) Relation between an Activity and an Input Objectfile_mmapsocket_recvmsg

WasControlledBy WasControlledByActivity Agent

credfork Relation between an Activity and an Agent.

Table 2: A mapping between PROV-DM’s core concepts (types and relationships), the LPM kernel events that triggertheir creation, and the resulting graph representation.

not affect the module’s functionality, although we havesubsequently extended the Hi-Fi protocol to captureadditional lineage information:

– File Versioning. The original Hi-Fi protocol did nottrack version information for files, leading to un-certainty as to the exact contents of a file at thetime it was read. Accurately recovering this in-formation in user space was not possible due torace conditions between kernel events. Becauseversioning is necessary to break cycles in prove-nance graphs [56], we have added a version field tothe provenance context for inodes, which is incre-mented each time the inode is written.

– Network Context. Hi-Fi omitted remote host ad-dress information for network events, reasoningthat source information could be forged by adishonest agent in the network. These human-interpretable data points were replaced with a ran-domly generated identifiers that were associatedwith each packet. We found, however, that theseidentifiers could not be interpreted without remoteaddress information, and incorporated the record-ing of remote IP addresses and ports into Provmon.

• SPADE. The SPADE system is an increasingly pop-ular option for provenance auditing, but collectingprovenance in user space limits SPADE’s expressive-ness and creates the potential for incomplete prove-nance. To address this limitation, we have cre-ated a mechanism that reports LPM provenance intoSPADE’s Domain-Specific Language pipe [36]. Thispermits the collection of whole-system provenancewhile simultaneously leveraging SPADE’s existingstorage, remote query, and visualization utilities.

• Using Provenance to Expedite MAC Policies (UP-TEMPO). Using LPM as a collection mechanism,Moyer et al. investigate provenance analysis as ameans of administrating Mandatory Access Control(MAC) policies [68]. UPTEMPO first observes system

execution in a sterile environment, aggregating LPMprovenance in a centralized data store. It then recov-ers the implicit information flow policy through min-ing the provenance store to generate a MAC policy forthe distributed system, decreasing both administratoreffort and the potential for misconfiguration.

3.4.4 Provenance Recorders

LPM provides modular support for different storagethrough provenance recorders. To prevent an infiniteprovenance loop, recorders are flagged as provenance-opaque [61] using the security.provenance ex-tended attribute, which is checked by LPM before creat-ing a new event. Each recorder was designed to be as ag-nostic to the active LPM as possible, making them easyto adapt to new modules. We currently provide 4 prove-nance recorders:Gzip incurs low storage overheads and fast insertionspeeds. On our test bed, we observed this recorder pro-cessing up to 400,000 events per second from the Prov-mon provenance stream. However, because the prove-nance is not stored in an easily queried form, this back-end is best suited for environments where queries are anoffline process.PostgreSQL provides limited support for online query-ing, but only moderate insertion and query speeds due tothe known difficulties of representing provenance in re-lational databases [42]. Under similar conditions as theGzip recorder, a representative insertion speed for thisrecorder was 2,200 events per second.Neo4j provides full support for the W3C PROV-DMmodel [74]. Although Neo4j is a popular graph database,we were unable to find a configuration that was effi-cient enough to keep up with Provmon’s provenancestream. Under similar conditions, a representative inser-tion speed for this recorder was just 28 events per second.SNAP: To create graph storage that was efficient enoughfor LPM, we used the SNAP graphing library 10 to design

10See http://snap.stanford.edu

8

1 [20a9] bprm_check_provenance 54558 "sort a > b"2 [20a9] task_fix_setuid uid 500 gid 5003 [20a9] file_permission R 7374:librt-2.12.so4 [20a9] file_permission R 33675:libc-2.12.so5 [20a9] file_permission R 33374:libpthread-2.12.so6 [20a9] file_permission R 16520304:a7 [20a9] file_permission W 16520302:b

(a) LPM Provenance Stream

Used Used Used WasControlledBy Used

WasGeneratedBy

sort a > b

librt-2.12.so libc-2.12.so libpthread-2.12.so

b

500 a

(b) PROV-DM Graph Representation

Figure 6: The LPM provenance stream of events can be transformed into standardized graph models. Here, a sortcommand resulting in kernel output 6a can be mapped to PROV-DM graph 6b. This is necessary to facilitate efficientquerying, and also to permit the exchange of provenance between deployments.

a recorder that maintains an in-memory graph databasethat is fully compliant with the W3C PROV-DM Model[74]. We have observed insertion speeds of over 150,000events per second using the SNAP recorder, and highlyefficient querying as well. This recorder is further evalu-ated in Section 6.

3.4.5 Graph Representation

A full summary of how LPM creates a PROV-DM-compliant graph is shown in Table 2. The whole-systemprovenance graph is made up of subgraphs where Ac-tivities, Agents, and Object nodes interact. The finestgranularity of these subgraphs is a system fork. Eachnew system fork is immediately associated with an Ac-tivity, Agent, and WasControlledBy relation, as describedabove. The input (Used) and output (WasGeneratedBy)objects of an activity link together subgraphs. Usingthese relations, PROV-DM’s WasDerivedFrom, WasIn-formedBy, WasAttributedTo, and WasAssociatedWith re-lations are implicitly represented [74]. ActedOnBehalfOfcan also be used to encode when the EUID is set.

LPM generates provenance as a stream of kernelevents which need to be processed into a graph repre-sentation before they can be efficiently queried. Figure6 shows an example of this transformation for the com-mand sort a > b. Line 1 of the LPM stream createsthe sort activity node, and creates a WasControlledByrelation between the new activity node and the parentfork’s agent node; line 2 updates the WasControlledByrelation so that it directs to the 500 agent node; lines 3-6get or create object nodes with Used relations; and line 7creates the b object and WasGeneratedBy relation.

3.4.6 Workflow Provenance

To support layered provenance while preserving our se-curity goals, we require a means of evaluating the in-tegrity of user space provenance disclosures. To ac-

UsedUsed

WasDerivedFrom

WasGeneratedBy

WasDerivedFrom

WasGeneratedBy

a.pngb.png

mogrify -format jpg *.png

a.jpgb.jpg

Figure 7: A provenance graph of image conversion.Here, workflow provenance (WasDerivedFrom) encodesa relationship that more accurately identifies the outputfiles’ dependencies compared to only using kernel layerobservations (Used, WasGeneratedBy).

complish this, we extend the LPM Provenance Recorderto use the Linux Integrity Measurement Architecture(IMA) [44, 65]. IMA computes a cryptographic hash ofeach binary before execution, extends the measurementinto a TPM Platform Control Register (PCR), and storesthe measurement in kernel memory. This set of measure-ments can be used by the Recorder to make a decisionabout the integrity of the a Provenance-Aware Applica-tion (PAA) prior to accepting the disclosed provenance.When a PAA wishes to disclose provenance, it opens anew UNIX domain socket to send the provenance datato the Provenance Recorder. The Recorder uses its ownUNIX domain socket to recover the process’s pid, thenuses the /proc filesystem to find the full path of the bi-nary, then uses this information to look up the PAA inthe IMA measurement list. The disclosed provenance isrecorded only if the signature of PAA matches a known-good cryptographic hash.

As a demonstration of this functionality, we createda provenance-aware version of the popular ImageMag-ick utility 11. ImageMagick contains a batch conversiontool for image reformatting, mogrify. Shown in Fig-

11See http://www.imagemagick.org

9

ure 7, mogrify reads and writes multiple files duringexecution, leading to an overtainting problem – at thekernel layer, LPM is forced to conservatively assumethat all outputs were derived from all inputs, creatingfalse dependencies in the provenance record. To addressthis, we extended the Provmon protocol to support anew message, provmsg_imagemagick_convert,which links an input file directly to its output file. Whenthe recorder receives this message, it first checks the listof IMA measurements to confirm that ImageMagick isin a good state. If successful, it then annotates the exist-ing provenance graph, connecting the appropriate inputand output objects with WasDerivedFrom relationships.Our instrumentation of ImageMagick demonstrates thatLPM supports layered provenance at no additional costover other provenance-aware systems [36, 56], and doesso in a manner that provides assurance of the integrity ofthe provenance log.

3.5 DeploymentWe now demonstrate how we used LPM in the deploy-ment of a secure provenance-aware system.

3.5.1 Platform Integrity

We configured LPM to run on a physical machine witha Trusted Platform Module (TPM). The TPM provides aroot of trust that allows for a measured boot of the sys-tem. The TPM also provides the basis for remote attes-tations to prove that LPM was in a known hardware andsoftware configuration. The BIOS’s core root of trust formeasurement (CRTM) bootstraps a series of code mea-surements prior to the execution of each platform com-ponent. Once booted, the kernel then measures the codefor user space components (e.g., provenance recorder)before launching them, through the use of the Linux In-tegrity Measurement Architecture (IMA)[65]. The resultis then extended into TPM PCRs, which forms a verifi-able chain of trust that shows the integrity of the systemvia a digital signature over the measurements. A remoteverifier can use this chain to determine the current stateof the system using TPM attestation.

We configured the system with Intel’s Trusted Boot(tboot),5 which provides a secure boot mechanism, pre-venting system from booting into the environment wherecritical components (e.g., the BIOS, boot loader andthe kernel) are modified. Intel tboot relies on the IntelTXT instructions that provide a secure execution envi-ronment.3 For virtual environments, similar function-ality can be provided on Xen via TPM sealing and thevirtual TPM (vTPM)12, which is bound to the physical

12See http://wiki.xenproject.org/wiki/Virtual_Trusted_Platform_Module_(vTPM)

1 policy_module(uprovd, 1.0.0)23 ########################################4 #5 # Declarations6 #78 type uprovd_t;9 type uprovd_exec_t;

10 init_daemon_domain(uprovd_t, uprovd_exec_t)1112 permissive uprovd_t;1314 type uprovd_log_t;15 logging_log_file(uprovd_log_t)1617 type uprovd_rw_t;18 files_type(uprovd_rw_t)1920 ########################################21 #22 # uprovd local policy23 #24 allow uprovd_t self:process { signal };2526 allow uprovd_t self:fifo_file rw_fifo_file_perms;27 allow uprovd_t self:unix_stream_socket

create_stream_socket_perms;2829 manage_dirs_pattern(uprovd_t, uprovd_log_t,

uprovd_log_t)30 manage_files_pattern(uprovd_t, uprovd_log_t,

uprovd_log_t)31 logging_log_filetrans(uprovd_t, uprovd_log_t, {

dir file })3233 manage_dirs_pattern(uprovd_t, uprovd_rw_t,

uprovd_rw_t)34 manage_files_pattern(uprovd_t, uprovd_rw_t,

uprovd_rw_t)3536 domain_use_interactive_fds(uprovd_t)3738 files_read_etc_files(uprovd_t)3940 miscfiles_read_localization(uprovd_t)

Figure 8: SELinux Policy Module for Provmon: TypeEnforcement for Provmon’s user space daemon.

TPM of the host system. Additionally, we compiled sup-port for IMA into the provenance-aware kernel, which isnecessary in order for the LPM Recorder to be able tomeasure the integrity of provenance-aware applications.

3.5.2 Runtime Integrity

After booting into the provenance-aware kernel, the run-time integrity of the TCB (defined in §3.2) must also beassured. To protect the runtime integrity of the kernel,we deploy a Mandatory Access Control (MAC) policy,as implemented by Linux Security Modules. On our pro-totype deployments, we enabled SELinux’s MLS policy,the security of which was formally modeled by Hicks etal. [41]. Refining the SELinux policy to prevent Ac-cess Vector Cache (AVC) denials on LPM componentsrequired minimal effort; the only denial we encounteredwas when using the PostgreSQL recorder, which was

10

12 /sys/kernel/debug/provenance0 --

gen_context(system_u:object_r:uprovd_rw_t,s0)345 /usr/bin/uprovd --

gen_context(system_u:object_r:uprovd_exec_t,s0)67 /var/log(/.*)?

gen_context(system_u:object_r:uprovd_log_t,s0)

Figure 9: SELinux Policy Module for Provmon: FileContexts for Provmon’s user space daemon.

quickly remedied with the audit2allow tool. Pre-serving the integrity of LPM’s user space components,such as the provenance recorder, was as simple as creat-ing a new policy module. We created a policy moduleto protect the LPM recorder and storage back-end usingthe sepolicy utility. Uncompiled, the policy modulewas only 135 lines. Excerpts from the policy are shownin Figures 8 and 9.

4 Security

In this section, we demonstrate that our system meetsall of the required security goals for trustworthy whole-system provenance. In this analysis, we consider an LPMdeployment on a physical machine that was enabled withthe Provmon module and has been configured to the con-ditions described in Section 3.5.Complete (G1). We defined whole-system provenanceas a complete description of agents (users, groups) con-trolling activities (processes) interacting with controlleddata types during system execution (§ 3.1). LPM at-tempts to track these system objects through the place-ment of provenance hooks (§3.4.1), which directly fol-low each LSM authorization hook. The LSM’s completemediation property has been formally verified [25, 81];in other words, there is an authorization hook prior toevery security-sensitive operation. Because every inter-action with a controlled data type is considered security-sensitive, we know that a provenance hook resides onall control paths to the provenance-sensitive operations.LPM is therefore capable of collecting complete prove-nance on the host.

It is important to note that, as a consequence of plac-ing provenance hooks beneath authorization hooks, LPMis unable to record failed access attempts. However, in-serting the provenance layer beneath the security layerensures accuracy of the provenance record. Moreover,failed authorizations are a different kind of metadata thanprovenance because they do not describe processed data;this information is better handled at the security layer,e.g., by the SELinux Access Vector Cache (AVC) Log.

Tamperproof (G2). The runtime integrity of the LPMtrusted computing base is assured via the SELinux MLSpolicy, and we have written a policy module that protectsthe LPM user space components (§3.5.2). Therefore, theonly way to disable LPM would be to reboot the sys-tem into a different kernel; this action can be disallowedthrough secure boot techniques,5 and is detectable by re-mote hosts via TPM attestation (§3.5.1).Verifiable (G3). While we have not conducted an in-dependent formal verification of LPM, our argument forits correctness is as follows. A provenance hook followseach LSM authorization hook in the kernel. The correct-ness of LSM hook placement has been verified throughboth static and dynamic analysis techniques [25, 32, 43].Because an authorization hook exists on the path of ev-ery sensitive operation to controlled data types, and LPMintroduces a provenance hook behind each authorizationhook, LPM inherits LSM’s formal assurance of com-plete mediation over controlled data types. That is, LPMcan observe every sensitive operation on controlled datatypes in the kernel, satisfying our definition of whole-system provenance.

In the same way that LSM and SELinux have been in-dependently verified [41, 45, 72], the logic of individualprovenance modules must also be modeled and verifiedbefore a formal guarantee about the collected provenancecan be made. A formal model for the Provmon systemis not provided in this work. However, because Prov-mon is significantly less complex than the SELinux se-curity module, we believe that formal modeling is possi-ble. Our Provmon implementation is 2,294 lines of code,while SELinux is 17,581 lines of code.

Authenticated Channel (G4). Through use of Net-filter hooks [71], LPM embeds a DSA signature in ev-ery outbound network packet. Signing occurs immedi-ately prior to transmission, and verification occurs im-mediately after reception, making it impossible for anadversary-controlled application running in user spaceto interfere. For both transmission and reception, thesignature is invisible to user space. Signatures areremoved from the packets before delivery, and LPMfeigns ignorance that the options field has been set ifget_options is called. Hence, LPM can enforce thatall applications participate in the commitment protocol.

Prior to implementing our own message commitmentprotocol in the kernel, we investigated a variety of ex-isting secure protocols. The integrity and authenticity ofprovenance identifiers could also be protected via IPsec[47], SSL tunneling [4], or other forms of encapsulation[11, 82]. We elected to move forward with our approachbecause 1) it ensures the monitoring of all all processesand network events, including non-IP packets, 2) it doesnot change the number of packets sent or received, en-suring that our provenance mechanism is minimally in-

11

vasive to the rest of the Linux network stack, and 3)it preserves compatibility with non-LPM hosts. An al-ternative to DSA signing would be HMAC [13], whichoffers better performance but requires pairwise keyingand sacrifices the non-repudiation policy; BLS, whichapproaches the theoretical maximum security parame-ter per byte of signature [16]; or online/offline signatureschemes [19, 30, 33, 69].

Authenticated Disclosures (G5). We make useof IMA to protect the channel between LPM andprovenance-aware applications wishing to discloseprovenance. IMA is able to prove to the provenancerecorder that the application was unmodified at the timeit was loaded into memory, at which point the recordercan accept the provenance disclosure into the officialrecord. If the application is known to be correct (e.g.,through formal verification), this is sufficient to estab-lish the runtime integrity of the application. However, ifthe application is compromised after execution, this ap-proach is unable to protect against provenance forgery.

A separate consideration for all of the above securityproperties are Denial of Service (DoS) attacks. DoS at-tacks on LPM do not break its security properties. If anattacker launches a resource exhaustion attack in orderto prevent provenance from being collected, all kerneloperations will be disallowed and the host will cease tofunction. If a network attacker drops packets, or stripsthem of their provenance identifiers, the packet will notbe delivered to the recipient application. Therefore, theprovenance record remains an accurate reflection of sys-tem events, regardless of DoS attacks.

5 LPM Application: Provenance-BasedData Loss Prevention

To further demonstrate the power of LPM, we now in-troduce a set of mechanisms for Provenance-Based DataLoss Prevention (PB-DLP) that offer dramatically sim-plified administration and improved enforcement overexisting DLP systems. Data Loss Prevention (DLP),which is also called Data Leakage Protection, is enter-prise software that seeks to minimize the loss and ex-filtration of sensitive data by monitoring and controllinginformation flows in large, complex organizations [5]. Inaddition to the desire to control intellectual property, an-other motivator for DLP systems is demonstrating reg-ulatory compliance for personally-identifiable informa-tion (PII),13 as well as directives such as PCI,7 HIPAA,14

SOX.15 or E.U. Data Protection.16 It is therefore impor-tant for a DLP system to be able to exhaustively explain

13 See NIST SP 800-12214 See http://www.hhs.gov/ocr/privacy15 Short for the Sarbanes-Oxley Act, U.S. Public Law No. 107-2016 See EU Directive 95/46/EC

Algorithm 1 Summarizes a’s propagation through the system.

Require: a is an entity1: procedure REPORT(a)2: Locations = [ ] . Assigns an empty list.3: for each s in a,FindSuccessors(a) do4: if s.type is File then5: Locations.Add(< s.disk,s.directory >)6: else if s.type is Network Packet then7: Locations.Add(< s.remote_ip,s.port >)8: end if9: end for

10: return Locations11: end procedure

which pieces of data are sensitive, where that data haspropagated to within the organization, and where it is(and is not) permitted to flow.

A provenance-based approach is a novel and effectivemeans of handling data loss prevention; to our knowl-edge, we are the first in the literature to do so. The ad-vantage of our approach when compared to existing sys-tems is that LPM-based provenance-aware systems al-ready perform system-wide capture of information flowsbetween kernel objects. Data loss prevention in sucha system therefore becomes a matter of preventing allderivations of a sensitive source entity, e.g., a PaymentCard Industry (PCI) database, from being written to amonitored destination entity (e.g., a network interface).

We begin by defining a policy format for PB-DLP. In-dividual rules take the form

< Srcs = [src1,src2, . . . ,srcn],dst >

where Srcs is a list of entities representing persistentdata objects, and dst is a single entity representing eithera persistent data object such as a file or interface or an ab-stract entity such as a remote host. The goal for PB-DLPis as follows – an entity e1 with ancestors A is writtento entity e2 if and only if A 6⊇ Srcs for all rules in theruleset where e2 = dst. The reason that sources are ex-pressed as sets is that, at times, the union of informationis more sensitive than its individual components. For ex-ample, sharing a person’s last name or birthdate may bepermissible, while sharing the last name and birthdate isrestricted as PII.13

Below, we define the functions that realize this goal.First, we define two provenance-based functions as thebasis for a DLP monitoring phase, which allows admin-istrators to learn more about the propagation of sensitivedata on their systems. Then, we define mechanisms for aDLP enforcement phase.

5.1 Monitoring PhaseThe goal of monitoring is to allow administrators to rea-son about how sensitive data is stored and put to use ontheir systems. The end product of the monitor phase is aset of rules (a policy) that restrict the permissible flows

12

Algorithm 2 Mediates request to write e to d given Rules.

Require: e,d are entitiesRequire: Rules is a PB-DLP policy1: procedure PROVWRITE(e,d,Rules)2: for each rule in Rules do3: if d = rule.dst then4: A = FindAncestors(e)5: NumSrcs = length(rule.Srcs)6: for each src in rule.Srcs do7: if src in A then8: NumSrcs−−9: end if

10: end for11: if NumSrcs = 0 then . A⊇ Srcs, deny.12: return PB-DLP_DENY13: end if14: end if15: end for16: return PB-DLP_PERMIT . A 6⊇ Srcs, permit.17: end procedure

for sensitive data sources. Monitoring is an ongoing pro-cess in DLP, where administrators attempt to iterativelyimprove protection against data leakage. The first step isto identify the data that needs protection. Identifying thesource of such information is often quite simple; for ex-ample, a database of PCI or PII data. However, reliablyfinding data objects that were derived from this source isextraordinarily complicated using existing solutions, butis simple now with LPM. To begin, we define a helperfunction for system monitoring:

1. FindSuccessors(Entity): This function performs aprovenance graph traversal to obtain the list of dataobjects derived from Entity.

FindSuccessors can then be used as the basis for afunction that summarizes the spread of sensitive data:

2. Report(Entity): List the locations that a target objectand its successors have propagated. This function isdefined in Algorithm 1.

The information provided by Report is similar to thedata found in the Symantec DLP Dashboard [5], andcould be used as the backbone of a PB-DLP user inter-face. Administrators can use this information to write aPB-DLP policy or revise an existing one.

5.2 Enforcement PhasePossessing a PB-DLP policy, the goal of the enforcementphase is to prevent entities that were derived from sensi-tive sources from being written to restricted locations. Todo so, we need to inspect the object’s provenance to dis-cover the entities from which it was derived. We definethe following helper function:

3. FindAncestors(Entity): This function performs aprovenance graph traversal to obtain the list of dataobjects used in the creation of Entity.

FindAncestors can be then used as the basis for a func-tion that prevents the spread of sensitive data:

4. ProvWrite(Entity, Destination, Rules): Write the tar-get entity to the destination if and only if it is valid tothe provided rule set, as defined in Algorithm 2.

5.3 File Transfer Application

In many enterprise networks that are isolated from theInternet via firewalls and proxies, it is desirable to sharefiles with external users. File transfer services are oneway to achieve this, and provide a single entry/exit pointto the enterprise network where files being transferredcan be examined before being released.17 In the caseof incoming files, scans can check for known malware,and in some cases, check for other types of maliciousbehavior from unknown malware.

We implemented PB-DLP as a file transfer applica-tion for provenance-aware systems using LPM’s Prov-mon module. The application interfaced with LPM’sSNAP recorder using a custom API. Before permittinga file to be transmitted to a remote host, the applicationran a query that traversed WasDerivedFrom edges to re-turn a list of the file’s ancestors, permitting the transferonly if the file was not derived from a restricted source.PB-DLP allows internal users to share data, while ensur-ing that sensitive data is not exfiltrated in the process.

Because provenance graphs continue to grow indefi-nitely over time, in practice the bottleneck of this appli-cation is the speed of provenance querying. We evaluatethe performance of PB-DLP queries in Section 6.3.

5.4 PB-DLP Analysis

Below, we select two open source systems that approx-imate label based and regular expression (regex) basedDLP solutions, and compare their benefits to PB-DLP.

5.4.1 Label-Based DLP

The SELinux MLS policy [38] provides information flowsecurity through a label-based approach, and could beused to approximate a DLP solution without relying oncommercial products. Proprietary label-based DLP sys-tems rely on manual annotations provided by users, re-quiring them to provide correct labeling based on theirknowledge of data content. Using SELinux as an exem-plar labeling system is therefore an extremely conserva-tive approach to analysis.

17 Two examples of vendors that provide this capability are FireEye(http://www.fireeye.com) and Accellion (http://www.accellion.com/)

13

Used

UsedUsedWasGeneratedBy WasGeneratedBy

Birth_Dates:0

SSNs:0

Training_Data:0

PII_Data:0 PII_Data.gz:0join Birth_Dates SSNs > PII_Data gzip PII_Data

3 4

1

2

Figure 10: A provenance graph of PII data objects that are first fused and then transformed. The numbers markDLP decision conditions. Objects marked by green circles should not be restricted, while red octagons should berestricted. Label-Based DLP correctly handles data resembling PII (1,2) and data transformations (4), but struggleswith data fusions (3). Regex-Based DLP correctly identifies data fusions (3), but is prone to incorrect handling of dataresembling PII (1) and fails to identify data transformations (4). PB-DLP correctly handles all conditions.

T

A, {α,β}

T

A, {β}A, {α}

B, {α,β}

B, {β}B, {α}

Figure 11: Example MLS lattice with two classificationlevels and two compartments. The lattice is a pictorialrepresentation of the access control policy that is en-forced.

Within an MLS system, each subject, and object, isassigned a classification level, and categories, or com-partments. Consider an example system, with classi-fication levels, {A,B} with A dominating B, and com-partments {α,β}. We can model our policy as a lat-tice, where each node in the lattice is a tuple of theform {< level >,{compartments}}. Once the policyis defined, it is possible to enforce the simple and *-properties. If a user has access to data with classificationlevel A, and compartment α , he cannot read anything incompartment {β} (no read-up). Furthermore, when datais accessed in A,{α}, the user cannot write anything toB,{α} (no write-down). Figure 11 shows an examplepolicy in lattice form.

In order to use SELinux’s MLS enforcement as a DLPsolution, the administrator configures the policy to en-force the constraint that no data of specified types canbe sent over the network. However, this is difficult inpractice. Consider an example system that processes PII.The users of the system may need to access information,such as last names, and send these to the payroll depart-

ment to ensure that each employee receives a paycheck.Separately, the user may need to send a list of birthdaysto another user in the department to coordinate birthdaycelebrations for each month. Either of these activities areacceptable (Figure 10, Decision Condition 2). However,it is common practice for organizations to have strictersharing policies for data that contains multiple forms ofPII, so while either of these identifiers could be transmit-ted in isolation, the two pieces of information combinedcould not be shared (Figure 10, Decision Condition 3).

The MLS policy cannot easily handle this type of datafusion. In order to provide support for correctly label-ing fused data, an administrator would need to define thepower set of all compartments within the MLS policy.In the example above, the administrator would define thefollowing compartments: {}, {α}, {β}, {α,β}. In thedefault configuration SELinux supports 256 unique cate-gories, meaning an SELinux DLP policy could only sup-port eight types of data. Furthermore, the MLS policydoes not support defining multiple categories within asingle sensitivity level18. This implies that the MLS pol-icy cannot support having a security level for A,{α} andfor A,{α,β}. Instead, the most restrictive labeling mustbe defined to protect the data on the system. In contrast,PB-DLP can support an arbitrary number of data fusions.

5.4.2 Regex-Based DLP

The majority of DLP software relies on pattern matchingtechniques to identify sensitive data. While enterprise so-lutions offer greater sophistication and customizability,their fundamental approach resembles that of Cornell’sSpider 6, a forensics tools for identifying sensitive per-sonal data (e.g., credit card or social security numbers).Because it is open source, we make use of Spider as anexemplar application for regex-based DLP.

Regex approaches are prone to false positives.Spider is pre-configured with a set of regular

18See the definition of level statements at http://selinuxproject.org/page/MLSStatements

14

Test Type Vanilla LPM ProvmonProcess tests, times in µseconds (smaller is better)null call 0.14 0.14 (0%) 0.14 (0%)null I/O 0.21 0.21 (0%) 0.32 (52%)stat 1.57 1.6 (2%) 2.8 (78%)open/close file 2.75 2.42 (-12%) 3.91 (42%)signal install 0.25 0.25 (0%) 0.25 (0%)signal handle 1.37 1.29 (-6%) 1.39 (1%)fork process 380 396 (4%) 401 (6%)exec process 873 879 (1%) 911 (4%)shell process 2990 3000 (0%) 3113 (4%)File and memory latencies in µseconds (smaller is better)file create (0k) 11.5 11.2 (-3%) 15.8 (37%)file delete (0k) 8.51 8.12 (-5%) 11.8 (39%)file create (10k) 23.4 21.6 (-8%) 28.8 (23%)file delete (10k) 12.5 12 (-4%) 14.7 (18%)mmap latency 1062 1053 (-1%) 1120 (5%)protect fault 0.32 0.3 (-6%) 0.346 (8%)page fault 0.016 0.016 (0%) 0.016 (0%)100 fd select 1.53 1.53 (0%) 1.53 (0%)

Table 3: LMBench measurements for LPM kernels. Alltimes are in microseconds. Percent overhead for modi-fied configurations are shown in parenthesis.

expressions for identifying potential PII, e.g.,(\d{3}-\d{2}-\d{4}) identifies a social se-curity number. However, it is common practice fordevelopers to generate and distribute training datasets toaid in software testing (Figure 10, Decision Condition1). Spider is oblivious to information flows, insteadsearching for content that bears structural similarityto PII, and therefore would be unable to distinguishbetween true PII and training data. PB-DLP tracks thepropagation of data from its source onwards, and couldtrivially differentiate between true PII and training sets.

Regex approaches are also prone to false negatives.Even after the most trivial data transformations, PII andPCI data is no longer identifiable to the Spider system(Figure 10, Decision Condition 4), permitting its exfil-tration. To demonstrate, we generated a file full of ran-dom valid social security numbers that Spider was able toidentify. We then ran gzip on the file and stored it in asecond file. Spider was unable to identify the second file,but PB-DLP correctly identified both files as PII since thegzip output was derived from a sensitive input.

6 Evaluation

We now evaluate the performance of LPM. Our bench-marks were run on a bare metal server machine with 12GB memory and 2 Intel Xeon quad core CPUs. The RedHat 2.6.32 kernel was compiled and installed under 3 dif-ferent configurations: all provenance disabled (Vanilla),LPM scaffolding installed but without an enabled mod-ule (LPM), and LPM installed with the Provmon moduleenabled (Provmon).

Test Vanilla Provmon OverheadKernel Compile 598 sec 614 sec 2.7%Postmark 25 sec 27 sec 7.5%Blast 376 sec 390 sec 4.8%

Table 4: Benchmarking Results. Our provenance moduleimposed just 2.7% overhead on kernel compilation.

6.1 Collection Performance

We used LMBench to microbenchmark LPM’s impacton system calls as well as file and memory latencies.Table 3 shows the averaged results over 10 trials foreach kernel, with a percent overhead calculation againstVanilla. For most measures, the performance differ-ences between LPM and Vanilla are negligible. Com-paring Vanilla to Provmon, there are several measuresin which overhead is noteworthy: stat, open/close, filecreation and deletion. Each of these benchmarks in-volve LMBench manipulating a temporary file that re-sides in /usr/tmp/lat_fs/. Because an absolutepath is provided, before each system call occurs LM-Bench first traverses the path to the file, resulting inthe creation of 3 different provenance events in Prov-mon’s inode_permission hook, each of which istransmitted to user space via the kernel relay. Whilethe overheads seem large for these operations, the log-ging of these three events only imposes approximately1.5 microseconds per traversal. Moreover, the over-head for opening and closing is significantly higher thanthe overhead than reads and writes (Null I/O); thus, thehigher open/close costs are likely to be amortized overthe course of regular system use. A provenance modulecould further reduce this overhead by maintaining stateabout past events within the kernel, then blocking thecreation of redundant provenance records.

To gain a more practical sense of the costs of LPM, wealso performed multiple benchmark tests that representedrealistic system workloads. Each trial was repeated 5times to ensure consistency. The results are summarizedin Table 4. For the kernel compile test, we recompiledthe kernel source (in a fixed configuration) while bootedinto each of the kernels. Each compilation occurred on16 threads. The LPM scaffolding (without an enabledmodule) is not included, because in both tests it imposedless than 1% overhead. In spite of seemingly high over-heads for file I/O, Provmon imposes just 2.7% overheadon kernel compilation, or 16 seconds. The Postmark testsimulates the operation of an email server. It was con-figured to run 15,000 transactions with file sizes rang-ing from 4 KB to 1 MB in 10 subdirectories, with up to1,500 simultaneous transactions. The Provmon moduleimposed just 7.5% overhead on this task. To estimateLPM’s overhead for scientific applications, we ran the

15

0

1

2

3

4

5

0 2 4 6 8 10

Sto

rage

Cos

t (G

Byt

es)

Time (Minutes)

Raw Prov. StreamGZip Recorder

SNAP Recorder

Figure 12: Growth of provenancestorage overheads during kernelcompilation.

0.95

0.96

0.97

0.98

0.99

1

0 5 10 15 20 25

Cum

ula

tive

Den

sity

Response Time (Milliseconds)

Figure 13: Performance of ances-try queries for objects created dur-ing kernel compilation.

0

1000

2000

3000

4000

5000

Iperf Performance

Thro

ughput

(Mbps)

VanillaLPM

ProvmonBatch Sig

Figure 14: LPM network overheadcan be reduced with batch signatureschemes.

BLAST benchmarks 19, which simulates typical biolog-ical sequence workloads obtained from analysis of hun-dreds of thousands of jobs from the National Institutes ofHealth.

For kernel compile and postmark, Provmon outper-forms the PASS system, which exacted 15.6% and 11.5%overheads on kernel compilation and postmark, respec-tively [56]. Provmon introduces comparable kernel com-pilation overhead to Hi-Fi (2.8%) [61]. It is difficult tocompare our Blast results to SPADE and PASS, as bothused a custom workload instead of a publicly availablebenchmark. SPADE reports an 11.5% overhead on alarge workload [36], while PASS reports just an 0.7%overhead. Taken as a whole, though, LPM collection ei-ther meets or exceeds the performance of past systems,while providing additional security assurances.

6.2 Storage Overhead

A major challenge to automated provenance collectionis the storage overhead incurred. We plotted the growthof provenance storage using different recorders duringthe kernel compilation benchmark, shown in Figure 12.LPM generated 3.7 GB of raw provenance. This re-quired only 450 MB of storage with the Gzip recorder,but provenance cannot be efficiently queried in this for-mat. The SNAP recorder builds an in-memory prove-nance graph. We approximated the storage overheadthrough polling the virtual memory consumed by therecorder process in the /proc filesystem. The SNAPgraph required 1.6 GB storage; the reduction from theraw provenance stream is due to the fact that redundantevents did not lead to the creation of new graph compo-nents. In contrast, the PASS system generates 1.3 GB ofstorage overhead during kernel compilation, but PASSdoes not monitor of shared memory and memory map-ping activity. LPM’s storage overheads are thus compa-rable to other provenance-aware systems.

19See http://fiehnlab.ucdavis.edu/staff/kind/Collector/Benchmark/Blast_Benchmark

6.3 Query Performance (PB-DLP)We evaluated query performance using our exemplar PB-DLP application and LPM’s SNAP recorder. The prove-nance graph that was populated using the routine fromthe kernel compile benchmark. This yielded a raw prove-nance stream of 3.7 GB, which was translated by therecorder into a graph of 6,513,398 nodes and 6,754,059edges. We were then able to use the graph to issue an-cestry queries, in which the subgraphs were traversed tofind the data object ancestors of a given file. Becausewe did not want ephemeral objects with limited ances-tries to skew our results, we only considered the resultsof objects with more than 50 ancestors.

In the worst case, which was a node that had 17,696ancestors, the query returned in just 21 milliseconds.Effectively, we were able to query object ancestries atline speed for network activity. We are confident thatthis approach can scale to large databases through pre-computation of expensive operations at data ingest, mak-ing it a promising strategy for provenance-aware dis-tributed systems; however, we note that these results arehighly dependent on the size of the graph. Our test graph,while large, would inevitably be dwarfed by the size ofthe provenance on long-lived systems. Fortunately, thereare also a variety of techniques for reducing these costs.Bates et al. show that the results from provenance graphtraversals can be extensively cached when using a fixedsecurity policy, which would allow querying to amor-tize to a small constant cost [12]. LPM could also beextended to support provenance deduplication [78, 77]and policy-based provenance pruning [17], both of whichwould further improve performance by reducing the sizeof provenance graphs.

6.4 Message Commitment ProtocolUnder each kernel configuration, we performed iperfbenchmarking to discover LPM’s impact on TCPthroughput. iperf was launched in both client and servermode over localhost. The client was launchedwith 16 threads, two per each CPU. Our results can be

16

found in Figure 14. The Vanilla kernel reached 4490Mbps. While the LPM framework imposed negligiblecost compared to the vanilla kernel (4480 Mbps), Prov-mon’s DSA-based message commitment protocol re-duced throughput by an order of magnitude (482 Mbps).Through use of printk instrumentation, we found thatthe average overhead per packet signature was 1.2 ms.

This result is not surprising when compared to IPsecperformance. IPsec’s Authentication Header (AH) modeuses an HMAC-based approach to provide similar guar-antees as our protocol. AH has been shown to reducethroughput by as much as half [20]. An HMAC approachis a viable alternative to establish integrity and data ori-gin authenticity and would also fit into the options field,but would require the negotiation of IPsec security as-sociations. Our message commitment protocol has thebenefit of being fully interoperable with other hosts, anddoes not require a negotiation phase before communi-cation occurs. Another option for increasing through-put would be to employ CPU instruction extensions [37]and security co-processor [24] to accelerate the speed ofDSA. Yet another approach to reducing our impact onnetwork performance would be to employ a batch signa-ture scheme [15]. We tested this by transmitting a sig-nature over every 10 packets during TCP sessions, andfound that throughput increased by 3.3 times to approx-imately 1600 Mbps. Due to the fact that this overheadmay not be suitable for some environments, Provmoncan be configured to use Hi-Fi identifiers [61], which arevulnerable to network attack but impose negligible over-head. LPM’s impact on network performance is specificto the particular module, and can be tailored to meet theneeds of the system.

7 Discussion

In the same way that LPM is not able to capture workflowlayer provenance without the assistance of instrumentedapplications, LPM is also vulnerable to workflow layerside channels that could be used to launder informationfrom one process to another. The most obvious exampleof such a side channel is the copy/paste buffer in win-dowing applications like Xorg. This is a known problemfor kernel layer security mechanisms, one that has beenaddressed by the Trusted Solaris project [14], TrustedX [28, 29], the SELinux-aware X window system [48],SecureView 20, and General Dynamics’ TVE 21. Simi-lar approaches could be used to create provenance-awaredesktop environments. Similarly, LPM is unable to cap-ture side channel flows, such as timing channels or L2cache measurements [64], a limitation that it shares with

20See http://www.ainfosec.com/secureview21See http://gdc4s.com/tve.html

most any other security solution [22].LPM does not address the matter of provenance confi-

dentiality; this is an important challenge that is exploredelsewhere in the literature [18, 59]. LPM’s Recordersprovide interfaces that can be used to introduce an ac-cess control layer onto the provenance store.

Although we have not presented a secure distributedprovenance-aware system in this work, LPM providesthe foundation for the creation of such a system. In thepresented modules, provenance is stored locally by thehost and retrieved on an as-needed basis from other hosts.This raises availability concerns as hosts inevitably beginto fail. Availability could be improved with minimal per-formance and storage overheads through Gehani et al.’sapproach of duplicating provenance at k neighbors witha limited graph depth d [34, 35].

8 Related Work

While myriad provenance-aware systems have been pro-posed in the literature, the majority disclose provenancewithin an application [39, 53, 82] or workflow [31, 73]. Itis difficult or impossible to obtain complete provenancein this manner. This is because systems events that occuroutside of the application, but still effect its execution,will not appear in the provenance record.

The alternative to disclosed systems are automaticprovenance-aware systems, which collect provenancetransparently within the operating system. Gehani etal.’s SPADE is a multi-platform system for eScience andgrid computing audiences, with an emphasis on low la-tency and availability in distributed environments [36].SPADE’s provenance reporters make use of familiar ap-plication layer utilities to generate provenance, such aspolling ps for process information and lsof for net-work information. This gives rise to the possibility of in-complete provenance due to race conditions. The PASSproject collects the provenance of system calls at thevirtual filesystem (VFS) layer. PASSv1 provides basefunctions for provenance collection that observe pro-cesses’ file I/O activity [56]. Because these basic func-tions are manually placed around the kernel, there is noclear way to extend PASSv1 to support additional col-lection hooks; we address this limitation in the modulardesign of LPM. PASSv2 introduces a Disclosed Prove-nance API for tighter integration between provenancecollected at different layers of abstraction, e.g., at theapplication layer [57]. PASSv2 assumes that disclos-ing processes are benign, while LPM provides a securedisclosure mechanism for attesting the correctness ofprovenance-aware applications. Both SPADE and PASSare designed for benign environments, and make no at-tempt to protect their collection mechanisms from an ad-versary.

17

Security Whole-SystemImplemented? Model? Provenance? Adversary on Host? Adversary in Network?

SPADE [36] 3 7 7 Vulnerable (Disable Monitor) Vulnerable (Prov. Forgery)PASSv2 [57] 3 7 7 Vulnerable (User Space Components) Vulnerable (Prov. Forgery)SProv [39] 3 3 7 Vulnerable (Disable Monitor) N/ASNooPy [82] 3 3 7 Vulnerable to all attacks, but with probabilistic detection.Lyle, Martin [52] 7 3 7 Secure. (But does not collect whole-system provenance.)Hi-Fi [61] 3 3 3 Vulnerable (User Space Components) Vulnerable (Prov. Forgery)

LPM 3 3 3 Secure for whole-system provenance collection.

Table 5: A summary of security considerations in existing provenance-aware systems. The “Security Model?" fieldmarks whether the system is designed to work in the presence of some adversary. “Whole-System Provenance?"denotes whether the system collects provenance for all events including kernel activities. The identified vulnerabilitiesare according to our security model.

Previous work has considered the security of prove-nance under relaxed threat models relative to LPM’s.The resulting vulnerabilities are surveyed in Table 5. InSProv, Hasan et al. introduce provenance chains, cryp-tographic constructs that prevent the insertion or dele-tion of provenance inside of a series of events [39].SProv effectively demonstrates the authentication prop-erties of this primitive, but is not intended to serve as asecure provenance-aware system; attackers can still ap-pend false records to the end of the chain, delete thewhole chain, or disable the library altogether. Zhou etal. consider provenance corruption an inevitability, andshow that provenance can detect some malicious hosts indistributed environments provided that a critical mass ofcorrect hosts still exist [82]. They later strengthen theseassurances through use of provenance-aware software-defined networking [11]. These systems consider onlynetwork events, and are unable to speak to the internalstate of hosts. Lyle and Martin sketch the design fora secure provenance monitor based on trusted comput-ing [52]. However, they conceptualize provenance asa TPM-aided proof of code execution, overlooking in-terprocess communication and other system activity thatcould inform execution results, and therefore offer infor-mation that is too coarse-grained to meet the needs ofsome applications. Moreover, to the best of our knowl-edge their system is unimplemented.

The most promising model to date for secure prove-nance collection is Pohly et al.’s Hi-Fi system [61].Hi-Fi is a Linux Security Module (LSM) that collectswhole-system provenance that details the actions ofprocesses, IPC mechanisms, and even the kernel itself(which does not exclusively use system calls). Hi-Fiattempts to provide a provenance reference monitor[55], but remains vulnerable to the provenance-awareadversary that we describe in Section 3.2. EnablingHi-Fi blocks the installation of other LSM’s, such asSELinux or Tomoyo, or requires a third party patch topermit module stacking. This blocks the installation ofMAC policy on the host, preventing runtime integrityassurances. Hi-Fi is also vulnerable to adversaries in the

network, who can strip the provenance identifiers frompackets in transit, resulting in irrecoverable provenance.Unlike LPM, Hi-Fi does not attempt to provide layeredprovenance services, and therefore does not considerthe integrity and authenticity of provenance-awareapplications.

Provenance vs. Information Flow

Provenance is a form of information flow monitor-ing that is related but distinct from past areas of study.Many projects have aimed to support Information FlowControl (IFC) through instrumenting applications, oper-ating systems, and programming languages. FlowwolFis a web browser that provides labeling support for dis-tributed mandatory access control at the application layer[40]. Decentralized IFC systems like Asbestos [26], HiS-tar [80], and Flume [49] make use of a decentralized la-beling scheme that allows processes to compartmental-ize data that is protected by the operating system. Thisfacilitates the use of a lattice-based model for informa-tion flow [23]. DStar [79] extends this approach to sup-port distributed environments. In each case, the primarythrust of these works is to provide system-layer supportto applications wishing to isolate user data, thus reducingthe consequences of a compromise. IFC systems thus re-quire a priori knowledge of desired flow properties, andare unable to answer questions such as “How did dataobject x come to have label a?" Provenance providesa means of reasoning about flows and answering thesequestions.

Another area of information flow study is dynamictaint analysis, in which the goal is to track the propaga-tion of select pieces of data across a system. Automatedinstrumentation for taint tracking has been developed forx86 binaries [67] and smartphones [27], and dynamictaint analysis in sandbox environments has also beenused to secure off-the-shelf applications [84]. Prove-nance can offer a more complete explanation as to howan object became tainted. It is also more flexible: tainttracking relies on an immutable policy that requires that

18

data be tagged at runtime, while a provenance-based ap-proach can obtain a result after execution by “replaying”the provenance graph [82], permitting different taints tobe considered without re-executing. Provenance is thusa distinct form of reasoning about information flow.

9 Conclusion

The Linux Provenance Module Framework is an ex-citing step forward for both provenance- and security-conscious communities. We have demonstrated thatLPM can be used to create a trusted provenance-awareexecution environment. A key feature of LPM is itsability to leverage Linux’s existing security features toprovide strong provenance collection assurances. Oursystem imposes as little as 2.7% performance overheadon normal system operation, and can respond to queriesabout data object ancestry in tens of milliseconds. Wehave used LPM as the foundation of a provenance-baseddata loss prevention (PB-DLP) system that can scan filetransmissions to detect the presence of sensitive ances-tors in just tenths of a second.

Acknowledgements

We would like to thank Rob Cunningham, Alin Do-bra, Will Enck, Jun Li, Al Malony, Patrick McDaniel,Daniela Oliveira, Nabil Schear, Micah Sherr, and PatrickTraynor for their valuable comments and insight, as wellas Devin Pohly for his sustained assistance in workingwith Hi-Fi, and Mugdha Kumar for her help developingLPM SPADE support. This work was supported in partby the US National Science Foundation under grant num-bers CNS-1118046, CNS-1254198, and CNS-1445983.

Availability

The LPM code base, including all user space utilities andpatches for both Red Hat and the mainline Linux kernels,is available at http://linuxprovenance.org.

References[1] CDW Data Loss Prevention. http://www.cdw.com/

content/solutions/data-loss-prevention.aspx.

[2] Data Leakage Protection: Assess Risk and SafeguardValuable Information. http://www.cisco.com/c/en/us/solutions/enterprise-networks/data-loss-prevention/index.html.

[3] Data Leakage Protection: Assess Risk and Safeguard ValuableInformation. http://www.mcafee.com/us/products/total-protection-for-data-loss-prevention.aspx.

[4] Kernel SSL Proxy (KSSL). http://docs.oracle.com/cd/E23823_01/html/816-5175/kssl-5.html.

[5] Symantec Data Loss Prevention Customer Brochure. http://www.symantec.com/data-loss-prevention.

[6] Titus. http://www.titus.com.

[7] What Makes Bit9 + Carbon Black Unique? https://www.bit9.com/why-bit9/unique-from-competitors/.

[8] What’s Yours is Mine: How Employees are PuttingYour Intellectual Property at Risk. https://www4.symantec.com/mktginfo/whitepaper/WP_WhatsYoursIsMine-HowEmployeesarePuttingYourIntellectualPropertyatRisk_dai211501_cta69167.pdf.

[9] R. Aldeco-Pérez and L. Moreau. Provenance-based Auditing ofPrivate Data Use. In Proceedings of the 2008 International Con-ference on Visions of Computer Science, VoCS’08, Sept. 2008.

[10] J. P. Anderson. Computer Security Technology Planning Study.Technical Report ESD-TR-73-51, Air Force Electronic SystemsDivision, 1972.

[11] A. Bates, K. Butler, A. Haeberlen, M. Sherr, and W. Zhou. LetSDN Be Your Eyes: Secure Forensics in Data Center Networks.In NDSS Workshop on Security of Emerging Network Technolo-gies, SENT, Feb. 2014.

[12] A. Bates, B. Mood, M. Valafar, and K. Butler. Towards SecureProvenance-based Access Control in Cloud Environments. InProceedings of the 3rd ACM Conference on Data and Applica-tion Security and Privacy, CODASPY ’13, pages 277–284, NewYork, NY, USA, 2013. ACM.

[13] M. Bellare, R. Canetti, and H. Krawczyk. Keyed Hash Functionsand Message Authentication. In Proceedings of Crypto’96, vol-ume 1109 of LNCS, pages 1–15, 1996.

[14] M. Bellis, S. Lofthouse, H. Griffin, and D. Kucukreisoglu.Trusted Solaris 8 4/01 Security Target. 2003.

[15] A. Bittau, D. Boneh, M. Hamburg, M. Handley, D. Mazieres,and Q. Slack. Cryptographic protection of TCP Streams(tcpcrypt). https://tools.ietf.org/html/draft-bittau-tcp-crypt-01.

[16] D. Boneh, B. Lynn, and H. Shacham. Short Signatures from theWeil Pairing. In C. Boyd, editor, Advances in Cryptology – ASI-ACRYPT 2001, volume 2248 of Lecture Notes in Computer Sci-ence, pages 514–532. Springer Berlin Heidelberg, 2001.

[17] U. Braun, S. L. Garfinkel, D. A. Holland, K.-K. Muniswamy-Reddy, and M. I. Seltzer. Issues in Automatic Provenance Col-lection. In International Provenance and Annotation Workshop,pages 171–183, 2006.

[18] U. Braun and A. Shinnar. A Security Model for Provenance.Technical Report TR-04-06, Harvard University Computer Sci-ence Group, 2006.

[19] D. Catalano, M. Di Raimondo, D. Fiore, and R. Gennaro. Off-line/On-line Signatures: Theoretical Aspects and ExperimentalResults. In PKC’08: Proceedings of the Practice and theory inpublic key cryptography, 11th international conference on Pub-lic key cryptography, pages 101–120, Berlin, Heidelberg, 2008.Springer-Verlag.

[20] S. Chaitanya, K. Butler, A. Sivasubramaniam, P. McDaniel, andM. Vilayannur. Design, Implementation and Evaluation of Secu-rity in iSCSI-based Network Storage Systems. In Proceedings ofthe Second ACM Workshop on Storage Security and Survivability,StorageSS ’06, pages 17–28, New York, NY, USA, 2006. ACM.

[21] D. D. Clark and D. R. Wilson. A Comparison of Commercial andMilitary Computer Security Policies. In IEEE S&P, Oakland,CA, USA, Apr. 1987.

19

[22] D. Cock, Q. Ge, T. Murray, and G. Heiser. The Last Mile: AnEmpirical Study of Some Timing Channels on seL4. In ACMConference on Computer and Communications Security, pages570–581, Scottsdale, AZ, USA, nov 2014.

[23] D. E. Denning. A Lattice Model of Secure Information Flow.Commun. ACM, 19(5):236–243, May 1976.

[24] J. Dyer, M. Lindemann, R. Perez, R. Sailer, L. van Doorn, andS. Smith. Building the IBM 4758 Secure Coprocessor. Computer,34(10):57–66, Oct 2001.

[25] A. Edwards, T. Jaeger, and X. Zhang. Runtime Verification ofAuthorization Hook Placement for the Linux Security ModulesFramework. In Proceedings of the 9th ACM Conference on Com-puter and Communications Security, CCS’02, 2002.

[26] P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey,D. Ziegler, E. Kohler, D. Mazières, F. Kaashoek, and R. Morris.Labels and Event Processes in the Asbestos Operating System.SIGOPS Oper. Syst. Rev., 39(5):17–30, Oct. 2005.

[27] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel,and A. N. Sheth. TaintDroid: An Information-flow Tracking Sys-tem for Realtime Privacy Monitoring on Smartphones. In Pro-ceedings of the 9th USENIX Symposium on Operating SystemsDesign and Implementation, OSDI’10, Oct. 2010.

[28] J. Epstein and J. Picciotto. Trusting X: Issues in Building TrustedX Window Systems -or- WhatâAZs Not Trusted About X. In Pro-ceedings of the 14th Annual National Computer Security Confer-ence, 1991.

[29] J. Epstein and M. Shugerman. A Trusted X Window SystemServer for Trusted Mach. In USENIX MACH Symposium, pages141–156, 1990.

[30] S. Even, O. Goldreich, and S. Micali. On-line/off-line DigitalSignatures. In Proceedings on Advances in cryptology, CRYPTO’89, pages 263–275, New York, NY, USA, 1989. Springer-VerlagNew York, Inc.

[31] I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao. Chimera: AVir-tual Data System for Representing, Querying, and AutomatingData Derivation. In Proceedings of the 14th Conference on Sci-entific and Statistical Database Management, SSDBM’02, July2002.

[32] V. Ganapathy, T. Jaeger, and S. Jha. Automatic placement ofauthorization hooks in the linux security modules framework. InProceedings of the 12th ACM Conference on Computer and Com-munications Security, CCS ’05, pages 330–339, New York, NY,USA, 2005. ACM.

[33] C.-z. Gao and Z.-a. Yao. A Further Improved Online/Offline Sig-nature Scheme. Fundam. Inf., 91:523–532, August 2009.

[34] A. Gehani, B. Baig, S. Mahmood, D. Tariq, and F. Zaffar. Fine-grained Tracking of Grid Infections. In Proceedings of the11th IEEE/ACM International Conference on Grid Computing,GRID’10, Oct 2010.

[35] A. Gehani and U. Lindqvist. Bonsai: Balanced Lineage Authen-tication. In Proceedings of the 23rd Annual Computer SecurityApplications Conference, ACSAC’07, Dec 2007.

[36] A. Gehani and D. Tariq. SPADE: Support for Provenance Audit-ing in Distributed Environments. In Proceedings of the 13th In-ternational Middleware Conference, Middleware ’12, Dec 2012.

[37] S. Gueron and V. Krasnov. Speed Up Big-Number MultiplicationUsing Single Instruction Multiple Data (SIMD) Architectures,June 7 2012. US Patent App. 13/491,141.

[38] C. Hanson. SELinux and MLS: Putting The Pieces Together. InIn Proceedings of the 2nd Annual SELinux Symposium, 2006.

[39] R. Hasan, R. Sion, and M. Winslett. The Case of the Fake Pi-casso: Preventing History Forgery with Secure Provenance. InProceedings of the 7th USENIX Conference on File and StorageTechnologies, FAST’09, San Francisco, CA, USA, Feb. 2009.

[40] B. Hicks, S. Rueda, D. King, T. Moyer, J. Schiffman, Y. Sreeni-vasan, P. McDaniel, and T. Jaeger. An Architecture for Enforc-ing End-to-end Access Control over Web Applications. In Pro-ceedings of the 15th ACM Symposium on Access Control Modelsand Technologies, SACMAT ’10, pages 163–172, New York, NY,USA, 2010. ACM.

[41] B. Hicks, S. Rueda, L. St.Clair, T. Jaeger, and P. McDaniel.A Logical Specification and Analysis for SELinux MLS Policy.ACM Trans. Inf. Syst. Secur., 13(3):26:1–26:31, July 2010.

[42] D. A. Holland, U. Bruan, D. Maclean, K.-K. Muniswamy-Reddy,and M. I. Seltzer. Choosing a Data Model and Query Languagefor Provenance. In 2nd International Provenance and AnnotationWorkshop, IPAW’08, June 2008.

[43] T. Jaeger, A. Edwards, and X. Zhang. Consistency Analysis ofAuthorization Hook Placement in the Linux Security ModulesFramework. ACM Trans. Inf. Syst. Secur., 7(2):175–205, May2004.

[44] T. Jaeger, R. Sailer, and U. Shankar. PRIMA: Policy-reducedIntegrity Measurement Architecture. In Proceedings of the 11thACM Symposium on Access Control Models and Technologies,SACMAT ’06, pages 19–28, New York, NY, USA, 2006. ACM.

[45] T. Jaeger, R. Sailer, and X. Zhang. Analyzing Integrity Protectionin the SELinux Example Policy. In Proceedings of the 12th con-ference on USENIX Security Symposium - Volume 12, SSYM’03,pages 5–5, Berkeley, CA, USA, 2003. USENIX Association.

[46] B. Kauer. OSLO: Improving the Security of Trusted Comput-ing. In Proceedings of 16th USENIX Security Symposium onUSENIX Security Symposium, SS’07, pages 16:1–16:9, Berkeley,CA, USA, 2007. USENIX Association.

[47] S. Kent and R. Atkinson. RFC 2406: IP Encapsulating SecurityPayload (ESP). 1998.

[48] D. Kilpatrick, W. Salamon, and C. Vance. Securing the X Win-dow System with SELinux. Technical report, Jan. 2003.

[49] M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek,E. Kohler, and R. Morris. Information Flow Control for Stan-dard OS Abstractions. SIGOPS Oper. Syst. Rev., 41(6):321–334,Oct. 2007.

[50] L. Lamport. Time, Clocks, and the Ordering of Events in a Dis-tributed System. Commun. ACM, 21(7):558–565, July 1978.

[51] H. Luhn. Computer for Verifying Numbers, Aug. 23 1960. USPatent 2,950,048.

[52] J. Lyle and A. Martin. Trusted Computing and Provenance: Bet-ter Together. In 2nd Workshop on the Theory and Practice ofProvenance, TaPP’10, Feb. 2010.

[53] P. Macko and M. Seltzer. A General-purpose Provenance Li-brary. In 4th Workshop on the Theory and Practice of Prove-nance, TaPP’12, June 2012.

[54] J. M. McCune, B. J. Parno, A. Perrig, M. K. Reiter, andH. Isozaki. Flicker: An Execution Infrastructure for TCB Mini-mization. In Proceedings of the 3rd ACM SIGOPS/EuroSys Euro-pean Conference on Computer Systems 2008, Eurosys ’08, pages315–328, New York, NY, USA, 2008. ACM.

[55] P. McDaniel, K. Butler, S. McLaughlin, R. Sion, E. Zadok, andM. Winslett. Towards a Secure and Efficient System for End-to-End Provenance. In Proceedings of the 2nd conference on The-ory and practice of provenance, San Jose, CA, USA, Feb. 2010.USENIX Association.

20

[56] K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer.Provenance-Aware Storage Systems. In Proceedings of the 2006USENIX Annual Technical Conference, 2006.

[57] K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko,D. Maclean, D. Margo, M. Seltzer, and R. Smogor. Layering inProvenance Systems. In Proceedings of the 2009 Conference onUSENIX Annual Technical Conference, ATC’09, June 2009.

[58] D. Nguyen, J. Park, and R. Sandhu. Dependency Path PatternsAs the Foundation of Access Control in Provenance-aware Sys-tems. In Proceedings of the 4th USENIX Conference on Theoryand Practice of Provenance, TaPP’12, pages 4–4, Berkeley, CA,USA, 2012. USENIX Association.

[59] Q. Ni, S. Xu, E. Bertino, R. Sandhu, and W. Han. An AccessControl Language for a General Provenance Model. In SecureData Management, Aug. 2009.

[60] J. Park, D. Nguyen, and R. Sandhu. A Provenance-Based AccessControl Model. In Proceedings of the 10th Annual InternationalConference on Privacy, Security and Trust (PST), pages 137–144,2012.

[61] D. Pohly, S. McLaughlin, P. McDaniel, and K. Butler. Hi-Fi: Col-lecting High-Fidelity Whole-System Provenance. In Proceedingsof the 2012 Annual Computer Security Applications Conference,ACSAC ’12, Orlando, FL, USA, 2012.

[62] J. Postel. RFC 791: Internet protocol. 1981.

[63] A. C. Revkin. Hacked E-mail is New Fodder for Climate Dispute.New York Times, 20, 2009.

[64] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, You,Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds. In Proceedings of the 16th ACM Con-ference on Computer and Communications Security (CCS’09),pages 199–212, Chicago, IL, USA, Oct. 2009. ACM.

[65] R. Sailer, X. Zhang, T. Jaeger, and L. van Doorn. Design andImplementation of a TCG-based Integrity Measurement Archi-tecture. In SSYM’04: Proceedings of the 13th conference onUSENIX Security Symposium, pages 16–16, Berkeley, CA, USA,2004. USENIX Association.

[66] C. Sar and P. Cao. Lineage File System.http://crypto.stanford.edu/~cao/lineage.html.

[67] P. Saxena, R. Sekar, and V. Puranik. Efficient Fine-grained Bi-nary Instrumentationwith Applications to Taint-tracking. In Pro-ceedings of the 6th Annual IEEE/ACM International Symposiumon Code Generation and Optimization, CGO ’08, pages 74–83,New York, NY, USA, 2008. ACM.

[68] J. Seibert, G. Baah, J. Diewald, and R. Cunningham. UsingProvenance To Expedite MAC Policies (UPTEMPO) (PreviouslyKnown as IPDAM). Technical Report USTC-PM-015, MIT Lin-coln Laboratory, October 2014.

[69] A. Shamir and Y. Tauman. Improved Online/Offline Signa-ture Schemes. In J. Kilian, editor, Advances in Cryptology —CRYPTO 2001, volume 2139 of Lecture Notes in Computer Sci-ence, pages 355–367. Springer Berlin / Heidelberg, 2001.

[70] D. Tariq, B. Baig, A. Gehani, S. Mahmood, R. Tahir, A. Aqil, andF. Zaffar. Identifying the Provenance of Correlated Anomalies. InProceedings of the 2011 ACM Symposium on Applied Computing,SAC ’11, Mar. 2011.

[71] The Netfilter Core Team. The Netfilter Project: Packet Manglingfor Linux 2.4. http://www.netfilter.org/, 1999.

[72] H. Vijayakumar, G. Jakka, S. Rueda, J. Schiffman, and T. Jaeger.Integrity Walls: Finding Attack Surfaces from Mandatory Ac-cess Control Policies. In Proceedings of the 7th ACM Symposiumon Information, Computer and Communications Security, ASI-ACCS ’12, pages 75–76, New York, NY, USA, 2012. ACM.

[73] J. Widom. Trio: A System for Integrated Management of Data,Accuracy, and Lineage. Technical Report 2004-40, Stanford In-foLab, Aug. 2004.

[74] World Wide Web Consortium. PROV-Overview: An Overviewof the PROV Family of Documents. http://www.w3.org/TR/prov-overview/, 2013.

[75] C. Wright, C. Cowan, J. Morris, S. Smalley, and G. Kroah-Hartman. Linux Security Module Framework. In Ottawa LinuxSymposium, page 604, 2002.

[76] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartman. Linux security modules: General security support forthe linux kernel. In Proceedings of the 11th USENIX SecuritySymposium, pages 17–31, Berkeley, CA, USA, 2002. USENIXAssociation.

[77] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, Y. Li, and D. D. E.Long. Evaluation of a Hybrid Approach for Efficient ProvenanceStorage. Trans. Storage, 9(4):14:1–14:29, Nov. 2013.

[78] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer,D. Feng, and Z. Tan. Compressing Provenance Graphs, June2011.

[79] N. Zeldovich, S. Boyd-Wickizer, and D. Mazières. Securing Dis-tributed Systems with Information Flow Control. In Proceedingsof the 5th USENIX Symposium on Networked Systems Design andImplementation, NSDI’08, pages 293–308, Berkeley, CA, USA,2008. USENIX Association.

[80] N. B. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazières.Making Information Flow Explicit in HiStar. In Proceedings ofthe 7th USENIX Symposium on Operating Systems Design andImplementation, OSDI’06, Nov. 2006.

[81] X. Zhang, A. Edwards, and T. Jaeger. Using CQUAL for StaticAnalysis of Authorization Hook Placement. In Proceedings ofthe 11th USENIX Security Symposium, 2002.

[82] W. Zhou, Q. Fei, A. Narayan, A. Haeberlen, B. T. Loo, andM. Sherr. Secure Network Provenance. In Proceedings ofthe 23rd ACM Symposium on Operating Systems Principles,SOSP’11, Oct. 2011.

[83] W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. EfficientQuerying and Maintenance of Network Provenance at Internet-Scale. In ACM SIGMOD International Conference on Manage-ment of Data (SIGMOD), June 2010.

[84] D. Y. Zhu, J. Jung, D. Song, T. Kohno, and D. Wetherall. Tain-tEraser: Protecting Sensitive Data Leaks Using Application-levelTaint Tracking. SIGOPS Oper. Syst. Rev., 45(1):142–154, Feb.2011.

21

Date post:	12-Aug-2019
Category:	Documents
Upload:	phungtuyen
View:	213 times
Download:	0 times