Risk Aware Query Replacement Approach for Secure Databases Performance Management

1545-5971 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TDSC.2014.2306675, IEEE Transactions on Dependable and Secure Computing

Risk Aware Query Replacement Approach For Secure

Databases Performance Management

Ousmane Amadou Dia

Computer Science and Engineering Dept.

University of South Carolina - Columbia

[email protected]

Csilla Farkas

Computer Science and Engineering Dept.

University of South Carolina - Columbia

[email protected]

Abstract – Large amount of data and increased de-mand to extract, analyze and derive knowledge from dataare impairing nowadays performance of enterprise mission-critical systems such as databases. For databases, the chal-lenging problem is to manage complex and sometimes non-optimized queries executed on enormous datasets storedacross several tables. This generally results in increasedquery response time and loss of employees productivity.In this paper, we investigate the problem of enterprisecomputing resources availability. Our goal is to minimizeperformance degradation arising from resource intensivequeries. We propose a risk aware approach that decou-ples the process of analyzing resource requirements of sqlqueries from their execution. We leverage XACML to con-trol users’ requests and to monitor database loads. Thisallows us to adjust available resources in a database sys-tem to computing resource needs of queries. A query cantherefore run in a database if it does not severely impactthe performance of the database. Otherwise, we propose tothe requester a replacement query denoted what-if-query.Such query proposes results that are similar to the resultsof the requester’s query, is secure and provides acceptableanswers when it executes without compromising the per-formance of the database.

Keywords: XACML, risk adaptive access control,databases, sql, cost estimation, information retrieval.

1 Introduction

As enterprises grow, so do their needs to extend their work-force in order to meet certain demands. The volume andcomplexity of business data also grow. A contrasting ob-servation however is that enterprises do not generally in-vest much on acquiring an IT infrastructure that best fittheir needs[3]. When workforce or data increases, it often-times induces more computing resources usage which if notwell managed can rapidly affect IT systems performance.Poor processing power or downtime can have severe con-sequences: low transaction volume or click-through rate,dissatisfied customers, loss of employees productivity, andmore. With the decline in hardware cost, and the adventof cloud computing, new opportunities arise for enterprisesto expand dynamically their IT infrastructure at a low costand on-demand. However, system performance does notalways scale with system size. Rather, it is highly depen-dant on the mixture of queries and data that constitute theworkload of the system. In addition, many enterprises are

reluctant to embrace cloud computing because of integra-tion, interoperability, regulatory compliance or corporategovernance, and security or privacy issues [1].

The real challenge then for many enterprises lies todayin their ability to meet their rising computing needs whilereducing their operational costs on hardware. The key toensuring this is to define effective measures that guaranteea wise usage of computing resources. By wise usage, wemean that the need to access information in a system, ascompelling as it can be in view of an enterprise’ missionpriorities, must be balanced with the overall performanceof the system to limit certain risks. In this paper, we definerisk as the likelihood that an IT system or a componentbecomes unavailable due to intensive demands and the en-suing consequences should the risk occurs. From a securitypoint of view, we are interested in defining an access con-trol framework that guarantees continuous availability ofenterprises IT systems. We focus on databases and pro-pose an approach to control their performance while en-suring the need-to-know of the users [30].

In recent years, significant research has been devotedon developing access control models to enforce enterprisesspecific security needs. Most of these models [26, 8, 25, 13,22], however, deal with preventing unauthorized access tosensitive data or denial of service attacks. Others [24, 23]focus on quantifying risks in security sensitive domains likehealthcare or for tackling insider abuses [20, 14, 4, 7]. Fewproposals approach the problem of availability from theperspective of computing resources usage monitoring andperformance management. Ensuring the availability of anIT resource from that viewpoint is however as important aspreserving its confidentiality or guaranteeing its integrity.As an example, databases that have high security and pri-vacy requirements also handle large volume of data and ac-cesses. While most database management systems are val-ued for their robustness, IT instances on which databasesrun, however, are memory, CPU, or disk bounded, whichthus results in many cases in poor processing performancewhen the databases are under intensive demands.

An approach that most databases administrators use toreduce workloads and thereby increase system performanceis to abort or pause queries. Such an approach, however,is not efficient for several reasons. First, allowing queriesto execute without assessing in advance their impacts onthe systems they run exposes the systems to risks of in-accessibility due to potential overload. Second, aborting

1



queries that run causes unnecessary usage of computingresources without satisfying the information needs of therequesters. We argue however that with good estimates ofthe resource requirements of each query and the workloadof the system processing the query such inconvenienceswould not occur. Our intuition is that by evaluating be-forehand these metrics and taking appropriate actions wecan thus minimize impacts of resource intensive queries onsome systems. For that, we must necessarily know the levelof tolerance of the systems we deal with. Following [21],we use the concept of risk band to get periodically an ideaof the workload of a given database system and to predicthow much of its resources an incoming request will con-sume. More intuitively, the performance of each databasesystem is monitored based on n risk layers determined bythe enterprise owner of the system and numbered from 1to n; the higher indices indicating greater system availabil-ity or performance (e.g., for n = 4, 1: highly-available, 2:available, 3: tolerable, and 4: critical - referring to exces-sive workload or resource exhaustion). At any in time, eachsystem is in one of these states; our goal being however topreventing a system from reaching the critical state.

Contributions - In this paper, we propose a risk-adaptiveframework for managing human users requests for infor-mation on database. Essentially, our framework allowsa database to process a query if doing so does not de-grade drastically its performance. Otherwise, our frame-work proposes to the requester a replacement query whichwhen executed provides acceptable answers without pre-venting the processing of other queries. Beside being sim-ilar to the requester’s query, this replacement query isalso safe i.e., it causes no performance issues, and se-cure i.e., access to the answerset of the query is autho-rized to the requester. Compared to the existing queryanalysis techniques (e.g., query rewriting)1, the novelty ofour framework is that it decouples the process of analyzingthe resources requirements of a query from its execution.Thereby, the execution of the query can be postponed incase it is considered to cause overhead. To find replace-ment queries, we adopt techniques used in the informationretrieval field. Two situations may arise with this replace-ment. A requester may choose to wait until there are lessresource contentions i.e., enough computing power in thesystem for his request to be processed and output results,or allow the proposed query to proceed. The choice willdepend on the expected amount of time the requester’squery takes to execute, time we refer to as the wait-time,and how similar the two queries are. In both cases, we pro-vide these estimates to facilitate the decision process. weassume the users know what they want and are able to for-mulate queries that express their needs. Finally, we havealso carried an experimental evaluation of our framework.The results we have prove the efficiency of our framework.

The remaining of the paper is organized as follows. Sec-tion 2 describes the problem we want to address. Sections3, 4, and 5 cover the core components of our risk adaptivemodel. In section 6 we present our experimental results.In section 7, we review existing risk aware access controlframeworks and compare them with our model. We con-clude in section 8 and outline some future work.

2 Preliminaries

2.1 Overview of XACML policies

At the top an XACML policy we have [31] a policy setthat is a container comprising policies or policy sets and apolicy combining algorithm. A policy is a combinationof one or more rules. It has a Target, a set of Rules, a rule

combining algorithm, and Obligations. The target con-tains three components (Subject, Action, Resource) thatidentify requests to which a policy is applicable. A rule isthe most basic unit of an XACML policy that evaluatesaccess requests. It consists of a Target (with a structuresimilar to the target of a policy), an Effect which is eitherpermit or deny, and Conditions denoting the constraintsthat must hold for a request to be permitted or denied bythe rule as specified in its Effect. Obligations are opera-tions that must be executed in conjunction with a permitor a deny decision. A policy (policy set) may contain mul-tiple rules (policies), each of which may evaluate to a dif-ferent access decision. A rule (policy) combining algorithmis used as a procedure for resolving conflicts among rules(policies) upon a request. It combines individual evalua-tion results of rules (policies) into one final decision.

Figure 1: Architecture of the Resource Monitor Module.(1) A user submits a query, (2) PEP translates the query toan equivalent XACML request. It finds the tables and columnsthe user can access from GMC and (3) sends to TREM. (4)TREM retrieves from the PAP policies applicable to the re-quest. It also extracts some environment attributes (stored inPIP) for use in the Performance Prediction Module (PPM). (5)PPM determines the execution plans of the input query. Ofthese plans, PPM selects the optimal one and sends the costand the wait-time of the CPU of the system (one of the serversin the figure) processing the query to TREM. (6) The QueryAdjustment Module finds a replacement query (what-if-query)to the input query if it causes performance issues and sends itto TREM which then evaluates it and sends it to PEP.

2.2 Framework Presentation

As the workload of a database system changes, so do thecomputing resources of that system. Detecting whether

1See the related work section for more in-depth comparison.

2



or not a query will create performance issues on a systemrequires an estimate of the current workload of the sys-tem, and the amount of computing resources the queryneeds. Current relational database management systems(RDBMS) do so only when the submitted query is parsedby the sql engine and ready to execute. Ideally, this in-formation would be more useful if available before thedatabase processes the query. The access control of thedatabase system could then use this information as a secu-rity constraint towards restricting the execution of a queryshould it be affecting the performance of the database. Weprovide a framework to overcome this limitation.

The Resource Monitor Module (RMM) depicted in Fig-ure 1 is designed for that purpose. RMM minimizesrisks of performance degradation of databases by detect-ing high resource consuming queries and preventing theirexecution at the times when the databases cannot handlethem. RMM uses a security policy expressed in XACML.This policy specifies which human users are authorized toperform which actions on which databases or their com-ponents (tables or columns). It is also responsible ofevaluating database systems workloads, incoming queriesresources requirements, and adjusting users informationneeds to available computing resources.

RMM receives its inputs from the Policy EnforcementPoint (PEP). Our PEP extends the standard XACMLPEP. It processes users SQL queries by transforming ob-ject names (aliases) of tables or columns into internalnames, and normalizing predicates into canonical format.These processed SQL queries are sent to PMM, the RMM’Performance Prediction Module. PPM determines first thequeries execution plans. Then, it evaluates the queriescomputing resource needs by selecting the most cost ef-fective execution plan (see section 4). Our PEP alsotranslates queries submitted by users into XACML re-quests that are evaluated by the Target and Rule Evalua-tor Module (TREM). Finally, it provides to RMM tablesand columns users submitting queries can access, and ad-ditional metadata stored in the Policy Information Point(PIP) for controlling the users access to the stored data. .

The Global Metadata Catalog (GMC) is a repositoryof relational tables. It consists of all tables thatstores the number of tuples of a relation, the num-ber of blocks that contain tuples of each relation of adatabase, all ind col that contains descriptions aboutall indexes defined on each relation of each database,all risk bands that provides information about databasesystems risk bands, all tab col that provides informa-tion about relations such as integrity constraints and col-umn name, all tab val that stores the occurrence countsof attribute-value pairs where an attribute correspondsto a column name of a table, and all db queries thatstores queries executed by a database and their costs. Theschema of each table is presented in the Appendix section.

2.3 Problem Formalization

We want to ensure that granting information access tousers to databases induces minimal risks. We define risk asthe likelihood of drastic performance degradation or sud-den workload increase in a database system to the point

that it prevents other queries from executing. We considerthe performance of each system to be monitored based onrisk layers or bands determined by the enterprise owner ofthe system. One practical approach to determine these risklayers is to put a database system under load and stresstests, compute the transaction count per second (tps), andrecord the CPU usage based on the disk I/O.

Definition 2.3.1. - Risk Bands. We define the riskbands (in unit time) of a database system as n inter-vals of the form ([αj , βj [)1≤j≤n where αj < αi, βj <βi, αi = βi+1, ∀ 1 ≤ i < j ≤ n. The risk bands withthe higher indices indicate greater system availability. Foreach 1 ≤ j ≤ n, [αj , βj [ defines a range of values where eachvalue refers to the amount of time the CPU of a databasesystem may wait for resources to be read or written in disk.

If τ(.) denotes the amount of time the CPU of a databasesystem remains idle waiting for blocks containing tuples tobe retrieved from disks, then at any point in time t, thereexists 1 ≤ k ≤ n such that αk ≤ τ(t) < βk

2.

Example 2.1. - Risk Bands. We give in Fig. 2 an ex-ample of a system with n risk bands [αn, βn[, ..., [α1, β1[.The performance of the system is measured based on thesen risk bands. When τ(t), which represents the amount oftime the CPU of the database system waits for blocks tobe read from disk, falls within the risk bands [αi, βi[ withi ≥ l, the performance is considered good. If the execu-tion of an incoming query will make τ(t) to fall in [αi, βi[with i < l, then the execution of that query will be post-poned at a later time when there are enough resources ora replacement query will be executed instead.

Figure 2: Example of a database system risk bands.

Problem Formalization – Let qi be an input query,and Qlog the log of queries of the database processing qi.At a time t, (∃ [αk, βk[, 1 ≤ k ≤ n : αk ≤ τ(t) < βk). Letus assume there exists l, 1 ≤ l < n such that for any l′ < l,the performance of the database system is considered crit-ical. If (cost(qi) + τ(t)) ≥ αl′ and there exists a queryqj0 ∈ Qlog such that qj0 = arg minqj∈X (qi)(cost(qj) + τ(t))where X (qi) = {qj ∈ Qlog | ε ≤ Sim(qi, qj) ≤ 1} and ε isthe similarity threshold (0 ≤ ε ≤ 1), then we replace qi byqj0 ; otherwise we postpone the execution of qi at a timet′ > t where (cost(qi) + τ(t′)) < αl′ .The minimum similarity threshold ε is used to prune thenumber of queries found similar to the query qi.

3 Target and Rule Evaluator Module

The target and rule evaluator module (TREM) is analo-gous to the XACML Policy Decision Point (PDP). The

2If τ(t)→∞, then [αk, βk[→ [αk,∞[.

3



role of TREM is to evaluate human users’ requests basedupon the users’ privileges and the availability of computingresources to process their requests. The aim is to handleusers’ requests with minimum performance degradation.

In relational databases, users’ requests are generally for-mulated as select from where SQL queries. The resourcethat is requested and the type of query that is executeddefine what is known in access control modeling as per-mission. In this paper, we extend this definition. Werefer to a permission as an access right that a human useris given on or refrained from a given object of a systembased upon the security requirements of the system ownerand the capability of that system to handle (resource-wise)the access without affecting its performance. We want (1)security enforcement to ensure that only legitimate userscan access the information they are authorized for, and(2) performance monitoring to measure the impact of anaccess request on the performance of a system.

Current RDBMS provide some intrinsic security mecha-nisms designed to control unauthorized accesses and datatampering. They come integrated with security tools suchas reference monitors that leverage native access controllists (ACLs) or matrices (ACM) to manage users’ requestsfor information. ACLs or ACMs provide fine-grained secu-rity enforcement capabilities. However, they do not scalewell on systems with large numbers of subjects or pro-tection objects. Moreover, they lack flexibility becausepermissions over protection objects have to be explicitlyencoded within files. This, given the context of this work,is a limitation for enforcing characteristic 2 of our defini-tion of a permission. Thus, to ensure both characteristic1 and 2, we use XACML [31]. XACML can express bothACLs and ACMs security policies. Moreover, it providesmeans to enforce performance requirements as access con-straints. To support compatibility of our model with ex-isting DBMs, we show how to translate ACLs and ACMsspecifications into XACML access control policies.

3.1 From ACLs and ACMs to XACML

TREM defines from databases ACLs and ACMs rules highlevel XACML policies against which users queries are eval-uated. We use XACML because it is more flexible thanACLs and ACMs particularly for enforcing enterprisespolicies where different users may share same types of per-missions (e.g., select from where) on similar resources (e.g.,tables, table columns), and most importantly to encodeperformance requirements as environment attributes. Thisis crucial in this work for adjusting users information needsto database systems performance. To translate a DACspecification to an XACML policy, we give the followingexample. For simplicity, we only consider the translationof an ACM to an XACML policy; the mapping of an ACLto an XACML policy follows intuitively.

User/Group Product Catalog Employee: ID

Alice Select None None

Bob None Select Select

Guest None None None

Manager Select Select Select

Table 1: An example of ACM.

Example 3.1. - Mapping an ACM to an XACMLPolicy. In Table 1, “Employee: ID” refers to the columnname ID of the relational table Employee. We translatethis ACM into a parsed XACML policy by using the First-applicable combined algorithm [9, 29, 28]. By applying thisapproach, we can then map the access control lists or ma-trices of any database system to their equivalent XACMLpolicies that we group in RMM policy administration point(PAP). For every XACML rule we create, we add the con-dition (cost(q) + τ(t) ≤ αl′). In the policy shown below,for instance, we want to ensure that the cost of executinga query q submitted either by Alice, Bob or any user in thegroups manager, guest will not cause performance issues.

Target Subject: Alice, Bob, role = manager, role= guestAction: Select.Resource: Employee: ID, Product, Catalog

Effect Permit.Condition cost(q) + τ(t) ≤ αl′

Rule Combining Algorithm: First-applicable

q : “Select A1, ..., An From R Where C1 AND ...

AND Cm” where ∀ Ci, 1 ≤ i ≤ m either Ci = (Ai . vi)

or Ci = (Ai IN Qi), Qi is a range [lb, ub] or a set of values.

3.2 From SQL Queries to XACML Requests

To show how our PEP translates queries into XACML re-quests that TREM evaluates, let us consider the query qgiven above. An XACML request consists essentially offour components: the subject, resource, action, and en-vironment attributes. The subject attributes define theuser credentials. Although in SQL, users’ credentials donot appear in queries users submit, such information to-gether with the permissions granted to the users can be ex-tracted after the users are authenticated or if they belongto default groups like customers or guests. The resourceattributes define the requested resources. The action at-tributes define the actions the user intends to perform onthe resources. The environment attributes, oftentimes op-tional, are system related attributes such as time, date. Inour case, we consider the wait-time of a database systemCPU as an environment attribute, and users credentials tobe stored in PIP (see Figure 1). The query q becomes:

Subject “User Stored Credentials.”

Action Select.

Resource O = πA1,...,An(W ) where W = σC(R)with C = (c1 AND ... AND cm).

In the table above, R = R1×R2× ...×Rn, R = ∪ni=1Ri, or

R = ∩ni=1Ri, set difference of relations, or a combination

of these cases with att(Ri) = attr(Rj), ∀ 1 ≤ i, j ≤ n.

4 Performance Prediction Module

In this section, we present the Performance PredictionModule (PPM) of our model. With this module, a queryis allowed to execute only if the database processing it hasenough computing resources to meet the query resource re-quirements, and if doing so will not degrade the system’sperformance and imped the execution of ongoing queries.For this, PPM evaluates first the database workload, thenit estimates the resource requirements of the query, and

4



finally it predicts the query’s impact on the database be-havior. Compared to the existing database query analysistechniques, PPM distinctive feature is that it decouplesthe process of analyzing the resources requirements of aquery from its execution. Thereby, the execution of thequery can be postponed if it will cause overhead.

To estimate the resource requirements of a submittedquery, and thereby get an idea of its impacts on a databasebehavior, PPM relies on disk I/O. In this work, we focuson the data retrieval aspect of the queries. With disk I/O,the number of block transfers from disk is used as metric toevaluate system performance. The higher this number is,the longer a CPU of a given database system has to waitfor blocks to be read or written to the disk. This, con-sequently, results in longer processing times of the queriessubmitted by users on this system. Many tracing tools canbe used for monitoring disk access. OS tools like iostat andtop on linux systems, perfmon monitor or dynamic man-agment view on Microsoft systems, for instance, providean easy way to estimate the disk I/O of a DBMS withoutintroducing overhead in the database system.

Before PPM estimates the resource needs of a query,it transforms the query into one or several semanticallyequivalent relational algebra expressions. Because a querymay have multiple relational algebra expressions, differentstrategies for estimating the cost of the query may exist.Of these, only the strategy with the most cost effective(ideally the least costly) evaluation plan will be chosen be-fore PPM sends it to the database processing the query.We use this approach because in mainstream databases,the resources requirements of a query are estimated basedon the optimal execution plan of the query.

4.1 From SQL To Relational Algebra

Let us consider the query q defined in page 4. To translateq into a relational algebra expression, we divide it into aselect, (S), relation(s), (R), and where (W) components.The relation(s) component is defined as per below.

R =

{R1 if k = 1R1 ×R2 × ...×Rk if 1 < k ≤ n

or as R = ∪ni=1Ri, or R = ∩n

i=1Ri, set difference of re-lations, or a combination of these cases with att(Ri) =attr(Rj), ∀ 1 ≤ i, j ≤ n.The where component W is defined as per below.

W =

{R if L(q) = ∅σC(R) If C = c1 AND ... AND cm, (m ≥ 1)

C denotes the selection condition of q where each ci, 1 ≤i ≤ m can be a literal, a conjunction or disjunction of liter-als, or a join predicate. In the following, we will use insteadthe set of literals L(q) = {li, (li = ¬bi) : bi ∈ cond(q)} withcond(q) = {¬b1, ...,¬bk}, k ≤ m obtained by putting C inconjunctive normal form (CNF).The select component S is defined as:

S =

{W if attr(R) = {A1, ..., An}πA1,...,An(W ) otherwise

{A1, ..., An} denotes the set of projection attributes of thequery q; a set we refer to in the following as attr(q). There-fore a query is parsed and S, R, and W obtained, it is

easy to find all the relational algebra expressions of thequery because of the commutativity and associativity ofσ, π,on,∪,∩,−. For more on these properties, we refer thereader to the formalization given by Cao et al. in [2].

Example 4.1. - Equivalent Transformation . Let usconsider the query q defined in the following. With thisquery, we have R =EMPLOYEES × SALARY, W = σC(R),where C = (EMPLOYEES.Title = SALARY.Title ∧ Salary≥5000), and S = πA1,A2(W ) where A1 =EMPLOYEES.Name

and A2 =SALARY.Salary. By using for example the com-mutativity of ∧ (σF1∧F2(r) ≡ σF1(σF2(r)) ≡ σF2(σF1(r))),we obtain two other transformations equivalent to Wwhich generate two expressions equivalent to S.

q: “Select e.Name, s.Salary From EMPLOYEES e,

SALARY s where e.T itle = s.T itle AND e.Salary ≥ 5000”

In the remainder, we use EP(q) = {p1, ..., pl} where pi =〈Ri,Wi, Si〉, and Ri,Wi, Si the relation, where, and se-lect components to refer to the set of all relational al-gebra expressions of a query q . ∀i 6= j, pi ≡ pj , i.e.,pi.Ri ≡ pj .Rj , pi.Wi ≡ pj .Wj , and pi.Si ≡ pj .Sj .

To estimate the resource needs of a query, we use theinformation below (extracted from GMC) to evaluate thecost of its execution plans. If pi ∈ EP has the lowest cost,we say pi is the most cost effective (optimal) execution planand define cost(pi) as the resource needs of the query.

• Qlog, the set of queries already processed in a givendatabase. The resource needs of these queries are known(stored in GMC). We assume however their costs are pe-riodically updated as tuples are inserted or deleted.

• NR which denotes the number of tuples in a relation R.

• BR: the number of blocks that contain tuples of R.

• B(ck, R): the occurrence count of a pair 〈Ak, vk〉 whereck = (Ak = vk), Ak being an attribute (column name)of the relation R and vk ∈ V (Ak, R). For instance, ifthere are 5 tuples in R where an attribute A takes avalue vk, then B(ck, R) = 5.

• HTI : the number of levels in an index I (B+-tree) of R.

• FR: the number of tuples of R that fit into one block.

• S(A,R): defines the average number of tuples of R thatsatisfy an equality constraint (predicate) on A.

4.2 SQL Queries Cost estimation

To estimate the resource needs of a query, we evaluate thesatisfiability of either of the cases presented below and con-sider the cost associated to the case as the execution costof the query. These cases are derived from Molina et al’sapproach [11]. We define satisfies(L(q), p, C∗) a functionthat returns true if for an execution plan p L(q) satisfiesone of these cases, and false otherwise.

C1: If there exists ck = (Ak = vk) ∈ L(q) where vk is aconstant or a tuple value, and Ak is a key attributeon which a relation R in q is ordered, use the binarysearch algorithm (S1) to locate the records and checkwhether the records satisfy the remaining predicatesin L(q): cost (S1) = dlog2(BR)e+ dS(Ak, R)/FRe.

5



C2: If there exists ck = (Ak . vk) ∈ L(q) where vk is aconstant or a tuple value, . ∈ {<,≤,=, >,≥} and Ak

is a key attribute with a primary index I (or a hashkey) on which a relation in q is ordered, use I (or thehash key) to locate the records and check whetherthe records satisfy the remaining predicates (∀ cj ∈L(q), j 6= k) (S2): cost(S2) = HTI + dS(Ak, R)/FRe.

C3: If for all ck = (Ak = vk) ∈ L(q) where vk is a constantor a tuple value, and Ak is a non-key attribute, andthere exists no index, use the linear search to locateall the records satisfying L(q) (S3): cost(S3) = BR.

C4: If there exists ck = (Ak . vk) ∈ L(q) where vk is aconstant or a tuple value, . ∈ {<,≤, >,≥} and Ak isa non key attribute but with a clustering index I, usethe clustering index to locate all the records satisfyingL(q) (S4). cost(S4) = HTI + dS(Ak, R)/FRe.

C5: If there is ck = (Ak IN Qk) ∈ L(q) where Qk is a setof values, and Ak is a non-key attribute, use B+-treesearch to locate the records satisfying ck, then checkwhether the records satisfy the remaining predicates(S5): cost(S5) = |Qk| ∗ (HTI + dS(Ak, R)/FRe).

Algorithm 1: Selection of a SQL Query OptimalExecution Plan.

Input : q, EP (q) = {p1, ..., pl}.Output: execution plan p ∈ EP (q).

maintain an empty array plan of length |EP(q)|;1

foreach pl ∈ EP(q) do2

cost(pl)← +∞;3

if (satisfies(L(q), pl, C1) = true) then4

v ← cost(S1);5

cost(pl)← min(cost(pl), v);6


v ← cost(S2);8



v ← cost(S3);11



v ← cost(S4);14



v ← cost(S5);17


plan[pl]← cost(pl);19

p← arg minpl∈EP plan[pl];20

return {p, cost(p)}21

In C1, C2, and C5, S(Ak, R) = NR/|V (Ak, R)|. In C4S(Ak, R) = NR/B(ck, R). The term dS(Ak, R)/FRe de-notes the number of blocks that contain tuples of R satis-fying the condition ck [11]. Algorithm 1 which our PPM(see Figure 1) implements checks these cases to find opti-mal execution plans for submitted queries.

Based on PPM estimation of a query resource require-ments (cost(p)), TREM evaluates the risk incurred by thesystem processing the query based on the system most re-cent behavior. If allowing the query to execute would cause

performance issues to the system, then TREM prevents itsexecution. Next, we explain how QAM finds that query.

5 Query Adjustment Module

The role of QAM is to provide to a requester the infor-mation he needs while minimizing the risk of performancedegradation. QAM does so in two steps. First, it deter-mines the degree of similitude between a requester’s inputquery and the queries previously executed in the databaseprocessing the requester’ s query. Then, it selects fromthese queries the query that is the most similar to the re-quester’s input query but with lower resource needs.

We assume each database maintains the logs of queriesit has processed and we centralize these logs in GMC. Wepair each of these queries, denoted as what-if-query, withthe input query and compute the similarity score of eachpair. This score reflects how similar two paired queries are.A similarity score ranges between 0 and 1. The higher (i.e.,closer to 1) the similarity score between a what-if-queryand an input query is the more similar the queries are;and the lower the computing resources the what-if-queryrequires, the more likely it is to be proposed by QAM asa replacement to the requester’s query.

When QAM proposes a what-if-query to a requester, italso provides the similarity score between this query andthe requester’s query, and a wait-time. The wait-time isan estimation of the time the requester has to wait beforeenough computing resources are available for his originalquery to execute. One of the following situations may hap-pen. First, the requester may decide to wait until there areless resource contentions in the database if he considers thesimilarity score low. The second possibility is when therequester allows the what-if-query to execute. This casearises if the requester judges the wait-time too long andthe what-if-query reflective enough of his needs. Below, weexplain what the concept “Query Similarity” means froma typical relational database perspective.

Definition 5.0.1. - Query Similarity. Let q and q′ betwo sql queries, attr(q) and attr(q′) their sets of columnnames. Let R = {t1, ..., tn} and R′ = {t′1, ..., t′k} be q andq′ answersets (set of tuples) after they execute. Let A =attr(q)∩attr(q′), O = {t ∈ R : ∀ a ∈ A, t[a] = t′[a], ∀ t′ ∈R′} and O′ = {t′ ∈ R′ : ∀ a ∈ A, t′[a] = t[a],∀ t ∈ R}.If O 6= ∅ or O′ 6= ∅, then we say q and q′ overlap overA. The closer A is to attr(q) or attr(q′), the more similarq and q′ are. More formally, considering for instance theLevenshtein distance metric d:

• If d(A, attr(q))→ 0 then O → R′, or

• If d(A, attr(q′))→ 0 then O′ →R

where for any two non-empty sets of strings A and B, wedefine the Levenshtein distance between A,B as d(A,B) =infa∈A(a,B) = infb∈B(A, b) = mina∈A,b∈B d(a, b); d(a, b)being defined in page 7 with i and j being indices over thestrings a, and b.

1. “Select Name, Salary, Seniority From EMPLOYEES

Where Seniority ≤ 5years”

2. “Select SSN, Name, Salary, DeptId From EMPLOY-

EES Where Salary ≤ 70K”

6



d(i, j) =

0 if i = j = 0i if j = 0, i > 0j if i = 0, j > 0

min =

d(i− 1, j) + 1d(i, j − 1) + 1d(i− 1, j − 1) if ai = bjd(i− 1, j − 1) + 1 if ai 6= bj

otherwise

# Name Salary Seniority

1 Alice 20K 1 year

2 Bob 30K 2 years

... ... ... ...

149 John 40K 3 years

150 Jessica 70K 4 years

# SSN Name Salary DeptID

1 0100 Alice 20K 0010

2 1010 Bob 30K 0023

... ... ... ... ...

149 2020 John 40K 0103

150 2332 Jessica 70K 0020

... ... ... ... ...

180 2662 Talulah 65K 0020

Example 5.1. Let us consider the two sql queries above.In page 8), we have the resultsets of the queries. Theshaded regions correspond to tuple attributes (Name,Salary) having similar values in all tuples of both queries.We say that the two queries overlap over the set of at-tributes {Name, Salary}. The more elements we have inthis set the more similar the two queries will be.

Because in this work a query that is submitted executesonly when we ensure it is not impacting the performanceof the database processing it, we cannot therefore knowin advance its answerset. To overcome this issue, inputqueries and what-if-queries are transformed to documents.Each document contains a bag of predicates of the form“A = v” where A is an attribute of a the correspondingquery and v a value that A takes in the relation it is de-fined. The similarity between two queries is then estimatedas the similarity between their corresponding documents.

5.1 Estimating Similarity Scores

Conceptually, the problem we address here is that of tak-ing a user input query which execution is considered costlybecause of its computing resource needs and mapping it toa similar query but with lower resource requirements.

Given a vocabulary of m words, many approaches(e.g., [32, 27, 12, 15]) translate a query answerset to anm−dimensional vector where the ith component is theterm frequency (TF) of the ith vocabulary word in theanswerset. TF measures the importance of a word in thevocabulary to a given answerset. IDF (inverse documentfrequency) is also used to suggest that commonly occur-ring words convey less information about users’ needs thanrarely occurring words, and thus should be weighted less.In this section, we use tf-idf, a technique that combinesTF and IDF to assign weights to words of collections.

Since we translate each query into a document of words,a query too has a vector representation. To represent aquery q as a vector, we create an array B. Each compo-nent k of B is a set of atomic predicates ck = (Ak = vk),vk ∈ V (Ak, R), R being a relation in q where Ak is defined.If Ak ∈ attr(q) we maintain as many atomic predicates“Ak = v”, v ∈ V (Ak, R) in the kth entry of q as we have

distinct values for Ak. However, if (Ak . vk) ∈ L(q), . ∈{<,≤,=, 6=, >,≥}, we maintain in the kth entry of q onlyatomic predicates “Ak = u”, with u ∈ V (Ak, R), u.vk (seebelow). Each predicate ck in B has an occurrence countF (ck) = size({q ∈ Qlog : ck ∈ L(q) ∨Ak ∈ attr(q)}).

SQL predicates - SQL queries may have multiple typesof predicates. This makes their representation as vectorsnon-trivial. Below, we explain how to represent querieswith different predicates as vectors.

• Case 1. One variable inequality constraint. If ck =(Ak . c), c a constant, and . ∈ {<,≤, >,≥, 6=}, we setthe kth entry of B to {(Ak = u) : u ∈ V (Ak, R), u . c}.

• Case 2. Real valued linear constraint:∑n

k=1 xkAk . u,where Ak are numeric attributes, u a constant, xk arereal values, and . ∈ {<,≤, >,≥,=, 6=}. If a predicateck ∈ L(q) is of this form, we use the simplex algo-rithm [5] to solve the satisfiability of ck under the con-straints {(Ak = vik), vik ∈ V (Ak, R), ∀ 1 ≤ k ≤ n}. Theresult is a set Q of one variable equality or inequalityconstraints. Every inequality constraint in Q is furthertransformed to an equality constraint by using case 1.

• Case 3. Regular expression constraints. Ak ∈ L(r)or Ak 6∈ L(r), where Ak is an attribute, and L(r) isthe language generated by regular expression r. Thisis equivalent to predicates of the type “Ak IN Qk” or“¬(Ak IN Qk)” where Qk is a set of values for categor-ical attributes or a range [lb, ub] for numeric attributes.If there exists ck ∈ L(q) and ck = (Ak IN Qk), we createa set of new atomic predicates where each predicate is ofthe form (Ak = u), u ∈ Qk ∧ u ∈ V (Ak, R), R being therelation where Ak is defined. If ck = ¬(Ak IN Qk), thenwe do the same except that the newly created predicatesare of the form (Ak = u), u ∈ V (Ak, R) \Qk.

Select co author, title, subject, conference, year From Pub-

lications Where year = 2010 AND authorID = “2XDF”

Example 5.2. - Query to Document Mapping . Ta-ble 2 in page 8 shows the document corresponding to thequery given above. Each keyword in this document has an

7



occurrence count. For instance, for the attribute Confer-ence, we see “ACSAC” with an occurrence count of 5, sug-gesting there are 5 queries over Publications which, whentransformed to documents, bind Conference to “ACSAC”.We also have the same occurrence count for “VLDB” andfor Co-author for the value “Zhang”.

When computing the similarity score of two queries, qand q′, we evaluate the similarity score between compo-nents of B and B′, q and q′ respective arrays. For that, weuse tf-idf weighting scheme. If ck = (Ak . vk), ck ∈ B is anatomic predicate, we have tf -idf(ck) = TF (ck) ∗ IDF (ck)with IDF (ck) = log(|Qlog|/(1 + F (ck))) and TF (ck) =F (ck)/(max{F (ci) : ci ∈ B}). For any pair of atomicpredicates (ck, c

′k), let S(ck, c

′k) be defined as:

S(ck, c′k) =

{tf-idf(ck) if (ck = c′k)0 otherwise

We refer to the quantities S(ck, c′k) as similarity coeffi-

cients. The similarity score between q and q′ is thus (1):

Sim(q, q′) = 1/nn∑

k=1

S(Lk,L′k) where S(Lk,L′k) =

maxcik∈Lk,c

′jk∈L′

k

S(cik, c′jk), n = max(size(B), size(B′)).

Attribute Value:Occurrence count

co-author Zhang:5, Zhu: 7, Yong: 4, ...

title data-mining: 3, access control: 5, ...

subject integration: 5, security: 2, ...

conference ACSAC: 5, VLDB: 5, ...

year 2010: 5

authorID 2XDF: 5

Table 2: Query to document mapping.

5.2 Algorithms and Theorems

We provide below functions we will be using in Algo-rithm 2 for adjusting dynamically users needs for infor-mation stored in some systems to the available resources.

• Let result(q) be a function that returns for any query qthe answerset of q.

• We denote by req(q) = {(A1, v1), (A2, v2), ..., (An, vn)}an XACML request corresponding to the translation ofa declarative sql query q to XACML.

• Let isPermit(req(q)) be a boolean function that returnstrue if the decision rendered for the evaluation of anXACML request req(q) by TREM is Permit, and falseif it is Deny or Not Applicable.

• We denote by (Q,≤cost), a total order set sorted in anon-decreasing order over the cost of the queries q ∈ Q.

• Let pos(q) denote the position of a query q in a queueE of queries that must be executed by the database pro-cessing q. Each qe ∈ E submitted before q has its costalready estimated and is waiting for resources.

Although the similarity score between the what-if-queryQAM proposes and a requester’s query may be high (i.e.,close to 1), we would like to remind the reader that theoutput of the what-if-query if it executes may not exactly

match the requester’ needs. However, because oftentimeshuman users make decisions or derive the information theyneed based on partial (incomplete) data, we consider thisapproach to be a good solution towards minimizing risksof severe performance degradation of a database system,and thereby guaranteeing its continuous availability.

Algorithm 2: Dynamic Adjustment of Users’need-to-know to Systems Resources Availability.

Input : ε, [αl′ , βl′ [, τ(t), q,Qlog , (cost(q) = cost(p))a,req(q) ≡ {(A1, v1), (A2, v2), ..., (An, vn)}.

Output: tuples or wait-time.

if (isPermit(req(q)) = true) AND1

((cost(q) + τ(t)) ≤ αl′ < βl′) thenreturn result(q);2

else if (isPermit(req(q)) = false) then3

return deny;4

else if ((cost(q) + τ(t)) ≥ αl′) then5

Q1 ← {qi ∈ Qlog : cost(qi) ≤ cost(q)};6

create array B following approach above;7

foreach qi ∈ Q1 do8

create array Bi following above approach;9

Sim(B,Bi)← 1/n∑n

k=1 S(Lk,L′k);10

Sim(q, q′)← Sim(B,Bi);11

if Sim(q, qi) ≥ ε then12

Q2 ← Q2 ∪ {qi};13

if Q2 6= ∅ then14

H ← (Q2,≤cost);15

while Q2 6= ∅ do16

q ← dequeue(Q2);17

if (isPermit(req(q)) = false) then18

H ← H− {q};19

wait-time ← 0;20

if H = ∅ then21

while pos(qe)≤ pos(q), qe ∈ E do22

wait-time ← wait-time +cost(qe);23

return wait-time;24

i← 1; q′ ← H[i];25

while i ≤ size(H) do26

if (cost(H[i]) + τ(t) ≤ αl′) AND27

(Sim(q,H[i]) ≥ Sim(q, q′)) thenq′ ← H[i];28

i← i+ 1;29

return result(q′);30

else31

execute the instructions at lines 21 - 2432

acost(p) is estimated in Algorithm 1 (see page 6).

Theorem 5.1. Algorithm 2 is secure and safe i.e., if thequery of a requester or the proposed what-if-query (q′) isallowed to execute, the requester can access the answer-set of the query that executed, and the processing of thatquery will not cause performance issue.

Proof. - Theorem 5.1.Security – Assume to the contrary that Algorithm 2 is

not secure i.e., q or a what-if-query q′ will execute and

8



render an answerset to a non authorized requester u. Ifq executes then the condition (isPermit(req(q), TREM ) =true) AND ((cost(q) + τ(t) ≤ αl′ < βl′) in line 1 istrue; which entails u is authorized to access the answer-set of q. However, if q does not execute, for q′ to exe-cute the conditions (isPermit(req(q), TREM ) = true) and(cost(q) + τ(t) ≥ αl′) must hold. The execution of q′ en-tails that (Q1 6= ∅) (see line 6), (Q2 6= ∅) (see line 13),(Sim(q, q′) = maxq′′∈H Sim(q, q′′)) (see lines 25-29), and(H 6= ∅) (see line 25), which then means u is authorizedto access the answerset of q′ when q′ renders tuples; thuscontradicting our initial assumption.

Safety – Assume to the contrary that Algorithm 2 isnot safe i.e., the execution of q or a what-if-query q′ willcause performance degradation. Then, either (Q1 = ∅) or(Q1 6= ∅). (1.1). If (Q1 = ∅) then the execution of qis postponed at time wait-time provided that TREM hasevaluated positively (permit) the XACML request req(q)corresponding to the translation of q. After wait-time inunit time, more resources will be available because all thequeries {qe ∈ E : pos(qe) ≤ pos(q)} will have already ex-ecuted. Thus, q will not cause performance issues. (1.2).If (Q1 6= ∅) then either (Q2 = ∅) or (Q2 6= ∅). (1.2.1).If (Q2 = ∅) given that q is an authorized query, it willbe allowed to execute after wait-time unit time (see line23). (1.2.2). If however (Q2 6= ∅) then either (H = ∅)or (H 6= ∅). In case (H = ∅) then the same case holds asin (1.2.1). If however (H 6= ∅) there must exist a queryq′ such that (cost(H[i]) + τ(t) ≤ αl′) because (q′ ∈ Q2),(cost(q′) ≤ cost(q)), and (cost(q) + τ(t) ≥ αl′). As in line29, q′ will then execute. Thus, in all cases, we ensure thesafe execution of q or a what-if-query q′, which contradictsour initial assumption. Hence, Algorithm 2 is safe.

Theorem 5.2. The what-if-query (q′) proposed by Algo-rithm 2 is the optimal authorized replacement query to theuser u submitted query (q); that is the what-if-query andthe input query have a similarity score at least equal tothe defined threshold (ε) and the what-if-query is safe.

Proof. - Theorem 5.2. Assume to the contrary that q′

is not the optimal authorized replacement query. Then,there must exist a query qi ∈ H, qi 6= q more similar to qi.e. Sim(q, qi) > Sim(q, q′) and such that cost(qi)+τ(t) <αl′ . If such qi exists then, instead of q′ being executed inline 30, qi will be. This contradicts our initial assumption.Thus, q′ is the best similar authorized replacement queryto the user u submitted query (q).

6 Experimental Results

In this section, we first present our experiment settings.Due to page size limit, we provide only an evaluation ofour similarity measurement approach based on this data.

6.1 Experiment Settings

We downloaded some open source database dumps. Asample of around 7000 databases tuples and 72 SQLqueries were drawn from these dumps. In the following,we denote our sample of queries by Q. We then createdcomplete database schemas from these dumps in MySQL.We defined a lookup table and populated it with valuesobtained by querying the databases generated before. For

each query in our query-set, we generated an expandeddocument representation, as described in section 4.2. Weused our lookup table to transform each query to a docu-ment of keywords where each keyword is of the form “col-umn name=value”. For each keyword, we counted its oc-currence count in the set of all the documents we createdand stored the value in another table.

6.2 Evaluation of Similarity Measurement

To evaluate our similarity measurement approach, we haveimplemented an application that computes by means ofTF-IDF weighting technique the weight of a keyword givena set of vocabularies. The security and performance eval-uation can be performed after the candidate (i.e., similar)queries are identified. Here, the vocabularies are the doc-uments we generated previously, and the keywords, theelements composing them. We applied equation (1) andgrouped the queries into buckets based on their similarityscores; each bucket consisting of a set of queries similar toa given query we refer to as the primary query of thebucket. In each bucket, we select the query sharing thehighest similarity with the primary query of the bucket.These queries are the most similar queries when consider-ing the similarity scores of other queries of the same bucketwith the primary query of the bucket.

Method I (M1) – For a better understanding, if n de-notes the size of our sample of SQL queries, we then cre-ate n groups G1, ...,Gn, each group being indexed by a SQLquery qi; 1 ≤ i ≤ n and of the form Gi = {qj ∈ Q : qi 6= qj}.Because of the symmetric property of the similarity score,S(qi, qj) = S(qj , qi), each time we compute the similar-ity score of qi and qj , S(qi, qj), we deduce the similar-ity score of qj and qi, S(qj , qi). Then, for each groupGi, 1 ≤ i ≤ n, we consider the query q satisfying the con-dition maxqik,qij∈Gi,qij 6=qik S(qik, qij) as the query that isthe most similar to the primary query of Gi, qi. If we let

f |M1 : Q 7→ R+

qi 7→ f |M1(qi) = maxqj∈Q\{qi}

S(qi, qj)

cases may however exist where there are at least twoqueries in {qj ∈ Q \ {qi} : S(qi, qj) = f |M1(qi)} havingthe same similarity scores with qi.

Method II (M2) – To assess the performance of oursimilarity method, we run our sample of queries against thedatabases that contain the relational tables these queriestarget. The answerset of the primary query of each ofthe buckets we defined before is compared with the re-sultsets of the queries composing the bucket. We appliedTF-IDF and used the same approach as before to comparethe answersets of two queries. The fundamental differencebetween M1 and M2 is that in M2, the queries are ranfirst, and the similarities scores are evaluated based on theouputs of the queries rather than on documents. TF-IDFhas proven effective in comparing results of search enginesqueries as well as tuples as witnessed by the number ofresearch proposals using it and the performance they haveobtained [32, 27, 12, 6]. Interestingly, as shown in sec-tion 6.3 below, most of the queries have a candidate query(based this time on their answersets i.e., M2) equivalentto the candidate queries we have computed with our ap-proach. Same as in M1, If we let

9



f |M2 : Q 7→ R+

qi 7→ f |M2(qi) = maxqj∈Q\{qi}

S(qi, qj)

cases may exist where the size of {qj ∈ Q\{qi} : S(qi, qj) =f |M2(qi)} is greater than 1; that is there exist at least twoqueries in size of Q that have the same similarity scoreswith qi. In such cases, if we can find a same candidatequery output by both methods, then we select that query,otherwise we just randomly choose them. In the following,for each query qi ∈ Q, 1 ≤ i ≤ n, we refer to the quantitiesf |M1(qi) and f |M2(qi) respectively as y1,i and y2,i.

6.3 Error Rate Estimation

To compare the results we obtained from M1 and M2, weconsider the following error rate estimation techniques.

6.3.1 Root Mean Square Error (RMSE)

The RMSE is used in statistics to measure the differencesbetween values predicted by a model or an estimator andthe values one actually observes after running his model.To construct the RMS error, we have to determine theresiduals. Residuals are here the squared difference be-tween the similarity scores evaluated with M1 and thosecomputed with M2. Note that in this experimentation, wedon’t consider the similarity scores calculated using M2as the expected similarity scores our method (M1) shouldpredict. Rather, we are interested in knowing whether thecandidate queries we come up with match those output byM1 and how disproportionate these similarity scores are.The RMSE estimate is defined as follows:

ε|M1,M2 =

√∑nk=1(y1,k − y2,k)2

n

where y1,k denotes the similarity score output by M1 forthe query qk, 1 ≤ k ≤ n, and y2,k the similarity score out-put by M2 for the same query. Using this formula we thenestimate the precision of our model wih respect to M2.

6.3.2 RMSE Precision Rate Formulation

In information retrieval, precision is defined as the per-centage of correct positive predictions i.e., the ratio of truepositives of a prediction model over the size of the datasetof the model. In this section, we use this definition toformulate our RMSE precision rate.

g|M1 : Q 7→ Qqi 7→ g|M1(qi) = arg maxqj∈Q\{qi} f |M1(qi)

g|M2 : Q 7→ Qqi 7→ g|M2(qi) = arg maxqj∈Q\{qi} f |M2(qi)

Let the above functions g|M1 and g|M2 be two functionsthat output for a given query qi ∈ Q the most similar queryqj ∈ Q, j 6= i when the methods M1 respectively M2 areapplied. We define the precision of our model with respectto M2 as follows:

P =size(P)

n, where P = {qk ∈ Q : g|M1(qk) =

g|M2(qk), |y1,k − y2,k| ≤ 2ε|M1,M2}

This formula can be interpreted as the number of queriesfor which M1 and M2 output the same candidate query

with the difference of their similarity scores bounded bytwice the value of the root mean square error estimateε|M1,M2 . We apply the RMSE to minimize prediction er-rors of both methods. With this formula, we obtain aprecision rate P ≈78%, the RMSE ε|M1,M2 being equalto 0.0142. The RMSE value is calculated using all queriesin our sample data. The graph in Figure 3 gives an idea ofthe distribution of the scores of the queries for which M1and M2 output the same candidate queries.

Figure 3: Cumulative Distribution Function.

Given that many research proposals [32, 27, 12, 6] us-ing TF-IDF have proven its relevance in assessing simi-larity between queries, we consider that our approach pro-vides good similarity measures between the original queriesand the replacement queries. We strongly believe howeverthat by incorporating database functional dependency con-straints such as foreign keys or through an effective map-ping and matching of the database tables schemas we couldincrease sensibly the accuracy of our model.

7 Related Work

We divide this related work into three sections. The firstsection is related to risk assessment methodologies, the sec-ond with database systems workloads and resources usageestimation, and the third one is related to query similarity.

Risk Assessment Methodologies and Our Framework -Risk Adaptable Access Controls (RAdAc) in which accessdecisions are based on changing risks and needs have beenunder investigation over the last few years [18, 30, 24, 8,22]. Most of these models can be categorized into twogroups: one that focuses on dynamic adjustment of per-missions in security sensitive domains like healthcare, andanother group addressing risks arising from insider abuses.

Wang et al. [24] for instance propose a model for quan-tifying patient privacy risks in health information systems.Cheng et al [21] introduce methods to assess risk associ-ated with information access, and give a case study forimplementing a multilevel security access control model(Fuzzy MLS). Ni et al. [22, 23] further builds on the FuzzyMLS example by introducing risk estimations and fuzzyinferences for risk-based access control models.

Models like [20, 14, 4, 16] address the problem of in-sider attacks. An insider attack occurs when a personlegitimately authorized to perform certain actions in anorganization decides to abuse the trust and harm the or-ganization. To tackle such threats, most of these models

10



require defining risk thresholds to limit the activation ofcertain roles or permissions by certain users. Baracaldoet al. in [20], for instance, propose to integrate RBACwith risk and trust to react to suspicious changes in users’behavior. In their model, users are associated with trustscores and permissions with risk thresholds. In reality,these values are difficult to estimate as they are based oninference of security breaches. Crampton et al. formulatein [14] a set of requirements to counter insider threats.However, we believe that supplementing XACML policieswith risk constraints is easier than using declarative pro-gramming to support richer types of policy decisions. Med-jri et al. [16] assign to roles confidence levels and to usersclearance levels. Using these values, they then evaluate therisk associated with letting a user activating a role. Thelevel of trust of users is however static and the paper doesnot present either any experiment.

Our risk aware access control framework differs how-ever from these models. First, we define risk as the likeli-hood that a database system becomes unavailable due tointensive demands and the ensuing consequences shouldthe risk occurs. Risks occur in our case because of a lackof understanding of query optimization techniques fromusers requesting data, or because little about database sys-tems performance metrics is easily interpretable by com-mon users. Moreover, we do not set risk values to limit roleor permission activation. We instead evaluate them basedon queries resource requirements, available resources, andrisk bands defined for the systems. Finally, rather thanpreventing users from accessing some information, likein [20, 14, 4, 16, 17], when granting the access is deemedrisky, we provide replacement queries with minimal im-pacts on systems performance to the users.

More related to our work are the MITRE report [19] andSavith et al. [30]. MITRE proposes three guiding prin-ciples - measuring risks, establishing acceptable levels ofrisks, and ensuring that information is accessible at ac-ceptable risk levels - we followed in our work. Savith etal. [30] give descriptive definitions of the core concepts ofRAdAc models. Two important concepts we adopted informalizing our framework are operational need and situa-tional factors. In this paper, operational need is expressedas the reason for a user request i.e., the need from usersto query databases. The situational factors refer to theconditions under which an access decision is made, i.e.the performance of the databases prior to the executionof users submitted SQL queries. Savith et al’s model doesnot however propose enforcement architectures and im-plementations. Our framework, in contrast, provides animplementation of these concepts and extends their for-malism by quantifying risks associated with users’ queriesand by providing replacement queries to risky ones.

Database Systems Workload Estimation - In recentyears, different techniques ranging from the use of bench-mark simulators to analytical methods have been proposedto evaluate database systems workloads [10, 3]. For mostof these models though, thorough knowledge of the innerworkings of the systems being modeled was involved. Inother words, many of these models are highly dependanton the systems the workloads of which they model. Thus,any change in the algorithms or parameters, OS settings,

or hardware of these systems requires modifying the mod-els [3]. In our paper, we used databases profiling tools in-stead. Database profiling tools trace and collect log datarelated to the execution of queries as well as the behaviorof the databases processing the queries. With these tools,we do not need to fully understand databases hardware-software designs and we can easily derive the amount ofcomputing resources database queries consume.

Query Similarity - To find replacement queries to riskyones, we used information retrieval techniques. Many pro-posals ([32, 27, 12, 15]) on query similarity have appliedthe same approach. Nambiar et al [32] propose a domainindependent approach for answering imprecise queries ina database. To compare queries, they used Jaccard sim-ilarity metric over the answersets of the queries. Giventhat in our model a query is allowed to execute only if it isfound not risky for the system that processes it, we cannottherefore know in advance its answerset, hence we cannotapply their approach to our problem. Agrawal et al. [27]propose an automated ranking of database query resultsbased on TF-IDF. They transform queries to bags of key-words before computing their similarity scores. Bordino etal. [12] introduce a new method based on query flow graph,a graph-based representation of a query log, for estimat-ing similarities between queries. Guo et al. [15] propose anintent-aware query similarity technique. Their approach isinteresting in that it incorporates search intents when com-puting similarity scores. However, it is more applicable tosearch engine or web queries than to sql queries.

More related to our replacement query mechanism isquery rewriting. Query rewriting is an approach wherebyinformation relevant to an input query are captured andused to answer the query in a more optimized way. Main-stream databases use different methods to rewrite queries.Many of these methods rely on materialized views. Es-sentially, they check the existence of same joins betweensome materialized views and input queries and ensure thatthe materialized views have sufficient data to answer thequeries. The issues with these methods are manyfold: 1) aslong as a replacement query has a lower cost it will executeregardless of the performance issues it may cause, 2) queryrewriting is more effective when used against data that isnot constantly changing, or when pre-computed results canbe joined to answer queries, which is not always the case,and 3) query rewriting cost analysis is tightly coupled withquery execution. In fact with query rewriting, as long asa replacement query is found. Our approach however pro-poses to decouple these two thereby allowing to postponethe execution of the query if it is deemed costly. Moreover,our model allows a replacement query to execute only whenit ensures the query causes minimal performance impact.

8 Conclusion

We investigated in this paper the problem of enterprisepolicy management from the perspective of computing re-sources availability. Our aim has been to minimize per-formance issues associated with high resource consumingsql queries. The approach we proposed decouples the pro-cess of analyzing the resources requirements of a sql queryfrom its execution thereby allowing to adjust a databasesystem performance to the query resource needs. To this

11



end, we used XACML to prevent the execution of highconsuming queries when a database system cannot han-dle them, and we applied information retrieval techniquesto propose replacement queries (what-if-queries) to userssubmitting such queries. The what-if-queries are secure,safe and present a high degree of similarity (depending ona threshold) to the queries submitted by the users.

We plan to extend our work along several directions.The first extension is to consider more complex sql queriessuch as nested queries. We will include search intent whencomputing queries similarity scores by leveraging groups orroles users submitting queries belong to. As catalog infor-mation needs to be refreshed periodically, thus not alwaysaccurate, we will use instead advanced statistical machinelearning methods for predicting queries resource consump-tion and database systems workload. Finally, we will de-sign more advanced algorithms to include database buffersize and contents so that we can predict more accuratelythe amount of disk I/O a sql query really necessitates.

References

[1] Enterprise features. http://enterprisefeatures.com/2011/12/biggest-factors-slowing-the-adoption-of-cloud-computing-in-2012.

[2] A. Badia B. Cao. Sql query optimization throughnested relational algebra. ACM Transactions onDatabase Systems (TDOS), August 2007.

[3] S. H. Balakrishnan C. Curino, Evan P. C. Jones.Workload-aware database monitoring and consolida-tion. SIGMOD, Athens, Greece, June 12-16 2011.

[4] K. Levitt C. Y. Chung, M. Gertz. Demids: A misusedetection system for database systems. In Proceedingsof the Integrity and Internal Control in InformationSystem, 1999.

[5] G. B. Dantzig. Linear programming and extensions.Princeton University Press, 1963.

[6] M. Donald. Generalized inverse document frequency.CIKM, October 26-30 2008.

[7] X. Li E. Bertino E. Celikel, M. Kantarcioglu. A riskmanagement approach to rbac. Risk and DecisionAnalysis, November 2009.

[8] Bhavani Thuraisingham Dilsad Cavus Ebru Celikel,Murat Kantarcioglu. A model for risk adaptive accesscontrol in rbac employed distributed environments.Technical report, December 2006.

[9] E. V. Herreweghen G. Karjoth, A. Schade. Imple-menting acl-based policies in xacml. Annual Com-puter Security Applications Conference (ACSAC),2008.

[10] A. S. Ganapathi. Predicting and optimizing systemutilization and performance via statistical machinelearning. Technical report, December 17 2009.

[11] J. Widom H. G. Molina, J. Ullman. Database sys-tems: The complete book (2nd edition). Pearson Hall,2009.

[12] C. Castillo A. Gionis I. Bordino, D. Donato. Querysimilarity by projecting the query-flow graph. SI-GIR’10, July 19-23 2010.

[13] E. Bertino J. Byun and N. Li. Purpose based accesscontrol of complex data for privacy protection. SAC-MAT, pages 102–110, 2005.

[14] M. Huth J. Crampton. Towards an access-controlframework for countering insider threats. InsiderThreats in Cyber Security, Advances in InformationSecurity, Volume 49. Springer, 2010.

[15] G. Xu X. Zhu J. Guo, X. Cheng. Intent-aware querysimilarity. CIKM’11, October 2011.

[16] M. Mejri L. Logrippo J. Ma, K. Adi. Risk analysisin access control systems. Eight Annual InternationalConference on In Privacy Security and Trust (PST),August 2010.

[17] J. Crampton L. Chen. Risk-aware role-based accesscontrol. In 7th International Workshop on Securityand Trust Management, 2011.

[18] Robert McGraw. Risk adaptive access control(radac). privilege management workshop. NIST,2009.

[19] MITRE. Horizontal integration: Broader access mod-els for realizing information dominance. JSR-04-32,December 2004. http://www.fas.org/irp/agency/dod/jason/classpol.pdf (Accessed 26-05-2012).

[20] J. Joshi N. Baracaldo. A trust-and-risk aware rbacframework: Tackling insider threat. SACMAT, June2012.

[21] Claudia Keser Paul A. Karger Grant M. Wagner An-gela Schuett Reninger Pau-Chen Cheng, Pankaj Ro-hatgi. Fuzzy multi-level security: An experiment onquantified risk-adaptive access control. IEEE Sympo-sium on Security and Privacy (SP’07), 2007.

[22] J. Lobo Q. Ni, E. Bertino. Risk-based access controlsystems built on fuzzy inferences. ASIACCS, April13-16 2010.

[23] J. Lobo C. Brodie Q. Ni, E. Bertino. Privacy-awarerole-based access control. ACM Transactions on In-formation and Security Volume 13 Issue 3, July 2010.

[24] H. Jin Q. Wang. Quantified risk-adaptive access con-trol for patient privacy protection in health informa-tion systems. ASIACCS’ 11.

[25] E. Bertino A. Ghafoor J. Joshi R. Bhatti, B. Shafiq.X-gtrbac admin: A decentralized administrationmodel for enterprise-wide access control. ACM Trans-actions on Information Security 8, pages 388–423,2005.

[26] F. D. Ferraiolo R. Sandhu and D. R. Kuhn. The nistmodel for role-based access control: Towards a uni-fied standard. Information Technology Lab, NIST,September 2000.

[27] G. Das A. Gionis S. Agrawal, S. Chaudhuri. Auto-mated ranking of database query results. Proceedingsof the 2003 CIDR Conference.

[28] C. A. Gunter H. Okhravi S. Jahid, I. Hoque. Myab-dac: Compiling xacml policies for attribute-baseddatabase access control. CODASPY’11, San Anto-nio, Texas, USA, February 21-23 2011.

[29] H. Okhravi C. A. Gunter S. Jahid, I. Hoque. Enhanc-ing database access control with xacml policy. ACMCCS 2009, Poster, 2009.

[30] Venkata Bhamidipati Savith Kandala, Ravi Sandhu.An attribute based framework for risk-adaptive accesscontrol models. ARES’ 11 Proceedings of the 2011Sixth International Conference on Availability, Relia-bility and Security, pages 236–241, 2011.

12



[31] OASIS Standard. extensible access control markuplanguage (xacml) version 2.0. Technical report,February 2005. http://docs.oasis-open.org/xacml/2.0 (Accessed 26-05-2012).

[32] S. Kambhampti U. Nambiar. Answering imprecisedatabase queries: A novel approach. WIDM, NewOrleans, Louisiana, USA, 2003.

Appendix

Column Description

DB NAME The database containing the table.

TABLE NAME Name of the table.

NUM BLOCKS Number of blocks containing tuples

of the tables.

NUM ROWS Number of rows in the table.

Table 3: all tables

Column Description

INDEX NAME Name associated with the index.

TABLE NAME Number of the table with the index.

COL NAME Column associated with the index.

COL POSIT Position of the column in the index

definition.

Table 4: all ind col

Column Description

TABLE NAME Name of the table.

CONS TYPE Name associated with the constraint

definition.

COL NAME Column associated with the

constraint.

Table 5: all tab col

Column Description

DB SYST Database system which risk bands

are defined.

AVAI RISK Risk band for which the system is

considered available.

TOLE RISK Risk band for which the system

state is considered tolerable.

CRIT RISK Risk band for which the system

state is considered critical.

Table 6: all risk bands

Column Description

TABLE NAME Number of the table with the index.

COL NAME A column of the table.

VALUE Value taken by the column.

OCC COUNT Occurrence count of the pair

(column, value) in the table.

Table 7: all tab values

Column Description

DB NAME Name of database.

DB SYST DB system hosting the database.

QUERY Query executed in the database.

COST Execution cost of the query.

Table 8: all db queries

13



Ousmane Amadou Dia defended his Ph.D. in Computer Science at theUniversity of South Carolina in the Department of Computer Science andEngineering. His research focused on access control modeling and policymanagement. Dr. Dia did his undergraduate studies in Applied Mathe-matics and finished his Master of Science at the “Universite Catholiquede Louvain” in Belgium where he was involved in the European SIM-ILAR network of Excellence Project focusing on multimodal interfacesfor Mammograms Computer Assisted Diagnosis. Ousmane Amadou DIAwas awarded the Fulbright Scholarship for his Ph.D. and now works asa research scientist at Amazon.

Csilla Farkas is an Associate Professor in the Department of ComputerScience and Engineering and Director of the Information Security Lab-oratory at the University of South Carolina (USC). Dr. Farkas receivedher Ph.D. from George Mason University, Virginia. She led the effortsto develop a nationally recognized information assurance graduate pro-gram and to achieve the designation of USC as a National Center ofAcademic Excellence in Information Assurance Education. Dr. Farkas’research interests include information security, data inference problems,economic and legal analysis of cyber crime, and security for Web-basedapplications, such as Service Oriented Architecture and policy issues incloud computing. She is a recipient of a National Science FoundationCareer award. The topic of her award is “Semantic Web: Interopera-tion vs. Security – A New Paradigm of Confidentiality Threats.” Dr.Farkas actively publishes and participates in peer-reviewed internationalconferences and journals.

14

Date post:	11-Jan-2017
Category:	Documents
Upload:	csilla
View:	214 times
Download:	1 times

Risk Aware Query Replacement Approach for Secure Databases Performance Management

Documents