+ All Categories
Home > Documents > 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND...

1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND...

Date post: 20-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
11
1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017 Privacy-Preserving Smart Semantic Search Based on Conceptual Graphs Over Encrypted Outsourced Data Zhangjie Fu, Member, IEEE, Fengxiao Huang, Kui Ren, Fellow, IEEE , Jian Weng, and Cong Wang, Member, IEEE Abstract— Searchable encryption is an important research area in cloud computing. However, most existing efficient and reliable ciphertext search schemes are based on keywords or shallow semantic parsing, which are not smart enough to meet with users’ search intention. Therefore, in this paper, we propose a content-aware search scheme, which can make semantic search more smart. First, we introduce conceptual graphs (CGs) as a knowledge representation tool. Then, we present our two schemes (PRSCG and PRSCG-TF) based on CGs according to different scenarios. In order to conduct numerical calculation, we transfer original CGs into their linear form with some modifi- cation and map them to numerical vectors. Second, we employ the technology of multi-keyword ranked search over encrypted cloud data as the basis against two threat models and raise PRSCG and PRSCG-TF to resolve the problem of privacy-preserving smart semantic search based on CGs. Finally, we choose a real-world data set: CNN data set to test our scheme. We also analyze the privacy and efficiency of proposed schemes in detail. The experiment results show that our proposed schemes are efficient. Index Terms— Searchable encryption, cloud computing, smart semantic search, conceptual graphs. I. I NTRODUCTION N OWADAYS, a large number of data owners decide to store their individual data in the cloud which can help them attain the on-demand high-quality applications and ser- vices. It also reduces the cost of data management and storage facility spending. Due to the scalability and high efficiency of cloud servers, the way for public data access is much more scalable, low-cost and stable, especially for the small enterprises. However, data owners are puzzled by the privacy Manuscript received November 19, 2016; revised January 16, 2017; accepted March 16, 2017. Date of publication April 12, 2017; date of current version May 8, 2017. This work was supported in part by the National Science Foundation of China under Grant 61232016, Grant 61373133, Grant U1536206, and Grant U1405254, in part by the PAPD Fund, and in part by the Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology and Qing Lan Project. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Giuseppe Persiano. (Corresponding author: Zhangjie Fu.) Z. Fu and F. Huang are with the School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China (e-mail: [email protected]; [email protected]). K. Ren is with the Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, NY 14260-2500 USA (e-mail: [email protected]). J. Weng is with the School of Information Technology, Jinan University, Guangzhou 510632, China (e-mail: [email protected]). C. Wang is with the Department of Computer Science, City University of Hong Kong 999077, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2017.2692728 of data and existing schemes prefer to using data encryption to solve the problem of information leakage. How to realize an efficient searchable encryption scheme is a challenging and essential problem. Many existing recent schemes are keyword-based search including single keyword and multi-keywords etc. These schemes allow data users to retrieve interested files and return related documents in the encrypted form. However, due to connatural localization of keywords as document eigenvectors, the returned results are always imprecise and unable to satisfy intention of users. That means keywords as a document feature are inadequate data which carry relatively little semantic information. And some existing schemes hope to explore the relationships among keywords to expand the retrieval results. However, when extracting keywords from documents, the relationships among keywords are out of consideration which leads to the limitation of these schemes. So exploring a new knowledge representation with more semantic information compared with keywords to realize searchable encryption is a challenging and essential task. To solve the problem, we introduce Conceptual Graph (CG) as a knowledge representation tool in this paper. CG is a struc- ture for knowledge representation based on first logic. They are natural, simple and fine-grained semantic representations to depict texts. A CG is a finite, connected and bipartite graph [1]. We will give a detail description in section 3. However, it’s difficult for making match on CG in the encrypted form. One existing representative scheme [17] attempts to solve this problem in the plaintext, but whose process of calculating the similarity scores always relies on the server and external knowledge base. It’s unlikely to be realized in the encrypted scenarios, the reason is that the cloud server should learn none of concrete content in our retrieval. Reference [18] proposes a scheme in the encrypted form, but it performs CG homeomorphisms before encrypting. That means the scheme is unable to operate on the encrypted data and doesn’t realize searchable encryption in the true sense. Although our previous study [26] is able to realize the goal of performing search on CG, it’s an initial and intuitive scheme which is cost expensive and not efficient. In this paper, we propose two practical and processing schemes to solve the challenging problem - CG match in the encrypted form. As a knowledge representation, CG is a perfect and mature manifestation of semantics. Since the gen- eration of CG, it has been widely applied in many scenarios. 1556-6013 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Transcript
Page 1: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

Privacy-Preserving Smart Semantic Search Basedon Conceptual Graphs Over Encrypted

Outsourced DataZhangjie Fu, Member, IEEE, Fengxiao Huang, Kui Ren, Fellow, IEEE,

Jian Weng, and Cong Wang, Member, IEEE

Abstract— Searchable encryption is an important research areain cloud computing. However, most existing efficient and reliableciphertext search schemes are based on keywords or shallowsemantic parsing, which are not smart enough to meet withusers’ search intention. Therefore, in this paper, we propose acontent-aware search scheme, which can make semantic searchmore smart. First, we introduce conceptual graphs (CGs) asa knowledge representation tool. Then, we present our twoschemes (PRSCG and PRSCG-TF) based on CGs according todifferent scenarios. In order to conduct numerical calculation,we transfer original CGs into their linear form with some modifi-cation and map them to numerical vectors. Second, we employ thetechnology of multi-keyword ranked search over encrypted clouddata as the basis against two threat models and raise PRSCG andPRSCG-TF to resolve the problem of privacy-preserving smartsemantic search based on CGs. Finally, we choose a real-worlddata set: CNN data set to test our scheme. We also analyzethe privacy and efficiency of proposed schemes in detail. Theexperiment results show that our proposed schemes are efficient.

Index Terms— Searchable encryption, cloud computing, smartsemantic search, conceptual graphs.

I. INTRODUCTION

NOWADAYS, a large number of data owners decide tostore their individual data in the cloud which can help

them attain the on-demand high-quality applications and ser-vices. It also reduces the cost of data management and storagefacility spending. Due to the scalability and high efficiencyof cloud servers, the way for public data access is muchmore scalable, low-cost and stable, especially for the smallenterprises. However, data owners are puzzled by the privacy

Manuscript received November 19, 2016; revised January 16, 2017;accepted March 16, 2017. Date of publication April 12, 2017; date of currentversion May 8, 2017. This work was supported in part by the NationalScience Foundation of China under Grant 61232016, Grant 61373133, GrantU1536206, and Grant U1405254, in part by the PAPD Fund, and in partby the Jiangsu Collaborative Innovation Center on Atmospheric Environmentand Equipment Technology and Qing Lan Project. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Dr. Giuseppe Persiano. (Corresponding author: Zhangjie Fu.)

Z. Fu and F. Huang are with the School of Computer and Software, NanjingUniversity of Information Science and Technology, Nanjing 210044, China(e-mail: [email protected]; [email protected]).

K. Ren is with the Department of Computer Science and Engineering, TheState University of New York at Buffalo, Buffalo, NY 14260-2500 USA(e-mail: [email protected]).

J. Weng is with the School of Information Technology, Jinan University,Guangzhou 510632, China (e-mail: [email protected]).

C. Wang is with the Department of Computer Science, City University ofHong Kong 999077, Hong Kong (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2017.2692728

of data and existing schemes prefer to using data encryptionto solve the problem of information leakage. How to realizean efficient searchable encryption scheme is a challenging andessential problem.

Many existing recent schemes are keyword-based searchincluding single keyword and multi-keywords etc. Theseschemes allow data users to retrieve interested files and returnrelated documents in the encrypted form. However, due toconnatural localization of keywords as document eigenvectors,the returned results are always imprecise and unable to satisfyintention of users. That means keywords as a document featureare inadequate data which carry relatively little semanticinformation. And some existing schemes hope to explorethe relationships among keywords to expand the retrievalresults. However, when extracting keywords from documents,the relationships among keywords are out of considerationwhich leads to the limitation of these schemes. So exploring anew knowledge representation with more semantic informationcompared with keywords to realize searchable encryption is achallenging and essential task.

To solve the problem, we introduce Conceptual Graph (CG)as a knowledge representation tool in this paper. CG is a struc-ture for knowledge representation based on first logic. Theyare natural, simple and fine-grained semantic representations todepict texts. A CG is a finite, connected and bipartite graph [1].We will give a detail description in section 3. However, it’sdifficult for making match on CG in the encrypted form.One existing representative scheme [17] attempts to solve thisproblem in the plaintext, but whose process of calculatingthe similarity scores always relies on the server and externalknowledge base. It’s unlikely to be realized in the encryptedscenarios, the reason is that the cloud server should learnnone of concrete content in our retrieval. Reference [18]proposes a scheme in the encrypted form, but it performs CGhomeomorphisms before encrypting. That means the schemeis unable to operate on the encrypted data and doesn’t realizesearchable encryption in the true sense. Although our previousstudy [26] is able to realize the goal of performing search onCG, it’s an initial and intuitive scheme which is cost expensiveand not efficient.

In this paper, we propose two practical and processingschemes to solve the challenging problem - CG match inthe encrypted form. As a knowledge representation, CG is aperfect and mature manifestation of semantics. Since the gen-eration of CG, it has been widely applied in many scenarios.

1556-6013 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
Page 2: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

FU et al.: PRIVACY-PRESERVING SMART SEMANTIC SEARCH BASED ON CGs OVER ENCRYPTED OUTSOURCED DATA 1875

That’s why we pick up CG among various ways of knowledgerepresentation. In order to conduct numerical calculation,we change the original CG into its linear form. However,CG’s linear form has some drawbacks which affects theeffective utilization of data. So we make some modificationon initial forms. We will introduce the modified linear formin detail in section 3. When extracting CGs from originaldocuments, we have two options according to the differentaspects. One is transferring all sentences in the documentsinto CGs, namely PRSCG-TF. The other is sorting out themost important sentence and transferring it into a CG, namelyPRSCG. In PRSCG-TF, we perform segmentation on CGs andattain their linear forms. We can view every part of the linearform for a CG as a whole. That means we can separate aCG into several individuals and regard them as “keywords”with enough semantic information. We count the TF valuesof these specific parts and store them in the file. Then werank them in descending order according to TF values andselect k “keywords” as representatives of the original docu-ment. Finally, we generate a dictionary to construct a numer-ical vector to replace the document according to vector spacemodel. In PRSCG, we take state-of-art technique in the textsummarization to generate a summarization for a document.Then we perform segmentation on CGs and attain their linearforms. Finally, we generate a dictionary to construct a bytevector to replace the original document. In above two schemes,we will conduct pre-process on sentences to eliminate redun-dant information. In this paper, we employ Tregex as a tool ofsimplifying sentences to increase efficiency of our schemes.We also employ MRSE as a basic framework of encryptionand realize our encrypted schemes of PRSCG and PRSCG-TF.we summarize our contributions briefly as follows:

1) We use Conceptual Graphs as a knowledge represen-tation to substitute traditional keywords and solve theproblem of privacy-preserving smart semantic searchbased on conceptual graphs over encrypted outsourceddata. Compared with [26], it’s more secure and efficient.

2) We creatively propose a modified linear form of con-ceptual graphs which makes quantitative calculation onconceptual graphs possible. In a sense, we facilitatefuzzy retrieval on conceptual graphs in semantic level.

3) We present two practical schemes from different aspectsto solve the problem of privacy-preserving smart seman-tic search based on conceptual graphs over encryptedoutsourced data. They are both secure and efficient, buthave their own focus on different aspects.

We organize this paper in the following manner.In Section 2, we focus on the recent research about theencrypted search schemes, and Section 3 describes the systemmodel, design goals and related technologies. In Section 4,we emphasize our scheme and give a detailed descrip-tion. In Section 5, we analyze our security of our scheme.Section 6 and Section 7 cover performance analysis and theconclusion.

II. RELATED WORK

With the development of searchable encryption, many exist-ing schemes provide more abundant retrieval function basedon text search.

References [5]–[8] mainly discuss the single keyword searchin the encrypted form. Song et al. [5] is the first to put forwardthe symmetric searchable encryption scheme. To search overthe encrypted documents with a sequential scan, the schemeemploys a 2-layered encryption structure. It is the first practicalscheme that defines the problem of searching on encrypteddata, which has a positive effect for later researches. Butits weakness is also distinct that the scheme only acceptsthe output of a fixed length and is suitable for its two-layer encryption method and fails on variable query as wellas compressed data. References [6] and [7] are proposed tomake an improvement of security definition and search effi-ciency. An effective searchable symmetric encryption schemeis proposed in [8] to realize the ranked keyword search. Thescheme uses an inverted index to store keywords and theircorresponding files. However, all mentioned schemes aboveonly support single keyword search.

References [9]–[12] mainly focus on multiple keywordssearch in the encrypted form. Especially, [9] is the first oneto solve the problem of privacy-preserving multi-keywordranked search over encrypted data in cloud computing againsttwo threat model which is called MRSE. The paper employsvector space model and secure inner product to realizethe high efficiency of search. Reference [10] generates itssearch index with term frequency and the vector spacemodel and chooses cosine similarity to compare the sourceand the query query which can help achieve more accuratesearch results. Reference [11] provides an additional refer-ence about how to return the ranked results through thefrequency of keyword access. Reference [12] introduces par-allel computing to increase the effectiveness of multi-keywordsearch.

Based on [9], many extended schemes [13], [14], [21], [22],[25], [30], [31], [33] were proposed. Reference [13] presents asecure multi-keyword ranked search scheme employing VSMand the widely-used TF-IDF mode, which can realize dynamicupdate operations like deletion and insertion of documentsat the same time. Reference [14] maps all the semanticallyclose words or different variants of a word to the same stemby using porter algorithm. References [25] and [35]–[39]proposes the schemes of secure outsourcing search overencrypted data. Reference [21] solves the problem of per-sonalized multi-keyword ranked search over encrypted datacombining the user interest model. Reference [22] proposesan innovative semantic search scheme based on the concepthierarchy and the semantic relationship between concepts inthe encrypted datasets. References [32] and [40]–[50] arealso important studies, which study the security problemsin cloud. However, the keyword still carries less semanticinformation.

The keyword carries less semantic information which leadsto support restricted sematic search. In our previous study [26],we propose an initial and intuitive scheme to solve theproblem of semantic search on encrypted cloud data basedon conceptual graphs. However, [26] is less efficient thankeyword search. In this paper, we attempt to solve the prob-lem of encrypted search based on CG as fast as keywordsearch.

DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
Page 3: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1876 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

Fig. 1. The architecture of smart search based on over encrypted cloud data.

III. PROBLEM FORMULATION

A. The System and Threat Models

We summarize our system model showed in Fig. 1 whichincludes three entities: data owner, data user and cloud server.

1) Data Owner: Data owner owns n data files F ={F1, F2, . . . , Fn} that he encrypts his source documents beforethey are outsourced to the cloud server. Also, he mustguarantee that these documents can be searched effectively.In this paper, the data owner encrypts their documents set andgenerates searchable indexes before outsourcing data to thecloud server. Besides this, the pre-process work such as theconstruction of CG, the transformation of CG into vectors andthe update operation of documents should be handled aheadof time. The data user also should make a secure distributionof the key information of trapdoor generation and provideauthorization for authorized data users.

2) Data Users: Data users should obtain a warrant fromdata owner to have access to documents. Data users shouldsubmit a simple sentence to generate a trapdoor and take backthe documents which meet his requirement from the cloudserver.

3) Cloud Server: Cloud server receives the store requestfrom the data owner and execute the operation of storing theencrypted documents and searchable indexes. When the datausers send the trapdoor to the cloud server, the cloud servermakes a computation of relevance scores and returns top-krelated documents to the data users. The cloud server is alsoresponsible for executing the command of updating documentsand searchable indexes.

In this paper, we introduce the threat model which isproposed in [8]. We assumes the cloud server is “honest-but-curious”, which is the same as the most previous work[8]–[14], [21], [22], [25], [26], [30]. The cloud server mustcomply with the designated protocol honestly and correctly,but it’s also curious to infer and analyze data whatever isdocuments or indexes. Based on those, [9] proposes two threatmodels as follows:

4) Known Ciphertext Model: In this model, we assume thatthe cloud server only know encrypted dataset and searchableindex which is outsourced by the data owner.

5) Known Background Model: Compared with the knownciphertext model, the known background model canacquire more information. This information may contain

the correlation relationship between given search requests(trapdoors)and the dataset related statistical information. As aninstance of possible attacks in this case, the cloud servercould use the known trapdoor information combined withdocument/keyword frequency [23] to deduce/identify certainkeywords in the query.

B. Design Goal

In order to achieve the efficient utilization of outsourcedencrypted data in semantic and ranked search under thementioned threat model, our design must meet the requirementof security and performance as follows.

• Semantic search based on conceptual graphs.To design schemes which can allow perform similaritycomputation between query CG and source CG and pro-vide ranked results which interest users most effectively.

• Privacy-preserving. To protect the privacy of users fromcloud server, we list some privacy requirement whichschemes must to achieve as follows.

1) Data privacy. That means we should protect docu-ments from the cloud server while keeping availablefor users. And traditional symmetric cryptographyis a good option. We can encrypt the documents bysymmetric key cryptography before outsourcing.

2) Index privacy. We must guarantee that the cloudserver can’t predict the relationship between thedocuments and “keywords” through the index.

3) “Keyword” privacy. In this paper, we view everypart of CG as a “keyword” and the “keyword” isrelated with each other. So it’s important to preserveusers’ query CG(sentence) and we should generatesecure trapdoors to avoid the leakage of informationabout query.

4) Trapdoor unlinkability. When performing search onthe source document, users in nature handle the trap-doors and indexes which are exposed to the cloudserver. For the protection of the privacy, we shouldrandomize the trapdoors. And we also should maketrapdoors undeterministic so that the same queriesin the theory correspond to the different trapdoors.The cloud can’t infer the relationship between thesetrapdoors.

C. Notations and Preliminaries

• F – the plaintext document collection, denoted as acollection of n documents F = {F1, F2, . . . , Fn}. Eachdocument Fi in the collection can be considered as a CG.

• n – the total number of documents in data set F.• C – the encrypted data set stored in the cloud server,

denoted as C = {C1, C2, . . . , Cn}.• W – the dictionary, we view every part of CG as a

“keyword” and the dictionary includes m “keyword”,denoted as W = {W1, W2, . . . , Wm}.

• m – the total number of “keywords” in W .• I – the searchable index which consists of the vector cor-

responding to source CG denoted as I = {I1, I2, . . . , In}.

DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
DLK
Highlight
Page 4: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

FU et al.: PRIVACY-PRESERVING SMART SEMANTIC SEARCH BASED ON CGs OVER ENCRYPTED OUTSOURCED DATA 1877

• D – the document vector corresponding to the sourcedocument based on CG..

• T – the trapdoor which consists of the vector correspond-ing to query CG.

1) Text Summarization: Today, making computer under-stand the people’s intention correctly is an urgent problemto be resolved which will also help cost saving in manyfields. Text Summarization (TS) as a part of comprehendingpeople’s intention has a remarkable significance. We simplifythe definition and make a brief description of TS: miningthe theme of the document. Essentially, TS techniques mainlycontain two methods of Extractive and Abstractive. In thispaper, we prefer to employing the method of extractive, whenconsidering the accuracy of TS. The main idea of extractive isto compute the score of all sentence and choose the sentenceswith highest scores as the main idea of documents. Extractivesummarization must ensure that the selected sentences areappeared in the content of document that’s why we consider ithas high precision. In the review [2], it shows the achievementsof single document summarization and makes a comparisonbetween existing sentence scoring methods in the last tenyears. We introduce the most effective method of that to helpus deal with our source documents.

In this paragraph, we take a crude way to describe theadvanced algorithm as follows. The modified TF/IDF algo-rithm can be used to compute the scores of sentences inthe preprocessing stage. In this algorithm, it draws analogybetween sentences, documents and documents, document set.The TF/IDF score can be calculated as follows:

T F/I DF(w) = DN × (log(1 + t f )

log(d f )) (1)

Where DN is the number of documents. When we attain allsentences’ TF/IDF, we sort these values to select the sentenceresponding to the higheset score.

2) Tregex: Tregex is a powerful syntactic tree search lan-guage for identifying syntactic elements which is important inmany aspects in Natural Language Processing (NLP), includ-ing sentence simplification which we focus on in this paper.Tregex can be used to distinguish the various relations betweenthe tree nodes. And Tregex is more powerful than regularsentence classification expressions because it can capture agreater number of syntactic ways with which people write acertain type of sentence with cue phrases [3]. After identifyingthe key nodes with Tregex, we use the Stanford Parser’s APIto analysis trees, which can conduct the operations such asinserting and deleting children, changing labels on tree nodes,etc [4].

3) Conceptual Graph: Conceptual graph as a knowledgerepresentation model was proposed by Sowa [15]. It is definedas a graph representation for logic, which is based on thesemantic networks of Artificial Intelligence (AI) and exis-tential graphs [16]. There are usually two kinds of nodes:concepts (rectangles) and conceptual relations (ovals) (Fig. 2).A concept is connected with another concept by conceptualrelation. Each conceptual relation must be connected to someother concepts. For each CG, we named conceptual relationsas semantic roles, it has 30 relations approximately include

Fig. 2. CG display form for John is going to Boston by bus.

Fig. 3. CG display in its linear form for John is going to Boston by bus.

Fig. 4. CG display in modified linear form for John is going to Boston by bus.

6 tenses. In this paper, we choose approximately 24 relations,regardless of tenses. Person and City in Fig. 2 in the CG areconcept types and represent the category of concepts. Theycan be null and marked as “*”.

4) Linear Form: Graph matching is more complex than treematching and conceptual graph as a graph is also inconvenientfor direct use. So Sowa proposes the linear form of CGwhich aims at solving the problem of representing multi-dimensional relationships in CG. It selects the node with themaximum number of edges as the vertices of tree. Then,according to the original CG, it can further form the subtree.For Fig. 2, it can be divided into three subtrees (Fig.3) whichsimplify the primitive form in order of easily understanding.In this paper, we introduce this representation with somemodification. We replace concept values with concept typesby leaving out concept values if there exists at the sametime (Fig.4). Through the modification, we can predigest theretrieve easily.

IV. THE PROPOSED SCHEMES

In this section,we first define our framework based onMRSE and then give the detailed description of index con-struction which is the foundation of our scheme. Then, we ana-lyze the effect brought by different index construction in

Page 5: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1878 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

PRSCG and PRSCG-TF. Finally, we construct our schemesPRSCG (PRSCG-1 and PRSCG-2 respectively) and PRSCG-TF (PRSCG-TF-1 and PRSCG-TF-2 respectively) against twothreat models.

In Section 3, we give a brief introduction of CG, we canlearn that CG is a perfect way of knowledge representationwhich considers specific content of query sentences and dis-covers the hierarchy and the combination within the sententialsemantic structures. However, it provides a challenging taskfor us to carry out retrieval. Suppose we perform search onCG self without some transformation, we have to take complexcontents in CG into consideration including structure andconcept values on ciphertext. Once involving concept values,we have no option but to introduce external knowledge tohelp the cloud server understand the query, which has violatedthe original intention of reducing the information leakage andguaranteeing security.

A. Basic Framework

In this section, we make a rough description without men-tioning specific operations. PRSCG and PRSCG-TF cover fouralgorithms as follows:

Setup: In this algorithm, the data owner need generate asymmetric key.

BuildIndex (F, SK): In this algorithm, the data owner needto generate a encrypted searchable index based on the CG set.Then, the data owner need to outsource the encrypted data andindex to the cloud sever, here encrypted data including CGs,CG’s linear forms and source documents.

Trapdoor (Q, SK): In this algorithm, the data user generatea trapdoor according to the input sentence which is alsoencrypted to be sent to the cloud sever.

Query (T, k, I): In this final algorithm, the cloud serverreceive the request which is the trapdoor, it need execute thesimilarity calculation and return the top-k results.

However, our paper does not include search control andaccess control which is the same as MRSE.

B. Document Vectors Constructions-PRSCG

In this section, we give the overall description aboutindex construction from original documents to byte vectors inPRSCG. In the pre-processing stage, we choose the efficientmeasure of “sentence scoring” in text summarization and useTregex to help acquire the most important and simplifiedtopic sentence for every document. We construct our CGsaccording to these simplified sentences [19]. In this way,we can obtain a CG for each original document. To workout the problem of CG search in the encryption domain,we decide to transform the CG into its linear form withsome modification before constructing index. That guaranteeswe can process numerical calculation on CGs. Although wehave described how the linear form is in detail in section3, we decide to state briefly its advantage from some otheraspects in the following. Considering that “semantic roles”,“concept types” and “concept values” are typical parts inthe CG, we take semantic roles including concept types andconcept roles on both sides as a whole to strive to maximize

storing information of the original CG. Thus, we can divideCGs into some separate entities which can be viewed as tripleswith rich semantic information. We can view these entitiesas “keywords”. Moreover, we have simplified these entitiesthrough combining concept types with concept values for easeof effective and efficient retrieval which also can help realizefuzzy search in semantic level to some extent.

When we get these specific “keywords”, it’s possible togenerate binary vectors for every document as the traditionkeyword search schemes with vector space model. If theCG covers the “keyword”, we set the corresponding binarybit is 1 otherwise 0.

We illustrate this process with fig.2, 3 and 4 as an example.We acquire a CG from the source document showed as fig.2.Then we divide the CG into some sub-parts which can beseen as some triples showed as fig.3 and fig.4. These triplesfrom fig.2 are (Go, Agent, Person), (Go, Dest, City) and (Go,Inst, Bus) regarded as a whole “keyword” respectively whichidentified as the document contents.

C. Document Vectors Constructions-PRSCG-TF

In this section, we give a overall description about indexconstruction from original documents to numerical vectors inPRSCG-TF which is a little different from PRSCG. In thepre-processing stage, we use CG to represent every sentence inoriginal documents which means one document will be dividedinto many sentences, i.e. CGs. We replace these CGs with theirmodified linear forms. Through a series of transformations,we employ weighted TF/IDF combining with vector spacemodel to construct our index. We count TF values and IDFvalue of these “keywords” where TF (or term frequency) isthe number of times a given “keyword” exists in one file (tomeasure the importance of the “keyword” in one file) andIDF (or inverse document frequency) is obtained by dividingthe number of files in the whole document collection bythe number of files containing the “keyword” (to measurethe overall importance of “keyword” in the whole documentcollection). We rank TF values of “keywords” in originaldocuments and select variable k “keywords” to represent thedocument. We use vector space model to embed TF valuesinto corresponding position and generate a numerical vectorfor each original document.

For example, we assume we elect 2 “keywords” which arethe former and the later in fig.4, then we may append their TFvalues into corresponding location in the numerical vector asfollows: (0.32, 0.12, 0, 0, ......, 0).

D. Differences in Index Construction Between PRSCG andPRSCG-TF

In this section, we focus on discussion on differencesin index construction between PRSCG and PRSCG-TF.In PRSCG, in the pre-process, we employ the state-of-artmethods in text summarization which decides our schemeswill be affected by results of text summarization. It limitsthe precision of scheme if we cannot extract suitable summa-rization. However, when we store CG in its linear form andconstruct indexes, we can retain original relationship among

DLK
Highlight
Page 6: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

FU et al.: PRIVACY-PRESERVING SMART SEMANTIC SEARCH BASED ON CGs OVER ENCRYPTED OUTSOURCED DATA 1879

“keywords” in byte vectors. Our index vector for one docu-ment contains a complete CG which improves the precisionof search. In PRSCG-TF, in the pre-process, we view eachsentence in document as an entity and transform them into CG.We employ TF/IDF weighting rule to deal with the specific“keywords” after the operation of division, which will lead tothe “keywords” extracted from documents have no relationshipamong each other and may be meaningless. However, thisscheme highlights the more important “keywords” becauseof their weight. Moreover, the dictionary of PRSCG will bea little smaller than PRSCG-TF when considering the samenumber of documents.

We can conclude that PRSCG scheme keep the order oforiginal “keywords”, that will be more semantic for docu-ments which have clear theme. Because these documents needmore centralized “keywords”. However, when documents havean ambiguous idea, more centralized “keywords” will omitrelated documents because of missing information. In a word,it will be better if considering contents of documents whendata users perform search.

E. PRSCG-1 Scheme

We employ MRSE [9] as our basic framework and wepropose our PRSCG-1 scheme to achieve the secure guaranteein the known ciphertext model. The concrete algorithms arepresented as follows:

We refer MRSE to extend the dimensions of source andquery vectors with different methods to make it more difficultfor the cloud server to learn the relationship between thereceived trapdoors.

Setup: The data owner randomly generates the secretkey SK which includes one (n + 2)-bit vector S andtwo (n + 2) × (n + 2) invertible matrices {M1, M2}. M1,M2 ∈ Rm×m and s ∈ {0, 1}m , m is fixed as n + 2.

BuildIndex (F, SK): The data owner splits the vectors Dwhich acquired by CG into two random vectors {D1, D2}for index I . Before this, D should be extended to (n + 2)dimensions which are regarded as (D, hi , 1) (hi is a randomnumber). The process of generating two vectors should followthe exactly principles: according to the vector S, if S[i ] = 0,D1[i ], D2[i ] will be the same as D[i ]; else D1[i ], D2[i ] willbe set as random numbers and D1[i ]+ D2[i ] = D[i ]. At last,the encrypted subindex Ii = {M1

T D1, M2T D2} is built for

each encrypted document Ci .Trapdoor (Q, SK): After dealing with the query, it will attain

one vector Q whose structure is the same as D. Applyingdimension extending and splitting procedures on Q whichis a little different from D bases on the following steps:first, Q should be changed into (r Q, r, t) which r , t arerandom numbers and r cannot be set equal to 0. Then Qis divided into two vectors {Q1, Q2} and if S[i ] = 0, Q1[i ],Q2[i ] will be set as random numbers and Q1[i ] + Q2[i ] =Q[i ]; else Q1[i ], Q2[i ] will be the same as Q[i ] which isdifferent from D. Finally, the trapdoor T can be concluded as{M1

−1 Q1, M2−1 Q2}.

Query (T, k, I): According to the formula (2), the rel-evance score is computed by the cloud server. The data

users can provide k for controlling the number of returnedresults. Without loss of generality, we assume the randomnumber r > 0.

Ii · T = {M1T D1, M2

T D2} · {M1−1 Q1, M2

−1 Q2}= r(D · Q + hi ) + t (2)

The relevance score in the final is mixed with randomnumbers which preserve the cloud server from obtainingdirectly scores. To some extent, it enhances security withoutthe loss of effectiveness.

F. PRSCG-TF-1 Scheme

In PRSCG-1, due to the MRSE whose ranking principleis “coordinate matching”, we treat each “keyword” in bytevectors as equally important. However, we replace “0” or “1”with TF values in numerical vectors in PRSCG-TF-1 tohighlight the importance of each “keyword”. We ignore theinternal relation among “keywords” which is different fromPRSCG-1. We employ TF/IDF weighting rule to indicatethe importance of “keywords”. Without loss of generality,we employ an example formula which is widely applied inthe literature [24] for the relevance score calculation [34].

Score(Fi , Q) = 1

|Fi |∑

(1 + 1

t f) · ln(1 + m

d f) (3)

Here, t f represents TF values of a certain “keyword” inone document, d f represents the number of files containing acertain “keyword”, m represents the total number of files inthe document set and |Fi |represents the euclidean length offile |Fi |which can be concluded as follows:

√√√√n∑

j=1

(1 + lnt f )2 (4)

functioning as the normalization factor.In PRSCG-TF-1, the data users will pad corresponding

location with “keywords” IDF values in the query vector whichis also different from PRSCG-1. In the Query, the similarityscore will be calculated as follows:

Ii · T = r(Di · Q + hi ) + t

= r(∑ 1 + lnt f

|Fi | · ln(1 + m

d f) + hi ) + t

= r(Score(Fi , Q) + hi ) + t (5)

We use the sum of TF × IDF to measure how the sourcedocument suits for the query, i.e. the inner product of subindexand trapdoor.

G. PRSCG-2 Scheme

If the cloud server learns more information about theoutsourced data set including some background informa-tion, for example, the correlation relationship between twogiven trapdoors, it may predict some specific informationabout CG. We consult MRSE to propose an enhanced schemePRSCG-2 against the known background model.

We extend the dimensions of vectors from (n + 2) to (n +U + 1) where U to further randomize the data vectors and

DLK
Highlight
DLK
Highlight
Page 7: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1880 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

query vectors with a number of dummy “keywords” insertedinto. Detailed PRSCG-2 scheme is presented as follows basedon above operation:

Setup: In this algorithm, the data owner still randomlygenerates the secrete key SK which is the same as PRSCG.And their only difference is about the extension dimensions ofvectors. The SK is still denoted as {S, M1, M2}. Also {M1,M2}. M1, M2 ∈ Rm×m and s ∈ {0, 1}m , m is fixed as n+U +1and U is the number of dummy keywords.

BuildIndex (F, SK): The (n+ j +1)th entry in data vector Dis set to a random number r ( j ) where j ∈ [1, U ] and the dataowner applies the spilt procedures in PRSCG-1 into extendedvectors.

Trapdoor (Q, SK): The data user operates the query asabove. And randomly selecting V out of U dummy keywords,the corresponding bits should be set to 1.

Query (T, k, I): In this final algorithm, the cloud servercomputes the relevance scores between the data vector D andthe query Q. And the result is equal to r(D · Q + ∑

riV ) + ti

where the V th dummy keyword is included in the V selectedones.

Through the above operation, the cloud server will be harderto analyze useful information from indexes and trapdoors. Therelationship between U and V including the value of U andV has been analyzed in [9]. We select appropriate U and Vfor the tradeoff between efficiency and privacy based on [9].

H. PRSCG-TF-2 Scheme

Mentioned in [9], PRSCG-TF-1 is still threated by the scaleanalysis attack. We apply MRSE-2 into PRSCG-TF-2 which isthe same as PRSCG-1. More dummy keywords involved in thedata vector will improve the privacy of scheme according to[9] and can resist the scale analysis attack. So we will overlookthe detailed description of PRSCG-TF-2 in this section.

V. SECURITY ANALYSIS

In this section, we will underline security analysis of ourschemes which are based on MRSE. As known, MRSEemploys secure inner product which is adapted from secureKNN algorithm [20]. That means if secure KNN is secureenough and prove MRSE secure, then we can sum up that ourPRSCG and PRSCG-TF are secure. Based on above deduction,in the following paragraph, we will demonstrate that ourschemes are secure enough against known ciphertext modelaccording to [7], [25], [27], [28]. As for known backgroundmodel, we accept the analysis of MRSE about scale analysisattack and directly adopt its conclusion.

Definition 1 (Size Pattern): There exists a plaintext docu-ment set denoted as D which includes n documents. We definethat size pattern of query Q whose length is m is a collection asfollows: {n, |Q1| , . . . , |Qm |}, where |Q| represents the lengthof query Q.

Definition 2 (Access Pattern): There exists a plaintext doc-ument set denoted as D and generates a set of index namedas I . We define that access pattern of query Q based on sizepattern is a collection as follows: {I (Q1), . . . , I (Qm)}, whereI (Q) represents a set of index which is related to Q.

Definition 3 (Search Pattern): There exists a plaintext doc-ument set denoted as D and contains n documents. We definethat search pattern of query Q based on size pattern is a matrixM of n × q . the value of every point in the matrix representthat specific to query Q in the size pattern whether exists inthe documents.

Definition 4 (Known Ciphertext Model Secure): There existsthe security parameter denoted as � = (Setup, Buildindex,Trapdoor, Query) which represents our schemes of PRSCGand PRSCG-TF. The known ciphertext model secure experi-ment SE can be concluded as follows:

1) The adversary submits two documents set with the samesize to a challenger.

2) The challenger runs Setup and generates the secret key{M, S}.

3) The challenger generates a random index with filling arandom number “0” or “1” in a random position of indexvector. Then the challenger encrypts the random indexand sends to the adversary.

4) The adversary follows up the step 3 and generates arandom index.

5) If the outputs in step 3 and step 4 is the same, the exper-iment result will be assigned to 1.

If the probability of above secure experiment can be limitedto 0.5 + r for all probabilistic polynomial-time adversaries,where r is a random number but which can be ignored, we canclaim that our scheme PRSCG and PRSCG-TF are secureunder known ciphertext model.

Pr(SE = 1) ≤ 0.5 + r (6)

Proof: According to the information source used to differen-tiate authenticity by adversary, we can conclude the followingformula:

Pr(SE = 1) = 0.5 + D(M, S) + D(I ) + D(E D) (7)

Where D(M, S) represents the advantage of the adversarycan differentiate M and S from two random matrixes and tworandom strings. D(I ) and D(E D) is the same as D(M, S)which represent index and encrypted document respectively.

As for D(M, S), the probability of that the adversary candifferentiate the M and S from two random matrixes and tworandom strings is very low which can be overlooked, becausethe matrix and key are selected from {0, 1} repeated moreoften randomly and uniformly. Generally, it’s impossible forthe adversary distinguish authenticity.

The indexes in our schemes are encrypted by MRSE basedon secure KNN algorithm. In [20], the secure KNN algo-rithm had been proved secure under known plaintext attack.According to this, we can cover the following conclusion: theprobability of the adversary can infer the true indexes is verylow which can be ignored.

The remaining question is only to testify D(E D) is a smallprobability event and can be ignored. Existing encryptionalgorithm is semantic secure which leads to our encrypted doc-uments are secure under known ciphertext model. So D(E D)in theory is small enough to be overlooked.

From the above analysis, we can view D(M, S) + D(I ) +D(E D) as a negligible probability which can be ignored.

DLK
Highlight
Page 8: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

FU et al.: PRIVACY-PRESERVING SMART SEMANTIC SEARCH BASED ON CGs OVER ENCRYPTED OUTSOURCED DATA 1881

According to this, the formula (6) holds and it proves ourschemes are secure under known ciphertext model.

VI. PERFORMANCE ANALYSIS

In this section, we test the overall performance ofour scheme PRSCG and PRSCG-TF on a real data set:CNN dataset [29] (www.cnn.com) which is publicly availableto form the outsourced dataset by Csharp language. Thewhole experiment system is based on Windows7 server withCore2 CPU 2.93GHz. We choose about 1,000 files fromCNN dataset as the outsourced dataset. Our performance testincudes efficiency and precision.

A. Precision

In this section, we give an analysis of precision about ourschemes. In PRSCG, first, we store all information of a queryin a CG. According to feature of CG, CG can not onlyretain the specific values of query but also the relationshipbetween these specific values. In the theory, it’s enough tomake a difference between different queries by the semanticinformation and the data user can find out what he reallywants. As for the CG’s linear form, it’s a simplified formwith less information lost and have no effect on CG’s abilityof semantic representations. Compared with simple keywordsearch, by storing additional information with less overhead,we can attain higher precision. In PRSCG-TF, although therehas a little difference from PRSCG in index construction,essentially, they both employ CG as a representation tool.So they should have approximate precision. As shown insection 4, our scheme inserts some dummy “keywords” intoeach vector which reduces the precision. Thus, we introducethe balance strategy of precision and privacy in [9] into ourscheme. According to the analysis of [9], we introduce itsbalance parameter to satisfy data users who have their differentrequirements on precision and rank privacy.

B. Efficiency

In this section, we analyze the performance of our schemeaccording to the time cost based on the experiment and theory.And we mainly present time cost of index construction andquery cost.

1) Index Construction: For ease of searching on the con-ceptual graphs, we choose to transfer these graphs into vec-tors including generating searchable subindex for each CGextracted from the data set. The process of index constructioncan be summarized as follows:

• extracting CG from the document collection F ;• generating data vectors based on the CG set;• encrypting data vectors by MRSE.

And in PRSCG-TF, which should be concluded as follows:

• transferring each sentence into CG and select importantpart of CG as “keywords” for every document in thedocument collection F ;

• generating data vectors based on the CG set;• encrypting data vectors by MRSE.

TABLE I

INDEX CONSTRUCTION OVERHEAD FOR APPROXIMATE 1000 FILES

In PRSCG, extracting CG can be viewed as a pre-processingwork, which is costly. However, it’s a one-time operationassociated with the number of documents, which is alsoacceptable for system. Specially, for each CG, we generatea numerical vector whose dimensionality is decided by howmany specific parts– “keywords” in the CG set. The total timecost is also related to the number of documents. As presentedin section 4, in PRSCG-1, including the splitting process andtwo multiplications of a (n+2)× (n+2) matrix and a (n+2)-dimension data vector, these operations are all linked with howmany specific keywords in the CG set. So PRSCG-1’s timecomplexity is O(mn2) in the theory. PRSCG-2 is similar withPRSCG-1 and its time complexity is O(m(n + U)2) which ismore than the above scheme. In PRSCG-TF, except the firststep, remaining steps are the same as PRSCG. We overlookanalysis on these steps and focus on the first step which iscostly and a one-time operation. Its time cost is decided bythe number of sentences in the whole document set.

Table 1 shows the result of making a comparison with [26].From the Table, we can detect that our scheme time costfor single files is larger than our previous study [26]. That’sbecause in our previous study, the dimension of index vectoris obvious shorter than PRSCG which is fixed as 24. And inPRSCG and PRSCG-TF, the source data vectors are almostover thousands of dimensions (we will select part of “key-words” set as our final set according to their importance inPRSCG-TF) respectively, except the same pre-possess of deal-ing with CG, the extraction of data vectors are costly. Althoughthe former has lower time cost, its construction is quitecomplex which is not suitable for practical use. Moreover,the index construction will be executed once across the wholeprocess which decides its time cost is acceptable. In table 1,although based on the same data set, the actual number of filesthe sever deal with are different between PRSCG andPRSCG-TF. The reason leads to such differences becausetheir ways of index construction are distinct. In PRSCG, everydocument only need to be handled once no matter how manysentences in one document, however, in PRSCG-TF, whichis decided by the number of sentences in every document.It need to deal with sentences in one document one by one.Considering the difference, we present time cost for per CG(sentence) to measure our scheme in Table 1. Especiallyfor PRSCG-TF, the scheme need to deal with several timesunprocessed data vectors compared with PRSCG, but whenencrypted by MRSE, they process the same number of data

Page 9: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1882 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

Fig. 5. The time cost for generating trapdoors.

Fig. 6. Time cost of query.

vector which lead to PRSCG-TF’s time cost has an overt riseand the difference between PRSCG-TF-1 and PRSCG-TF-2 isnot obvious. Fig.5 shows the time cost for generating trapdoorsfor different sizes of keywords dictionary extracted fromCG set. The difference of time cost between PRSCG-1,PRSCG-TF and PRSCG-2, PRSCG-TF-2 respectively iscaused by extending to higher dimensions.

2) Query: The data user sends a trapdoor to the cloudserver and the cloud server checks whether contains linkedspecific parts of CG in the source. That means through these“keywords”, the cloud server can have the same efficiency askeyword search but with different level of semantic.

When executing query, the cloud server need computerelevance score for the whole data set and it should spendO(mn) in PRSCG-1 and PRSCG-TF-1, O(m(n + U)) inPRSCG-2 and PRSCG-TF-2 respectively. Finally, the cloudserver returns top-k encrypted data document to authorityusers.

The search efficiency of the proposed scheme is exper-imented on a server and we compare the result with theprevious study. Fig.6 shows the results of that our schemePRSCG compares with [26]. There is a distinct drop on timecost which testifies our schemes are more effective.

Fig.7 shows the time cost for top-k retrieval which canconfirm the efficiency of our scheme compared with [26] onthe other side.

C. Comparing With Previous Study

In our previous study [26], we propose a fundamentalscheme of performing encrypted search on CGs. And in this

Fig. 7. The time cost for top-k retrieval.

paper, we propose these two improved schemes in efficiencyand security. In [26], we use three vectors of 24 dimensions torepresent a CG corresponding to document. And the schemeneed execute search step by step sequentially. More specially,the third vector is filled with specific values of concept. It willincrease overhead when executes match of vectors. However,in this paper, we ensure that every document will only cor-respond to a numerical vector which is constructed underVSM (vector space model). Specially, we select modifiedlinear forms of CG to construct our dictionary of “keywords”which help employ VSM. When search over the sourcevectors, the server only need use numerical vectors to computesimilarity scores. That will reduce overhead greatly.

Reference [26] only meets the security requirement of SSE.In this paper, our scheme is secure enough against the knownbackground model which is more secure than [26]. And wehave described security analysis in section 5. We can concludethat our schemes are more secure and efficient than ourprevious study.

VII. CONCLUSION

In this paper, compared with the previous study, we proposetwo more secure and efficient schemes to solve the problem ofprivacy-preserving smart semantic search based on conceptualgraphs over encrypted outsourced data. Considering varioussemantic representation tools, we select Conceptual Graphsas our semantic carrier because of its excellent ability ofexpression and extension. To improve the accuracy of retrieval,we use Tregex simplify the key sentence and make it moregeneralizable. We transfer CG into its linear form with somemodification creatively which makes quantitative calculationon CG and fuzzy retrieval in semantic level possible. We usedifferent methods to generate indexes and construct twodifferent schemes with two enhanced schemes respectivelyagainst two threat models by introducing the frame of MRSE.We implement our scheme on the real data set to prove itseffectiveness and efficiency.

For the further work, we will explore the possibility ofsemantic search over encrypted cloud data with natural lan-guage processing technology.

REFERENCES

[1] S. Miranda-Jiménez, A. Gelbukh, and G. Sidorov, “Summarizing con-ceptual graphs for automatic summarization task,” in Conceptual Struc-tures for STEM Research and Education. Berlin, Germany: Springer,2013, pp. 245–253.

DLK
Highlight
DLK
Highlight
DLK
Highlight
Page 10: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

FU et al.: PRIVACY-PRESERVING SMART SEMANTIC SEARCH BASED ON CGs OVER ENCRYPTED OUTSOURCED DATA 1883

[2] R. Ferreira, L. de S. Cabral, and R. D. Lins, “Assessing sentence scoringtechniques for extractive text summarization,” Expert Syst. Appl., vol. 40,no. 14, pp. 5755–5764, 2013.

[3] M. Liu, R. A. Calvo, A. Aditomo, and L. A. Pizzato, “UsingWikipedia and conceptual graph structures to generate questions foracademic writing support,” IEEE Trans. Learn. Technol., vol. 5, no. 3,pp. 251–263, Sep. 2012.

[4] M. Heilman and N. A. Smith, “Extracting simplified statements forfactual question generation,” in Proc. QG 3rd Workshop QuestionGenerat., 2010, pp. 11–20.

[5] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searcheson encrypted data,” in Proc. IEEE Symp. Secur. Privacy, May 2000,pp. 44–55.

[6] Y.-C. Chang and M. Mitzenmacher, “Privacy preserving keywordsearches on remote encrypted data,” in Proc. ACNS, 2005, pp. 391–421.

[7] R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: Improved definitions and efficient constructions,”in Proc. ACM CCS, 2006, pp. 79–88.

[8] C. Wang, N. Cao, and J. Li, “Secure ranked keyword search overencrypted cloud data,” in Proc. IEEE 30th Int. Conf. Distrib. Comput.Syst. (ICDCS), Jun. 2010, pp. 253–262.

[9] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preservingmulti-keyword ranked search over encrypted cloud data,” IEEE Trans.Parallel Distrib. Syst., vol. 25, no. 1, pp. 222–233, Jan. 2014.

[10] W. Sun, B. Wang, and N. Cao, “Privacy-preserving multi-keyword textsearch in the cloud supporting similarity-based ranking,” in Proc. 8thACM SIGSAC Symp. Inf., Comput. Commun. Secur., 2013, pp. 71–82.

[11] R. Li, Z. Xu, and W. Kang, “Efficient multi-keyword ranked query overencrypted data in cloud computing,” Future Generat. Comput. Syst.,vol. 30, pp. 179–190, Jan. 2014.

[12] Z. Fu, X. Sun, and Q. Liu, “Achieving efficient cloud search services:Multi-keyword ranked search over encrypted cloud data supportingparallel computing,” IEICE Trans. Commun., vols. E98–B, no. 1,pp. 190–200, 2015.

[13] Z. Xia, X. Wang, X. Sun, and Q. Wang, “A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data,” IEEE Trans.Parallel Distrib. Syst., vol. 27, no. 2, pp. 340–352, Jan. 2016.

[14] Z. Fu, J. Shu, and X. Sun, “Semantic keyword search based on trieover encrypted cloud data,” in Proc. 2nd Int. Workshop Secur. CloudComput., 2014, pp. 59–62.

[15] J. F. Sowa, Conceptual Structures: Information Processing in Mind andMachine. Reading, MA, USA: Addison-Wesley, 1983.

[16] J. F. Sowa, “Conceptual graphs,” in Handbook of Knowledge Represen-tation. Amsterdam, The Netherlands: Elsevier, 2008, pp. 213–237.

[17] J. Zhong, H. Zhu, and J. Li, Conceptual Graph Matching for Seman-tic Search. Conceptual Structures: Integration and Interfaces. Berlin,Germany: Springer, 2002, pp. 92–106.

[18] G. S. Poh, M. S. Mohamad, and M. R. Zaba, “Structured encryption forconceptual graphs,” in Advances in Information and Computer Security.Berlin, Germany: Springer, 2012, pp. 105–122.

[19] S. Hensman and J. Dunnion, “Automatically building conceptual graphsusing VerbNet and WordNet,” in Proc. Int. Symp. Inf. Commun. Technol.,Dublin, Republic of Ireland, 2004, pp. 115–120.

[20] W. K. Wong, D. W. Cheung, and B. Kao, “Secure KNN computation onencrypted databases,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,2009, pp. 139–152.

[21] Z. Fu, K. Ren, J. Shu, X. Sun, and F. Huang, “Enabling personalizedsearch over encrypted outsourced data with efficiency improvement,”IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 9, pp. 2546–2559,Sep. 2016.

[22] Z. Fu, X. Sun, S. Ji, and G. Xie, “Towards efficient content-aware searchover encrypted outsourced data in cloud,” in Proc. 35th Annu. IEEE Int.Conf. Comput. Commun. (IEEE INFOCOM), San Francisco, CA, USA,Apr. 2016, pp. 1–9, doi: 10.1109/INFOCOM.2016.7524606.

[23] S. Zerr, E. Demidova, D. Olmedilla, W. Nejdl, M. Winslett, andS. Mitra, “Zerber: r-Confidential indexing for distributed documents,”in Proc. 11th Int. Conf. Extending Database Technol. (EDBT), 2008,pp. 287–298.

[24] J. Zobel and A. Moffat, “Exploring the similarity space,” ACM SIGIRForum, vol. 32, no. 1, pp. 18–34, 1998.

[25] C. Chen et al., “An efficient privacy-preserving ranked keyword searchmethod,” IEEE Trans. Parallel Distrib. Syst., vol. 27, no, 4, pp. 951–963,Apr. 2016.

[26] Z. Fu, F. Huang, X. Sun, A. V. Vasilakos, and C. Yang,“Enabling semantic search based on conceptual graphs over encryptedoutsourced data,” IEEE Trans. Serv. Comput., to be published.doi: 10.1109/TSC.2016.2622697.

[27] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchablesymmetric encryption,” in Proc. ACM Conf. Comput. Commun. Secur.,2012, pp. 965–976.

[28] M. Chase and S. Kamara, Structured Encryption and Controlled Dis-closure. Berlin, Germany: Springer, 2010, pp. 577–594.

[29] R. D. Lins and S. J. Simske, “A multi-tool scheme for summarizingtextual documents,” in Proc. 11st IADIS Int. Conf. WWW/INTERNET,2012, pp. 1–8.

[30] Z. Fu, X. Wu, C. Guan, X. Sun, and K. Ren, “Towards efficient multi-keyword fuzzy search over encrypted outsourced data with accuracyimprovement,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 12,pp. 2706–2716, Dec. 2016, doi: 10.1109/TIFS.2016.2596138.

[31] Z. Xia, X. Wang, L. Zhang, Z. Qin, X. Sun, and K. Ren, “A privacy-preserving and copy-deterrence content-based image retrieval scheme incloud computing,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 11,pp. 2594–2608, Nov. 2016, doi: 10.1109/TIFS.2016.2590944.

[32] Y. Ren, J. Shen, J. Wang, J. Han, and S. Lee, “Mutual verifiable provabledata auditing in public cloud storage,” J. Internet Technol., vol. 16, no. 2,pp. 317–323, 2015.

[33] B. Gu and V. S. Sheng, “A robust regularization path algorithm forν-support vector classification,” IEEE Trans. Neural Netw. Lear. Syst.,to be published. doi: 10.1109/TNNLS.2016.2527796.

[34] G. Sankareeswari, S. Selvi, and R. Vidhyalakshmi, “Enabling secure out-sourced cloud data,” in Proc. IEEE Int. Conf. Inf. Commun. EmbeddedSyst. (ICICES), Feb. 2014, pp. 1–6.

[35] J. Wang, X. Chen, X. Huang, I. You, and Y. Xiang, “Verifiable auditingfor outsourced database in cloud computing,” IEEE Trans. Comput.,vol. 64, no. 11, pp. 3293–3303, Nov. 2015.

[36] F. Cheng, Q. Wang, and Q. Zhang, “Highly efficient indexing forprivacy-preserving multi-keyword query over encrypted cloud data,” inProc. Int. Conf. Web-Age Inf. Manage., 2014, pp. 348–359.

[37] M. Li, S. Yu, and N. Cao, “Authorized private keyword search overencrypted data in cloud computing,” in Proc. IEEE 31st Int. Conf.Distrib. Comput. Syst. (ICDCS), Jun. 2011, pp. 383–392.

[38] R. Li, A. X. Liu, and A. L. Wang, “Fast range query processingwith strong privacy protection for cloud computing,” Proc. VLDBEndowment, vol. 7, no. 14, pp. 1953–1964, 2014.

[39] X. Chen, J. Li, J. Ma, Q. Tang, and W. Lou, “New algorithms for secureoutsourcing of modular exponentiations,” IEEE Trans. Parallel Distrib.Syst., vol. 25, no. 9, pp. 2386–2396, Sep. 2014.

[40] C. Wang, N. Cao, and K. Ren, “Enabling secure and efficient rankedkeyword search over outsourced cloud data,” IEEE Trans. ParallelDistrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug. 2012.

[41] J. Li, J. Li, X. Chen, C. Jia, and W. Lou, “Identity-based encryptionwith outsourced revocation in cloud computing,” IEEE Trans. Comput.,vol. 64, no. 2, pp. 425–437, Feb. 2015.

[42] Q. Wang, C. Wang, K. Ren, W. Lou, and J. Li, “Enabling publicauditability and data dynamics for storage security in cloud comput-ing,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 5, pp. 847–859,May 2011.

[43] W. Zhang, Y. Lin, and S. Xiao, “Privacy preserving ranked multi-keyword search for multiple data owners in cloud computing,” IEEETrans. Comput., vol. 65, no. 5, pp. 1566–1577, May 2016.

[44] X. Yi, E. Bertino, and J. Vaidya, “Private searching on streaming databased on keyword frequency,” IEEE Trans. Depend. Sec. Comput.,vol. 11, no. 2, pp. 155–167, Mar./Apr. 2014.

[45] Z. Fu, J. Shu, J. Wang, Y. Liu, and S. Lee, “Privacy-preservingsmart similarity search based on Simhash over encrypted data incloud computing,” J. Internet Technol., vol. 16, no. 3, pp. 453–460,2015.

[46] Q. Liu, W. Cai, J. Shen, Z. Fu, X. Liu, and N. Linge, “A speculativeapproach to spatial-temporal efficiency with multi-objective optimizationin a heterogeneous cloud environment,” Secur. Commun. Netw., vol. 9,no. 17, pp. 4002–4012, 2016.

[47] Z. Fu, X. Sun, N. Linge, and L. Zhou, “Achieving effective cloudsearch services: Multi-keyword ranked search over encrypted cloud datasupporting synonym query,” IEEE Trans. Consum. Electron., vol. 60,no. 1, pp. 164–172, Feb. 2014.

[48] Y. Kong, M. Zhang, and D. Ye, “A belief propagation-based method fortask allocation in open and dynamic cloud environments,” Knowl.-basedSyst., vol. 115, pp. 123–132, Jan. 2017.

[49] Q. Tian and S. Chen, “Cross-heterogeneous-database age estimationthrough correlation representation learning,” Neurocomputing, vol. 238,pp. 286–295, May 2017.

[50] Z. Wu, B. Liang, and L. You, “High dimension space projection-basedbiometric encryption for fingerprint with fuzzy minutia,” Soft Comput.,vol. 20, no. 12, pp. 4907–4918, 2016.

Page 11: 1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND …1croreprojects.com/basepapers/2017/Privacy-Preserving... · 2017-09-22 · technology of multi-keyword ranked search over encrypted

1884 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

Zhangjie Fu (M’13) received the Ph.D. degree incomputer science from the College of Computer,Hunan University, China, in 2012. He is currentlyan Associate Professor with the School of Computerand Software, Nanjing University of InformationScience and Technology, China. He was a VisitingScholar of Computer Science and Engineering withThe State University of New York at Buffalo from2015 to 2016. His research interests include cloudsecurity, outsourcing security, digital forensics, net-work and information security. His research has been

supported by NSFC, PAPD, and GYHY. He is a member of ACM.

Fengxiao Huang received the B.E. degree in softengineering from the Nanjing University of Infor-mation Science and Technology, Nanjing, China,in 2014, where he is currently pursuing theM.S. degree in computer science and technologywith the Department of Computer and Software.His research interests include cloud security andinformation security.

Kui Ren (F’16) received the Ph.D. degree in elec-trical and computer engineering from the WorcesterPolytechnic Institute in 2007. He was an Assis-tant/Associate Professor with the Electrical andComputer Engineering Department, Illinois Instituteof Technology. He is currently a Professor withthe Computer Science and Engineering Department,The State University of New York at Buffalo. Hisresearch interests include security and privacy incloud computing, wireless security, and smart gridsecurity. His research is supported by NSF, DOE,

AFRL, and Amazon. He was a recipient of an NSF Faculty Early CareerDevelopment (CAREER) Award in 2011 and a co-recipient of the BestPaper Award of the IEEE ICNP 2011. He serves on the editorial boards ofIEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, IEEETRANSACTIONS ON SMART GRID, IEEE WIRELESS COMMUNICATIONS,IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, and the Journal ofCommunications and Networks.

Jian Weng received the B.S. and M.S. degrees incomputer science and engineering from the SouthChina University of Technology, in 2000 and 2004,respectively, and the Ph.D. degree in computerscience and engineering from Shanghai Jiao TongUniversity, in 2008. From 2008 to 2010, he helda post-doctoral position with the School of Infor-mation Systems, Singapore Management University.He is currently a Professor and the Dean with theSchool of Information Technology, Jinan University.He has authored over 50 papers in cryptography

conferences and journals, such as CRYPTO, Eurocrypt, TCC, and Asiacrypt.He served as the PC co-chairs or PC member for over 10 internationalconferences, such as the ISPEC 2011, the RFIDsec 2013 Asia, and theISC 2011, IWSEC 2012.

Cong Wang (M’12) received the B.E. andM.E. degrees from Wuhan University in 2004 and2007, respectively, and the Ph.D. degree from theIllinois Institute of Technology in 2012, all in elec-trical and computer engineering. He was with thePalo Alto Research Center in 2011. He is currentlyan Assistant Professor with the Computer ScienceDepartment, City University of Hong Kong. Hisresearch interests are in the areas of cloud computingsecurity, with current focus on secure data outsourc-ing and secure computation outsourcing in public

cloud. He is a member of the ACM.


Recommended