Elysium PRO Titles with Abstracts 2017-18 · challenges on encrypted data storage and management...

Elysium PRO Titles with Abstracts 2017-18



The k-nearest neighbors (k-NN) query is a fundamental primitive in spatial and multimedia databases. It has

extensive applications in location-based services, classification & clustering and so on. With the promise of

confidentiality and privacy, massive data are increasingly outsourced to cloud in the encrypted form for enjoying

the advantages of cloud computing (e.g., reduce storage and query processing costs). Recently, many schemes

have been proposed to support k-NN query on encrypted cloud data. However, prior works have all assumed

that the query users (QUs) are fully-trusted and know the key of the data owner (DO), which is used to encrypt

and decrypt outsourced data. The assumptions are unrealistic in many situations, since many users are neither

trusted nor knowing the key. In this paper, we propose a novel scheme for secure k-NN query on encrypted

cloud data with multiple keys, in which the DO and each QU all hold their own different keys, and do not share

them with each other; meanwhile, the DO encrypts and decrypts outsourced data using the key of his own. Our

scheme is constructed by a distributed two trapdoors public-key cryptosystem (DT-PKC) and a set of protocols

of secure two-party computation, which not only preserves the data confidentiality and query privacy but also

supports the offline data owner. Our extensive theoretical and experimental evaluations demonstrate the

effectiveness of our scheme in terms of security and performance.

ETPL BD -

001

Secure k-NN Query on Encrypted Cloud Data with Multiple Keys

Advanced communications and data processing technologies bring great benefits to the smart grid. However,

cyber-security threats also extend from the information system to the smart grid. The existing security works for

smart grid focus on traditional protection and detection methods. However, a lot of threats occur in a very short

time and overlooked by exiting security components. These threats usually have huge impacts on smart gird and

disturb its normal operation. Moreover, it is too late to take action to defend against the threats once they are

detected, and damages could be difficult to repair. To address this issue, this paper proposes a security situational

awareness mechanism based on the analysis of big data in the smart grid. Fuzzy cluster based analytical method,

game theory and reinforcement learning are integrated seamlessly to perform the security situational analysis

for the smart grid. The simulation and experimental results show the advantages of our scheme in terms of high

efficiency and low error rate for security situational awareness.

ETPL BD -

002

Big Data Analysis based Security Situational Awareness for Smart Grid


Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing

in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems

are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class

of problems, e.g., search, clustering, log analysis, different types of join operations, matrix

multiplication, pattern matching, and social network analysis. However, all these popular systems have

a major drawback in terms of locally distributed computations, which prevent them in implementing

geographically distributed data processing. The increasing amount of geographically distributed

massive data is pushing industries and academia to rethink the current big-data processing systems.

The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved

in the current system, are expected to process geographically distributed data at their locations without

moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and

requirements in designing geographically distributed data processing frameworks and protocols. We

classify and study batch processing (MapReduce-based systems), stream processing (Spark-based

systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their

overhead issues.

ETPL BD -

003

A Survey on Geographically Distributed Big-Data Processing using MapReduce

As urban populations grow, cities face many challenges related to transportation, resource consumption, and the

environment. Ride sharing has been proposed as an effective approach to reduce traffic congestion, gasoline

consumption, and pollution. However, despite great promise, researchers and policy makers lack adequate tools

to assess the tradeoffs and benefits of various ride-sharing strategies. In this paper, we propose a real-time, data-

driven simulation framework that supports the efficient analysis of taxi ride sharing. By modeling taxis and trips

as distinct entities, our framework is able to simulate a rich set of realistic scenarios. At the same time, by

providing a comprehensive set of parameters, we are able to study the taxi ride-sharing problem from different

angles, considering different stakeholders’ interests and constraints. To address the computational complexity

of the model, we describe a new optimization algorithm that is linear in the number of trips and makes use of an

efficient indexing scheme, which combined with parallelization, makes our approach scalable. We evaluate our

framework through a study that uses data about 360 million trips taken by 13,000 taxis in New York City during

2011 and 2012. We describe the findings of the study which demonstrate that our framework can provide insights

into strategies for implementing city-wide ride-sharing solutions. We also carry out a detailed performance

analysis which shows the efficiency of our approach.

ETPL BD -

004

STaRS: Simulating Taxi Ride Sharing at Scale


Cloud storage as one of the most important services of cloud computing helps cloud users break the bottleneck

of restricted resources and expand their storage without upgrading their devices. In order to guarantee the

security and privacy of cloud users, data are always outsourced in an encrypted form. However, encrypted data

could incur much waste of cloud storage and complicate data sharing among authorized users. We are still facing

challenges on encrypted data storage and management with deduplication. Traditional deduplication schemes

always focus on specific application scenarios, in which the deduplication is completely controlled by either

data owners or cloud servers. They cannot flexibly satisfy various demands of data owners according to the level

of data sensitivity. In this paper, we propose a heterogeneous data storage management scheme, which flexibly

offers both deduplication management and access control at the same time across multiple Cloud Service

Providers (CSPs). We evaluate its performance with security analysis, comparison and implementation. The

results show its security, effectiveness and efficiency towards potential practical usage.

ETPL BD -

005

Heterogeneous Data Storage Management with Deduplication in Cloud Computing

Increasing popular big data applications bring about invaluable information, but along with challenges to

industrial community and academia. Cloud computing with unlimited resources seems to be the way out.

However, this panacea cannot play its role if we do not arrange fine allocation for cloud infrastructure resources.

In this paper, we present a multi-objective optimization algorithm to trade off the performance, availability, and

cost of Big Data application running on Cloud. After analyzing and modeling the interlaced relations among

these objectives, we design and implement our approach on experimental environment. Finally, three sets of

experiments show that our approach can run about 20% faster than traditional optimization approaches, and can

achieve about 15% higher performance than other heuristic algorithms, while saving 4% to 20% cost.

ETPL BD -

006

Cloud Infrastructure Resource Allocation for Big Data Applications


Privacy has become a considerable issue when the applications of big data are dramatically growing in cloud

computing. The benefits of the implementation for these emerging technologies have improved or changed

service models and improve application performances in various perspectives. However, the remarkably

growing volume of data sizes has also resulted in many challenges in practice. The execution time of the data

encryption is one of the serious issues during the data processing and transmissions. Many current applications

abandon data encryptions in order to reach an adoptive performance level companioning with privacy concerns.

In this paper, we concentrate on privacy and propose a novel data encryption approach, which is called Dynamic

Data Encryption Strategy (D2ES). Our proposed approach aims to selectively encrypt data and use privacy

classification methods under timing constraints. This approach is designed to maximize the privacy protection

scope by using a selective encryption strategy within the required execution time requirements. The performance

of D2ES has been evaluated in our experiments, which provides the proof of the privacy enhancement.

ETPL BD -

007

Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud

Computing

With the fast growing demands for the big data, we need to manage and store the big data in the cloud. Since

the cloud is not fully trusted and it can be accessed by any users, the data in the cloud may face threats. In this

paper, we propose a secure authentication protocol for cloud big data with a hierarchical attribute authorization

structure. Our proposed protocol resorts to the tree-based signature to significantly improve the security of

attribute authorization. To satisfy the big data requirements, we extend the proposed authentication protocol to

support multiple levels in the hierarchical attribute authorization structure. Security analysis shows that our

protocol can resist the forgery attack and replay attack. In addition, our protocol can preserve the entities privacy.

Comparing with the previous studies, we can show that our protocol has lower computational and

communication overhead.

ETPL BD -

008

Secure Authentication in Cloud Big Data with Hierarchical Attribute Authorization

Structure


With the growing amount of data, the demand of big data storage significantly increases. Through the cloud

center, data providers can conveniently share data stored in the center with others. However, one practically

important problem in big data storage is privacy. During the sharing process, data is encrypted to be confidential

and anonymous. Such operation can protect privacy from being leaked out. To satisfy the practical conditions,

data transmitting with multi receivers is also considered. Furthermore, this paper proposes the notion of pre-

authentication for the first time, i.e., only users with certain attributes that have already. The pre-authentication

mechanism combines the advantages of proxy conditional re-encryption multi-sharing mechanism with the

attribute-based authentication technique, thus achieving attributes authentication before re-encryption, and

ensuring the security of the attributes and data. Moreover, this paper finally proves that the system is secure and

the proposed pre-authentication mechanism could significantly enhance the system security level.

ETPL BD -

009

A Pre-Authentication Approach to Proxy Re-encryption in Big Data Context

Privacy preservation is one of the greatest concerns in big data. As one of extensive applications in big data,

privacy preserving data publication (PPDP) has been an important research field. One of the fundamental

challenges in PPDP is the trade-off problem between privacy and utility of the single and independent data set.

However, recent research has shown that the advanced privacy mechanism, i.e., differential privacy, is

vulnerable when multiple data sets are correlated. In this case, the trade-off problem between privacy and utility

is evolved into a game problem, in which payoff of each player is dependent on his and his neighbors’ privacy

parameters. In this paper, we firstly present the definition of correlated differential privacy to evaluate the real

privacy level of a single data set influenced by the other data sets. Then, we construct a game model of multiple

players, in which each publishes data set sanitized by differential privacy. Next, we analyze the existence and

uniqueness of the pure Nash Equilibrium. We refer to a notion, i.e., the price of anarchy, to evaluate efficiency

of the pure Nash Equilibrium. Finally, we show the correctness of our game analysis via simulation experiments.

ETPL BD -

010

Game Theory Based Correlated Privacy Preserving Analysis in Big Data


As both the scale of mobile networks and the population of mobile users keep increasing, the applications of

mobile social big data have emerged where mobile social users can use their mobile devices to exchange and

share contents with each other. The security resource is needed to protect mobile social big data during the

delivery. However, due to the limited security resource, how to allocate the security resource becomes a new

challenge. Therefore, in this paper we propose a joint match-coalitional game based security-aware resource

allocation scheme to deliver mobile social big data. In the proposed scheme, firstly a coalition game model is

introduced for base stations (BSs) to form groups to provide both wireless and security resource, where the

resource efficiency and profits can be improved. Secondly, a matching theory based model is employed to

determine the selecting process between communities and the coalitions of BSs so that mobile social users can

form communities to select the optimal coalition to obtain security resource. Thirdly, a joint matching-coalition

algorithm is presented to obtain the stable security-aware resource allocation. At last, the simulation experiments

prove that the proposal scheme outperforms other existing schemes.

ETPL BD -

011

Security-Aware Resource Allocation for Mobile Social Big Data: A Matching-

Coalitional Game Solution

Traffic speed is a key indicator for the efficiency of an urban transportation system. Accurate modeling of the

spatiotemporally varying traffic speed thus plays a crucial role in urban planning and development. This paper

addresses the problem of efficient fine-grained traffic speed prediction using big traffic data obtained from static

sensors. Gaussian processes (GPs) have been previously used to model various traffic phenomena, including

flow and speed. However, GPs do not scale with big traffic data due to their cubic time complexity. In this work,

we address their efficiency issues by proposing local GPs to learn from and make predictions for correlated

subsets of data. The main idea is to quickly group speed variables in both spatial and temporal dimensions into

a finite number of clusters, so that future and unobserved traffic speed queries can be heuristically mapped to

one of such clusters. A local GP corresponding to that cluster can then be trained on the fly to make predictions

in real-time. We call this method localization. We use non-negative matrix factorization for localization and

propose simple heuristics for cluster mapping. We additionally leverage on the expressiveness of GP kernel

functions to model road network topology and incorporate side information. Extensive experiments using real-

world traffic data collected in the two U.S. cities of Pittsburgh and Washington, D.C., show that our proposed

local GPs significantly improve both runtime performances and prediction accuracies compared to the baseline

global and local GPs.

ETPL BD –

012

33

Local Gaussian Processes for Efficient Fine-Grained Traffic Speed Prediction


Map Reduce effectively partitions and distributes computation workloads to a cluster of servers, facilitating

today’s big data processing. Given the massive data to be dispatched, and the intermediate results to be collected

and aggregated, there have been a significant studies on data locality that seeks to co-locate computation with

data, so as to reduce cross-server traffic in Map Reduce. They generally assume that the input data have little

dependency with each other, which however is not necessarily true for that of many real-world applications, and

we show strong evidence that the finishing time of Map Reduce tasks can be greatly prolonged with such data

dependency. In this paper, we present DALM (Dependency-Aware Locality for Map Reduce) for processing the

real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a

data-locality framework, organically synthesizing the key components from data reorganization, replication, and

placement. Beside algorithmic design within the framework, we have also closely examined the deployment

challenges, particularly in public virtualized cloud environments, and have implemented DALM on Hadoop

1.2.1 with Giraph 1.0.0. Its performance has been evaluated through both simulations and real-world

experiments, and compared with that of state-of-the-art solutions.

ETPL BD -

013

Dependency-aware Data Locality for Map Reduce

Cloud computing has promoted the success of big data applications such as medical data analyses. With the

abundant resources provisioned by cloud platforms, the QoS (quality of service) of services that process big data

could be boosted significantly. However, due to unstable network or fake advertisement, the QoS published by

service providers is not always trusted. Therefore, it becomes a necessity to evaluate the service quality in a

trustable way, based on the services’ historical QoS records. However, the evaluation efficiency would be low

and cannot meet users’ quick response requirement, if all the records of a service are recruited for quality

evaluation. Moreover, it may lead to ‘Lagging Effect’ or low evaluation accuracy, if all the records are treated

equally, as the invocation contexts of different records are not exactly the same. In view of these challenges, a

novel approach named Partial-HR (Partial Index Terms—big data, cloud, context-aware service evaluation,

historical QoS record, weight Historical Records-based service evaluation approach) is put forward in this paper.

In Partial-HR, each historical QoS record is weighted based on its service invocation context. Afterwards, only

partial important records are employed for quality evaluation. Finally, a group of experiments are deployed to

validate the feasibility of our proposal, in terms of evaluation accuracy and efficiency.

ETPL BD -

014

A Context-aware Service Evaluation Approach over Big Data for Cloud Applications


To address the computing challenge of ’big data’, a number of data-intensive computing frameworks (e.g.,

MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto resource

management platform that enables these frameworks running together in a shared system. However, we observe

that, in cloud computing environment, the fair resource allocation policy implemented in YARN is not suitable

because of its memoryless resource allocation fashion leading to violations of a number of good properties in

shared computing systems. This paper attempts to address these problems for YARN. Both singlelevel and

hierarchical resource allocations are considered. For single-level resource allocation, we propose a novel fair

resource allocation mechanism called Long-Term Resource Fairness (LTRF) for such computing. For

hierarchical resource allocation, we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending

LTRF. We show that both LTRF and H-LTRF can address these fairness problems of current resource allocation

policy and are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF

and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than existing fair

schedulers of YARN.

ETPL BD -

015

Fair Resource Allocation for Data-Intensive Computing in the Cloud

This paper introduces RankMap, a platform-aware end-to-end framework for efficient execution of a broad class

of iterative learning algorithms for massive and dense data sets. Our framework exploits data structure to

scalably factorize it into an ensemble of lower rank subspaces. The factorization creates sparse low-dimensional

representations of the data, a property which is leveraged to devise effective mapping and scheduling of iterative

learning algorithms on the distributed computing machines. We provide two APIs, one matrix-based and one

graph-based, which facilitate automated adoption of the framework for performing several contemporary

learning applications. To demonstrate the utility of RankMap, we solve sparse recovery and power iteration

problems on various real-world data sets with up to 1.8 billion nonzeros. Our evaluations are performed on

Amazon EC2 and IBM iDataPlex servers using up to 244 cores. The results demonstrate up to two orders of

magnitude improvements in memory usage, execution speed, and bandwidth compared with the best reported

prior work, while achieving the same level of learning accuracy.

ETPL BD -

016

Rank Map: A Framework for Distributed Learning From Dense Data Sets


The growth of mobile cloud computing (MCC) is challenged by the need to adapt to the resources and

environment that are available to mobile clients while addressing the dynamic changes in network bandwidth.

Big data can be handled via MCC. In this paper, we propose a model of computation partitioning for stateful

data in the dynamic environment that will improve performance. First, we constructed a model of stateful data

streaming and investigated the method of computation partitioning in a dynamic environment. We developed a

definition of direction and calculation of the segmentation scheme, including single frame data flow, task

scheduling and executing efficiency. We also defined the problem for a multi-frame data flow calculation

segmentation decision that is optimized for dynamic conditions and provided an analysis. Second, we proposed

a computation partitioning method for single frame data flow. We determined the data parameters of the

application model, the computation partitioning scheme, and the task and work order data stream model. We

followed the scheduling method to provide the optimal calculation for data frame execution time after

computation partitioning and the best computation partitioning method. Third, we explored a calculation

segmentation method for single frame data flow based on multi-frame data using multi-frame data optimization

adjustment and prediction of future changes in network bandwidth. We were able to demonstrate that the

calculation method for multi-frame data in a changing network bandwidth environment is more efficient than

the calculation method with the limitation of calculations for single frame data. Finally, our research verified

the effectiveness of single frame data in the application of the data stream and analyzed the performance of the

method to optimize the adjustment of multi-frame data. We used a mobile cloud computing platform prototype

system for face recognition to verify the effectiveness of the method.

ETPL BD -

017

Computation partitioning for mobile cloud computing in big data environment

Attribute-based encryption (ABE) has been widely used in cloud computing where a data provider outsources

his/her encrypted data to a cloud service provider, and can share the data with users possessing specific

credentials (or attributes). However, the standard ABE system does not support secure deduplication, which is

crucial for eliminating duplicate copies of identical data in order to save storage space and network bandwidth.

In this paper, we present an attribute-based storage system with secure deduplication in a hybrid cloud setting,

where a private cloud is responsible for duplicate detection and a public cloud manages the storage. Compared

with the prior data deduplication systems, our system has two advantages. Firstly, it can be used to confidentially

share data with users by specifying access policies rather than sharing decryption keys. Secondly, it achieves

the standard notion of semantic security for data confidentiality while existing systems only achieve it by

defining a weaker security notion. In addition, we put forth a methodology to modify a ciphertext over one

access policy into ciphertexts of the same plaintext but under other access policies without revealing the

underlying plaintext.

ETPL BD -

018

Attribute-Based Storage Supporting Secure Deduplication of Encrypted Data in Cloud


Due to the complexity and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most effective

approaches for big data storage and access. Nevertheless, verifying the access legitimacy of a user and securely

updating a ciphertext in the cloud based on a new access policy designated by the data owner are two critical

challenges to make cloud-based big data storage practical and effective. Traditional approaches either

completely ignore the issue of access policy update or delegate the update to a third party authority; but in

practice, access policy update is important for enhancing security and dealing with the dynamism caused by user

join and leave activities. In this paper, we propose a secure and verifiable access control scheme based on the

NTRU cryptosystem for big data storage in clouds. We first propose a new NTRU decryption algorithm to

overcome the decryption failures of the original NTRU, and then detail our scheme and analyze its correctness,

security strengths, and computational efficiency. Our scheme allows the cloud server to efficiently update the

ciphertext when a new access policy is specified by the data owner, who is also able to validate the update to

counter against cheating behaviors of the cloud. It also enables (i) the data owner and eligible users to effectively

verify the legitimacy of a user for accessing the data, and (ii) a user to validate the information provided by other

users for correct plaintext recovery. Rigorous analysis indicates that our scheme can prevent eligible users from

cheating and resist various attacks such as the collusion attack.

ETPL BD -

019

A Secure and Verifiable Access Control Scheme for Big Data Storage in Clouds

With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected.

Although such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and

huge-volume pose great challenges on efficient analysis of economic big data. The existing methods have

primarily analyzed economic data from the perspective of econometrics, which involves limited indicators and

demands prior knowledge of economists. When embracing large varieties of economic factors, these methods

tend to yield unsatisfactory performance. To address the challenges, this paper presents a new framework for

efficient analysis of high-dimensional economic big data based on innovative distributed feature selection.

Specifically, the framework combines the methods of economic feature selection and econometric model

construction to reveal the hidden patterns for economic development. The functionality rests on three pillars: (i)

novel data pre-processing techniques to prepare high-quality economic data, (ii) an innovative distributed feature

identification solution to locate important and representative economic indicators from multidimensional data

sets, and (iii) new econometric models to capture the hidden patterns for economic development. The

experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework

and methods have superior performance in analyzing enormous economic data.

ETPL BD -

020

Distributed Feature Selection for Efficient Economic Big Data Analysis


To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed

decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such

information asset. Signcryption is one of several promising techniques to simultaneously achieve big data

confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke

users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based)

signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure

big data communications between data collectors and data analytical system(s). Our scheme is designed to

achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while

providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG)

in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme

the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not

affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating

its utility using simulations.

ETPL BD -

021

Revocable Identity-Based Access Control for Big Data with Verifiable Outsourced

Computing

With the popularity of wearable devices, along with the development of clouds and cloudlet technology, there

has been increasing need to provide better medical care. The processing chain of medical data mainly includes

data collection, data storage and data sharing, etc. Traditional healthcare system often requires the delivery of

medical data to the cloud, which involves users’ sensitive information and causes communication energy

consumption. Practically, medical data sharing is a critical and challenging issue. Thus in this paper, we build

up a novel healthcare system by utilizing the flexibility of cloudlet. The functions of cloudlet include privacy

protection, data sharing and intrusion detection. In the stage of data collection, we first utilize Number Theory

Research Unit (NTRU) method to encrypt user’s body data collected by wearable devices. Those data will be

transmitted to nearby cloudlet in an energy efficient fashion. Secondly, we present a new trust model to help

users to select trustable partners who want to share stored data in the cloudlet. The trust model also helps similar

patients to communicate with each other about their diseases. Thirdly, we divide users’ medical data stored in

remote cloud of hospital into three parts, and give them proper protection. Finally, in order to protect the

healthcare system from malicious attacks, we develop a novel collaborative intrusion detection system (IDS)

method based on cloudlet mesh, which can effectively prevent the remote healthcare big data cloud from attacks.

Our experiments demonstrate the effectiveness of the proposed scheme.

ETPL BD -

022

Privacy Protection and Intrusion Avoidance for Cloudlet-based Medical Data Sharing


Resource constrained sensing devices are being used widely to build and deploy self-organizing wireless sensor

networks for a variety of critical applications such as smart cities, smart health, precision agriculture and

industrial control systems. Many such devices sense the deployed environment and generate a variety of data

and send them to the server for analysis as data streams. A Data Stream Manager (DSM) at the server collects

the data streams (often called big data) to perform real time analysis and decision-making for these critical

applications. A malicious adversary may access or tamper with the data in transit. One of the challenging tasks

in such applications is to assure the trustworthiness of the collected data so that any decisions are made on the

processing of correct data. Assuring high data trustworthiness requires that the system satisfies two key security

properties: confidentiality and integrity. To ensure the confidentiality of collected data, we need to prevent

sensitive information from reaching the wrong people by ensuring that the right people are getting it. Sensed

data are always associated with different sensitivity levels based on the sensitivity of emerging applications or

the sensed data types or the sensing devices. For example, a temperature in a precision agriculture application

may not be as sensitive as monitored data in smart health. Providing multilevel data confidentiality along with

data integrity for big sensing data streams in the context of near real time analytics is a challenging problem. In

this paper, we propose a Selective Encryption (SEEN) method to secure big sensing data streams that satisfies

the desired multiple levels of confidentiality and data integrity. Our method is based on two key concepts:

common shared keys that are initialized and updated by DSM without requiring retransmission, and a seamless

key refreshment process without interrupting the data stream encryption/decryption. Theoretical analyses and

experimental results of our SEEN method show that it can significantly improve the efficiency and buffer usage

at DSM without compromising the confidentiality and integrity of the data streams.

ETPL BD -

023

SEEN: A Selective Encryption Method to Ensure Confidentiality for Big Sensing Data

Streams

Biomedical research often involves studying patient data that contain personal information. Inappropriate use of

these data might lead to leakage of sensitive information, which can put patient privacy at risk. The problem of

preserving patient privacy has received increasing attentions in the era of big data. Many privacy methods have

been developed to protect against various attack models. This paper reviews relevant topics in the context of

biomedical research. We discuss privacy preserving technologies related to (1) record linkage, (2) synthetic data

generation, and (3) genomic data privacy. We also discuss the ethical implications of big data privacy in

biomedicine and present challenges in future research directions for improving data privacy in biomedical

research.

ETPL BD -

024

Big Data Privacy in Biomedical Research


The next generation wireless networks are expected to operate in fully automated fashion to meet the burgeoning

capacity demand and to serve users with superior quality of experience. Mobile wireless networks can leverage

spatio-temporal information about user and network condition to embed the system with end-to-end visibility

and intelligence. Big data analytics has emerged as a promising approach to unearth meaningful insights and to

build artificially intelligent models with assistance of machine learning tools. Utilizing aforementioned tools

and techniques, this paper contributes in two ways. First, we utilize mobile network data (big data) – call detail

record (CDR) – to analyze anomalous behavior of mobile wireless network. For anomaly detection purposes,

we use unsupervised clustering techniques namely k-means clustering and hierarchical clustering. We compare

the detected anomalies with ground truth information to verify their correctness. From the comparative analysis,

we observe that when the network experiences abruptly high (unusual) traffic demand at any location and time,

it identifies that as anomaly. This helps in identifying regions of interest (RoI) in the network for special action

such as resource allocation, fault avoidance solution etc. Second, we train a neural-network based prediction

model with anomalous and anomaly-free data to highlight the effect of anomalies in data while training/building

intelligent models. In this phase, we transform our anomalous data to anomaly-free and we observe that the error

in prediction while training the model with anomaly-free data has largely decreased as compared to the case

when the model was trained with anomalous data.

ETPL BD -

025

Big Data Analytics for User Activity Analysis and User Anomaly Detection in Mobile

Wireless Network

In this paper, we consider the problem of mutual privacy-protection in social participatory sensing in which

individuals contribute their private information to build a (virtual) community. Particularly, we propose a mutual

privacy preserving k-means clustering scheme that neither discloses individual’s private information nor leaks

the community’s characteristic data (clusters). Our scheme contains two privacy-preserving algorithms called

at each iteration of the k-means clustering. The first one is employed by each participant to find the nearest

cluster while the cluster centers are kept secret to the participants; and the second one computes the cluster

centers without leaking any cluster center information to the participants while preventing each participant from

figuring out other members in the same cluster. An extensive performance analysis is carried out to show that

our approach is effective for k-means clustering, can resist collusion attacks, and can provide mutual privacy

protection even when the data analyst colludes with all except one participant.

ETPL BD -

026

Mutual Privacy Preserving k-Means Clustering in Social Participatory Sensing


Today, one of the main challenges of big data research is the processing of big time-series data. Moreover, time

data analysis is of considerable importance, because previous trends are useful for predictingthe future. Due to

the considerable delay when the volume of the data increases, the presence of redundancy, and the innate lack

of time-series structures, the traditional relational data model does not seem to be adequately able of analyzing

time data. Moreover, many of traditional data structures do not support time operators, which results in an

inefficient access to time data. Therefore, relational database management systems have difficulty in dealing

with big data—it may require massively parallel software that runs on many servers. This has led us to

implement Chronos Software, an in-memory background-based time database for key-value pairs; this software

was implemented using C++ language. An independent design has been suggested through appropriately using

temporal algorithms, parallelism algorithms, and methods of data storage in RAM. Our results indicate that the

employment of RAM for storing the data, and of the Timeline Index algorithm for getting access to the time

background of the keys in Chronos translate into an increase of about 40%-90% in the efficiency as compared

to other databases like MySQL and MongoDB.

ETPL BD -

027

A New Method for Time-Series Big Data Effective Storage

Secret sharing scheme has been applied commonly in distributed storage for Big Data. It is a method for

protecting outsourced data against data leakage and for securing key management systems. The secret is

distributed among a group of participants where each participant holds a share of the secret. The secret can be

only reconstructed when a sufficient number of shares are reconstituted. Although many secret sharing schemes

have been proposed, they are still inefficient in terms of share size, communication cost and storage cost; and

also lack robustness in terms of exact-share repair. In this paper, for the first time, we propose a new secret

sharing scheme based on Slepian-Wolf coding. Our scheme can achieve an optimal share size utilizing the

simple binning idea of the coding. It also enhances the exact-share repair feature whereby the shares remain

consistent even if they are corrupted. We show, through experiments, how our scheme can significantly reduce

the communication and storage cost while still being able to support direct share repair leveraging lightweight

exclusive-OR (XOR) operation for fast computation.

ETPL BD -

028

Optimizing Share Size in Efficient and Robust Secret Sharing Scheme for Big Data


In literature, the task of learning a support vector machine for large datasets has been performed by splitting the

dataset into manageable sized “partitions” and training a sequential support vector machine on each of these

partitions separately to obtain local support vectors. However, this process invariably leads to the loss in

classification accuracy as global support vectors may not have been chosen as local support vectors in their

respective partitions. We hypothesize that retaining the original distribution of the dataset in each of the

partitions can help solve this issue. Hence, we present DiP-SVM, a distribution preserving kernel support vector

machine where the first and second order statistics of the entire dataset are retained in each of the partitions.

This helps in obtaining local decision boundaries which are in agreement with the global decision boundary,

thereby reducing the chance of missing important global support vectors. We show that DiP-SVM achieves a

minimal loss in classification accuracy among other distributed support vector machine techniques on several

benchmark datasets. We further demonstrate that our approach reduces communication overhead between

partitions leading to faster execution on large datasets and making it suitable for implementation in cloud

environments.

ETPL BD -

029

DiP-SVM : Distribution Preserving Kernel Support Vector Machine for Big Data

Increasing popular big data applications bring about invaluable information, but along with challenges to

industrial community and academia. Cloud computing with unlimited resources seems to be the way out.

However, this panacea cannot play its role if we do not arrange fine allocation for cloud infrastructure resources.

In this paper, we present a multi-objective optimization algorithm to trade off the performance, availability, and

cost of Big Data application running on Cloud. After analyzing and modeling the interlaced relations among

these objectives, we design and implement our approach on experimental environment. Finally, three sets of

experiments show that our approach can run about 20% faster than traditional optimization approaches, and can

achieve about 15% higher performance than other heuristic algorithms, while saving 4% to 20% cost.

ETPL BD -

030

Cloud Infrastructure Resource Allocation for Big Data Applications


Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification.

The approaches proposed so far to FDT learning, however, have generally neglected time and space

requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce

programming model for generating both binary and multi-way FDTs from big data. The scheme relies on a novel

distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy

information entropy. The fuzzy partitions are therefore used as input to the FDT learning algorithm, which

employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT

learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets

for evaluating the behavior of the scheme along three dimensions: i) performance in terms of classification

accuracy, model complexity and execution time, ii) scalability varying the number of computing units and iii)

ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme

turns out to be suitable for managing big datasets even with modest commodity hardware support. Finally, we

have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-

FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative

analysis.

ETPL BD -

031

On Distributed Fuzzy Decision Trees for Big Data

The k-nearest neighbors (k-NN) query is a fundamental primitive in spatial and multimedia databases. It has

extensive applications in location-based services, classification & clustering and so on. With the promise of

confidentiality and privacy, massive data are increasingly outsourced to cloud in the encrypted form for enjoying

the advantages of cloud computing (e.g., reduce storage and query processing costs). Recently, many schemes

have been proposed to support k-NN query on encrypted cloud data. However, prior works have all assumed

that the query users (QUs) are fully-trusted and know the key of the data owner (DO), which is used to encrypt

and decrypt outsourced data. The assumptions are unrealistic in many situations, since many users are neither

trusted nor knowing the key. In this paper, we propose a novel scheme for secure k-NN query on encrypted

cloud data with multiple keys, in which the DO and each QU all hold their own different keys, and do not share

them with each other; meanwhile, the DO encrypts and decrypts outsourced data using the key of his own. Our

scheme is constructed by a distributed two trapdoors public-key cryptosystem (DT-PKC) and a set of protocols

of secure two-party computation, which not only preserves the data confidentiality and query privacy but also

supports the offline data owner. Our extensive theoretical and experimental evaluations demonstrate the

effectiveness of our scheme in terms of security and performance.

ETPL BD -

032

Secure k-NN Query on Encrypted Cloud Data with Multiple Keys


Today, cloud storage becomes one of the critical services, because users can easily modify and share data with

others in cloud. However, the integrity of shared cloud data is vulnerable to inevitable hardware faults, software

failures or human errors. To ensure the integrity of the shared data, some schemes have been designed to allow

public verifiers (i.e., third party auditors) to efficiently audit data integrity without retrieving the entire users’

data from cloud. Unfortunately, public auditing on the integrity of shared data may reveal data owners’ sensitive

information to the third party auditor. In this paper, we propose a new privacy-aware public auditing mechanism

for shared cloud data by constructing a homomorphic verifiable group signature. Unlike the existing solutions,

our scheme requires at least t group managers to recover a trace key cooperatively, which eliminates the abuse

of single-authority power and provides nonframeability. Moreover, our scheme ensures that group users can

trace data changes through designated binary tree; and can recover the latest correct data block when the current

data block is damaged. In addition, the formal security analysis and experimental results indicate that our scheme

is provably secure and efficient.

ETPL BD -

033

NPP: A New Privacy-Aware Public Auditing Scheme for Cloud Data Sharing with Group

Users

As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern

applications involve heavy batch processing jobs over large volumes of data and at the same time require

efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of

these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need

to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on

investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an

interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and

transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It

is tailored to modern workflows found in machine learning and data exploration use cases, which often involve

iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing

window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata

that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB

achieves very good performance for a wide range of ad-hoc queries compared to alternatives.

ETPL BD -

034

DiNoDB: an Interactive-speed Query Engine for Ad-hoc Queries on Temporary Data


In the environment of cloud computing, the data produced by massive users form a data stream and need to be

protected by encryption for maintaining confidentiality. Traditional serial encryption algorithms are poor in

performance and consume more energy without considering the property of streams. Therefore, we propose a

velocity-aware parallel encryption algorithm with low energy consumption (LECPAES) for streams in cloud

computing. The algorithm parallelizes Advanced Encryption Standard (AES) based on heterogeneous many-

core architecture, adopts a sliding window to stabilize burst flows, senses the velocity of streams using the

thresholds of the window computed by frequency ratios, and dynamically scales the frequency of Graphics

Processing Units (GPUs) to lower down energy consumption. The experiments for streams at different velocities

and the comparisons with other related algorithms show that the algorithm can reduce energy consumption, but

only slightly increases retransmission rate and slightly decreases throughput. Therefore, LECPAES is an

excellent algorithm for fast and energy-saving stream encryption.

ETPL BD -

035

Velocity-Aware Parallel Encryption Algorithm with Low Energy Consumption for Streams

Abstract—Many data owners are required to release the data in a variety of real world application, since it is of

vital importance to discovery valuable information stay behind the data. However, existing re-identification

attacks on the AOL and ADULTS datasets have shown that publish such data directly may cause tremendous

threads to the individual privacy. Thus, it is urgent to resolve all kinds of re-identification risks by recommending

effective de-identification policies to guarantee both privacy and utility of the data. De-identification policies is

one of the models that can be used to achieve such requirements, however, the number of de-identification

policies is exponentially large due to the broad domain of quasi-identifier attributes. To better control the trade

off between data utility and data privacy, skyline computation can be used to select such policies, but it is yet

challenging for efficient skyline processing over large number of policies. In this paper, we propose one parallel

algorithm called SKY-FILTER-MR, which is based on MapReduce to overcome this challenge by computing

skylines over large scale de-identification policies that is represented by bit-strings. To further improve the

performance, a novel approximate skyline computation scheme was proposed to prune unqualified policies using

the approximately domination relationship. With approximate skyline, the power of filtering in the policy space

generation stage was greatly strengthened to effectively decrease the cost of skyline computation over alternative

policies. Extensive experiments over both real life and synthetic datasets demonstrate that our proposed SKY-

FILTER-MR algorithm substantially outperforms the baseline approach by up to four times faster in the optimal

case, which indicates good scalability over large policy sets.

ETPL BD -

036

Efficient Recommendation of De-identification Policies using MapReduce


The sudden and spontaneous occurrence of epileptic seizures can impose a significant burden on patients with

epilepsy. If seizure onset can be prospectively predicted, it could greatly improve the life of patients with

epilepsy and also open new therapeutic avenues for epilepsy treatment. However, discovering effective

predictive patterns from massive brainwave signals is still a challenging problem. The prediction of epileptic

seizures is still in its early stage. Most existing studies actually investigated the predictability of seizures offline

instead of a truly prospective online prediction, and also the high inter-individual variability was not fully

considered in prediction. In this study, we propose a novel adaptive pattern learning framework with a new

online feature extraction approach to achieve personalized online prospective seizure prediction. In particular, a

two-level online feature extraction approach is applied to monitor intracranial electroencephalogram (EEG)

signals and construct a pattern library incrementally. Three prediction rules were developed and evaluated based

on the continuously-updated patient-specific pattern library for each patient, including the adaptive probabilistic

prediction (APP), adaptive lineardiscriminant- analysis-based prediction (ALP), and adaptive Naive Bayes-

based prediction (ANBP). The proposed online pattern learning and prediction system achieved impressive

prediction results for 10 patients with epilepsy using longterm EEG recordings. The best testing prediction

accuracy averaged over the 10 patients were 79%, 78%, and 82% for the APP, ALP, and ANBP prediction

scheme, respectively.

ETPL BD -

037

An Adaptive Pattern Learning Framework to Personalize Online Seizure Prediction

Processing Big Data in cloud is on the increase. An important issue for efficient execution of Big Data processing

jobs on a cloud platform is selecting the best fitting virtual machine (VM) configuration(s) among the miscellany

of choices that cloud providers offer. Wise selection of VM configurations can lead to better performance, cost

and energy consumption. Therefore, it is crucial to explore the available configurations and opt for the best ones

that well suit each MapReduce application. Profiling the given application on all the configurations is costly,

time and energy consuming. An alternative is to run the application on a subset of configurations (sample

configurations) and estimate its performance on other configurations based on the obtained values by sample

configurations. We show that the choice of these sample configurations highly affects accuracy of later

estimations. Our Smart Configuration Selection (SCS) scheme chooses better representatives from among all

configurations by once-off analysis of given performance figures of the benchmarks so as to increase the

accuracy of estimations of missing values, and consequently, to more accurately choose the configuration

providing the highest performance. The results show that the SCS choice of sample configurations is very close

to the best choice, and can reduce estimation error to 11.58% from the original 19.72% of random configuration

selection. More importantly, using SCS estimations in a makespan minimization algorithm improves the

execution time by up to 36.03% compared with random sample selection.

ETPL BD -

038

Faster MapReduce Computation on Clouds through Better Performance Estimation


In this paper, we consider the problem of mutual privacy-protection in social participatory sensing in which

individuals contribute their private information to build a (virtual) community. Particularly, we propose a mutual

privacy preserving k-means clustering scheme that neither discloses individual’s private information nor leaks

the community’s characteristic data (clusters). Our scheme contains two privacy-preserving algorithms called

at each iteration of the k-means clustering. The first one is employed by each participant to find the nearest

cluster wshile the cluster centers are kept secret to the participants; and the second one computes the cluster

centers without leaking any cluster center information to the participants while preventing each participant from

figuring out other members in the same cluster. An extensive performance analysis is carried out to show that

our approach is effective for k-means clustering, can resist collusion attacks, and can provide mutual privacy

protection even when the data analyst colludes with all except one participant.

ETPL BD -

039

Mutual Privacy Preserving k-Means Clustering in Social Participatory Sensing

The proliferation of private clouds that are often underutilized and the tremendous computational potential of

these clouds when combined has recently brought forth the idea of volunteer cloud computing (VCC), a

computing model where cloud owners contribute underutilized computing and/or storage resources on their

clouds to support the execution of applications of other members in the community. This model is particularly

suitable to solve big data scientific problems. Scientists in data-intensive scientific fields increasingly recognize

that sharing volunteered resources from several clouds is a cost-effective alternative to solve many complex,

data- and/or compute-intensive science problems. Despite the promise of the idea of VCC, it still remains at the

vision stage at best. Challenges include the heterogeneity and autonomy of member clouds, access control and

security, complex inter-cloud virtual machine scheduling, etc. In this paper, we present CloudFinder, a system

that supports the efficient execution of big data workloads on volunteered federated clouds (VFCs). Our

evaluation of the system indicates that VFCs are a promising cost-effective approach to enable big data science.

ETPL BD -

040

Cloud Finder: A System for Processing Big Data Workloads on Volunteered

Federated Clouds


To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed

decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such

information asset. Signcryption is one of several promising techniques to simultaneously achieve big data

confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke

users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based)

signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure

big data communications between data collectors and data analytical system(s). Our scheme is designed to

achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while

providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG)

in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme

the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not

affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating

its utility using simulations.

ETPL BD -

042

Revocable Identity-Based Access Control for Big Data with Verifiable Outsourced

Computing

Big sensing data is prevalent in both industry and scientific research applications where the data is generated

with high volume and velocity. Cloud computing provides a promising platform for big sensing data processing

and storage as it provides a flexible stack of massive computing, storage, and software services in a scalable

manner. Current big sensing data processing on Cloud have adopted some data compression techniques.

However, due to the high volume and velocity of big sensing data, traditional data compression techniques lack

sufficient efficiency and scalability for data processing. Based on specific on-Cloud data compression

requirements, we propose a novel scalable data compression approach based on calculating similarity among

the partitioned data chunks. Instead of compressing basic data units, the compression will be conducted over

partitioned data chunks. To restore original data sets, some restoration functions and predictions will be

designed. MapReduce is used for algorithm implementation to achieve extra scalability on Cloud. With real

world meteorological big sensing data experiments on U-Cloud platform, we demonstrate that the proposed

scalable compression approach based on data chunk similarity can significantly improve data compression

efficiency with affordable data accuracy loss.

ETPL BD -

041

A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big

Sensing Data Processing on Cloud


Intervals have become prominent in data management as they are the main data structure to represent a number

of key data types such as temporal or genomic data. Yet, there exists no solution to compactly store and

efficiently query big interval data. In this paper we introduce CINTIA—the Checkpoint INTerval Index Array—

an efficient data structure to store and query interval data, which achieves high memory locality and outperforms

state-of-the art solutions. We also propose a low-latency, Big Data system that implements CINTIA on top of a

popular distributed file system and efficiently manages large interval data on clusters of commodity machines.

Our system can easily be scaled-out and was designed to accommodate large delays between the various

components of a distributed infrastructure. We experimentally evaluate the performance of our approach on

several datasets and show that it outperforms current solutions by several orders of magnitude in distributed

settings.

ETPL BD –

043

Managing Big Interval Data with CINTIA: the Checkpoint INTerval Array

Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of

developing big data programs and applications. However, the jobs in these frameworks are roughly defined and

packaged as executable jars without any functionality being exposed or described. This means that deployed

jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability

for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the

Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for

writing composable big data applications. Along with HDM, a runtime framework is provided to support the

execution, integration and management of HDM applications on distributed infrastructures. Based on the

functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of

executing HDM jobs. The experimental results show that our optimizations can achieve improvements between

10% to 40% of the Job- Completion- Time for different types of applications when compared with the current

state of art, Apache Spark.

ETPL BD -

044

HDM: A Composable Framework for Big Data Processing


In this paper, we propose a general model to address the overfitting problem in online similarity learning for big

data, which is generally generated by two kinds of redundancies: 1) feature redundancy, that is there exists

redundant (irrelevant) features in the training data; 2) rank redundancy, that is non-redundant (or relevant)

features lie in a low rank space. To overcome these, our model is designed to obtain a simple and robust metric

matrix through detecting the redundant rows and columns in the metric matrix and constraining the remaining

matrix to a low rank space. To reduce feature redundancy, we employ the group sparsity regularization, i.e., the

`2;1 norm, to encourage a sparse feature set. To address rank redundancy, we adopt the low rank regularization,

the max norm, instead of calculating the SVD as in traditional models using the nuclear norm. Therefore, our

model can not only generate a low rank metric matrix to avoid overfitting, but also achieves feature selection

simultaneously. For model optimization, an online algorithm based on the stochastic proximal method is derived

to solve this problem efficiently with the complexity of O(d2). To validate the effectiveness and efficiency of

our algorithms, we apply our model to online scene categorization and synthesized data and conduct experiments

on various benchmark datasets with comparisons to several state-of-the-art methods. Our model is as efficient

as the fastest online similarity learning model OASIS, while performing generally as well as the accurate model

OMLLR. Moreover, our model can exclude irrelevant / redundant feature dimension simultaneously.

ETPL BD -

045

Online Similarity Learning for Big Data with Overfitting

In the last decade Digital Forensics has experienced several issues when dealing with network evidence.

Collecting network evidence is difficult due to its volatility. In fact, such information may change over time,

may be stored on a server out jurisdiction or geographically far from the crime scene. On the other hand, the

explosion of the Cloud Computing as the implementation of the Software as a Service (SaaS) paradigm is

pushing users toward remote data repositories such as Dropbox, Amazon Cloud Drive, Apple iCloud, Google

Drive, Microsoft OneDrive. In this paper is proposed a novel methodology for the collection of network

evidence. In particular, it is focused on the collection of information from online services, such as web pages,

chats, documents, photos and videos. The methodology is suitable for both expert and non-expert analysts as it

“drives” the user through the whole acquisition process. During the acquisition, the information received from

the remote source is automatically collected. It includes not only network packets, but also any information

produced by the client upon its interpretation (such as video and audio output). A trusted-third-party, acting as

a digital notary, is introduced in order to certify both the acquired evidence (i.e., the information obtained from

the remote service) and the acquisition process (i.e., all the activities performed by the analysts to retrieve it). A

proof-of-concept prototype, called LINEA, has been implemented to perform an experimental evaluation of the

methodology.

ETPL BD -

046

A Novel Methodology to Acquire Live Big Data Evidence from the Cloud


Lifestyles are a valuable model for understanding individuals’ physical and mental lives, comparing social

groups, and making recommendations for improving people's lives. In this paper, we examine and compare

lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a

large-scale, low-cost alternative to traditional survey methods. We use the Greater New York City area as a

representative for large cities, and the Greater Rochester area as a representative for smaller cities in the United

States. We employed matrix factor analysis as an unsupervised method to extract salient mobility and work-rest

patterns for a large population of users within each metropolitan area. We discovered interesting human behavior

patterns at both a larger scale and a finer granularity than is present in previous literature, some of which allow

us to quantitatively compare the behaviors of individuals of living in big cities to those living in small cities. We

believe that our social media-based approach to lifestyle analysis represents a powerful tool for social computing

in the big data age.

ETPL BD -

047

Tales of Two Cities: Using Social Media to Understand Idiosyncratic Lifestyles in

Distinctive Metropolitan Areas

In Big Data era, applications are generating orders of magnitude more data in both volume and quantity. While

many systems emerge to address such data explosion, the fact that these data’s descriptors, i.e., metadata, are

also “big” is often overlooked. The conventional approach to address the big metadata issue is to disperse

metadata into multiple machines. However, it is extremely difficult to preserve both load-balance and data-

locality in this approach. To this end, in this work we propose hierarchical indirection layers for indexing the

underlying distributed metadata. By doing this, data locality is achieved efficiently by the indirection while load-

balance is preserved. Three key challenges exist in this approach, however: first, how to achieve high resilience;

second, how to ensure flexible granularity; third, how to restrain performance overhead. To address above

challenges, we design Dindex, a distributed indexing service for metadata. Dindex incorporates a hierarchy of

coarse-grained aggregation and horizontal key-coalition. Theoretical analysis shows that the overhead of

building Dindex is compensated by only two or three queries. Dindex has been implemented by a lightweight

distributed key-value store and integrated to a fully-fledged distributed filesystem. Experiments demonstrated

that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.

ETPL BD -

048

Toward Efficient and Flexible Metadata Indexing of Big Data System


Image categorisation is an active yet challenging research topic in computer vision, which is to classify the

images according to their semantic content. Recently, fine-grained object categorisation has attracted wide

attention and remains difficult due to feature inconsistency caused by smaller inter-class and larger intra-class

variation as well as large varying poses. Most of the existing frameworks focused on exploiting a more

discriminative imagery representation or developing a more robust classification framework to mitigate the

suffering. The concern has recently been paid to discovering the dependency across fine-grained class labels

based on Convolutional Neural Networks. Encouraged by the success of semantic label embedding to discover

the fine-grained class labels’ correlation, this paper exploits the misalignment between visual feature space and

semantic label embedding space and incorporates it as a privileged information into a cost-sensitive learning

framework. Owing to capturing both the variation of imagery feature representation and also the label correlation

in the semantic label embedding space, such a visual-semantic misalignment can be employed to reflect the

importance of instances, which is more informative that conventional cost-sensitivities. Experiment results

demonstrate the effectiveness of the proposed framework on public fine-grained benchmarks with achieving

superior performance to state-of-the-arts.

ETPL BD -

049

Learning to Classify Fine-Grained Categories with Privileged Visual-Semantic

Misalignment


Date post:	22-Aug-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Elysium PRO Titles with Abstracts 2017-18 · challenges on encrypted data storage and management...

Documents