Elysium PRO Titles with Abstracts 2017-18
Elysium PRO Titles with Abstracts 2017-18
Elysium PRO Titles with Abstracts 2017-18
The k-nearest neighbors (k-NN) query is a fundamental primitive in spatial and multimedia databases. It has
extensive applications in location-based services, classification & clustering and so on. With the promise of
confidentiality and privacy, massive data are increasingly outsourced to cloud in the encrypted form for enjoying
the advantages of cloud computing (e.g., reduce storage and query processing costs). Recently, many schemes
have been proposed to support k-NN query on encrypted cloud data. However, prior works have all assumed
that the query users (QUs) are fully-trusted and know the key of the data owner (DO), which is used to encrypt
and decrypt outsourced data. The assumptions are unrealistic in many situations, since many users are neither
trusted nor knowing the key. In this paper, we propose a novel scheme for secure k-NN query on encrypted
cloud data with multiple keys, in which the DO and each QU all hold their own different keys, and do not share
them with each other; meanwhile, the DO encrypts and decrypts outsourced data using the key of his own. Our
scheme is constructed by a distributed two trapdoors public-key cryptosystem (DT-PKC) and a set of protocols
of secure two-party computation, which not only preserves the data confidentiality and query privacy but also
supports the offline data owner. Our extensive theoretical and experimental evaluations demonstrate the
effectiveness of our scheme in terms of security and performance.
ETPL BD -
001
Secure k-NN Query on Encrypted Cloud Data with Multiple Keys
Advanced communications and data processing technologies bring great benefits to the smart grid. However,
cyber-security threats also extend from the information system to the smart grid. The existing security works for
smart grid focus on traditional protection and detection methods. However, a lot of threats occur in a very short
time and overlooked by exiting security components. These threats usually have huge impacts on smart gird and
disturb its normal operation. Moreover, it is too late to take action to defend against the threats once they are
detected, and damages could be difficult to repair. To address this issue, this paper proposes a security situational
awareness mechanism based on the analysis of big data in the smart grid. Fuzzy cluster based analytical method,
game theory and reinforcement learning are integrated seamlessly to perform the security situational analysis
for the smart grid. The simulation and experimental results show the advantages of our scheme in terms of high
efficiency and low error rate for security situational awareness.
ETPL BD -
002
Big Data Analysis based Security Situational Awareness for Smart Grid
Elysium PRO Titles with Abstracts 2017-18
Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing
in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems
are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class
of problems, e.g., search, clustering, log analysis, different types of join operations, matrix
multiplication, pattern matching, and social network analysis. However, all these popular systems have
a major drawback in terms of locally distributed computations, which prevent them in implementing
geographically distributed data processing. The increasing amount of geographically distributed
massive data is pushing industries and academia to rethink the current big-data processing systems.
The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved
in the current system, are expected to process geographically distributed data at their locations without
moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and
requirements in designing geographically distributed data processing frameworks and protocols. We
classify and study batch processing (MapReduce-based systems), stream processing (Spark-based
systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their
overhead issues.
ETPL BD -
003
A Survey on Geographically Distributed Big-Data Processing using MapReduce
As urban populations grow, cities face many challenges related to transportation, resource consumption, and the
environment. Ride sharing has been proposed as an effective approach to reduce traffic congestion, gasoline
consumption, and pollution. However, despite great promise, researchers and policy makers lack adequate tools
to assess the tradeoffs and benefits of various ride-sharing strategies. In this paper, we propose a real-time, data-
driven simulation framework that supports the efficient analysis of taxi ride sharing. By modeling taxis and trips
as distinct entities, our framework is able to simulate a rich set of realistic scenarios. At the same time, by
providing a comprehensive set of parameters, we are able to study the taxi ride-sharing problem from different
angles, considering different stakeholders’ interests and constraints. To address the computational complexity
of the model, we describe a new optimization algorithm that is linear in the number of trips and makes use of an
efficient indexing scheme, which combined with parallelization, makes our approach scalable. We evaluate our
framework through a study that uses data about 360 million trips taken by 13,000 taxis in New York City during
2011 and 2012. We describe the findings of the study which demonstrate that our framework can provide insights
into strategies for implementing city-wide ride-sharing solutions. We also carry out a detailed performance
analysis which shows the efficiency of our approach.
ETPL BD -
004
STaRS: Simulating Taxi Ride Sharing at Scale
Elysium PRO Titles with Abstracts 2017-18
Cloud storage as one of the most important services of cloud computing helps cloud users break the bottleneck
of restricted resources and expand their storage without upgrading their devices. In order to guarantee the
security and privacy of cloud users, data are always outsourced in an encrypted form. However, encrypted data
could incur much waste of cloud storage and complicate data sharing among authorized users. We are still facing
challenges on encrypted data storage and management with deduplication. Traditional deduplication schemes
always focus on specific application scenarios, in which the deduplication is completely controlled by either
data owners or cloud servers. They cannot flexibly satisfy various demands of data owners according to the level
of data sensitivity. In this paper, we propose a heterogeneous data storage management scheme, which flexibly
offers both deduplication management and access control at the same time across multiple Cloud Service
Providers (CSPs). We evaluate its performance with security analysis, comparison and implementation. The
results show its security, effectiveness and efficiency towards potential practical usage.
ETPL BD -
005
Heterogeneous Data Storage Management with Deduplication in Cloud Computing
Increasing popular big data applications bring about invaluable information, but along with challenges to
industrial community and academia. Cloud computing with unlimited resources seems to be the way out.
However, this panacea cannot play its role if we do not arrange fine allocation for cloud infrastructure resources.
In this paper, we present a multi-objective optimization algorithm to trade off the performance, availability, and
cost of Big Data application running on Cloud. After analyzing and modeling the interlaced relations among
these objectives, we design and implement our approach on experimental environment. Finally, three sets of
experiments show that our approach can run about 20% faster than traditional optimization approaches, and can
achieve about 15% higher performance than other heuristic algorithms, while saving 4% to 20% cost.
ETPL BD -
006
Cloud Infrastructure Resource Allocation for Big Data Applications
Elysium PRO Titles with Abstracts 2017-18
Privacy has become a considerable issue when the applications of big data are dramatically growing in cloud
computing. The benefits of the implementation for these emerging technologies have improved or changed
service models and improve application performances in various perspectives. However, the remarkably
growing volume of data sizes has also resulted in many challenges in practice. The execution time of the data
encryption is one of the serious issues during the data processing and transmissions. Many current applications
abandon data encryptions in order to reach an adoptive performance level companioning with privacy concerns.
In this paper, we concentrate on privacy and propose a novel data encryption approach, which is called Dynamic
Data Encryption Strategy (D2ES). Our proposed approach aims to selectively encrypt data and use privacy
classification methods under timing constraints. This approach is designed to maximize the privacy protection
scope by using a selective encryption strategy within the required execution time requirements. The performance
of D2ES has been evaluated in our experiments, which provides the proof of the privacy enhancement.
ETPL BD -
007
Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud
Computing
With the fast growing demands for the big data, we need to manage and store the big data in the cloud. Since
the cloud is not fully trusted and it can be accessed by any users, the data in the cloud may face threats. In this
paper, we propose a secure authentication protocol for cloud big data with a hierarchical attribute authorization
structure. Our proposed protocol resorts to the tree-based signature to significantly improve the security of
attribute authorization. To satisfy the big data requirements, we extend the proposed authentication protocol to
support multiple levels in the hierarchical attribute authorization structure. Security analysis shows that our
protocol can resist the forgery attack and replay attack. In addition, our protocol can preserve the entities privacy.
Comparing with the previous studies, we can show that our protocol has lower computational and
communication overhead.
ETPL BD -
008
Secure Authentication in Cloud Big Data with Hierarchical Attribute Authorization
Structure
Elysium PRO Titles with Abstracts 2017-18
With the growing amount of data, the demand of big data storage significantly increases. Through the cloud
center, data providers can conveniently share data stored in the center with others. However, one practically
important problem in big data storage is privacy. During the sharing process, data is encrypted to be confidential
and anonymous. Such operation can protect privacy from being leaked out. To satisfy the practical conditions,
data transmitting with multi receivers is also considered. Furthermore, this paper proposes the notion of pre-
authentication for the first time, i.e., only users with certain attributes that have already. The pre-authentication
mechanism combines the advantages of proxy conditional re-encryption multi-sharing mechanism with the
attribute-based authentication technique, thus achieving attributes authentication before re-encryption, and
ensuring the security of the attributes and data. Moreover, this paper finally proves that the system is secure and
the proposed pre-authentication mechanism could significantly enhance the system security level.
ETPL BD -
009
A Pre-Authentication Approach to Proxy Re-encryption in Big Data Context
Privacy preservation is one of the greatest concerns in big data. As one of extensive applications in big data,
privacy preserving data publication (PPDP) has been an important research field. One of the fundamental
challenges in PPDP is the trade-off problem between privacy and utility of the single and independent data set.
However, recent research has shown that the advanced privacy mechanism, i.e., differential privacy, is
vulnerable when multiple data sets are correlated. In this case, the trade-off problem between privacy and utility
is evolved into a game problem, in which payoff of each player is dependent on his and his neighbors’ privacy
parameters. In this paper, we firstly present the definition of correlated differential privacy to evaluate the real
privacy level of a single data set influenced by the other data sets. Then, we construct a game model of multiple
players, in which each publishes data set sanitized by differential privacy. Next, we analyze the existence and
uniqueness of the pure Nash Equilibrium. We refer to a notion, i.e., the price of anarchy, to evaluate efficiency
of the pure Nash Equilibrium. Finally, we show the correctness of our game analysis via simulation experiments.
ETPL BD -
010
Game Theory Based Correlated Privacy Preserving Analysis in Big Data
Elysium PRO Titles with Abstracts 2017-18
As both the scale of mobile networks and the population of mobile users keep increasing, the applications of
mobile social big data have emerged where mobile social users can use their mobile devices to exchange and
share contents with each other. The security resource is needed to protect mobile social big data during the
delivery. However, due to the limited security resource, how to allocate the security resource becomes a new
challenge. Therefore, in this paper we propose a joint match-coalitional game based security-aware resource
allocation scheme to deliver mobile social big data. In the proposed scheme, firstly a coalition game model is
introduced for base stations (BSs) to form groups to provide both wireless and security resource, where the
resource efficiency and profits can be improved. Secondly, a matching theory based model is employed to
determine the selecting process between communities and the coalitions of BSs so that mobile social users can
form communities to select the optimal coalition to obtain security resource. Thirdly, a joint matching-coalition
algorithm is presented to obtain the stable security-aware resource allocation. At last, the simulation experiments
prove that the proposal scheme outperforms other existing schemes.
ETPL BD -
011
Security-Aware Resource Allocation for Mobile Social Big Data: A Matching-
Coalitional Game Solution
Traffic speed is a key indicator for the efficiency of an urban transportation system. Accurate modeling of the
spatiotemporally varying traffic speed thus plays a crucial role in urban planning and development. This paper
addresses the problem of efficient fine-grained traffic speed prediction using big traffic data obtained from static
sensors. Gaussian processes (GPs) have been previously used to model various traffic phenomena, including
flow and speed. However, GPs do not scale with big traffic data due to their cubic time complexity. In this work,
we address their efficiency issues by proposing local GPs to learn from and make predictions for correlated
subsets of data. The main idea is to quickly group speed variables in both spatial and temporal dimensions into
a finite number of clusters, so that future and unobserved traffic speed queries can be heuristically mapped to
one of such clusters. A local GP corresponding to that cluster can then be trained on the fly to make predictions
in real-time. We call this method localization. We use non-negative matrix factorization for localization and
propose simple heuristics for cluster mapping. We additionally leverage on the expressiveness of GP kernel
functions to model road network topology and incorporate side information. Extensive experiments using real-
world traffic data collected in the two U.S. cities of Pittsburgh and Washington, D.C., show that our proposed
local GPs significantly improve both runtime performances and prediction accuracies compared to the baseline
global and local GPs.
ETPL BD –
012
33
Local Gaussian Processes for Efficient Fine-Grained Traffic Speed Prediction
Elysium PRO Titles with Abstracts 2017-18
Map Reduce effectively partitions and distributes computation workloads to a cluster of servers, facilitating
today’s big data processing. Given the massive data to be dispatched, and the intermediate results to be collected
and aggregated, there have been a significant studies on data locality that seeks to co-locate computation with
data, so as to reduce cross-server traffic in Map Reduce. They generally assume that the input data have little
dependency with each other, which however is not necessarily true for that of many real-world applications, and
we show strong evidence that the finishing time of Map Reduce tasks can be greatly prolonged with such data
dependency. In this paper, we present DALM (Dependency-Aware Locality for Map Reduce) for processing the
real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a
data-locality framework, organically synthesizing the key components from data reorganization, replication, and
placement. Beside algorithmic design within the framework, we have also closely examined the deployment
challenges, particularly in public virtualized cloud environments, and have implemented DALM on Hadoop
1.2.1 with Giraph 1.0.0. Its performance has been evaluated through both simulations and real-world
experiments, and compared with that of state-of-the-art solutions.
ETPL BD -
013
Dependency-aware Data Locality for Map Reduce
Cloud computing has promoted the success of big data applications such as medical data analyses. With the
abundant resources provisioned by cloud platforms, the QoS (quality of service) of services that process big data
could be boosted significantly. However, due to unstable network or fake advertisement, the QoS published by
service providers is not always trusted. Therefore, it becomes a necessity to evaluate the service quality in a
trustable way, based on the services’ historical QoS records. However, the evaluation efficiency would be low
and cannot meet users’ quick response requirement, if all the records of a service are recruited for quality
evaluation. Moreover, it may lead to ‘Lagging Effect’ or low evaluation accuracy, if all the records are treated
equally, as the invocation contexts of different records are not exactly the same. In view of these challenges, a
novel approach named Partial-HR (Partial Index Terms—big data, cloud, context-aware service evaluation,
historical QoS record, weight Historical Records-based service evaluation approach) is put forward in this paper.
In Partial-HR, each historical QoS record is weighted based on its service invocation context. Afterwards, only
partial important records are employed for quality evaluation. Finally, a group of experiments are deployed to
validate the feasibility of our proposal, in terms of evaluation accuracy and efficiency.
ETPL BD -
014
A Context-aware Service Evaluation Approach over Big Data for Cloud Applications
Elysium PRO Titles with Abstracts 2017-18
To address the computing challenge of ’big data’, a number of data-intensive computing frameworks (e.g.,
MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto resource
management platform that enables these frameworks running together in a shared system. However, we observe
that, in cloud computing environment, the fair resource allocation policy implemented in YARN is not suitable
because of its memoryless resource allocation fashion leading to violations of a number of good properties in
shared computing systems. This paper attempts to address these problems for YARN. Both singlelevel and
hierarchical resource allocations are considered. For single-level resource allocation, we propose a novel fair
resource allocation mechanism called Long-Term Resource Fairness (LTRF) for such computing. For
hierarchical resource allocation, we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending
LTRF. We show that both LTRF and H-LTRF can address these fairness problems of current resource allocation
policy and are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF
and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than existing fair
schedulers of YARN.
ETPL BD -
015
Fair Resource Allocation for Data-Intensive Computing in the Cloud
This paper introduces RankMap, a platform-aware end-to-end framework for efficient execution of a broad class
of iterative learning algorithms for massive and dense data sets. Our framework exploits data structure to
scalably factorize it into an ensemble of lower rank subspaces. The factorization creates sparse low-dimensional
representations of the data, a property which is leveraged to devise effective mapping and scheduling of iterative
learning algorithms on the distributed computing machines. We provide two APIs, one matrix-based and one
graph-based, which facilitate automated adoption of the framework for performing several contemporary
learning applications. To demonstrate the utility of RankMap, we solve sparse recovery and power iteration
problems on various real-world data sets with up to 1.8 billion nonzeros. Our evaluations are performed on
Amazon EC2 and IBM iDataPlex servers using up to 244 cores. The results demonstrate up to two orders of
magnitude improvements in memory usage, execution speed, and bandwidth compared with the best reported
prior work, while achieving the same level of learning accuracy.
ETPL BD -
016
Rank Map: A Framework for Distributed Learning From Dense Data Sets
Elysium PRO Titles with Abstracts 2017-18
The growth of mobile cloud computing (MCC) is challenged by the need to adapt to the resources and
environment that are available to mobile clients while addressing the dynamic changes in network bandwidth.
Big data can be handled via MCC. In this paper, we propose a model of computation partitioning for stateful
data in the dynamic environment that will improve performance. First, we constructed a model of stateful data
streaming and investigated the method of computation partitioning in a dynamic environment. We developed a
definition of direction and calculation of the segmentation scheme, including single frame data flow, task
scheduling and executing efficiency. We also defined the problem for a multi-frame data flow calculation
segmentation decision that is optimized for dynamic conditions and provided an analysis. Second, we proposed
a computation partitioning method for single frame data flow. We determined the data parameters of the
application model, the computation partitioning scheme, and the task and work order data stream model. We
followed the scheduling method to provide the optimal calculation for data frame execution time after
computation partitioning and the best computation partitioning method. Third, we explored a calculation
segmentation method for single frame data flow based on multi-frame data using multi-frame data optimization
adjustment and prediction of future changes in network bandwidth. We were able to demonstrate that the
calculation method for multi-frame data in a changing network bandwidth environment is more efficient than
the calculation method with the limitation of calculations for single frame data. Finally, our research verified
the effectiveness of single frame data in the application of the data stream and analyzed the performance of the
method to optimize the adjustment of multi-frame data. We used a mobile cloud computing platform prototype
system for face recognition to verify the effectiveness of the method.
ETPL BD -
017
Computation partitioning for mobile cloud computing in big data environment
Attribute-based encryption (ABE) has been widely used in cloud computing where a data provider outsources
his/her encrypted data to a cloud service provider, and can share the data with users possessing specific
credentials (or attributes). However, the standard ABE system does not support secure deduplication, which is
crucial for eliminating duplicate copies of identical data in order to save storage space and network bandwidth.
In this paper, we present an attribute-based storage system with secure deduplication in a hybrid cloud setting,
where a private cloud is responsible for duplicate detection and a public cloud manages the storage. Compared
with the prior data deduplication systems, our system has two advantages. Firstly, it can be used to confidentially
share data with users by specifying access policies rather than sharing decryption keys. Secondly, it achieves
the standard notion of semantic security for data confidentiality while existing systems only achieve it by
defining a weaker security notion. In addition, we put forth a methodology to modify a ciphertext over one
access policy into ciphertexts of the same plaintext but under other access policies without revealing the
underlying plaintext.
ETPL BD -
018
Attribute-Based Storage Supporting Secure Deduplication of Encrypted Data in Cloud
Elysium PRO Titles with Abstracts 2017-18
Due to the complexity and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most effective
approaches for big data storage and access. Nevertheless, verifying the access legitimacy of a user and securely
updating a ciphertext in the cloud based on a new access policy designated by the data owner are two critical
challenges to make cloud-based big data storage practical and effective. Traditional approaches either
completely ignore the issue of access policy update or delegate the update to a third party authority; but in
practice, access policy update is important for enhancing security and dealing with the dynamism caused by user
join and leave activities. In this paper, we propose a secure and verifiable access control scheme based on the
NTRU cryptosystem for big data storage in clouds. We first propose a new NTRU decryption algorithm to
overcome the decryption failures of the original NTRU, and then detail our scheme and analyze its correctness,
security strengths, and computational efficiency. Our scheme allows the cloud server to efficiently update the
ciphertext when a new access policy is specified by the data owner, who is also able to validate the update to
counter against cheating behaviors of the cloud. It also enables (i) the data owner and eligible users to effectively
verify the legitimacy of a user for accessing the data, and (ii) a user to validate the information provided by other
users for correct plaintext recovery. Rigorous analysis indicates that our scheme can prevent eligible users from
cheating and resist various attacks such as the collusion attack.
ETPL BD -
019
A Secure and Verifiable Access Control Scheme for Big Data Storage in Clouds
With the rapidly increasing popularity of economic activities, a large amount of economic data is being collected.
Although such data offers super opportunities for economic analysis, its low-quality, high-dimensionality and
huge-volume pose great challenges on efficient analysis of economic big data. The existing methods have
primarily analyzed economic data from the perspective of econometrics, which involves limited indicators and
demands prior knowledge of economists. When embracing large varieties of economic factors, these methods
tend to yield unsatisfactory performance. To address the challenges, this paper presents a new framework for
efficient analysis of high-dimensional economic big data based on innovative distributed feature selection.
Specifically, the framework combines the methods of economic feature selection and econometric model
construction to reveal the hidden patterns for economic development. The functionality rests on three pillars: (i)
novel data pre-processing techniques to prepare high-quality economic data, (ii) an innovative distributed feature
identification solution to locate important and representative economic indicators from multidimensional data
sets, and (iii) new econometric models to capture the hidden patterns for economic development. The
experimental results on the economic data collected in Dalian, China, demonstrate that our proposed framework
and methods have superior performance in analyzing enormous economic data.
ETPL BD -
020
Distributed Feature Selection for Efficient Economic Big Data Analysis
Elysium PRO Titles with Abstracts 2017-18
To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed
decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such
information asset. Signcryption is one of several promising techniques to simultaneously achieve big data
confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke
users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based)
signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure
big data communications between data collectors and data analytical system(s). Our scheme is designed to
achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while
providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG)
in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme
the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not
affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating
its utility using simulations.
ETPL BD -
021
Revocable Identity-Based Access Control for Big Data with Verifiable Outsourced
Computing
With the popularity of wearable devices, along with the development of clouds and cloudlet technology, there
has been increasing need to provide better medical care. The processing chain of medical data mainly includes
data collection, data storage and data sharing, etc. Traditional healthcare system often requires the delivery of
medical data to the cloud, which involves users’ sensitive information and causes communication energy
consumption. Practically, medical data sharing is a critical and challenging issue. Thus in this paper, we build
up a novel healthcare system by utilizing the flexibility of cloudlet. The functions of cloudlet include privacy
protection, data sharing and intrusion detection. In the stage of data collection, we first utilize Number Theory
Research Unit (NTRU) method to encrypt user’s body data collected by wearable devices. Those data will be
transmitted to nearby cloudlet in an energy efficient fashion. Secondly, we present a new trust model to help
users to select trustable partners who want to share stored data in the cloudlet. The trust model also helps similar
patients to communicate with each other about their diseases. Thirdly, we divide users’ medical data stored in
remote cloud of hospital into three parts, and give them proper protection. Finally, in order to protect the
healthcare system from malicious attacks, we develop a novel collaborative intrusion detection system (IDS)
method based on cloudlet mesh, which can effectively prevent the remote healthcare big data cloud from attacks.
Our experiments demonstrate the effectiveness of the proposed scheme.
ETPL BD -
022
Privacy Protection and Intrusion Avoidance for Cloudlet-based Medical Data Sharing
Elysium PRO Titles with Abstracts 2017-18
Resource constrained sensing devices are being used widely to build and deploy self-organizing wireless sensor
networks for a variety of critical applications such as smart cities, smart health, precision agriculture and
industrial control systems. Many such devices sense the deployed environment and generate a variety of data
and send them to the server for analysis as data streams. A Data Stream Manager (DSM) at the server collects
the data streams (often called big data) to perform real time analysis and decision-making for these critical
applications. A malicious adversary may access or tamper with the data in transit. One of the challenging tasks
in such applications is to assure the trustworthiness of the collected data so that any decisions are made on the
processing of correct data. Assuring high data trustworthiness requires that the system satisfies two key security
properties: confidentiality and integrity. To ensure the confidentiality of collected data, we need to prevent
sensitive information from reaching the wrong people by ensuring that the right people are getting it. Sensed
data are always associated with different sensitivity levels based on the sensitivity of emerging applications or
the sensed data types or the sensing devices. For example, a temperature in a precision agriculture application
may not be as sensitive as monitored data in smart health. Providing multilevel data confidentiality along with
data integrity for big sensing data streams in the context of near real time analytics is a challenging problem. In
this paper, we propose a Selective Encryption (SEEN) method to secure big sensing data streams that satisfies
the desired multiple levels of confidentiality and data integrity. Our method is based on two key concepts:
common shared keys that are initialized and updated by DSM without requiring retransmission, and a seamless
key refreshment process without interrupting the data stream encryption/decryption. Theoretical analyses and
experimental results of our SEEN method show that it can significantly improve the efficiency and buffer usage
at DSM without compromising the confidentiality and integrity of the data streams.
ETPL BD -
023
SEEN: A Selective Encryption Method to Ensure Confidentiality for Big Sensing Data
Streams
Biomedical research often involves studying patient data that contain personal information. Inappropriate use of
these data might lead to leakage of sensitive information, which can put patient privacy at risk. The problem of
preserving patient privacy has received increasing attentions in the era of big data. Many privacy methods have
been developed to protect against various attack models. This paper reviews relevant topics in the context of
biomedical research. We discuss privacy preserving technologies related to (1) record linkage, (2) synthetic data
generation, and (3) genomic data privacy. We also discuss the ethical implications of big data privacy in
biomedicine and present challenges in future research directions for improving data privacy in biomedical
research.
ETPL BD -
024
Big Data Privacy in Biomedical Research
Elysium PRO Titles with Abstracts 2017-18
The next generation wireless networks are expected to operate in fully automated fashion to meet the burgeoning
capacity demand and to serve users with superior quality of experience. Mobile wireless networks can leverage
spatio-temporal information about user and network condition to embed the system with end-to-end visibility
and intelligence. Big data analytics has emerged as a promising approach to unearth meaningful insights and to
build artificially intelligent models with assistance of machine learning tools. Utilizing aforementioned tools
and techniques, this paper contributes in two ways. First, we utilize mobile network data (big data) – call detail
record (CDR) – to analyze anomalous behavior of mobile wireless network. For anomaly detection purposes,
we use unsupervised clustering techniques namely k-means clustering and hierarchical clustering. We compare
the detected anomalies with ground truth information to verify their correctness. From the comparative analysis,
we observe that when the network experiences abruptly high (unusual) traffic demand at any location and time,
it identifies that as anomaly. This helps in identifying regions of interest (RoI) in the network for special action
such as resource allocation, fault avoidance solution etc. Second, we train a neural-network based prediction
model with anomalous and anomaly-free data to highlight the effect of anomalies in data while training/building
intelligent models. In this phase, we transform our anomalous data to anomaly-free and we observe that the error
in prediction while training the model with anomaly-free data has largely decreased as compared to the case
when the model was trained with anomalous data.
ETPL BD -
025
Big Data Analytics for User Activity Analysis and User Anomaly Detection in Mobile
Wireless Network
In this paper, we consider the problem of mutual privacy-protection in social participatory sensing in which
individuals contribute their private information to build a (virtual) community. Particularly, we propose a mutual
privacy preserving k-means clustering scheme that neither discloses individual’s private information nor leaks
the community’s characteristic data (clusters). Our scheme contains two privacy-preserving algorithms called
at each iteration of the k-means clustering. The first one is employed by each participant to find the nearest
cluster while the cluster centers are kept secret to the participants; and the second one computes the cluster
centers without leaking any cluster center information to the participants while preventing each participant from
figuring out other members in the same cluster. An extensive performance analysis is carried out to show that
our approach is effective for k-means clustering, can resist collusion attacks, and can provide mutual privacy
protection even when the data analyst colludes with all except one participant.
ETPL BD -
026
Mutual Privacy Preserving k-Means Clustering in Social Participatory Sensing
Elysium PRO Titles with Abstracts 2017-18
Today, one of the main challenges of big data research is the processing of big time-series data. Moreover, time
data analysis is of considerable importance, because previous trends are useful for predictingthe future. Due to
the considerable delay when the volume of the data increases, the presence of redundancy, and the innate lack
of time-series structures, the traditional relational data model does not seem to be adequately able of analyzing
time data. Moreover, many of traditional data structures do not support time operators, which results in an
inefficient access to time data. Therefore, relational database management systems have difficulty in dealing
with big data—it may require massively parallel software that runs on many servers. This has led us to
implement Chronos Software, an in-memory background-based time database for key-value pairs; this software
was implemented using C++ language. An independent design has been suggested through appropriately using
temporal algorithms, parallelism algorithms, and methods of data storage in RAM. Our results indicate that the
employment of RAM for storing the data, and of the Timeline Index algorithm for getting access to the time
background of the keys in Chronos translate into an increase of about 40%-90% in the efficiency as compared
to other databases like MySQL and MongoDB.
ETPL BD -
027
A New Method for Time-Series Big Data Effective Storage
Secret sharing scheme has been applied commonly in distributed storage for Big Data. It is a method for
protecting outsourced data against data leakage and for securing key management systems. The secret is
distributed among a group of participants where each participant holds a share of the secret. The secret can be
only reconstructed when a sufficient number of shares are reconstituted. Although many secret sharing schemes
have been proposed, they are still inefficient in terms of share size, communication cost and storage cost; and
also lack robustness in terms of exact-share repair. In this paper, for the first time, we propose a new secret
sharing scheme based on Slepian-Wolf coding. Our scheme can achieve an optimal share size utilizing the
simple binning idea of the coding. It also enhances the exact-share repair feature whereby the shares remain
consistent even if they are corrupted. We show, through experiments, how our scheme can significantly reduce
the communication and storage cost while still being able to support direct share repair leveraging lightweight
exclusive-OR (XOR) operation for fast computation.
ETPL BD -
028
Optimizing Share Size in Efficient and Robust Secret Sharing Scheme for Big Data
Elysium PRO Titles with Abstracts 2017-18
In literature, the task of learning a support vector machine for large datasets has been performed by splitting the
dataset into manageable sized “partitions” and training a sequential support vector machine on each of these
partitions separately to obtain local support vectors. However, this process invariably leads to the loss in
classification accuracy as global support vectors may not have been chosen as local support vectors in their
respective partitions. We hypothesize that retaining the original distribution of the dataset in each of the
partitions can help solve this issue. Hence, we present DiP-SVM, a distribution preserving kernel support vector
machine where the first and second order statistics of the entire dataset are retained in each of the partitions.
This helps in obtaining local decision boundaries which are in agreement with the global decision boundary,
thereby reducing the chance of missing important global support vectors. We show that DiP-SVM achieves a
minimal loss in classification accuracy among other distributed support vector machine techniques on several
benchmark datasets. We further demonstrate that our approach reduces communication overhead between
partitions leading to faster execution on large datasets and making it suitable for implementation in cloud
environments.
ETPL BD -
029
DiP-SVM : Distribution Preserving Kernel Support Vector Machine for Big Data
Increasing popular big data applications bring about invaluable information, but along with challenges to
industrial community and academia. Cloud computing with unlimited resources seems to be the way out.
However, this panacea cannot play its role if we do not arrange fine allocation for cloud infrastructure resources.
In this paper, we present a multi-objective optimization algorithm to trade off the performance, availability, and
cost of Big Data application running on Cloud. After analyzing and modeling the interlaced relations among
these objectives, we design and implement our approach on experimental environment. Finally, three sets of
experiments show that our approach can run about 20% faster than traditional optimization approaches, and can
achieve about 15% higher performance than other heuristic algorithms, while saving 4% to 20% cost.
ETPL BD -
030
Cloud Infrastructure Resource Allocation for Big Data Applications
Elysium PRO Titles with Abstracts 2017-18
Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification.
The approaches proposed so far to FDT learning, however, have generally neglected time and space
requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce
programming model for generating both binary and multi-way FDTs from big data. The scheme relies on a novel
distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy
information entropy. The fuzzy partitions are therefore used as input to the FDT learning algorithm, which
employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT
learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets
for evaluating the behavior of the scheme along three dimensions: i) performance in terms of classification
accuracy, model complexity and execution time, ii) scalability varying the number of computing units and iii)
ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme
turns out to be suitable for managing big datasets even with modest commodity hardware support. Finally, we
have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-
FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative
analysis.
ETPL BD -
031
On Distributed Fuzzy Decision Trees for Big Data
The k-nearest neighbors (k-NN) query is a fundamental primitive in spatial and multimedia databases. It has
extensive applications in location-based services, classification & clustering and so on. With the promise of
confidentiality and privacy, massive data are increasingly outsourced to cloud in the encrypted form for enjoying
the advantages of cloud computing (e.g., reduce storage and query processing costs). Recently, many schemes
have been proposed to support k-NN query on encrypted cloud data. However, prior works have all assumed
that the query users (QUs) are fully-trusted and know the key of the data owner (DO), which is used to encrypt
and decrypt outsourced data. The assumptions are unrealistic in many situations, since many users are neither
trusted nor knowing the key. In this paper, we propose a novel scheme for secure k-NN query on encrypted
cloud data with multiple keys, in which the DO and each QU all hold their own different keys, and do not share
them with each other; meanwhile, the DO encrypts and decrypts outsourced data using the key of his own. Our
scheme is constructed by a distributed two trapdoors public-key cryptosystem (DT-PKC) and a set of protocols
of secure two-party computation, which not only preserves the data confidentiality and query privacy but also
supports the offline data owner. Our extensive theoretical and experimental evaluations demonstrate the
effectiveness of our scheme in terms of security and performance.
ETPL BD -
032
Secure k-NN Query on Encrypted Cloud Data with Multiple Keys
Elysium PRO Titles with Abstracts 2017-18
Today, cloud storage becomes one of the critical services, because users can easily modify and share data with
others in cloud. However, the integrity of shared cloud data is vulnerable to inevitable hardware faults, software
failures or human errors. To ensure the integrity of the shared data, some schemes have been designed to allow
public verifiers (i.e., third party auditors) to efficiently audit data integrity without retrieving the entire users’
data from cloud. Unfortunately, public auditing on the integrity of shared data may reveal data owners’ sensitive
information to the third party auditor. In this paper, we propose a new privacy-aware public auditing mechanism
for shared cloud data by constructing a homomorphic verifiable group signature. Unlike the existing solutions,
our scheme requires at least t group managers to recover a trace key cooperatively, which eliminates the abuse
of single-authority power and provides nonframeability. Moreover, our scheme ensures that group users can
trace data changes through designated binary tree; and can recover the latest correct data block when the current
data block is damaged. In addition, the formal security analysis and experimental results indicate that our scheme
is provably secure and efficient.
ETPL BD -
033
NPP: A New Privacy-Aware Public Auditing Scheme for Cloud Data Sharing with Group
Users
As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern
applications involve heavy batch processing jobs over large volumes of data and at the same time require
efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of
these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need
to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on
investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an
interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and
transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It
is tailored to modern workflows found in machine learning and data exploration use cases, which often involve
iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing
window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata
that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB
achieves very good performance for a wide range of ad-hoc queries compared to alternatives.
ETPL BD -
034
DiNoDB: an Interactive-speed Query Engine for Ad-hoc Queries on Temporary Data
Elysium PRO Titles with Abstracts 2017-18
In the environment of cloud computing, the data produced by massive users form a data stream and need to be
protected by encryption for maintaining confidentiality. Traditional serial encryption algorithms are poor in
performance and consume more energy without considering the property of streams. Therefore, we propose a
velocity-aware parallel encryption algorithm with low energy consumption (LECPAES) for streams in cloud
computing. The algorithm parallelizes Advanced Encryption Standard (AES) based on heterogeneous many-
core architecture, adopts a sliding window to stabilize burst flows, senses the velocity of streams using the
thresholds of the window computed by frequency ratios, and dynamically scales the frequency of Graphics
Processing Units (GPUs) to lower down energy consumption. The experiments for streams at different velocities
and the comparisons with other related algorithms show that the algorithm can reduce energy consumption, but
only slightly increases retransmission rate and slightly decreases throughput. Therefore, LECPAES is an
excellent algorithm for fast and energy-saving stream encryption.
ETPL BD -
035
Velocity-Aware Parallel Encryption Algorithm with Low Energy Consumption for Streams
Abstract—Many data owners are required to release the data in a variety of real world application, since it is of
vital importance to discovery valuable information stay behind the data. However, existing re-identification
attacks on the AOL and ADULTS datasets have shown that publish such data directly may cause tremendous
threads to the individual privacy. Thus, it is urgent to resolve all kinds of re-identification risks by recommending
effective de-identification policies to guarantee both privacy and utility of the data. De-identification policies is
one of the models that can be used to achieve such requirements, however, the number of de-identification
policies is exponentially large due to the broad domain of quasi-identifier attributes. To better control the trade
off between data utility and data privacy, skyline computation can be used to select such policies, but it is yet
challenging for efficient skyline processing over large number of policies. In this paper, we propose one parallel
algorithm called SKY-FILTER-MR, which is based on MapReduce to overcome this challenge by computing
skylines over large scale de-identification policies that is represented by bit-strings. To further improve the
performance, a novel approximate skyline computation scheme was proposed to prune unqualified policies using
the approximately domination relationship. With approximate skyline, the power of filtering in the policy space
generation stage was greatly strengthened to effectively decrease the cost of skyline computation over alternative
policies. Extensive experiments over both real life and synthetic datasets demonstrate that our proposed SKY-
FILTER-MR algorithm substantially outperforms the baseline approach by up to four times faster in the optimal
case, which indicates good scalability over large policy sets.
ETPL BD -
036
Efficient Recommendation of De-identification Policies using MapReduce
Elysium PRO Titles with Abstracts 2017-18
The sudden and spontaneous occurrence of epileptic seizures can impose a significant burden on patients with
epilepsy. If seizure onset can be prospectively predicted, it could greatly improve the life of patients with
epilepsy and also open new therapeutic avenues for epilepsy treatment. However, discovering effective
predictive patterns from massive brainwave signals is still a challenging problem. The prediction of epileptic
seizures is still in its early stage. Most existing studies actually investigated the predictability of seizures offline
instead of a truly prospective online prediction, and also the high inter-individual variability was not fully
considered in prediction. In this study, we propose a novel adaptive pattern learning framework with a new
online feature extraction approach to achieve personalized online prospective seizure prediction. In particular, a
two-level online feature extraction approach is applied to monitor intracranial electroencephalogram (EEG)
signals and construct a pattern library incrementally. Three prediction rules were developed and evaluated based
on the continuously-updated patient-specific pattern library for each patient, including the adaptive probabilistic
prediction (APP), adaptive lineardiscriminant- analysis-based prediction (ALP), and adaptive Naive Bayes-
based prediction (ANBP). The proposed online pattern learning and prediction system achieved impressive
prediction results for 10 patients with epilepsy using longterm EEG recordings. The best testing prediction
accuracy averaged over the 10 patients were 79%, 78%, and 82% for the APP, ALP, and ANBP prediction
scheme, respectively.
ETPL BD -
037
An Adaptive Pattern Learning Framework to Personalize Online Seizure Prediction
Processing Big Data in cloud is on the increase. An important issue for efficient execution of Big Data processing
jobs on a cloud platform is selecting the best fitting virtual machine (VM) configuration(s) among the miscellany
of choices that cloud providers offer. Wise selection of VM configurations can lead to better performance, cost
and energy consumption. Therefore, it is crucial to explore the available configurations and opt for the best ones
that well suit each MapReduce application. Profiling the given application on all the configurations is costly,
time and energy consuming. An alternative is to run the application on a subset of configurations (sample
configurations) and estimate its performance on other configurations based on the obtained values by sample
configurations. We show that the choice of these sample configurations highly affects accuracy of later
estimations. Our Smart Configuration Selection (SCS) scheme chooses better representatives from among all
configurations by once-off analysis of given performance figures of the benchmarks so as to increase the
accuracy of estimations of missing values, and consequently, to more accurately choose the configuration
providing the highest performance. The results show that the SCS choice of sample configurations is very close
to the best choice, and can reduce estimation error to 11.58% from the original 19.72% of random configuration
selection. More importantly, using SCS estimations in a makespan minimization algorithm improves the
execution time by up to 36.03% compared with random sample selection.
ETPL BD -
038
Faster MapReduce Computation on Clouds through Better Performance Estimation
Elysium PRO Titles with Abstracts 2017-18
In this paper, we consider the problem of mutual privacy-protection in social participatory sensing in which
individuals contribute their private information to build a (virtual) community. Particularly, we propose a mutual
privacy preserving k-means clustering scheme that neither discloses individual’s private information nor leaks
the community’s characteristic data (clusters). Our scheme contains two privacy-preserving algorithms called
at each iteration of the k-means clustering. The first one is employed by each participant to find the nearest
cluster wshile the cluster centers are kept secret to the participants; and the second one computes the cluster
centers without leaking any cluster center information to the participants while preventing each participant from
figuring out other members in the same cluster. An extensive performance analysis is carried out to show that
our approach is effective for k-means clustering, can resist collusion attacks, and can provide mutual privacy
protection even when the data analyst colludes with all except one participant.
ETPL BD -
039
Mutual Privacy Preserving k-Means Clustering in Social Participatory Sensing
The proliferation of private clouds that are often underutilized and the tremendous computational potential of
these clouds when combined has recently brought forth the idea of volunteer cloud computing (VCC), a
computing model where cloud owners contribute underutilized computing and/or storage resources on their
clouds to support the execution of applications of other members in the community. This model is particularly
suitable to solve big data scientific problems. Scientists in data-intensive scientific fields increasingly recognize
that sharing volunteered resources from several clouds is a cost-effective alternative to solve many complex,
data- and/or compute-intensive science problems. Despite the promise of the idea of VCC, it still remains at the
vision stage at best. Challenges include the heterogeneity and autonomy of member clouds, access control and
security, complex inter-cloud virtual machine scheduling, etc. In this paper, we present CloudFinder, a system
that supports the efficient execution of big data workloads on volunteered federated clouds (VFCs). Our
evaluation of the system indicates that VFCs are a promising cost-effective approach to enable big data science.
ETPL BD -
040
Cloud Finder: A System for Processing Big Data Workloads on Volunteered
Federated Clouds
Elysium PRO Titles with Abstracts 2017-18
To be able to leverage big data to achieve enhanced strategic insight, process optimization and make informed
decision, we need to be an efficient access control mechanism for ensuring end-to-end security of such
information asset. Signcryption is one of several promising techniques to simultaneously achieve big data
confidentiality and authenticity. However, signcryption suffers from the limitation of not being able to revoke
users from a large-scale system efficiently. We put forward, in this paper, the first identity-based (ID-based)
signcryption scheme with efficient revocation as well as the feature to outsource unsigncryption to enable secure
big data communications between data collectors and data analytical system(s). Our scheme is designed to
achieve end-to-end confidentiality, authentication, non-repudiation, and integrity simultaneously, while
providing scalable revocation functionality such that the overhead demanded by the private key generator (PKG)
in the key-update phase only increases logarithmically based on the cardiality of users. Although in our scheme
the majority of the unsigncryption tasks are outsourced to an untrusted cloud server, this approach does not
affect the security of the proposed scheme. We then prove the security of our scheme, as well as demonstrating
its utility using simulations.
ETPL BD -
042
Revocable Identity-Based Access Control for Big Data with Verifiable Outsourced
Computing
Big sensing data is prevalent in both industry and scientific research applications where the data is generated
with high volume and velocity. Cloud computing provides a promising platform for big sensing data processing
and storage as it provides a flexible stack of massive computing, storage, and software services in a scalable
manner. Current big sensing data processing on Cloud have adopted some data compression techniques.
However, due to the high volume and velocity of big sensing data, traditional data compression techniques lack
sufficient efficiency and scalability for data processing. Based on specific on-Cloud data compression
requirements, we propose a novel scalable data compression approach based on calculating similarity among
the partitioned data chunks. Instead of compressing basic data units, the compression will be conducted over
partitioned data chunks. To restore original data sets, some restoration functions and predictions will be
designed. MapReduce is used for algorithm implementation to achieve extra scalability on Cloud. With real
world meteorological big sensing data experiments on U-Cloud platform, we demonstrate that the proposed
scalable compression approach based on data chunk similarity can significantly improve data compression
efficiency with affordable data accuracy loss.
ETPL BD -
041
A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big
Sensing Data Processing on Cloud
Elysium PRO Titles with Abstracts 2017-18
Intervals have become prominent in data management as they are the main data structure to represent a number
of key data types such as temporal or genomic data. Yet, there exists no solution to compactly store and
efficiently query big interval data. In this paper we introduce CINTIA—the Checkpoint INTerval Index Array—
an efficient data structure to store and query interval data, which achieves high memory locality and outperforms
state-of-the art solutions. We also propose a low-latency, Big Data system that implements CINTIA on top of a
popular distributed file system and efficiently manages large interval data on clusters of commodity machines.
Our system can easily be scaled-out and was designed to accommodate large delays between the various
components of a distributed infrastructure. We experimentally evaluate the performance of our approach on
several datasets and show that it outperforms current solutions by several orders of magnitude in distributed
settings.
ETPL BD –
043
Managing Big Interval Data with CINTIA: the Checkpoint INTerval Array
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of
developing big data programs and applications. However, the jobs in these frameworks are roughly defined and
packaged as executable jars without any functionality being exposed or described. This means that deployed
jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability
for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the
Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for
writing composable big data applications. Along with HDM, a runtime framework is provided to support the
execution, integration and management of HDM applications on distributed infrastructures. Based on the
functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of
executing HDM jobs. The experimental results show that our optimizations can achieve improvements between
10% to 40% of the Job- Completion- Time for different types of applications when compared with the current
state of art, Apache Spark.
ETPL BD -
044
HDM: A Composable Framework for Big Data Processing
Elysium PRO Titles with Abstracts 2017-18
In this paper, we propose a general model to address the overfitting problem in online similarity learning for big
data, which is generally generated by two kinds of redundancies: 1) feature redundancy, that is there exists
redundant (irrelevant) features in the training data; 2) rank redundancy, that is non-redundant (or relevant)
features lie in a low rank space. To overcome these, our model is designed to obtain a simple and robust metric
matrix through detecting the redundant rows and columns in the metric matrix and constraining the remaining
matrix to a low rank space. To reduce feature redundancy, we employ the group sparsity regularization, i.e., the
`2;1 norm, to encourage a sparse feature set. To address rank redundancy, we adopt the low rank regularization,
the max norm, instead of calculating the SVD as in traditional models using the nuclear norm. Therefore, our
model can not only generate a low rank metric matrix to avoid overfitting, but also achieves feature selection
simultaneously. For model optimization, an online algorithm based on the stochastic proximal method is derived
to solve this problem efficiently with the complexity of O(d2). To validate the effectiveness and efficiency of
our algorithms, we apply our model to online scene categorization and synthesized data and conduct experiments
on various benchmark datasets with comparisons to several state-of-the-art methods. Our model is as efficient
as the fastest online similarity learning model OASIS, while performing generally as well as the accurate model
OMLLR. Moreover, our model can exclude irrelevant / redundant feature dimension simultaneously.
ETPL BD -
045
Online Similarity Learning for Big Data with Overfitting
In the last decade Digital Forensics has experienced several issues when dealing with network evidence.
Collecting network evidence is difficult due to its volatility. In fact, such information may change over time,
may be stored on a server out jurisdiction or geographically far from the crime scene. On the other hand, the
explosion of the Cloud Computing as the implementation of the Software as a Service (SaaS) paradigm is
pushing users toward remote data repositories such as Dropbox, Amazon Cloud Drive, Apple iCloud, Google
Drive, Microsoft OneDrive. In this paper is proposed a novel methodology for the collection of network
evidence. In particular, it is focused on the collection of information from online services, such as web pages,
chats, documents, photos and videos. The methodology is suitable for both expert and non-expert analysts as it
“drives” the user through the whole acquisition process. During the acquisition, the information received from
the remote source is automatically collected. It includes not only network packets, but also any information
produced by the client upon its interpretation (such as video and audio output). A trusted-third-party, acting as
a digital notary, is introduced in order to certify both the acquired evidence (i.e., the information obtained from
the remote service) and the acquisition process (i.e., all the activities performed by the analysts to retrieve it). A
proof-of-concept prototype, called LINEA, has been implemented to perform an experimental evaluation of the
methodology.
ETPL BD -
046
A Novel Methodology to Acquire Live Big Data Evidence from the Cloud
Elysium PRO Titles with Abstracts 2017-18
Lifestyles are a valuable model for understanding individuals’ physical and mental lives, comparing social
groups, and making recommendations for improving people's lives. In this paper, we examine and compare
lifestyle behaviors of people living in cities of different sizes, utilizing freely available social media data as a
large-scale, low-cost alternative to traditional survey methods. We use the Greater New York City area as a
representative for large cities, and the Greater Rochester area as a representative for smaller cities in the United
States. We employed matrix factor analysis as an unsupervised method to extract salient mobility and work-rest
patterns for a large population of users within each metropolitan area. We discovered interesting human behavior
patterns at both a larger scale and a finer granularity than is present in previous literature, some of which allow
us to quantitatively compare the behaviors of individuals of living in big cities to those living in small cities. We
believe that our social media-based approach to lifestyle analysis represents a powerful tool for social computing
in the big data age.
ETPL BD -
047
Tales of Two Cities: Using Social Media to Understand Idiosyncratic Lifestyles in
Distinctive Metropolitan Areas
In Big Data era, applications are generating orders of magnitude more data in both volume and quantity. While
many systems emerge to address such data explosion, the fact that these data’s descriptors, i.e., metadata, are
also “big” is often overlooked. The conventional approach to address the big metadata issue is to disperse
metadata into multiple machines. However, it is extremely difficult to preserve both load-balance and data-
locality in this approach. To this end, in this work we propose hierarchical indirection layers for indexing the
underlying distributed metadata. By doing this, data locality is achieved efficiently by the indirection while load-
balance is preserved. Three key challenges exist in this approach, however: first, how to achieve high resilience;
second, how to ensure flexible granularity; third, how to restrain performance overhead. To address above
challenges, we design Dindex, a distributed indexing service for metadata. Dindex incorporates a hierarchy of
coarse-grained aggregation and horizontal key-coalition. Theoretical analysis shows that the overhead of
building Dindex is compensated by only two or three queries. Dindex has been implemented by a lightweight
distributed key-value store and integrated to a fully-fledged distributed filesystem. Experiments demonstrated
that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.
ETPL BD -
048
Toward Efficient and Flexible Metadata Indexing of Big Data System
Elysium PRO Titles with Abstracts 2017-18
Image categorisation is an active yet challenging research topic in computer vision, which is to classify the
images according to their semantic content. Recently, fine-grained object categorisation has attracted wide
attention and remains difficult due to feature inconsistency caused by smaller inter-class and larger intra-class
variation as well as large varying poses. Most of the existing frameworks focused on exploiting a more
discriminative imagery representation or developing a more robust classification framework to mitigate the
suffering. The concern has recently been paid to discovering the dependency across fine-grained class labels
based on Convolutional Neural Networks. Encouraged by the success of semantic label embedding to discover
the fine-grained class labels’ correlation, this paper exploits the misalignment between visual feature space and
semantic label embedding space and incorporates it as a privileged information into a cost-sensitive learning
framework. Owing to capturing both the variation of imagery feature representation and also the label correlation
in the semantic label embedding space, such a visual-semantic misalignment can be employed to reflect the
importance of instances, which is more informative that conventional cost-sensitivities. Experiment results
demonstrate the effectiveness of the proposed framework on public fine-grained benchmarks with achieving
superior performance to state-of-the-arts.
ETPL BD -
049
Learning to Classify Fine-Grained Categories with Privileged Visual-Semantic
Misalignment
Elysium PRO Titles with Abstracts 2017-18