STBD Editorial Boardhipore.com/stbd/2016/STBD-Vol3-No1-2016-Full.pdf · 2019-09-08 · Services...

Services Transactions on Big Data (ISSN 2326-442X) Vol. 3, No. 1, 2016

i

STBD Editorial Board Editors-in-Chief Ernesto Damiani, Università degli Studi di Milano, Italy Paul Hoffman, SaffronTechnology Inc., USA Associate Editors Shiyong Lu, Wayne State University, USA Min Luo, Huawei, China Editorial Board Alfredo Cuzzocrea, University of Calabria, Italy Althea Liang, HP Lab, Singapore Andy Twigg, Oxford University, UK Badari Narayana Thyamagondlu Nagarajasharman, University of New South Wales, Australia Bin Wu, Beijing University of Posts and Telecommunications Budak Arpinar, University of Georgia, USA Byron Choi, Hong Kong Baptist University, Hong Kong Carolyn McGregor, University of Ontario Institute of Technology, Canada Chiara Braghin, Università degli Studi di Milano, Italy Chi-Hung Chi, CSIRO, Australia Claudio Ardagna, Università degli Studi di Milano, Italy Du Li, Ericsson, USA Gabriele Ruffatti, Engineering Ingegneria Informatica S.p.A., Italy Gong Zhang, Oracle, USA Hamid Motahari, University of New South Wales, Australia Hesham Hallal, University of Tabuk, Saudi Arabia Huan Chen, Kingdee Research, China Jian Yin, Sun Yat-Sen University, China Jia Zhang, Carnegie Mellon University - Silicon Valley, USA Marcello Leida, EBTIC, UAE Maryam Panahizar, Kno.e.sis, USA Ming-Chien Shan, SAP Research, USA Omar Hasan, INSA Lyon, France Paul Rosen, University of Utah, USA Peng Han, Chongqing Research Center for Information and Automation Technology (CIAT), China Piero Fraternali, Politecnico di Milano, Italy Rafael Accorsi, University of Freiburg, Germany Rajdeep Bhomwik, Cisco Systems, Inc., USA Raymond Wong, University of New South Wales, Australia Shailesh Kumar, Google India, India Stelvio Cimato, Università degli Studi di Milano, Italy Srividya Kona, Arizona State University, USA Suren Byna, Lawrence Berkeley National Laboratory, USA Valerio Bellandi, Università degli Studi di Milano, Italy Weining Qian, East China Normal University, China Yongqiang He, Facebook, USA Zhanhuai Li, Northwestern Polytechnical University, China Zhe Shan, Manhattan College, USA Zhixiong Chen, Mercy College, USA


ii

Services Transactions on Big Data

2016, Vol. 3, No.1

Table of Contents

iii. Editor-in-Chief Preface Ernesto Damiani, Università degli Studi di Milano, Italy Paul Hoffman, SaffronTechnology Inc., USA vi. Call for Articles: STBD special issue of application oriented innovations

Research Articles 1 An Experimental Investigation of Mobile Network Traffic Prediction Accuracy

Ali Yadavar Nikravesh, Samuel A. Ajila, Chung-Horng Lung, Department of Systems and Computer Engineering, Carleton University, Canada Wayne Ding, LTE System, Business Unit Radio Ericsson, Canada

17 An Investigation of Mobile Network Traffic Data and Apache Hadoop Performance Man Si, Chung-Horng Lung, Samuel Ajila, Department of Systems and Computer Engineering, Carleton University, Canada Wayne Ding, RAN System, Business Unit Radio Ericsson, Canada

32 On Developing the RaaS Chen-Cheng Ye, School of Marine Science and Technology, Northwestern Polytechnical University, China Huan Chen, Liang-Jie Zhang, Xin-Nan Li, Hong Liang, Kingdee Research, Kingdee International Software Group Company Limited, China, National Engineering Research Center for Supporting Software of Enterprise Internet Services, China

44 Evaluations of Big Data Processing Duygu Sinanc Terzi, Seref Sagiroglu, Department of Computer Engineering Gazi University, Turkey Umut Demirezen, STM Defense Technologies Engineering and Trade Inc., Turkey


iii

Editor-in-Chief Preface

Ernesto Damiani, Università degli Studi di Milano, Italy Paul Hofmann, Saffron Technology, USA Welcome to Services Transactions on Big Data (STBD). Big Data is a broad term for data sets so large, complex or incomplete that current data technologies could not process or handle. From the technology foundation perspective, Big Data covers the science and technology needed for bridging the gap between data services and business, R&D. All topics regarding data study and management align with the theme of STBD. Specially, we focus on: Big Data Models and Algorithms (Foundational Models for Big Data, Algorithms and Programming Techniques for Big Data Processing, Big Data Analytics and Metrics, Representation Formats for Multimedia Big Data) Big Data Architectures (Cloud Computing Techniques for Big Data, Big Data as a Service, Big Data Open Platforms, Big Data in Mobile and Pervasive Computing) Big Data Management (Big Data Persistence and Preservation, Big Data Quality and Provenance Control, Management Issues of Social Network enabled Big Data) Big Data Protection, Integrity and Privacy (Models and Languages for Big Data Protection, Privacy Preserving Big Data Analytics Big Data Encryption) Security Applications of Big Data (Anomaly Detection in Very Large Scale Systems, Collaborative Threat Detection using Big Data Analytics) Big Data Search and Mining (Algorithms and Systems for Big Data Search, Distributed, and Peer-to-peer Search, Machine learning based on Big Data, Visualization Analytics for Big Data) Big Data for Enterprise, Government and Society (Big Data Economics, Real-life Case Studies of Value Creation through Big Data Analytics, Big Data for Business Model Innovation, Big Data Toolkits, Big Data in Business Performance Management, SME-centric Big Data Analytics, Big Data for Vertical Industries (including Government, Healthcare, and Environment), Scientific Applications of Big Data, Large-scale Social Media and Recommendation Systems, Experiences with Big Data Project Deployments, Big Data in Enterprise Management Models and Practices, Big Data in Government Management Models and Practices, Big Data in Smart Planet Solutions, Big Data for Enterprise Transformation) The Services Transactions on Big Data (STBD) is designed to be an important platform for dissemination high quality research on above topics in a timely manner and provide an ongoing platform for continuous discussion on research published in this journal. To ensure quality, STBD only considers expanded version of papers presented at high quality


iv

conferences and key survey articles, such as BigData Congress, ICWS, SCC, MS, CLOUD, SERVICES, etc... This issue collects four papers of the latest research on the application and overview of big data and other relevant technologies based on the practical scenario. The first article is titled “An Experimental Investigation of Mobile Network Traffic Prediction Accuracy”. The authors applied the time-series analysis to identify the significant factors that drive the network traffic and investigate the accuracy of machine learning techniques – Multi-Layer Perceptron (MLP), Multi-Layer Perceptron with Weight Decay (MLPWD), and Support Vector Machines (SVM) – to predict the components of the commercial trial mobile network traffic. The second article is titled “An Investigation of Mobile Network Traffic Data and Apache Hadoop Performance”. The authors presented an application of data analytics by focusing on processing and analyzing two datasets from a commercial trial mobile network. A detailed description that uses Apache Hadoop and the Mahout Machine learning library to process and analyze the datasets was presented. The third article is titled “On Developing the RaaS”. The authors proposed a comprehensive solution of Ranking as a Service (RaaS). The authors used combination weighting method in RaaS which can overcome the defects of subjective and objective weighting methods and a complete case of ranking service for understanding the internals of APIs was presented. The fourth article is titled “Evaluations of Big Data Processing’”. The author demonstrated a roadmap on the chronological development of batch, real-time and hybrid technologies, as well as their advantages and disadvantages. An overview on big data's concepts was summarized and techniques, technologies, tools and platforms for big data were generally reviewed. We would like to thank the authors for their efforts in delivering these four quality articles. We would also like to thank the reviewers, as well as the Program Committee of IEEE BigData Congress for their help with the review process. About the Publication Lead

Liang-Jie (LJ) Zhang is Senior Vice President, Chief Scientist, & Director of Research at Kingdee International Software Group Company Limited, and director of The Open Group. Prior to joining Kingdee, he was a Research Staff Member and Program Manager of Application Architectures and Realization at IBM Thomas J. Watson Research Center as well as the Chief Architect of Industrial Standards at IBM Software Group. Dr. Zhang has published more than 140 technical papers in journals, book chapters, and conference proceedings. He has 40 granted patents and more than 20 pending patent applications. Dr. Zhang received his Ph.D. on Pattern

Recognition and Intelligent Control from Tsinghua University in 1996. He chaired the IEEE Computer Society's Technical Committee on Services Computing from 2003 to 2011. He


v

also chaired the Services Computing Professional Interest Community at IBM Research from 2004 to 2006. Dr. Zhang has served as the Editor-in-Chief of the International Journal of Web Services Research since 2003 and was the founding Editor-in-Chief of IEEE Transactions on Services Computing. He was elected as an IEEE Fellow in 2011, and in the same year won the Technical Achievement Award "for pioneering contributions to Application Design Techniques in Services Computing" from IEEE Computer Society. Dr. Zhang also chaired the 2013 IEEE International Congress on Big Data and the 2009 IEEE International Conference on Cloud Computing (CLOUD 2009). About the Editor-in-Chief

Ernesto Damiani is a full professor at the Università degli Studi di Milano and the Head of the PhD program in computer science. Ernesto’s areas of interest include cloud-based service and process analysis, processing of semi and unstructured information, knowledge representation and sharing. Ernesto has published several books and about 300 papers and international patents. He leads/has led a number of international research projects: he is the Principal Investigator of the ASSERT4SOA project (STREP) on the certification of SOA; leads the activity of SESAR research unit within

SecureSCM (STREP), ARISTOTELE (IP), PRACTICE ((IP) ASSERT4SOA (STREP), and CUMULUS (STREP) projects funded by the EC in the 7th Framework Program. Ernesto has been an Associate Editor of the IEEE Trans. on Services Computing since its inception. Also, Ernesto is Editor in Chief of the International Journal of Knowledge and Learning (Inderscience) and of the International Journal of Web Technology and Engineering (IJWTE). He has served and is serving in all capacities on many congress, conference, and workshop committees. He is a senior member of the IEEE and ACM Distinguished Scientist.

Paul Hofmann is Chief Technology Officer of Saffron Technology, Inc. Paul is responsible for Saffron’s technology direction and product management.Before joining Saffron in 2012 Paul was Vice President Research at SAP Labs at Palo Alto. Paul has also worked for the SAP Corporate Venturing Group. Paul joined SAP in 2001 as Director for Business Development EMEA SAP AG where he has created the Value Based Selling program. Paul was visiting scientist at Civil and Environmental Engineering Department at MIT,

Cambridge, MA 2009. Prior to joining SAP, Paul was Senior Plant Manager at BASF’s Global Catalysts Business Unit in Ludwigshafen, Germany. Paul has been entrenched in research as Senior Scientist and Assistant Professor at outstanding European and American Universities (Northwestern University, U.S., Technical University Munich and Darmstadt, Germany). He is an expert in computational chemistry and computer graphics (Ph.D., research and teaching in Nonlinear Quantum Dynamics and Chaos Theory), authoring numerous publications and books, including a book on SCM and environmental information systems as well as performance management and productivity of supply chains. Call for Articles


vi

STBD special issue of application oriented innovations Big Data is a dynamic discipline. It has become a valuable resource and mechanism for the practitioners and researchers to explore the value of data sets in all kinds of business scenarios and scientific work. From industry perspective, IBM, SAP, Oracle, Google, Microsoft, Yahoo, and other leading software and internet service companies have also launched their own innovation initiatives around big data. The Services Transactions on Big Data (STBD) includes topics related to the advancements in the state of the art standards and practices of Big Data, as well as emerging research topics which are going to define the future of Big Data, including strategy planning, business architecture, application architecture, data architecture, technology architecture, design, development, deployment, operational practices, analytics, optimization, security and privacy. STBD now launches a special issue which focuses on application oriented innovations. The papers should generally have results from real world development, deployment, and experiences delivering Big Data solutions. It should also provide information like "Lessons learned" or general advices gained from the experience of Big Data. Other appropriate sections are general background on the solutions, overview of the solutions, and directions for future innovation and improvements of Big Data. Authors should submit papers (12 pages minimum, 24 papers maximum per paper) related to the following practical topics: 1. Architecture practice of Big Data 2. Big Data management practice 3. Emerging algorithm from real-world scenario 4. Security application of Big Data 5. Big Data search and mining practice 6. Enterprise-level Big Data tooling 7. Innovative Idea from TED talk Please note this special issue mainly considers papers from real-world practices. In addition, STBD only considers extended versions of papers published in reputable related conferences. Sponsored by Services Society, the published STBD papers will be made accessible for easy of citation and knowledge sharing, in addition to paper copies. All published STBD papers will be promoted and recommended to potential authors of the future versions of related reputable conferences such as IEEE BigData Congress, ICWS, SCC, CLOUD, SERVICES, and MS. If you have any questions or queries on STBD, please send email to IJBD AT ServicesSociety.org.

http://servicessociety.org/

Services Transactions on Big Data (ISSN 2326-442X) Vol. 3, No. 1, 2016 1

AN EXPERIMENTAL INVESTIGATION OF MOBILE NETWORK TRAFFIC PREDICTION ACCURACY

Ali Yadavar Nikravesh Samuel A. Ajila Chung-Horng Lung Department of Systems and Computer Engineering Carleton University {alinikravesh, ajila, chlung}@sce.carleton.ca

Wayne Ding LTE System Business Unit Radio Ericsson [email protected]

Abstract The growth in the number of mobile subscriptions has led to a substantial increase in the mobile network bandwidth demand. The mobile network operators need to provide enough resources to meet the huge network demand and provide a satisfactory level of Quality-of-Service (QoS) to their users. However, in order to reduce the cost, the network operators need an efficient network plan that helps them provide cost effective services with a high degree of QoS. To devise such a network plan, the network operators should have an in-depth insight into the characteristics of the network traffic. This paper applies the time-series analysis technique to decomposing the traffic of a commercial trial mobile network into components and identifying the significant factors that drive the traffic of the network. The analysis results are further used to enhance the accuracy of predicting the mobile traffic. In addition, this paper investigates the accuracy of machine learning techniques – Multi-Layer Perceptron (MLP), Multi-Layer Perceptron with Weight Decay (MLPWD), and Support Vector Machines (SVM) – to predict the components of the commercial trial mobile network traffic. The experimental results show that using different prediction models for different network traffic components increases the overall prediction accuracy up to 17%. The experimental results can help the network operators predict the future resource demands more accurately and facilitate provisioning and placement of the mobile network resources for effective resource management.

Keywords: Mobile Network, Traffic Analysis, Prediction, Multi-Layer Perceptron, Multi-Layer Perceptron with Weight Decay, Support Vector Machine

__________________________________________________________________________________________________________________

1. INTRODUCTION In recent years, mobile data traffic has increased

rapidly [1]. The analysis reports show the mobile data traffic grows by 60 percent year-on-year [2]. The growth in the mobile data traffic is due to the rising number of smartphone subscriptions, as well as the increasing data consumption per subscriber [2]. The ubiquity of smartphones and the increasing amount of data generated by mobile phone users give rise to enormous datasets which can be used to characterize and understand user mobility, communication, and interaction patterns [3].

The increase in the mobile network traffic necessitates the mobile network operators to deal with a network resource management issue. Deciding the right amount of network resources is a nontrivial

task and may lead to either under-provisioning or over-provisioning conditions. Under-provisioning condition occurs when the provisioned network resources are not adequate to serve the workload of the network, which may cause users’ dissatisfaction. On the other hand, over-provisioning condition is the result of provisioning an excessive amount of network resources, which leads to the waste of valuable network resources, such as the spectrum. To prevent the under-provisioning and the over-provisioning conditions, mobile network operators need to gain insight into the factors that affect the network traffic. Knowing the characteristics of the network traffic helps the operators devise an effective resource provisioning plan and accommodate future traffic demands more efficiently.

This paper investigates the network resource provisioning issue. Network resource provisioning is similar to cloud resource provisioning. Intensive


research has been conducted on cloud resource management and provisioning. The objective of this research is to apply the methodology and techniques used in predicting the cloud resources to the prediction of network resource usage by using a real life dataset from a commercial trial mobile network.

A comparison between three machine learning algorithms (i.e., Support Vector Machine, Multi-Layer Perceptron, and Multi-Layer Perceptron with Weight Decay) to predict the future traffic of a mobile network is presented in our previous paper [4]. According to our previous paper [4], dimensionality of the traffic data can affect the prediction accuracy of the regression models. Based on our results, Support Vector machine (SVM) outperforms Multi-Layer Perceptron with Weight Decay (MLPWD) and Multi-Layer Perceptron (MLP) in predicting the multidimensionality of the real-life traffic data, while MLPWD has better accuracy in predicting the unidimensional data. In addition, the experimental results in [4] indicate that using multidimensional traffic datasets significantly increases the prediction accuracy of the MLP, MLPWD, and SVM algorithms. However, since none of the future values of the network traffic attributes is known a priori, it is not feasible to use a multidimensional dataset to predict the future network traffic. Therefore, the goal of this paper is to improve accuracy of the regression models to predict the unidimensional mobile network traffic datasets.

This paper enhances our previous unidimensional prediction results by further analyzing individual factors that affect the mobile network traffic. This paper uses time-series analysis technique to decompose the mobile network traffic into trend, seasonal and remainder components. The trend component shows the long term direction of the traffic and indicates whether the network traffic is increasing or decreasing. The seasonal component represents the repetitive and predictable movement around the trend line of the network traffic. Knowing the trend and the seasonality of the traffic helps the network operators to accommodate the future traffic more efficiently.

Moreover, this paper investigates the accuracy of the MLP, MLPWD, and SVM algorithms to predict the future values of each of the network traffic components (i.e., trend, seasonal, and remainder). This facilitates to improve the overall prediction accuracy by using the most accurate prediction model for each of the traffic components. According to [5], SVM and Artificial Neural Networks (ANN) are effective algorithms to predict future system

characteristics. Therefore, in this paper we compare SVM and two variations of ANN algorithm (i.e., MLP and MLPWD) to verify their accuracy in predicting the future behavior of mobile network traffic’s components. The main difference between the MLP and MLPWD algorithms is the risk minimization principle that is used to create MLP and MLPWD regression models. Therefore, comparing MLP with MLPWD shows the impact of the risk minimization principle on the prediction accuracy of the artificial neural networks. Furthermore, comparing MLPWD and MLP with SVM can help evaluate the ANN against the vector machines and highlight the capability of each of the prediction models to predict the components of the mobile network traffic. The contributions of this paper are:

• Decomposing a real life mobile network traffic dataset into components and providing an insight into the factors that are likely to affect the future network traffic.

• Comparing the accuracy of the MLP, MLPWD, and SVM algorithms to predict future behavior of individual components of the mobile network traffic.

• Analyzing the impact of the sliding window size on the prediction accuracy of MLP, MLPWD, and SVM algorithms.

The remainder of the paper is organized as follows: Section 2 discusses the background and related work. This is followed by the presentation of experiments and analysis of results in section 3. Conclusions and possible directions for the future research are discussed in section 4.

2. BACKGROUND AND RELATED WORK This section briefly introduces fundamental

concepts that are used in the paper. Sub-section 2.1 describes the machine learning concept and the prediction algorithms used in the experiment. In sub-section 2.2 an overview of the mobile network resource provisioning approaches is presented.

2.1 MACHINE LEARNING Machine learning is a study of algorithms which

can learn complex relationships or patterns from empirical data and make accurate decisions [6]. Machine learning includes a broad range of techniques such as data pre-processing, feature selection, classification, regression, association rules, and visualization. In big data analytics, machine learning techniques can help extract insightful


information from the enormous datasets and identify hidden relationships.

Vapnik indicates that machine learning corresponds to the problem of function approximation [7]. Based on this definition, the machine learning regression objective is to find the best available approximation to a given time-series. To choose the best approximation, the loss function between the actual values of the time-series and the response provided by the learning machine should be measured. The expected value of the loss, given by the risk function, is:

𝑅𝑅(𝑊𝑊) = ∫ 𝐿𝐿(𝑦𝑦, 𝑓𝑓(𝑥𝑥,𝑤𝑤))𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑦𝑦) (1)

where 𝑓𝑓(𝑥𝑥,𝑤𝑤) is the response provided by the machine learning algorithm, given x is the input and w is the parameter of the function, y is the actual value of the time-series, and 𝐿𝐿(𝑦𝑦,𝑓𝑓(𝑥𝑥,𝑤𝑤)) is the loss function. The problem in minimizing the functional risk is that the joint probability distribution 𝑑𝑑(𝑥𝑥,𝑦𝑦) =𝑑𝑑(𝑦𝑦|𝑥𝑥)𝑑𝑑(𝑥𝑥) is unknown and the only available information is contained in the training set. In other words, we only have the supervisor’s response for the training set, and there is no access to the supervisor’s response for the testing data set.

Since in the regression problems the actual future values of the time-series are unknown (i.e., 𝑑𝑑(𝑦𝑦|𝑥𝑥) is unknown), the loss function cannot be calculated. To solve the functional risk problem, Vapnik [7] proposes an induction principle of replacing the risk functional 𝑅𝑅(𝑤𝑤) by empirical risk functional:

𝐸𝐸(𝑤𝑤) = 1𝑙𝑙∑ 𝐿𝐿�𝑦𝑦𝑖𝑖 ,𝑓𝑓(𝑥𝑥𝑖𝑖,𝑤𝑤)�𝑙𝑙𝑖𝑖=1 (2)

where 𝑙𝑙 is the size of the training dataset. The induction principle of empirical risk minimization (ERM) suggests that in the presence of specific conditions, the learning machine that minimizes functional risk over the training dataset (i.e., 𝐸𝐸(𝑤𝑤)) is the learning machine that minimizes the risk function 𝑅𝑅(𝑤𝑤) . Therefore, the function with the minimum empirical risk is the best approximation to the time-series.

Vapnik also proves that in the presence of specific conditions, ERM could lose its precision due to the over-fitting problem [8]. To prevent the over-fitting problem, structural risk minimization (SRM) principle is proposed to describe a general model of complexity control and to provide a trade-off between hypothesis space complexity (i.e., VC-dimension) and the quality of fitting the training data.

In our problem domain, mobile network traffic represents the time-series that is to be predicted. This paper aims to find the most accurate regression model that predicts mobile network traffic’s future behavior. To this end, we compare three well-known machine learning regression models to investigate their precision in predicting network traffic. The machine learning regression models investigated in this paper are: MLP, MLPWD, and SVM. Each of the three algorithms is highlighted as follows:

2.1.1 MULTI-LAYER PERCEPTRON (MLP) MLP is a feed-forward ANN that maps input data

to appropriate output. A MLP is a network of simple neurons called perceptron. Perceptron computes a single output from multiple real valued inputs by forming a linear combination to its input weights and putting the output through a nonlinear activation function. The mathematical representation of MLP output is:

𝑦𝑦 = 𝜑𝜑(∑ 𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 + 𝑏𝑏𝑛𝑛𝑖𝑖=1 ) = 𝜑𝜑(𝑊𝑊𝑇𝑇𝑋𝑋 + 𝑏𝑏) (3)

where 𝑊𝑊 denotes the vector of weights, 𝑋𝑋 is the vector of inputs, 𝑏𝑏 is the bias, and 𝜑𝜑 is the activation function.

MLP networks are typically used in supervised learning. Therefore, there are training and testing datasets that are used to train and evaluate the model, respectively. The training of MLP refers to adapting all the weights and biases to their optimal values to minimize the following equation:

𝐸𝐸 = 1𝑙𝑙∑ (𝑇𝑇𝑖𝑖 − 𝑌𝑌𝑖𝑖)2𝑙𝑙𝑖𝑖=1 (4)

where 𝑇𝑇𝑖𝑖 denotes the predicted value, 𝑌𝑌𝑖𝑖 is the actual value, and 𝑙𝑙 is the training set size. Equation (4) is a simplified version of equation (2) and represents the ERM. In other words, MLP uses the ERM approach to create its regression model.

2.1.2 MULTI-LAYER PERCEPTRON WITH WEIGHT DECAY (MLPWD)

The MLPWD uses the SRM approach to create prediction model. In addition to empirical risk, SRM describes a general model of capacity (or complexity) control and provides a trade-off between the complexity (i.e., VC-dimension) of the prediction model and the quality of fitting the training data. The general principle of SRM can be implemented in many different ways. According to [9], the first step to implement the SRM is to choose a class of functions with a hierarchy of nested subsets in the


order of increasing complexity. The authors in [7] suggest three approaches to build a nested set of the functions implemented by neural networks:

• Create the nested set of the functions by the architecture of the neural network. This approach creates the nested set by increasing the number of the neurons in the hidden layer of the neural network.

• Create the nested set of the functions by the learning procedure. In this approach the architecture of the neural network (i.e., the number of the neurons and layers) is fixed. The nested set is created by changing the risk minimization equation (i.e., learn procedure)

• Create the nested set of the functions by preprocessing the input of the neural network. In this approach the architecture and the learning procedure of the neural network are fixed. The nested set is created by changing the representation of the input of the neural network.

The second proposed structure (i.e., given by the learning procedure) uses “weight decay” to create a hierarchy of nested functions. This structure considers a set of functions 𝑆𝑆 = {𝑓𝑓(𝑥𝑥,𝑤𝑤),𝑤𝑤 ∈ 𝑊𝑊} that is implemented by a neural network with a fixed architecture. The parameters {w} are the weights of the neural network. A nested structure is introduced through 𝑆𝑆𝑝𝑝 = {𝑓𝑓(𝑥𝑥,𝑤𝑤), |(|𝑤𝑤|)| ≤ 𝐶𝐶𝑝𝑝} and 𝐶𝐶1 < 𝐶𝐶2 <⋯ < 𝐶𝐶𝑛𝑛, where 𝐶𝐶𝑖𝑖 is a constant value that defines the ceiling of the norm of neural network weights. For a convex loss function, the minimization of the empirical risk within the element 𝑆𝑆𝑝𝑝 of the structure is achieved through the minimization of:

𝐸𝐸�𝑤𝑤, 𝛾𝛾𝑝𝑝� = 1𝑙𝑙∑ 𝐿𝐿�𝑦𝑦𝑖𝑖 , 𝑓𝑓(𝑥𝑥𝑖𝑖,𝑤𝑤)� + 𝛾𝛾𝑝𝑝||𝑤𝑤||2𝑙𝑙1 (5)

The nested structure can be created by appropriately choosing Lagrange multipliers 𝛾𝛾1 >𝛾𝛾2 > ⋯ > 𝛾𝛾𝑛𝑛.

Training MLP with weight decay means that during the training phase, each updated weight is multiplied by a factor slightly less than 1 to prevent the weights from growing too large. The risk minimization equation for MLPWD is:

𝐸𝐸 = 1𝑙𝑙∑ (𝑇𝑇𝑖𝑖 − 𝑌𝑌𝑖𝑖)2𝑙𝑙𝑖𝑖=1 + 𝜆𝜆

2∑ 𝑤𝑤𝑖𝑖2𝑙𝑙𝑖𝑖=1 (6)

where 𝑙𝑙 , 𝑇𝑇𝑖𝑖 and 𝑌𝑌𝑖𝑖 are identical to that used in equation (4), 𝑤𝑤 represents the weights in the neural network, and 𝜆𝜆 is the penalty coefficient of the sum of squares of weights.

The authors in [10] have shown that conventional weight decay technique can be considered as the simplified version of structural risk minimization in neural networks. Therefore, in this paper we use the MLPWD algorithm to investigate the accuracy of neural networks using SRM in predicting mobile network traffic.

2.1.3 SUPPORT VECTOR MACHINE (SVM) SVM is a learning algorithm used for binary

classification. The basic idea is to find a hyper-plane which perfectly separates the multidimensional data into two classes. Because input data are often not linearly separable, SVM introduces the notion of “kernel induced feature space” which casts the data into a higher dimensional space where the data are separable [11]. The key insight used in SVM is that the higher dimensional space does not need to be dealt with directly. In addition, similar to MLPWD, SVM uses SRM to create a regression model. Although SVM is originally being used for binary classification, it also has been extended to solve regression tasks and is termed Support Vector Regression (SVR). In this paper we use SVM and SVR interchangeably.

2.2 NETWORK RESOURCE PROVISIONING APPROACHES

Different researchers have performed network traffic analysis and prediction. The objective of the network traffic analysis is to get an insight into the types of network packets and the data flowing through a network. Network traffic analysis involves network data preprocessing, analysis (i.e., data mining), and evaluation.

Network traffic prediction (the scope of this paper) is useful for congestion control, admission control, and network bandwidth allocation [13]. Authors in [12] have categorized network traffic prediction techniques under three broad categories: linear time series model, nonlinear time series model, and hybrid model.

The linear time series models to predict network traffic data include Auto Regressive (AR), Moving Average (MA), and Autoregressive Moving Average (ARMA) techniques. Moving average generally generates poor results for time-series analysis [4]. Therefore, it is usually applied only to remove noises from the time-series. In addition, results of [14] show that the performance of auto-regression highly depends on the monitoring the interval length, the size of the history window, and the size of the


adaptation window. ARMA is a combination of moving average and auto-regression algorithms and has been widely used for network traffic prediction [15][16][17].

Linear time series models are not accurate in environments with complex network traffic behaviors [12]. Therefore, researchers have used nonlinear time series models to forecast complex network traffic. ANN is the most popular non-linear model that is used in existing research works to predict network traffic data [18][19][20]. ANN has different variations, e.g., two popular variations MLP and MLPWD are introduced in Section 2.1.

Hybrid model techniques are a combination of linear and nonlinear models [21][22]. Authors of [13] have compared ARMA (i.e., linear), ANN (i.e., nonlinear), and FARIMA (i.e., hybrid) models and conclude that ANN outperformed other models.

To the best our knowledge, no research work has been published that looks at the prediction of a commercial network using a real life dataset. However, there are research works analyzing and trying to understand network data. The work by Tang et al. [23] analyzes South China city network data and develops a Traffic Analysis System (TAS). The work by Esteves et al. [24] examines twelve weeks’ trace of a city building block local area wireless network. Further research work by Esteves et al. in [25] analyzes the performance of k-means and fuzzy k-means algorithms in the context of a Wikipedia dataset. In addition, provisioning of mobile network resources is mostly static, hence the network cannot dynamically adapt to traffic changes well. Often, the anticipated worst case scenario is considered in the lack of dynamic adaptation, which mostly results in over-provisioning and, hence, a waste of resources.

3. EXPERIMENTS AND RESULTS This section presents the experiments that

decompose real-life mobile network traffic into its components and to analyze the behavior of each of the traffic components. In addition, the experiment presented in Section 3.6 compares the accuracy of the MLP, MLPWD, and SVM algorithms for predicting individual components of the network traffic.

3.1 DATA PREPARATION AND CLEANING Experiments have been carried out by using a

real-life dataset from a commercial trial mobile network. The initial network traffic dataset was

composed of 1,012,959 rows and 27 columns (features), each row representing aggregated traffic of a cell (or a base station) in the network. Data were collected every hour between January 25, 2015 and January 31, 2015 from 5840 unique wireless network cells. To prepare the data for the experiment, we reduced the number of rows by selecting data of one of the network cells. The cell with the maximum number of data points was chosen to be investigated in this research. This resulted in a new dataset with 175 rows (i.e., the selected network cell has 175 network traffic data). Moreover, removing duplicated rows reduced the dataset size to 168 rows.

Parameter name

Value Description

generateRanking True Whether or not to generate ranking

numToSelect -1

Specifies the number of attributes to retain. The default value (-1) indicates that all attributes are to be retained

startSet Null Specifies a set of attributes to ignore.

threshold -1.79

Set threshold by which attributes can be discarded. The default value (-1.79) results in no attributes being discarded.

Table 1. CorrelationAttributeEval Configuration Parameters

We also reduced the data dimension (i.e., number of columns) by using the WEKA attribute selection tool. The attribute selection process in WEKA is separated in two parts [26]:

• Attribute evaluator: a method by which attribute subsets are assessed

• Search method: a method by which the space of possible subsets is searched.

In this research, the CorrelationAttributeEval algorithm and the Ranker algorithm are used as the attribute evaluator and the search method, respectively. Table 1 represents the CorrelationAttributeEval configuration parameters and their values.

Table 2 shows the attribute names, description, and correlation of the data used for the analysis.


Please note that the real attribute names are replaced by codes for the information disclosure reason.

According to Table 2, the last three attributes X25, X26, and X27 are not correlated to the rest of the attributes, i.e., the correlation value is 0 for all those attributes; hence, they are not useful for the machine learning model construction purpose. Therefore, to create machine learning models, the last three attributes (i.e., X25, X26, and X27) are discarded from the dataset. Removing the aforementioned attributes, results in a new dataset with 24 dimensions or attributes.

Machine learning algorithms predict the future value of a time-series dataset by discovering the relations between the features of the historical data and using the discovered relations for prediction. After the initial data preparation step, the dataset has 168 network traffic data points, and each data point has 24 dimensions (i.e., attributes). To perform prediction, at least one of the attributes should be selected as the target class (i.e., the attribute that is being predicted) and the rest of the attributes should be used to predict the target class. However, since none of the future values of the attributes is known a priori, it is not feasible to use them to predict the future value of the target class. Therefore, in this research one of the attributes is selected as the target class and the rest of the attributes are removed from the dataset.

The result is a new dataset with 168 data points and each data point has only one attribute (a unidimensional dataset). Since all of the 24 features in the initial dataset follow the same periodic pattern, we selected the X12 feature as the target class. Feature X12 is the number of the active users who are connected to the network cell and represents the workload of the cell. The reasons for selecting X12 as the target class are:

• Feature X12 represents the workload of the cell during a period and is a crucial parameter for the network operators.

• Most of the features in the dataset have a strong correlation with feature X12.

The experimental results in our previous paper [4] indicate the regression models are not very accurate in predicting the unidimensional network traffic datasets. The goal of this research is to increase the accuracy of the regression models to predict the unidimensional network traffic.

Attribute name

Description Correlation

X1 PDCP signaling radio bearers volume at

downlink (DL) 0.0678

X2 Total UEs scheduling time at uplink (UL) 0.0667

X3 Sum of radio resource control connections 0.0667

X4 Max radio resource control connection 0.0667

X5 PDCP signaling radio bearers volume at UL 0.0666

X6 PDCP latency at DL 0.066 X7 The aggregated

scheduling time per cell at UL

0.0658

X8 Total UEs scheduling time at DL 0.0657

X9 PDCP data radio bearers volume at DL 0.0656

X10 The aggregated scheduling time per cell at

DL 0.0655

X11 Total no. of packets for latency measurement at

DL 0.0655

X12 Total no. of active user equipment at DL 0.0653

X13 PDCP packets received at UL 0.0653

X14 PDCP packets received at DL 0.0652

X15 Aggregated transport time over UEs at DL 0.0648

X16 Active user equipment 0.0641 X17 Min PDCP data radio

bearers bit rate at UL 0.0639 X18 PDCP DRB vol. at DL 0.0637 X19 Aggregated transport time

over UEs at UL 0.0629 X20 Max PDCP data radio

bearers bit rate at UL 0.0613 X21 Max PDCP data radio

bearers bit rate at DL 0.0601 X22 Min PDCP data radio

bearers bit rate at DL 0.0553 X23 PDCP data radio bearers

volume at UL 0.0537 X24 Aggregated transport time

over UEs at UL 0.042 X25 Number of objects in

measurement 0 X26 Sample of radio resource

control connections 0 X27 Network cell ID 0

Table 2. Correlation Attributes Ranking


Figure 1. The network traffic time-series

3.2 TIME-SERIES DECOMPOSITION Figure 1 shows the time-series which is

composed of the values of feature X12 between January 25th, 2015 and January 31st, 2015. This section depicts the decomposition of the time-series into the components that affect its behavior. Any given time-series 𝑦𝑦𝑡𝑡 typically consists of the following four components [27]:

• A seasonal component: a seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year or day of the week). Seasonality is always of a fixed and known period.

• A trend component: is the long term pattern of a time series, which can be positive or negative depending on whether the time series exhibits an increasing long term pattern or a decreasing long term pattern.

• A cyclical component: exists when the time-series exhibits rises and falls that are not of fixed period. The duration of these fluctuations is usually of at least 2 years. The time-series that is analyzed in this paper represents a period of seven days and does not include the cyclical component. Therefore, in this paper the cyclical component of the time-series is neglected.

• A remainder component: includes anything else after removing the trend, seasonal, and cyclical components from the time-series.

There are two approaches to mathematically model a time-series: additive time-series model (c.f. equation (7)) and multiplicative time-series model (c.f. equation (8)).

𝑦𝑦𝑡𝑡 = 𝑆𝑆𝑡𝑡 + 𝑇𝑇𝑡𝑡 + 𝐸𝐸𝑡𝑡 (7)

𝑦𝑦𝑡𝑡 = 𝑆𝑆𝑡𝑡×𝑇𝑇𝑡𝑡×𝐸𝐸𝑡𝑡 (8)

where 𝑡𝑡 is the time, 𝑆𝑆𝑡𝑡 is the seasonal component, 𝑇𝑇𝑡𝑡 is the trend component, and 𝐸𝐸𝑡𝑡 is the remainder component.

The additive model is most appropriate if the magnitude of the seasonal fluctuations or the variation around the trend line does not vary with the level of the time series. When the variation in the seasonal pattern or the variation around the trend appears to be proportional to the level of the time-series, then a multiplicative model is more appropriate. Because the magnitude of the seasonal fluctuations the time-series of Figure 1 does not change, the additive model is used in this paper to model the time-series.

There are different methods for the additive decomposition of the time-series, such as classical decomposition, X-12-ARIMA decomposition, and Seasonal and Trend decomposition by using Loess (STL) [27]. The classical decomposition does not provide the trend values for the first few and the last few observations. In addition, the classical decomposition is not robust when there are occasional unusual small observations in the time-series. Furthermore, the X-12-ARIMA method only decomposes the quarterly and monthly time-series [27], and because the time-series of Figure 1 represents a weekly pattern, the X-12-ARIMA method cannot be used in this research. The STL method is used in this paper to decompose the time-series. Figure 2, Figure 3 and Figure 4, respectively, show the trend, seasonal, and remainder components of the mobile network traffic.

The grey bars on the right of Figure 2, Figure 3 and Figure 4 show the relative scales of the figures. The grey bars in all of the figures represent the same length but because the figures are on different scales, the bars vary in size. Therefore, the larger is the gray bar the smaller is the scale of the diagram. For instance the gray bar in Figure 2 (i.e., the seasonal component) is larger than the gray bar in Figure 3


(i.e., the trend component). This indicates that the seasonal component has smaller scale compared to the trend component. In other words, if the figures are shrunk until their bars became the same size, then all of the figures would be on the same scale.

The seasonal component shows a periodic pattern which is repeated twice a day. The seasonal component can be easily predicted with a machine learning prediction model.

Figure 2. The seasonal component of the traffic

Figure 3. The trend component of the traffic

Figure 4. The remainder component of the traffic

However, since the variations in the seasonal component are small (i.e., the gray bar in Figure 2 is large), increasing the prediction accuracy of this component cannot significantly increase the overall traffic prediction accuracy.

Similar to the seasonal component, the trend component of the network traffic follows a periodic repeatable pattern (c.f. Figure 3). According to the trend component, the traffic of the network cell has an increasing trend between 4:00 AM and 6:00 PM and decreases afterwards. In addition, the peak traffic

increases during the weekends (i.e., the peaks of the trend curve are on Sunday 25th and Saturday 31st). Therefore, the network operators need to consider more resources during the peak hours and over the weekends.

Unlike the seasonal and the trend components, the remainder component does not follow a periodic pattern and includes unpredictable fluctuations (c.f. Figure 4). In addition, the variation of the remainder component is larger compared to the seasonal and trend components. Therefore, increasing the accuracy


of the remainder component’s predictions can significantly increase the overall traffic prediction accuracy. To increase the overall accuracy of the traffic predictions, in the following sections the accuracy of three machine learning algorithms (i.e., MLP, MLPWD, and SVM) to predict the future values of the components of the network traffic are evaluated.

3.3 TRAINING AND TESTING OF MP, MLPWD, AND SVM

The network traffic data used represent hourly performance condition of a network cell. Since the dataset includes 168 data points, the experiment duration is 168 hours. In our previous work [28] we demonstrated that the optimum training duration for the ANN and the SVM algorithms is 60% of the experiment duration. Therefore, we consider the first 100 data samples (i.e., 60%) of the mobile network traffic dataset as the training set and the rest 68 data samples as the testing set.

Because the datasets have only one feature, the sliding window technique is used to train and test the machine learning prediction algorithms. The sliding window technique uses the last k samples of a given feature to predict the future value of that feature. For example, to predict value of 𝑏𝑏𝑘𝑘+1, the sliding window technique uses [𝑏𝑏1,𝑏𝑏2, … , 𝑏𝑏𝑘𝑘] values. Similarly, to predict 𝑏𝑏𝑘𝑘+2, the sliding window technique updates the historical window by adding the actual value of 𝑏𝑏𝑘𝑘+1 and removing the oldest value from the window (i.e., the new sliding window is [𝑏𝑏2, 𝑏𝑏3, … ,𝑏𝑏𝑘𝑘+1]).

Parameter name

MLP value MLPWD value

Learning Rate (ρ) 0.3 0.3

Momentum 0.2 0.2

Validation Threshold 20 20

Hidden Layers 1 1

Hidden Neurons

(attributes + classes)2

(attributes + classes)

2

Decay False True

Table 3. MLP and MLPWD configuration parameters

To reduce the effect of the over-fitting problem, cross-validation technique is used in the training

phase. In this experiment 10 runs of 10-fold cross-validation technique is used to minimize the over-fitting effect. Readers are encouraged to see [29] for more details about the cross-validation technique. Table 3 presents configuration of MLP and MLPWD algorithms in this experiment. The configuration of SVM is shown in Table 4.

Parameter name Value C (complexity parameter) 1.0

Kernel RBF Kernel

regOptimizer RegSMOImproved

Table 4. SVM Configuration Parameters

3.4 EVALUATION METRICS The accuracy of the experimental results can be

evaluated based on the different metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), PRED (25) and R2 Prediction Accuracy [30]. Among these metrics, PRED(25) only considers the percentage of the observations whose prediction accuracy falls within 25% of the actual value. On the other hand, R2 Prediction Accuracy is a measure of the goodness-of-fit, whose value falls within the range [0, 1] and is commonly applied to the linear regression models [30]. Due to the limitations of PRED (25) and R2 Prediction Accuracy, the MAE and the RMSE metrics are used in this paper to measure the accuracy of the prediction algorithms. The formal definitions of these metrics are [30]:

𝑀𝑀𝑀𝑀𝐸𝐸 = 1𝑛𝑛∑ |𝑌𝑌𝑑𝑑𝑖𝑖 − 𝑌𝑌𝑖𝑖|𝑛𝑛𝑖𝑖=1 (9)

𝑅𝑅𝑀𝑀𝑆𝑆𝐸𝐸 = �∑ (𝑌𝑌𝑃𝑃𝑖𝑖−𝑌𝑌𝑖𝑖)2𝑛𝑛𝑖𝑖=1

𝑛𝑛 (10)

where 𝑌𝑌𝑑𝑑𝑖𝑖 is the predicted output and 𝑌𝑌𝑖𝑖 is the actual output for the ith observation, and 𝑛𝑛 is the number of observations for which the prediction is made.

The MAE metric is a popular metric in statistics, especially in the prediction accuracy evaluation. The RMSE represents the sample standard deviation of the differences between the predicted values and the observed values. A smaller MAE and RMSE value indicates a more accurate prediction scheme.

The MAE metric is a linear score which assumes all of the individual errors are weighted equally. Moreover, the RMSE is most useful when large errors are particularly undesirable [31]. Since the large errors can significantly reduce the network QoS,


the network operators prefer to use a regression model which generates a greater number of small errors rather than a regression model that generates a fewer number of the large errors. As a result, in the mobile network resource provisioning domain, the RMSE factor is more important than the MAE factor. However, considering both metrics provides a comprehensive analysis of the accuracy of the prediction models. The greater the difference between RMSE and MAE is, the greater the variance in the individual errors is in the sample.

3.5 HARDWARE CONFIGURATION Hardware configuration influences the

performance (i.e., the time required to create a regression model) of the prediction algorithms. Therefore, to eliminate the impact of the hardware configuration on the prediction results, the same hardware is used to create MLP, MLPWD, and SVM regression models. Table 5 shows the hardware configuration that is used in the experiment.

Hardware Capacity Memory 8 Gigabytes

Processor Intel Core i5

Storage 2 Terabytes HDD

Table 5. Hardware configuration

3.6 EXPERIMENT AND RESULTS The main objective of the experiment is to

compare accuracy of the MLP, MLPWD, and SVM algorithms in predicting the future values of the traffic components (i.e., trend, seasonal, and remainder components). In Section 3.2 the STL method was used to decompose the network traffic into three components. The STL method decomposes the network traffic dataset into three datasets, each representing one of the traffic components. As a result, in this experiment there are three datasets (each dataset includes historical values of one of the components) and the objective is to find the most accurate prediction model to forecast each of the datasets.

Since there is only one feature in the datasets, the sliding window technique is used to train and test the prediction models. Choosing the right size for the sliding window is not a trivial task. Usually smaller window sizes do not reflect the correlation between the data points thoroughly, while using greater window size increases the chance of overfitting. Therefore, this experiment also uses different

window sizes to measure the effect of the sliding window size on the prediction accuracy of the MLP, MLPWD, and SVM algorithms.

Table 6 shows the prediction accuracy of the MLP, MLPWD, and SVM algorithms to forecast the future value of the seasonal component of the network traffic. The metric values (i.e., MAE and RMSE values) are plotted in Figure 5 and Figure 6, as well.

The results show that by increasing the window size, the accuracy of the regression models increases. In addition, when the window size is greater than 8 time slots the MLP algorithm can predict the exact future value of the seasonal component (c.f. Table 6). Because MLP uses empirical risk minimization principle, it can increase the complexity of the regression model until the regression model becomes fully tailored to the training dataset. Since the seasonal component’s behavior in the testing and the training datasets is identical, fitting the model to the training dataset increases the model’s accuracy to predict the testing dataset, as well. Therefore, to predict the future seasonal value of the network traffic, it is better to use the MLP regression model.

The resulting metric values for the algorithms to predict the trend component are shown in Table 7, Figure 7 and Figure 8. In the training and the testing phases, the MLP and the SVM algorithms have similar accuracy results which increase by increasing the window size. However, increasing the window size that is greater than 4 time slots neither improves nor decreases the prediction accuracies of MLP and SVM algorithms. On the other hand, increasing the window size to greater than 4 time slots has a negative impact on the MLPWD’s accuracy.

In the training phase, the SVM algorithm has slightly better accuracy compared with MLP to forecast the trend component. However, MLP has smaller RMSE value compared to SVM. This indicates that the predictions of the MLP algorithm have fewer large errors compared with the SVM algorithm. Therefore, similar to the seasonal component, it is better to use the MLP algorithm to predict the trend values of the network traffic.

Unlike the seasonal and the trend components, the remainder component does not follow a periodic pattern. The complex nature of the remainder component necessitates the prediction models to increase their VC-dimension to predict the time-series. Increasing the VC-dimension may cause the overfitting problem. The prediction accuracy of the


MLP, MLPWD, and SVM algorithms are shown in Table 8, Figure 9 and Figure 10.

Phase Window Size

MAE RMSE MLP MLPWD SVM MLP MLPWD SVM

Training

2 3.93 3.51 3.79 5.37 5.07 5.39 3 3.67 3.18 3.01 5.04 4.61 4.58 4 0.68 2.45 3.42 0.84 3.04 4.96 5 0.16 2.28 2.80 0.23 2.92 4.08 6 0.005 2.21 2.54 0.025 2.76 3.89 7 0.005 2.34 2.63 0.009 2.70 3.96 8 0.002 2.22 2.81 0.003 2.55 4.42 9 0.001 1.17 1.20 0.003 1.49 2.06 10 0 1.02 1.11 0 1.38 2.06

Testing

2 5.92 3.56 3.55 7.24 4.843 4.981 3 3.86 3.03 2.91 4.87 4.328 4.392 4 0.41 2.08 2.98 0.567 2.633 4.572 5 0.15 2.23 2.55 0.2 2.908 3.760 6 0.03 2.71 2.51 0.031 3.222 3.904 7 0.002 2.20 2.28 0.003 2.578 3.449 8 0.002 2.02 2.05 0.002 2.307 3.577 9 0 0.93 0.83 0 1.236 1.626 10 0 1.02 0.77 0 1.321 1.506

Table 6. MAE and RMSE values (seasonal component)

Figure 5. MAE and RMSE values in the training phase (seasonal component)

Figure 6. MAE and RMSE values in the testing phase (seasonal component)


Phase Window Size


Training

2 3.079 2.879 2.867 3.680 3.399 3.333 3 1.036 3.15 0.863 1.291 3.686 1.044 4 0.758 1.325 0.474 0.998 1.750 0.665 5 0.709 0.918 0.575 0.930 1.222 0.801 6 0.499 1.018 0.628 0.721 1.353 0.896 7 0.887 1.219 0.615 1.116 1.624 0.874 8 0.693 1.317 0.591 0.950 1.764 0.863 9 0.608 1.476 0.624 0.840 2.037 0.908 10 0.64 1.533 0.844 0.838 2.058 1.211

Testing

2 2.914 3.152 2.517 3.369 3.658 3.074 3 1.002 3.279 1.067 1.246 3.784 1.253 4 0.972 1.216 0.830 1.294 1.620 1.176 5 0.881 1.295 0.993 1.148 1.845 1.443 6 0.838 1.463 1.066 1.084 2.136 1.564 7 0.640 1.991 1.062 0.939 2.830 1.540 8 0.783 2.077 1.065 1.051 3.042 1.543 9 0.653 2.211 1.100 0.867 3.290 1.621 10 0.692 2.566 0.596 1.043 3.821 0.868

Table 7. MAE and RMSE values (trend component)

Figure 7. MAE and RMSE values in the training phase (trend component)

Figure 8. MAE and RMSE values in the testing phase (trend component)


Phase Window Size


Training

2 9.650 7.644 7.221 12.69 10.69 10.44 3 9.553 7.802 7.434 12.24 10.827 10.89 4 8.448 8.004 7.668 11.54 11.070 11.04 5 10.10 8.053 7.957 12.59 11.178 11.11 6 9.347 7.755 7.798 13.14 10.664 10.90 7 9.529 7.865 7.493 12.30 10.900 10.62 8 10.92 7.607 7.476 14.85 10.550 10.74 9 10.59 7.552 7.709 16.38 10.178 10.75 10 9.724 7.330 7.521 12.68 10.124 10.71

Testing

2 9.694 9.396 8.038 11.68 11.851 10.92 3 8.270 8.64 8.242 10.74 11.167 11.20 4 8.989 9.385 8.360 12.20 11.937 11.32 5 9.506 8.637 8.403 13.26 11.063 11.16 6 10.92 8.908 8.348 14.62 11.391 11.08 7 11.59 8.601 8.052 15.53 10.938 10.90 8 10.31 8.193 8.099 13.32 10.105 10.61 9 10.90 8.318 8.014 15.42 10.36 10.84 10 11.71 8.102 7.767 15.12 10.186 10.67

Table 8. MAE and RMSE values (remainder component)

Figure 9. MAE and RMSE values in the training phase (remainder component)

Figure 10. MAE and RMSE values in the testing phase (remainder component)

The accuracy of the MLP algorithm decreases by increasing the window size. This is because MLP

uses empirical risk minimization principle and becomes overfitted to the training dataset. Because


MLPWD and SVM algorithms use structural risk minimization, their accuracies doesn’t change much by increasing the window size.

According to the testing results (c.f. Figure 10), SVM and MLPWD algorithms have similar accuracy results. However, SVM has slightly better accuracy. Therefore, it is better to use the SVM algorithm to predict the remainder component.

The experimental results suggest using MLP algorithm to predict the trend and the seasonal components and using SVM algorithm to forecast the remainder component. In addition, increasing the window size helps to predict the seasonal and the trend components more accurately, but has no substantial impact on predicting the remainder component.

3.7 EVALUATION The experimental results in our previous work [28]

showed that the MLPWD algorithm is more accurate than the MLP and the SVM algorithm to predict the unidimensional mobile traffic. Section 3.6 suggests using MLP and SVM to use an ensemble of the prediction algorithms to predict the different components of the traffic and combine the prediction results to improve the accuracy. In the ensemble approach, the STL algorithm is used to divide the network traffic into its components. Then the MLP algorithm is used to predict the seasonal and the trend components of the traffic and SVM algorithm is used to forecast the remainder component. Finally, the prediction results of the components are aggregated to create the traffic predictions.

Figure 11. MLPWD and ensemble approach MAE values

This section compares the accuracy of the MLPWD algorithm with the accuracy of the ensemble approach. Figure 11 and Figure 12 show the comparison between the MAE and the RMSE

values of the ensemble and the MLPWD algorithms to forecast the future traffic of the mobile network.

Figure 12. MLPWD and ensemble approach RMSE values

The result shows that the ensemble approach improves the accuracy of forecasting the unidimensional datasets. According to Figure 11 and Figure 12, using the ensemble approach improves the MAE and the RMSE metrics by 13.16% and 17.1%, respectively.

4. CONCLUSIONS AND FUTURE WORK The goal of this paper is to improve the prediction

accuracy of the existing machine learning algorithms to forecast the unidimensional mobile network traffic datasets. To this end, this paper investigates the components of the mobile network traffic (i.e., the seasonal, trend and remainder components) and analyzes impact of each of the components on the future behavior of the network traffic. In addition, this paper compares the accuracy of the MLP, MLPWD, and SVM algorithms in predicting the future behavior of the mobile network traffic components. According to our experimental results, SVM outperforms MLPWD and MLP in predicting the remainder component of the network traffic, while MLP has better accuracy in predicting the seasonal and the trend components of the network traffic. The experimental results show using an ensemble of MLP and SVM algorithms improves the prediction accuracy of the regression models up to 17%.

This paper uses a commercial trial mobile network to carry out the experiments. The dataset includes the traffic of the network cells for one week. Although the dataset shows the seasonality and the trend of the traffic, it is not large enough to show the monthly or the yearly behavior of the traffic. Using a larger dataset in the experiments can help to identify


the behavior of the traffic more accurately and prove the network operators with more information about the long term predictions of the traffic. In addition, it is important to find the dominant feature in the set of the features which can then be used to predict the future network traffic.

5. ACKNOWLEDGEMENT The authors of this paper would like to thank

Ericsson, Canada for providing various contributions and technical support. In addition, this research is partly sponsored by an NSERC (Canada Natural Sciences and Engineering Research Council) Engage grant and an NSERC Discovery grant.

6. REFERENCES [1] Chen, M., Mao, S., and Liu, Y. (2014), Big data: A survey, Mobile Networks Application Journal, vol. 19, no. 2, pp. 171–209.

[2] Ericsson Mobile Traffic Report (2016), [Online], Available: https://www.ericsson.com/mobility-report/mobile-traffic.

[3] Laurila, J. K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., and Miettinen, M. (2012), The mobile data challenge: Big data for mobile computing research, Proceedings of the Workshop on the Nokia Mobile Data Challenge, in Conjunction with the 10th International Conference on Pervasive Computing, pp. 1–8.

[4] Nikravesh, A., Ajila, S.A., and Lung, C-H., (2016), Using MLP, MLPWD, and SVM to Forecast Mobile Network Traffic, Proceedings of IEEE International Congress on Big Data.

[5] Bankole, A.A., and Ajila, S.A., (2013), Cloud Client Prediction Models for Cloud Resource Provisioning in a Multitier Web Application Environment, IEEE 7th International Symposium. Service System Engineering, pp. 156–161.

[6] Wang, S., and Summers, R. M., (2012), Machine learning and radiology, Medical Image Analysis Journal, vol. 16, no. 5, pp. 933–951.

[7] Vapnik, V., (1992), Principles of risk minimization for learning theory, Advanced Neural Information Processing Systems Journal, pp. 831–838.

[8] Vapnik, V., and Chervonenkis, A. Y., (2013), Necessary and sufficient conditions for the uniform convergence of means to their expectations, pp. 7–13.

[9] Sewell, A., (2008), VC-Dimension, Department of Computer Science University College London.

[10] Yeh, I., Tseng, P.-Y., Huang, K.-C., and Kuo, Y.-H., (2012), Minimum Risk Neural Networks and Weight Decay Technique, Emerging Intelligence Computing Journal, pp. 10–16.

[11] Smola, J., and Schölkopf, B., (2004),A tutorial on support vector regression, Statistics Computing Journal, vol. 14, pp. 199–222.

[12] Joshi, M., and Hadi, T. H., (2015), A Review of Network Traffic Analysis and Prediction Techniques, Cornell University Library Archive, p. 23.

[13] Feng, H., and Shu, Y., (2005) Study on network traffic prediction techniques, Proceedings of International Conference on Wireless Communications, Networking and Mobile Computing, pp. 995–998.

[14] Ghanbari, H., Simmons, B., Litoiu, M., and Iszlai, G., (2011), Exploring Alternative Approaches to Implement an Elasticity Policy, IEEE 4th International Conference on Cloud Computing, pp. 716–723.

[15] Hoong, N.K., Hoong, P.K., Tan, I., Muthuvelu, N., and Seng, L.C., (2011), Impact of Utilizing Forecasted Network Traffic for Data Transfers, Proceedings of 13th International Conference on Advanced Communications, pp. 1199–1204.

[16] KuanHoong, P., Tan, I., and YikKeong, C., (2012), Gnutella Network Traffic Measurements, International Journal of Computing Networks Communication, vol. 4, no. 4.

[17] Yu, Y., Song, M., Ren, Z., and Song, J., (2011), Network Traffic Analysis and Prediction Based on APM, Proceedings of 6th International Conference on Pervasive Computing and Applications, pp. 275–280.

[18] Park, D.-C., and Woo, D.-M., (2009), Prediction of Network Traffic Using Dynamic Bilinear Recurrent Neural Network, Proceedings of 5th International Conference on Natural Computation, pp. 419–423.

[19] Junsong, E., Jiukun, W., Maohua, Z., and Junjie, W., (2009), Prediction of internet traffic based on Elman neural network, Chinese Control Decision Conference, pp. 1248–1252.

[20] Zhao, H., (2009), Multi-scale analysis and prediction of network traffic, Proceedings of the IEEE International Conference on Performance, Computing and Communications, pp. 388–393.

[21] Burney, S.M., and Raza, A., (2007), “Montecarlo simulation and prediction of Internet load using conditional mean and conditional variance model, Proceeding of the 9th Islamic Countries Conference on Statistical Sciences.

[22] Zhou, B., He, D., and Sun, Z., (2006), Traffic predictability based on ARIMA/GARCH model, Proceedings of 2nd Conference on Next Generation Internet Design and Engineering, pp. 200–207.

[23] Tang, D., and Baker, M., (2000), Analysis of a local-area wireless network, Proceedings of 6th Annual International Conference on Mobile Computing Network, pp. 1–10.

[24] Rong, C., and Esteves, R.M., (2011),, Using Mahout for clustering Wikipedia’s latest articles: A comparison between k-means and fuzzy c-means in the cloud, Proceedings of 3rd IEEE International Conference on Cloud Computing Technology Science, pp. 565–569.

[25] Esteves, R.M., Pais, R., and Rong, C., (2011), K-means clustering in the cloud - A Mahout test, IEEE Workshop of International Conference on Advanced Information Networking and Applications.

[26] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I., (2009), The WEKA data mining software, ACM SIGKDD Explorer Newsletter, vol. 11, no. 1, p. 10.


[27] Hyndman, R., and Athanasopoulos, G., (2013), Forecasting: principles and practice, 1st edition, OTexts publisher.

[28] Nikravesh, A., Ajila, S.A., and Lung, C.-H., (2014), Measuring Prediction Sensitivity of a Cloud Auto-scaling System, Proceedings of 38th IEEE Annual International Computers, Software and Applications Conference Workshop, pp. 690–695.

[29] Hastie, T., Tibshirani, R., and Freidman, J., (2009), The Elements of Statistical Learning, 2nd edition, Springer publisher.

[30] Witten, I., Frank, E., and Hall, M., (2011), Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Morgan Kaufmann Publisher.

[31] Chai, T., and Draxler, R., (2014), Root mean square error (RMSE) or mean absolute error (MAE) – Arguments against avoiding RMSE in the literature, Journal of Geoscience Model Deveopment, vol. 7, no. 1, pp. 1247 – 1250.

Authors Ali Yadavar Nikravesh received the B.S. and the M.S. degrees in Software Engineering from Shahid Beheshti University, Iran and the Ph.D. degree in Computer Engineering from Carleton

University, Canada. In September 2016 he joined Microsoft Corporation where he is now a Software Engineer. His research interests include: Cloud computing, machine learning algorithms, and time-series prediction.

Samuel A. Ajila is currently an associate professor of engineering at the Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada. He received B.Sc. (Hons.)

degree in Computer Science from University of Ibadan, Ibadan, Nigeria and Ph.D. degree in Computer Engineering specializing in Software Engineering and Knowledge-based Systems from LORIA, Université Henri Poincaré – Nancy I, Nancy, France. His research interests are in the fields of Software Engineering, Cloud Computing, Big Data Analytics and Technology Management.

Chung-Horng Lung received the B.S. degree in Computer Science and Engineering from Chung-Yuan Christian University, Taiwan and the M.S. and Ph.D. degrees in Computer

Science and Engineering from Arizona State University. He was with Nortel Networks from 1995 to 2001. In September 2001, he joined the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada,

where he is now a Professor. His research interests include: Software Engineering, Cloud Computing, and Communication Networks.

Wayne Ding is a senior specialist in Ericsson, Canada. He is involved in 4G and 5G wireless system R&D. His work focuses on mobile big data and wireless network KPIs. He received the BS and MS degrees

Dalhousie University, Canada.


AN INVESTIGATION OF MOBILE NETWORK TRAFFIC DATA AND APACHE HADOOP PERFORMANCE

Man Si, Chung-Horng Lung, Samuel Ajila

Department of Systems and Computer Engineering Carleton University

Ottawa, Ontario, Canada, K1S 5B6 [email protected], {chlung,

ajila}@sce.carleton.ca

Wayne Ding RAN System, Business Unit Radio

Ericsson, Canada Ottawa, Ontario, Canada, K2K 2V6

[email protected]

Abstract Since the emergence of mobile networks, the number of mobile subscriptions has continued to increase year after year. To efficiently assign mobile network resources such as spectrum (which is expensive), the network operator needs to critically process and analyze information and develop statistics about each base station and the traffic that passes through it. This paper presents an application of data analytics by focusing on processing and analyzing two datasets from a commercial trial mobile network. A detailed description that uses Apache Hadoop and the Mahout Machine learning library to process and analyze the datasets is presented. The analysis provides insights about the resource usage of network devices. This information is of great importance to network operators for efficient and effective management of resources and for supporting high-quality of user experience. Furthermore, an investigation has been conducted that evaluates the impact of executing the Mahout clustering algorithms with various system and workload parameters on a Hadoop cluster. The results demonstrate the value of performance data analysis. Specifically, the execution time can be significantly reduced using data pre-processing, some machine learning techniques, and Hadoop. The investigation provides useful information for the network operators for future real-time data analytics.

Keywords: Mobile network traffic; clustering; Principal Component Analysis (PCA); real-time data analytics; Hadoop; Mahout

_____________________________________________________________________________________________

1. INTRODUCTION

With the emergence of mobile networks, more and more users are connected to the network and have access to the Internet through their mobile devices (Ericsson, 2015). Each connected user occupies the limited radio resources of a mobile network (e.g. spectrum) for a certain period of time. Since the radio resources are very expensive, mobile network operators spend a great deal of time and money to determine how to efficiently allocate these resources. More specifically, the mobile network operators aim to maximize profits and ensure that the

users are satisfied with the quality of service (QoS). To support resource allocation, the mobile network operators collect the statistics of the traffic (e.g. throughput and average number of connections/cell) from base stations. This information can be analyzed to determine how the resources are used and to find traffic patterns from the data. However, the base stations collect a wide variety of information for a long period, and hence the amount of data collected becomes vast. Analyzing such a large amount of data becomes a challenge and this problem can be solved efficiently by using Big Data Analytics approach and techniques.


Hadoop (Apache, 2015) and its machine learning library called Mahout (Hortonworks, 2015) (discussed more in Section 2.1) are well-known techniques used for analyzing big data. This research focuses on exploring and comparing the popular clustering algorithms of Mahout to analyze real-life base station data from a commercial trial mobile network. In addition, Hadoop parameters, such as the number of Map tasks, the number of Reduce tasks, need to be tuned to achieve the optimal performance when running machine learning techniques, e.g., K-means clustering, on a Hadoop cluster. The experiments are performed on a Hadoop cluster composed of Virtual Machines within a server.

There are two main motivating factors for this research work:

• Growth of mobile subscriptions brings a big challenge for mobile network operators to provide a robust and scalable wireless network. However, a mobile network operator has limited wireless resources. Thus, they need to devise efficient and effective techniques to allocate these limited wireless resources.

• System data collected by base stations consists of valuable information that can reveal the current state and quality of the system. The amount of data that needs to be analyzed to find meaningful information and useful patterns can be substantial. However, the traditional data analysis techniques that use a single machine cannot efficiently process the large volume of data.

To address these two problems, this paper presents an approach to processing and analyzing mobile network data for more than 5000 base stations. The approach consists of multiple steps, including identification of the relationship between features, using Principal Component Analysis (PCA) (Fodor, 2002) for feature reduction, and application of clustering algorithms for categorizing data records.

Therefore, this paper presents an empirical investigation on applying data analytics techniques. The main contributions of this paper include the following. First, the paper applies Mahout K-means and fuzzy K-means algorithms on the experimental datasets to group together data records that have similar attributes. Second, experiments have been performed using PCA to reduce execution time as a result of feature reduction, while still maintaining an accurate clustering solution. Third, a detailed investigation of a critical feature to mobile network

operators: the Average RRC (Radio Resource Control) Connections/Cell that specifies the average number of user connected to the base station. The analysis reveals how the Average RRC Connections/Cell changes throughout a 24-hour period and how the base station cells are affected. The study provides useful information for the network operator and for future research on data analytics, including real-time data analytics.

The rest of the paper is organized as follows. Section 2 describes the background information and related work. Section 3 discusses the proposed data analysis procedure. Section 4 presents and discusses the experimental results. Lastly, Section 5 concludes the paper and outlines directions for future work.

2. BACKGROUND AND RELATED WORK

2.1 HADOOP AND MAHOUT

Apache Hadoop (Apache, 2015) is a software platform used for analyzing and processing large datasets. It is an open-source software framework implemented using Java and allows for the distributed processing of large datasets using a cluster of computers. Hadoop consists of two core components: the Hadoop Distributed File System (HDFS) and the MapReduce Software Framework. HDFS is a distributed file system that runs on a cluster of commodity hardware and is designed to store very large files (White, 2010). On the other hand, MapReduce (Dean & Ghemawat, 2004) is a programming model and technique for distributing a job across multiple nodes. It works by separating the processing into two phases, the Map phase and Reduce phase. The Map, with processing function Mapper, takes original input data and produces intermediate data as the input for the Reduce phase. In the Reduce phase, the function Reducer accepts intermediate data and merges together those intermediate values which have the same key.

Mahout (Hortonworks, 2015) is a machine learning library built on top of the Hadoop platform and uses the MapReduce paradigm. Mahout contains a collection of algorithms to solve recommendation, classification and clustering problems. Mahout contains the implementation of various clustering algorithms, including K-means, fuzzy K-means and canopy clustering.


2.2 CLUSTERING ANALYSIS AND K-MEANS

Clustering analysis (Tan et al., 2006) refers to the process of organizing items into groups based on their similarity (Owen et al., 2012). The generated clusters consist of a set of items that are similar to one another in the same group but dissimilar from the items belonging to other groups. Clustering analysis is regarded as a form of classification since both techniques are used to divide items into groups. However, clustering analysis categorizes unlabeled items based only on the data and is thus referred to as unsupervised classification. On the contrary, classification techniques which assign new class labels to unlabeled items based on a labeled model are regarded as supervised classification.

One of the advantages of clustering analysis is that it can be used to help provide meaningful and useful insights into the data. Some applications use clustering analysis to achieve a better understanding of their dataset. For example, clustering is used in software evolution to help reduce legacy properties in code; in image segmentation to divide digital image into distinct regions; in recommender systems to recommend new items based on user’s tastes; and in business to summarize dataset.

One of the most well-known and widely used clustering algorithms is called K-means (Tan et al.,2006). In K-means algorithm, first, the user chooses K initial centroids, where K is the number of clusters. The K-means algorithm centroid is defined as the mean of a group of points. Step 2 is to assign each point to the cluster which has the nearest centroid to this point. Closest is quantified by a proximity measure such as Euclidean distance, Manhattan distance and cosine similarity (Owen et al., 2012). Thirdly, the centroids are then updated using the mean of the points assigned to a cluster. Next, steps 2 and 3 are repeated until the result converges which means that the centroid values do not change in subsequent iterations.

2.3 RELATED WORK

This section presents related work on analyzing mobile network traffic data collected from a trial network. With the popularity of mobile devices, more and more users are accessing the Internet using mobile devices (e.g. phones and tablets) constantly.

To improve the performance of wireless networks and the development of the 5G network, network operators need to monitor and analyze the network traffic.

Yang et al. (2014) analyzed the mobile Internet traffic data collected from a capital city in southern China and developed a Hadoop-based system, Traffic Analysis System (TAS). Their analysis shows the basic make up of current mobile devices and users’ interests via the available network providers. Tang and Baker (2000) examined a twelve-week trace of a building-wide local-area wireless network. They analyzed the overall user behavior such as how many users are active at a time and how much users move between access points. The analysis can help determine how wireless hardware and software should be optimized to handle traffic generated for this specific local-area network.

Esteves and Rong (2011) compared K-means clustering and fuzzy K-means clustering algorithms in Mahout to perform clustering of Wikipedia’s latest articles. Based on their research, the authors commented that Mahout is a promising tool for performing clustering but the preprocessing tools need to be further developed. Esteves et al. (2011) studied the performance of Mahout by using a large dataset comprising of tcpdump (a common packet analyzer) data collected from a US Air Force local area network (LAN). They tested the scalability of Mahout with regards to the data size and also compared the performance of processing the dataset on a single node versus multiple nodes.

Comparison with Related Work. The data used for this research was statistical traffic data from a commercial mobile network. Unlike the data analyzed in (Yang et al., 2014) and (Tang & Baker 2000), which do not have base station information, experiment datasets in this paper include records from more than 5000 base stations. Analyze this large volume of data with multiple dimensions from different base stations is a challenge.

Esteves and Rong (2011) and Esteves et al. (2011) tested the Mahout clustering methods on Amazon EC2. In this research, we performed additional experiments using a network traffic dataset to compare the performance of Mahout’s K-means and fuzzy K-means clustering algorithms (see Section 4.3). Furthermore, this research also varies the number of Map and Reduce tasks in a Hadoop job, and it also varies the number of cores assigned to slave nodes, which were not investigated by (Esteves & Rong, 2011) or (Esteves et al., 2011)


3. METHODOLOGY AND APPROACH

Section 3.1 describes the base station datasets that are analyzed in this research. Following that, in Section 3.2, the proposed data processing and analysis procedure is presented.

3.1 DESCRIPTION OF DATASETS Two datasets have been collected from a

commercial trial mobile network.

Description of Dataset 1 (2015 dataset). The first dataset, referred to as the Dataset 1 (236 MB), is anonymized mobile network traffic data collected over a one-week period from January 25, 2015 to January 31, 2015. The data was collected from 5840 unique cells in a region. Each row is a record which contains the unique Cell ID and the values of the 24 features (e.g., downlink/uplink throughput and average number of connected users) collected every hour for one week. Therefore, each unique cell should have 168 records (24 hours per day × 7 days per week). However, some cells have missing records (i.e., data was not collected for certain time slots). To fill in these missing values so that further analysis (e.g. clustering) can be performed, the average value of the particular column (time slot) of where the missing value is located is used in place of the missing value.

Table 1 presents the 24 features used in the 2015 dataset including the unit used for each feature. The following list outlines the abbreviations used in Table 1:

• UE: User Equipment • DL: Downlink • UL: Uplink • PDCP: Packet Data Convergence Protocol • RRC: Radio Resource Control • DRB: Data Radio Bearer • SRB: Signal Radio Bearer

An example of UE is a mobile phone. PDCP is one of the user plane protocols used in LTE mobile networks. This protocol sends and receives packets from User Equipment and eNodeB (abbreviation for Evolved Node B), which is the hardware used to communicate directly with UEs in a mobile phone network. There are two kinds of PDCP bearers, which are data carriers in LTE systems, the SRB and DRB. SRB is used for carrying signals and DRB is used to carry User Plane content on the Air Interface.

The Average RRC Connections is calculated by dividing the total number of RRC Connections with the number of samples.

Feature ID

Feature Name Description

1 ActiveUeDlSum Number of Active UE at DL 2 ActiveUeUlSum Number of Active UE at UL 3 PdcpBitrateDlDrb

Max Max PDCP DRB Bit Rate in Mbps at DL

4 PdcpBitrateDlDrbMin

Min PDCP DRB Bit Rate in Mbps at DL

5 PdcpBitrateUlDrbMax

Max PDCP DRB Bit Rate in Mbps at UL

6 PdcpBitrateUlDrbMin

Min PDCP DRB Bit Rate in Mbps at UL

7 PdcpLatPktTransDl

PDCP Latency for Packet Transmission at DL

8 PdcpLatTimeDl PDCP Latency Time 9 PdcpPktReceived

Dl Number of PDCP Packets Received at DL

10 PdcpPktReceivedUl

Number of PDCP Packets Received at UL

11 PdcpVolDlDrb PDCP DRB Volume at DL 12 PdcpVolDlDrbLas

tTTI PDCP DRB Volume at DL for the Last TTI

13 PdcpVolDlSrb PDCP SRB Volume at DL 14 PdcpVolUlDrb PDCP DRB Volume at UL 15 PdcpVolUlSrb PDCP SRB Volume at UL 16 AvgRrcConn/Cell Number of Average RRC

Connections/Cell 17 RrcConnMax Number of Peak Connected

Users/Cell 18 SchedActivityCell

Dl Scheduling Time in Seconds per Cell at DL

19 SchedActivityCellUl

Scheduling Time in Seconds per Cell at UL

20 SchedActivityUeDl

Scheduling Time in Seconds per UE at DL

21 SchedActivityUeUl

Scheduling Time in Seconds per UE at UL

22 UeThpTimeDl UE Throughput Time at DL 23 UeThpTimeUl UE Throughput Time at UL 24 UeThpVolUl UE Throughput Volume at

UL Table 1. Feature description for Dataset 1

Description of Dataset 2 (2013 dataset). The second dataset, which is referred to as the Dataset 2 (24MB), is also anonymized mobile network traffic data. The dataset has a similar format as the Dataset 1. The records were collected from 7592 unique cells in a region. The data for these unique cells were collected every 15 minutes in the morning from 6:00 to 7:00 and in the afternoon from 14:00 to 15:00 on November 18, 2013. At each time slot, data for 17 features, including the average number of connected users, are recorded. The description of the 17 features in the 2013 dataset is presented in Table 2.

The following list outlines the abbreviations used in Table 2:


• UE: User Equipment • TTI: Transmission Time Interval • DL: Downlink • UL: Uplink • RRC: Radio Resource Control • DRB: Data Radio Bearer • SRB: Signal Radio Bearer

TTI is a parameter used in digital communication networks that refers to the duration of a transmission on the radio link. Occupancy denotes how much percentage the TTI has been occupied for data transmission. Note that the other terms, which are also used in the 2015 dataset (see Table 1) are explained in the section named “Description of Dataset 1”.

Feature ID

Feature Name Description

1 ActiveUeDlMax/TTI/Cell

Number of Maximum Active UE at DL/TTI/Cell

2 ActiveUeUlMax/TTI/Cell

Number of Maximum Active UE at UL/TTI/Cell

3 RrcConnMax/Cell Number of Maximum RRC Connection/Cell

4 RrcConnAvg/Cell Number of Average RRC Connection/Cell

5 ActUeDl/TTI (incl. Idle time)

Number of Active UE at DL/TTI (include idle time)

6 ActUeUl/TTI (incl. Idle time)

Number of Active UE at UL/TTI (include idle time)

7 ActUeDl/TTI (SE/TTI)

Number of Active UE at DL/TTI (Schedule Entity/TTI)

8 ActUeUl/TTI (SE/TTI)

Number of Active UE at UL/TTI (Schedule Entity/TTI)

9 DlDrbCellTput DL DRB Cell Throughput in Mbps

10 UlDrbCellTput UL DRB Cell Throughput in Mbps

11 DlDrbUeTput DL DRB UE Throughput in Mbps

12 UlDrbUeTput UL DRB UE Throughput in Mbps

13 DlTtiOccupancy DL TTI Occupancy 14 UlTtiOccupancy UL TTI Occupancy 15 TotalAvgTput

(Vol/ROP) Total Average Throughput Volume/ Recording Output

16 AvgServiceLength(s)/CU

Average Service Length/Connected User

17 AvgSessionLength(s)/CU

Average Session Length/Connected User

Table 2. Feature description of Dataset 2

3.2 DATA ANALYSIS PROCEDURE

This paper applies a sequence of steps to analyze the mobile network traffic datasets. The procedure shown in Algorithm 1 outlines the key steps.

Data Preprocessing. The features in the experimental dataset have different units. Before applying any analysis method (e.g. checking the correlation of different features or clustering the dataset), we first applied a standardization method (Kuhn, 2015) to eliminate the effect of different parameter units. Without the same scale, different parameter units will cause an unfair comparison when data mining techniques is applied, for example, clustering algorithm. A standardized score is calculated using the following formula in R, a data analysis tool (R Core Team, 2015):

Standardized score = (1) where Z is the standardized score, is the value of an element, is the mean value of all elements, and is the standard deviation. After standardization, we get a dataset with standardized data where each feature has the same weight.

1: Preprocess data (Standardization) 2: Find feature correlation (using R package) 3: If analyzing with all features then 4: Apply K-means and fuzzy K-means clustering

method 5: Else if analyzing with reduced feature dimension then 6: Apply PCA 7: Apply K-means clustering method 8: Else if individual feature analysis then 9: Select one feature at a time 10: Convert the data format (include

handling missing time slots)

11: Apply K-means clustering method 12: End if 13: Evaluate and analyze results

Algorithm 1. Data analysis procedure

Identification of Feature Correlation. In order to have a better understanding of the relationship between different features, the next step is to calculate the correlation between each pair of features by using the cor() function from the ‘stats’ package of R (R Core Team, 2015).

Applying Clustering Methods. Clustering is used to organize items into groups from a large set of items based on their similarity. The steps of applying Mahout K-means and fuzzy K-means clustering to the dataset are shown in Algorithm 2.


Retrieval of Clustering Results. The final results of a clustering job are stored in files located in HDFS. To retrieve the clustering results, two commands are used: seqdumper and clusterdump.

1: Prepare data file and convert the data format to Mahout Vector

2: Convert Mahout Vector to Hadoop SequenceFile 3: Randomly select K centroids 4: Copy the SequenceFile to HDFS 5: Set Hadoop configuration parameters: NameNode IP 6: Set K-means clustering parameters: threshold 7: Run Mahout K-means clustering 8: Use ClusterDumper/SequenceFileDumper to get the

clustering result Algorithm 2. Steps of applying Mahout K-means

Clustering

Evaluation of Clustering Quality. Two common metrics provided by the Mahout Library to evaluate clustering quality are: (1) scaled average inter-cluster distance (can be regarded as separation) and (2) scaled average intra-cluster distance (cohesion). A large scaled average inter-cluster distance represents good clustering quality, since good clusters usually do not have centroids that are close to each other. Scaled average intra-cluster distance computes the distance between members within a cluster. A small intra-cluster distance value shows that the points within the cluster are close to one another (i.e. a cohesive cluster), which is desired. Some research has used one or both of the two metrics separately to evaluate the clustering quality (e.g., Esteves & Rong 2011). In this paper, we propose a new metric, called the validity metric, that combines the two evaluation metrics together, as shown in Eq. (2):

A large validity value is desirable, since a large inter-distance and a small intra-distance indicate a good quality clustering result.

Dimension Reduction Using Principal Component Analysis (PCA) - Dimension reduction is the process of reducing the number of features (or dimensions) in a dataset. With dimension reduction, the original data is mapped from a high-dimensional space to a lower dimensional space to simplify the analysis of the dataset. This can remove redundant information and noise in the original dataset that can impact the analysis results. PCA is a well-known linear dimension reduction technique (Fodor, 2002). It reduces the dimension of a dataset by finding a few principal components with large variances. Principal

components (PCs) are orthogonal and generated by a series of linear combinations of the variables.

Feature-by-feature analysis - allows us to see the distribution of the base stations from just one feature. This can be useful because mobile network service providers sometimes are interested in one specific feature and want to see the behavior of base stations connected to that feature. Furthermore, feature-by-feature analysis can provide a more detailed result compared to using values of all features in the same experiment.

To perform feature-by-feature analysis, we develop an algorithm to convert the data into an appropriate format. The goal is to transpose the start time of the records. That is, the start time for each traffic trace now becomes the feature of the record. Thus, by analyzing the converted dataset, we can analyze how one feature changes at different points in time.

4. EXPERIMENTATION AND ANALYSIS

4.1 ANALYSIS OF THE CORRELATION OF FEATURES

To analyze the correlation of features, a correlation matrix is generated using the R data analysis tool (R Core Team, 2015).

Correlation Matrix for Dataset 1. The correlation matrix for Dataset 1 is displayed in Figure 1. The color blue indicates positive correlation, and the color red indicates negative correlation. The darker the color (or the larger the circle size and the value) the stronger the correlation. The result shows that many of the features in Dataset 1 are strongly correlated (e.g. X9, X10 and X11).

Correlation Matrix for Dataset 2. The correlation matrix for Dataset 2 is illustrated in Figure 2. The results show that features X3 (Maximum RRC Connection/Cell) and X4 (Average RRC Connection/Cell), and features X1 (Active UE at DL/TTI/Cell) and X5 (Active UE UL/TTI) are observed to have a strong positive correlation. Conversely, it is observed that some feature pairs (e.g. X1 and X 11, and X11 and X3) are negatively correlated. X11 represents the downlink UE throughput and the negative relationship means that as the number of active UE (or the RRC Connections)


increases, downlink UE throughput decreases. There are also some feature pairs that are loosely correlated or have almost no correlation (e.g. X7 and X8), and thus have a small absolute correlation value. X7 and X8 represent the number of active users at downlink and uplink, respectively. The features X7 and X8 are not strongly correlated, which indicates that downlink and uplink channels are independent in terms of user numbers.

Figure 1. Correlation matrix for features of Dataset 1

Figure 2. Correlation matrix for features of Dataset 2.

4.2 EXPERIMENT SETUP AND CONFIGURATION OF HADOOP

The Hadoop cluster is deployed on a local server with the following characteristics: 16 AMD Opteron Processors 4386 (3.1 GHz with 8 cores), 32 GB of

memory, and 1770 GB of storage, running on 64-bit CentOS (Linux) 6.0 operating system.

As outlined in Table 3, five virtual machines (VMs) are deployed on the server. The specification of each VM is outlined in Table 3. All the VMs have the same virtual disk size equal to 100.73 GB, and each VM also runs Ubuntu 14.04 (64-bit) operating system.

VM ID

Type Number of Virtual

CPUs

RAM (GB)

1 Master Node (NameNode +JobTracker)

8 16

2 Slave Node A (DataNode +TaskTracker)

1 8

3 Slave Node B (DataNode +TaskTracker)

2 8

4 Slave Node C (DataNode +TaskTracker)

4 8

5 Slave Node D (DataNode +TaskTracker)

8 8

Table 3. VM specifications inside host server

The Hadoop daemons are executed on these VMs, and thus each VM is considered a node in the Hadoop cluster. VM1 operates as the Master Node of the Hadoop cluster and therefore executes the NameNode and JobTracker. The other VMs, VM2 to VM5, are considered the Slave Nodes and run the DataNode and TaskTracker Hadoop daemons. Note that the Hadoop cluster used in the experiments comprises of one Master Node and one Slave Node (which can be VM2, VM3, VM4, or VM5). Each of the Slave Node VMs is configured to have a n/2 map slots and n/2 reduce slots, which determine the number of map tasks and reduce tasks the Slave Node can execute in parallel at a point in time. The value of n equals the total number of cores the VM has. For example, VM2 is configured with 4 map and 4 reduce slots because it has 8 cores.

4.3 CHOOSING APPROPRIATE K AND FUZZY VALUES

This section presents and analyzes the clustering results of Dataset 1 and Dataset 2. K-means (Tan et al., 2006) and fuzzy k-means (Tan et al., 2006) techniques are used for clustering the experimental datasets in this paper. Note that the inputs to the clustering techniques are the preprocessed versions of Dataset 1 and Dataset 2. The standardization technique that is used to preprocess the datasets is detailed in Section 3.2. We vary the K value for the


k-means technique to determine the K value that achieves the best validity value (refer to Eq. (3.2)). Furthermore, additional experiments (which use the best K value identified in the previous experiment) are conducted to find the value of the fuzziness parameter for the fuzzy k-mean technique that generates the best validity value.

The Hadoop cluster used in this section is composed of the Master Node and the Slave Node D (with 8 CPUs as shown in Table 3). The convergence threshold of the k-means clustering algorithm is set to 0.01, which means that the difference between the last two iterations has to be less than 1% for the algorithm to stop. However, the maximum number of iterations is set to 150, hence even if the algorithm does not converge based on the threshold after 150 iterations the algorithm with still stop. Note that the number of iterations is set to 150 so that the experiments can complete in a reasonable amount of time if the algorithm does not converge. These configuration values are the same as those used in (Esteves et al., 2011).

In the experiments conducted, the K value varies from 3 to 10 (see Table 4). For each experiment the following metrics are collected or calculated: scaled average inter-cluster distance (denoted as inter-cluster density), scaled average intra-cluster distance (indicated as avg. intra-cluster density), the validity value (calculated using Eq. (3.2), the number of iterations, and the execution time of each iteration). Since the initial centroids are selected randomly, each experiment is run three times and the average value of each metric is calculated. Running the experiments three times was adequate, because less than 5% variation was observed in the final results and all three runs converge to a similar solution.

K-means Result of Dataset 1. This section applies k-means clustering method to Dataset 1. As shown in Table 4, as the K value (No. of clusters) increases, it is observed that in general the value of Inter-cluster density tends to decrease. The largest inter-cluster density, which means that the clusters have good separation, is achieved when K=4. The value of intra-cluster density decreases as K increases, except when K increases from 8 to 9. A smaller value for intra-cluster density indicates higher cohesion. The largest validity value (defined in 3.2) is reached when K=4. Thus K=4 is the most appropriate K value for Dataset 1. As shown in Table 4, it requires 35 iterations for the k-means algorithm to converge when K=4. Lastly, it is also observed that the execution time for each iteration increases with an increase in K. This is expected because k-means

requires calculating the distance between each data point and each cluster center. Therefore, a higher value of K (clusters) means that more calculations need to be performed causing a higher execution time.

Fuzzy K-means Results of Dataset 1. The experiments in this section apply fuzzy k-means to Dataset 1 to find the fuzziness parameter that obtains the highest validity value. It can be observed from Table 4 that K=4 achieves the highest validity value when k-means is used to cluster Dataset 1. Therefore, K=4 is chosen for the fuzzy k-means algorithm used in this section which aims to find the value of the fuzziness parameter that achieves the highest validity value. The fuzziness parameter varies from 1.0 to 1.9. The experimental results of applying fuzzy k-means clustering to Dataset 1 are displayed in Table 5. The largest validity value is generated when fuzziness=1.0. Note that using fuzzy k-means with fuzziness=1.0 is identical to using normal k-means clustering 0. This result indicates that k-means clustering with K=4 mostly fits Dataset 1, meaning that the data points in Dataset 1 can be divided into four well-defined clusters. By comparing the execution time for fuzzy k-means in Table 5 and k-means in Table 4, it is observed that the execution time per iteration for fuzzy k-means is longer compared to k-means. This is reasonable, because the complexity of the fuzzy k-means algorithm is more than the k-means algorithm.

K

Inter-CD

Intra-CD Validity N

Execution Time(s) / Iteration

3 0.498 0.701 0.710 34 46.22 4 0.501 0.673 0.744 35 46.78 5 0.440 0.663 0.664 60 48.73 6 0.458 0.651 0.704 64 49.98 7 0.396 0.644 0.615 96 51.69 8 0.372 0.641 0.580 122 53.34 9 0.350 0.642 0.545 135 54.68 10 0.325 0.632 0.514 131 55.95 K= no. of clusters, CD = cluster density, N= no. of iterations

Table 4. K-means results of Dataset 1

F Inter-CD

Intra-CD Validity N

Execution Time(s) / Iteration

1.0 0.501 0.673 0.744 35 67.05 1.1 0.485 0.671 0.723 83 66.70 1.2 0.469 0.668 0.702 69 66.51 1.3 0.448 0.668 0.671 45 66.69 1.4 0.431 0.660 0.653 34 66.95 1.5 0.414 0.653 0.634 27 66.71 1.6 0.394 0.656 0.601 21 66.45


1.7 0.380 0.658 0.578 17 66.49 1.8 0.382 0.684 0.558 14 66.69 1.9 0.394 0.614 0.642 45 66.79 F = fuzziness, CD = cluster density, N= no. of iterations

Table 5. Fuzzy K-means results of Dataset 1

4.4 APPLYING PRINCIPAL COMPONENT ANALYSIS (PCA)

This section discusses the performance and accuracy of applying the PCA (Fodor, 2002) technique before clustering. Dataset 1 has 24 features and most of these features are strongly correlated. This phenomenon implies that redundant features exist in the dataset and thus processing the dataset with all the features, which requires a substantial amount of time for a very large data, is not required.

After applying PCA, the dataset will lose some information based on the number of Principal Components (PCs) that have been chosen. The cumulative proportion of variance (see Eq. (3)) is a way to measure this lost.

Cumulative proportion of variance:

(3)

where λ is the variance and n indicates the total number of PCs. To determine the number of PCs to be used, we set the threshold of V to be 0.85 (Kerdprasop, 2005) and PCs to be 1 to 4 are retained as meaningful variables for further analysis.

To evaluate the performance of PCA as applied to our experimental datasets, an accuracy function (see Eq. (4)) is proposed. We have evaluated 8 K values (from 3 to 10, inclusive, cf. Table 4) for K-means clustering and selected K = 4 as it generates the highest validity value, see Eq. (2), compared to other values. Hence, K-means clustering with K = 4 is performed on Dataset 1 using PCA before clustering and without using PCA. In addition, our experimental results for K-means are better than that for fuzzy K-means, since the validity value for K-means when K=4 is higher than any of the results for fuzzy K-means (excepts when fuzzy number is equal to 1 which is identical to using normal K-means clustering). Hence, the rest of the paper focuses only on K-means.

Accuracy = (4)

Table 6 displays the results of two clustering experiments: A and B. Experiment A represents the clustering result of using all 24 features of the original Dataset 1 and experiment B denotes the clustering result of Dataset 1 using top 4 principal components (PCs) after using PCA. Since K=4 is chosen for both experiments, both experiments A and B generate 4 clusters. The percentage of overlapping data records in A and B is over 99.45% which is confirmation of the selection of K=4. The numbers displayed in the second, third, and fourth columns of Table 6 are the number of data records in each cluster. For example, there are a total of 978108 records in experiment A, 978108 records in B and 972738 records in records A∩B (intersection of A and B). A∩B shows the number of data records that are grouped in the same cluster in both experiment A and experiment B. In cluster #1, there are 4644 records in experiment A, 4662 records in B and 4527 records in A∩B which means that 4527 data records from cluster 1 in experiment A also appear in cluster 1 in experiment B.

Cluster ID

A B A∩B A∩B/A

1 4644 4662 4527 97.48% 2 84566 85371 84052 99.39% 3 301373 302629 299442 99.36% 4 587525 585446 584717 99.52%

Total 978108 978108 972738 99.45%

Table 6. Comparison of clustering results without using PCA (A) and with using PCA (B)

As shown in Table 6, the number of correctly assigned records after applying PCA is 972738. What is “correct” is evaluated by comparing the results of B with the results of A, as experiment A is based on all 24 features (i.e. A∩B/A). Thus, the accuracy of applying PCA for feature reduction is 972738/978108 = 99.45%.

By Using PCA, the number of features in Dataset 1 was reduced from 24 to 4, which results in the input data size reduction from 236MB to 38MB. This in turn reduced the total execution time of a K-means clustering job from 1637.30s to 1046.08s, or 36.11% reduction. Extrapolating these figures (i.e., data size and execution time) over a bigger dataset (GB or TB) indicates a substantial gain in memory space and time. The savings in computation time can also be critical for real-time data analytics that could provide useful information rapidly to the network operator for potential better resource management.


4.5 ANALYSIS OF IMPACT OF AVG. RRC CONNECTIONS PER DAY

This section discusses the results of analyzing a single feature for a whole day in order to have a better understanding of how the feature value changes during the day and how the base station cells are affected by the feature. The feature that is chosen for analysis is “Feature 16” of Dataset 1 which is the average number of radio resource control (RRC) connections per cell. The reasons for choosing Feature 16 are as follows:

• Many features in Dataset 1 have a strong correlation with Feature 16 based on correlation analysis. Therefore, analyzing Feature 16 is representative of analyzing the other features of Dataset 1.

• Average RRC Connections/Cell is a crucial parameter for mobile network operators, since it can influence user experience and resource utilization.

The first step of this experiment is to select Feature 16 from Dataset 1 and convert it to generate the dataset of which each row has one unique base station ID and the values of Feature 16 collected every hour from the hour (0:00) to the end of the day (23:00). K-means clustering is then applied to the modified dataset. The K-means clustering result of Feature 16 with K = 4 is displayed in Table 7. K = 4 is selected due to its high validity value in comparison to other K values (cf. section 4.2 Table 4). The number of unique cells assigned to each cluster is shown in Table 7. Each cluster has a center vector that is calculated by averaging all of the points’ values in a cluster. By plotting and analyzing the 4 center vectors, with each having 24 dimensions (one for each hour of the day), we can have a better understanding of how the base station cells in different groups are affected by Feature 16 i.e. the Average RRC Connections/Cell, a dominant feature for resource usage.

Cluster ID 1 2 3 4 Number of Cells 71 604 1935 3230

Table 7. K-means clustering result of Feature 16

from Dataset 1 for K = 4.

Figure 3 shows the feature values of the 4 cluster centers at different times of the day. As seen in

Figure 3, a large number of cells have a small number of average connected users (RRC Connections) (e.g., Cluster 4), and only 71 of 5840 cells have a large number of average connected users (see Cluster 1). As time changes from morning to night, we can see the number of average connected users fluctuates. For Cluster 1, the value varies a lot throughout the day. It reaches its minimum value of 45.93 at 5:00 in the morning and achieves its peak value of 194 at 15:00. This is reasonable since mobile network users are more active in the early afternoon compared to that in early morning. A valuable result obtained from the experiment is that those cells in this cluster may need to be monitored closely as the number of users can be high, which may have impact on resource management, system performance, and user experience. Identifying those cells becomes useful for the operator in network planning and management.

Figure 3. Average number of RRC Connections/Cell at different times of a day

The values for Cluster 2 and Cluster 3, however, do not vary as much compared to that of Cluster 1, but they still follow a similar trend. Cluster 4 has the least amount of connected users both in the morning and in the afternoon. This indicates that the cells in Cluster 4 either have only a small number of UEs or the UEs are not highly utilized.

In addition, mobile network operators and users are concerned with QoS which can be evaluated using throughput. Since Dataset 1 does not have features related to throughput, the following experiment uses Dataset 2 which has several metrics related to throughput. To understand the impact of RRC Connections/Cell on throughput, we select two throughput related features from Dataset 2 and


analyze the relationship between these features. These two features are Downlink UE Throughput (Feature 11) and Uplink UE Throughput (Feature 12). To analyze the relationship of these features with respect to the RRC Connections/Cell (Feature 4 in Dataset 2) we proceed as follow. There are 60129 records in total for different base stations at different times. We group every 500 records and calculate the average value for each of these three features and then plot the result as depicted in Figure 4 (for RRC Connections/Cell and UE throughput results).

Figure 4. Relationship between RRC Connections/Cell and UE Throughput

Figure 4 has two vertical axes, the left axis is for RRC Connections/Cell and the right axis is for Downlink and Uplink Throughputs. Figure 4 shows that as the number of RRC Connections increases, both the downlink and uplink data radio bearers (DRB) UE throughputs decrease. The uplink DRB UE throughput drops quickly when the number of average connected users starts to increase and the downlink DRB UE throughput decreases more evenly and slowly. This figure demonstrates that a large value of the number of average connected users can lead to lower user equipment throughput (both downlink and uplink) and thus QoS can be affected. For example, Netflix, the popular video streaming service recommends at least 5 Mbps for downlink throughput for streaming HD (High Definition) quality video (Netflix, 2015). From this analysis, we observe that average connected users/cell has a strong negative correlation with UE throughput. Therefore, a large number of average connected users/cell may degrade QoS because of the lower UE throughput. Conversely, a small number of average connected users/cell may not be desirable because it can lead to

lower resource utilization if the service provider has a larger number of base stations.

The results of the experiment can help the network operator effectively allocate and manage resources and improve QoS for users.

Analysis of Impact of RRC Connections for a weekday. To compare the distribution of base station cells between a weekend (Sunday, Jan. 25, 2015) and a weekday, an analysis of a weekday (Monday, Jan. 26, 2015) is performed. The K-Means clustering technique is applied to this data and the number of clusters (K value) is chosen to be 4 since it best represents this dataset as described in Section 4.3. Four clusters are generated and the number of base station cells assigned to each cluster is shown in Table 8. For example, the first cluster includes 176 base station cells.

Cluster ID 1 2 3 4 Number of Cells 176 853 2027 2802

Table 8. The number of cells for each cluster

Figure 5. Average number of RRC Connections/Cell at different times of a weekday (Monday, Jan. 26,

2015)

Figure 5 shows the value of average RRC Connections at different time slots of a weekday (Monday, Jan 26, 2015). By comparing the results of Figure 3 and Figure 5, it is observed that the overall trend is similar. However, cluster 1 of the weekday result (cf. Figure 5) has a slightly lower average value compared with cluster 1 of the weekend result (cf. Figure 3). In Figure 5, cluster 1 reaches the minimum of 32 at 5:00 and achieves the maximum of 150 at 14:00 compared to Figure 3 where cluster 1 achieves a minimum value of 45.93 at 5:00 and its peak value of 194 at 15:00. This is because cluster 1


of the weekday result has more cells and some of these cells have some lower values of number of Avg. RRC Connections/Cell compared with cluster 1 of the weekend result. Data of other weekdays are analyzed and the results follow the same trend as that of Monday as shown in Figure 5. Note that those cells assigned to the first cluster in results of Monday (Jan 26, 2015) and Tuesday (Jan 27, 2015) have 93 in common but only 3 cells are common on Sunday (Jan 25, 2015) and Monday (Jan 26, 2015).

4.6 PERFORMANCE EVALUATION OF HADOOP

This section evaluates the performance of executing the K-means clustering technique on a Hadoop cluster with a larger amount of data. Note that the experiments are not meant to achieve the optimal performance. Rather, this is a set of preliminary experiments to demonstrate the effect of various parameters and the potential benefits of using Hadoop for performance analysis and potential improvement.

The default values for the Hadoop cluster parameters used for all experiments in this section are as follows:

• Clustering method: K-means;

• K = 4;

• ConvergeThreshold = 0.01;

• DistanceMeasurement=EuclideanDistance; and

• MaximumIterations = 150.

The experimental results using Hadoop provide useful performance guidelines for future data analysis. The information becomes crucial if real-time traffic data analysis needs to be performed, as the execution speed is important for real-time data analysis. The following highlights some of our experimental results.

Evaluation of Parallel Execution. Experiments in this section are executed on the Hadoop cluster composed of the Master Node (VM1) and the Slave Node D (VM5) with five experiment data sets. The first experimental data is Dataset 1, which comprises data for one week and is 236 MB. The other four data sets contain two to five weeks of data with sizes ranging from 472 MB to 1180 MB (with 236 MB

increment for each week). Besides changing the size of the experiment dataset, the numbers of Map and Reduce tasks running in parallel are also varied to investigate how the Hadoop cluster performance is affected.

Figure 6 illustrates how changing the number of map tasks affects the execution time of the Map Stage. Overall, the performance improves when the number of Map tasks increases from 1 up to 8, but deteriorates at larger values (i.e., 16 and 32 Map tasks). This means that when the data size is small (236 MB), using a large number of Map tasks (e.g., 8, 16 or 32) for parallel processing does not generate a better performance due to the overhead associated with creating Map tasks and managing the splits. It is safe to conclude that for the example datasets the optimum number of parallel processing is 8 (i.e. 8 Map tasks). Using the 8 Map tasks as a baseline, as the size of the input dataset increases, there is improvement in the performance because of increased parallelism. The reduction can be significant for a larger data size, e.g., the execution time is reduced from 120+ seconds to ~30 seconds for data size 5 of 1180MB, as shown in Figure 6.

Figure 6. Effect on the execution time of the Map Stage when varying the number of Map tasks.

Figure 7 shows the impact of the number of Reduce tasks on system performance when using Datasize 5: 1180MB. The number of Reduce tasks is changed from 1 to 32 and the number of Map tasks is fixed at 32, since the number of Map tasks should not be less than the number of Reduce tasks (Leskovec, J., 2014). Note that the results of the experiments using the smaller datasets show a similar trend in performance. As shown in Figure 7, increasing the


number of Reduce tasks decreases system performance. This is because the time taken for the Reduce stage keeps increasing, which in turn also increases the turnaround time. The Reduce stage is composed of shuffling, sorting and Reduce processing. The execution time of the shuffling and sorting operations are observed to increase with the number of reduce tasks. Thus, for the dataset experimented with, it is observed that using one Reduce task achieves the best performance and increasing the number of reduce tasks is not effective.

Figure 7: Effect on system performance when varying the number of reduce tasks.

Comparison of Performance for Different Number of CPUs. This section compares the performance of a Hadoop job executing on a slave node with different numbers of cores. The Hadoop cluster used is composed of the Master Node (VM1) and Slave Nodes A, B, C, and D with 1, 2, 4, and 8 CPUs, respectively. The size of dataset used for this experiment is 1180 MB. The number of Map tasks is chosen to be 8 and the number of Reduce tasks is set to 1 since they are the optimal values based on the result as shown in Figure 6 and Figure 7.

The execution time of the Map stage, the Reduce stage, and the turnaround time are depicted in Figure 8. Increasing the number of CPUs decreases the execution time for the Map stage and slightly for the Reduce stage due to the increased parallel execution using more cores. Note that the turnaround time is typically more than the the sum of the average execution time required for the Map stage and Reduce stage because of the execution time required by the other phases of the MapReduce jobs, including

the shuffle and sort phases. In the one case where the opposite is true (i.e. when the number of CPUs is 8), this can be attributed to Hadoop executing some of the Reduce tasks before all the map tasks finish executing. The total execution time can be reduced from ~188 seconds to ~37 seconds from 1 CPU to 8 CPUs for our data set with 1180MB.

Figure 8. Effect of varying the number of CPUs.

5. CONCLUSIONS AND FUTURE WORK

This paper focused on processing and analyzing mobile network traffic data. One of the main objectives was to identify important information from traffic data patterns that may provide useful insight for the network operator. We presented a detailed approach to processing and analyzing the datasets collected from real life mobile networks. The approach used the clustering algorithms (K-means and Fuzzy K-means) and other data mining algorithms of the Mahout Machine learning library to analyze the datasets. Furthermore, this research also investigated how the performance of the Hadoop cluster is affected when executing the Mahout clustering algorithms using different number of Map and Reduce tasks, and workload parameters. A summary of our experiences and key findings from analyzing the datasets are outlined below.

• The analysis of the Average RRC (radio resource control) Connections/Cell (a critical feature for mobile networks) over a 24h period, showed that a majority of base stations had an acceptable number of connected users. However, 71 base stations have a high value of Average


RRC Connections/Cell. The utilization and performance of those 71 base stations may be affected by the high demands, and hence they need to be monitored closely for better resource management and future network planning.

• The results showed that applying PCA before clustering could reduce the execution time (e.g., for Dataset 1 the execution time decreased from 1637s to 1046s) while still maintaining the accuracy of the clustering solution (99.45%). The study is useful for large datasets and for real-time traffic data analysis where execution time becomes more important.

The advantage of using PCA is obvious; however, in some cases it is not practical to apply dimension reduction because the information of the original features needs to be analyzed. For example, when we analyzed how the Average RRC Connections/ Cell changes in a 24-hour period, the original information is needed as opposed to transposed data generated by PCA transformation.

With regards to the performance evaluation of the Hadoop cluster when executing K-means clustering, it was observed that increasing the number of Map tasks could noticeably reduce the execution (i.e., 8 Map tasks for our datasets). However, generating many Map tasks for some data size can lead to poor performance due to the overhead. Furthermore, as expected, increasing the number of processing cores improves performance, which shows that Hadoop and Mahout are scalable. Based on our preliminary experimentation, the performance can even be significantly improved for a reasonably large data size, i.e., 1180MB, using multiple Maps and/or CPUs. A great deal of research using Hadoop for performance has been reported in the literature. The approach, together with a reduced number of features using PCA, can considerably shorten the total execution time. As a result, it can facilitate real-time big data analytics, which can provide useful information in a timely manner for the network operator to effectively manage resources.

Using other machine learning algorithms, such as classification algorithms in the Mahout library to analyze the datasets, is an interesting direction for future research. In addition, developing a tool that can automate the data analysis approach presented in this paper can be investigated. The tool can automate the process of collecting, processing, analyzing, and displaying the analysis results.

6. ACKNOWLEDGMENT

The authors of this paper would like to thank Ericsson, Canada for providing various contributions and technical support. In addition, this research is partly sponsored by NSERC Engage and NSERC Discovery grants.

7. REFERENCES

Apache. Welcome to Apache Hadoop. Retrieved June 25, 2015, from https://hadoop.apache.org

Dean, J. & Ghemawat S. (2004). MapReduce: Simplified Data Processing on Large Clusters,” Proceedings of the 6th Symposium on Operating System Design and Implementation, 137–150.

Ericsson. (June 2015). Ericsson Mobility Report: On the Pulse of the Networked Society. Retrieved Feb. 8 2017, from https://www.ericsson.com/res/docs/ 2015/ericsson-mobility-report-june-2015.pdf

Esteves, R.M. & Rong, C. (2011). Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud. Proceedings of the International Conference on Cloud Computing Technology and Science, 565-569.

Esteves, R.M., Pais, R., & Rong, C. (2011). K-means Clustering in the Cloud – A Mahout Test. Proceeings of IEEE Workshops of Advanced Information Networking and Applications, 514-519.

Fodor, I. K. (2002). A Survey of Dimension Reduction Techniques. (Lawrence Livermore National Laboratory Technical Report UCRL-ID-148494). Retrieved Feb. 8 2017 from https://computation.llnl.gov/casc/sapphire/pubs/148494.pdf

Hortonworks. Apache Mahout. Retrieved Feb. 8 2017, from http://hortonworks.com/hadoop/mahout/

Kerdprasop, N. (2005). Multiple principal component analyses and projective clustering. Proc. of IEEE International Workshop and Expert Systems Applications.

Kuhn, M. Variable Selection Using the Caret Package. Retrieved Feb. 8 2017, from http://cran.r-project.org/web/packages/caret/caret.pdf.


Leskovec, J., Rajaraman A., & Ullman J. (2014). Mining of Massive Datasets. Stanford, CA: Stanford University.

Netflix. Internet Connection Speed Recommendations. Retrieved Dec. 5, 2015, from https://help.netflix.com/en/node/306

Owen, S., Anil, R., Dunning, T., & Friedman, E. (2012). Mahout In Action. Manning Publications Co. ISBN 978-1-9351-8268-9.

R Core Team. R: A language and Environment for Statistical Computing. Retrieved Dec. 5, 2015, from http://www.r-project.org/

R Core Team. The R Stats Package. Retrieved Feb. 8, 2017, from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html

Tan, P-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Addison-Wesley.

Tang, D. & Baker, M. (2000). Analysis of a Local-Area Wireless Network. Proceedings of the 6th International Conference on Mobile computing and networking, 1-10.

White, T. (2010). Hadoop: The Definitive Guide. Yahoo Press.

Yang, J., He, H., & Qiao, Y. (2014). Network Traffic Analysis based on Hadoop. Proceedings of the International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace & Electronic Systems, 1-5.

Authors

Man Si is currently a software developer at CENX Inc. located in Ottawa, ON, Canada. In 2015, she received her M.A.Sc. in Electrical and Computer Engineering from Carleton University (Ottawa, ON, Canada), and in 2013, she graduated from Dalian Maritime University

with a B. Eng. degree in Electronic and Information Engineering. Her research interests include data analysis and distributed systems.

Chung-Horng Lung received the B.S. degree in Computer Science and Engineering from Chung-Yuan Christian University, Taiwan and the M.S. and Ph.D. degrees in Computer Science and Engineering from Arizona State University. He was with Nortel Networks from 1995 to 2001. In September 2001, he joined the Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada, where he is now a Professor. His research interests include: Software Engineering, Cloud Computing, and Communication

Networks.

Samuel Ajila is currently an associate professor of engineering at the Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada. He received B.Sc. (Hons.) degree in

Computer Science from University of Ibadan, Ibadan, Nigeria and Ph.D. degree in Computer Engineering specializing in Software Engineering and Knowledge-based Systems from LORIA, Université Henri Poincaré – Nancy I, Nancy, France. His research interests are in the fields of Software Engineering, Cloud Computing, and Big Data Analytics and Technology Management.

Wayne Ding is a senior specialist in Ericsson, Canada. He is involved in 4G and 5G wireless system R&D. His work focuses on mobile big data and wireless network KPIs. He received the BS

and MS degrees Dalhousie University, Canada.


ON DEVELOPING THE RAAS Chen-Cheng Ye, Huan Chen, Liang-Jie Zhang, Xin-Nan Li, Hong Liang

School of Marine Science and Technology, Northwestern Polytechnical University, China Kingdee Research, Kingdee International Software Group Company Limited, China

National Engineering Research Center for Supporting Software of Enterprise Internet Services, China

[email protected]

Abstract Choice is a pervasive feature of social life that profoundly affects us. Ranking results can be used as a

reference to help people make a correct choice. But there are two problems. One problem is that fixed ranking results instead of the ranking methods are provided to people by service providers as a reference when making choice at most time. For example, TIMES World University Rankings can be used as a reference when choosing a college. However, in the numerous factors that affect objects ranking, people have their own understanding on the effect of each factor on objects ranking. Using mobile phone-selection as a practical case, some people think performance of a mobile phone is more important, while others hold the view that appearance of a mobile phone is more attractive. What’s more, there are many ranking methods proposed, such as The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) and expert marking.Using only one kind of ranking methods for object ranking may lead to over objective or subjective ranking results. Although various ranking algorithms are studied, very little is known about the detailed development and deployment of the ranking services.

This paper proposes a comprehensive solution of Ranking as a Service (RaaS), with the manifold contributions: Firstly, we use combination weighting method in RaaS and it can overcome the defects of subjective and objective weighting methods. Secondly, we develop ranking service APIs that bring convenience to people when making choices. Thirdly, ranking service provides ranking results for people according to their own understanding on the effect of each factor on objects ranking. Fourthly, this paper is arguably the first one that proposes using Ranking as a Service. Finally, we evaluate and analyze the proposed strategies and technologies in accordance to the experimental results.

Keywords: Ranking as a Service; Combination Weighting Method; TOPSIS; Comprehensive Evaluation Method

__________________________________________________________________________________________________________________

1. INTRODUCTION Life is always full of choices. For example,

customers are faced with the choice of buying goods, companiesare faced with the choice of recruiting and selecting new staff and parents are faced with the choice of choosing a school for their children. Different ranking results of various comprehensive evaluation algorithms are provided to people as a reference and help them make a correct choice, for example, ranking results of the top 100 app provided by App Store will always be used as a reference when people want to try new integrated applications, but few comprehensive evaluation algorithm was

presented to the user as a service.Comprehensive evaluation methods such as fuzzy comprehensive evaluation method, artificial neural network evaluation method, Technique for Order Preference by Similarity to an Ideal Solution(TOPSIS) [1], grey relational analysis [2] and expert scoring evaluation method [3] are widely used. TOPSIS has been successfully applied to the areas of human resources management [11], transportation [12], product design [13], manufacturing [14], water management [15], quality control [16], and location analysis [17]. There are also many areas that grey comprehensive evaluation method and expert scoring evaluation method are used in, such as helping manufacturing


companies deal with the problem of assessment and selection of suppliers [19], evaluating land suitability [20], evaluating and selecting complex hardware and software systems [21], forecasting project risk [22] and evaluating water quality [24].

The most important step of evaluating objects with multiple indicators is to determine the weight of each indicator. At present, the method to determine indicators weight can be mainly divided into two categories: subjective weighting methods and objective weighting methods. The subjective weighting methods include Analytical Hierarchy Process (AHP) [4], precedence diagram method [5], Delphi method [6] and TACTIC method [7] and so on. The objective weighting methods include Principal Component Analysis (PCA) [8], entropy weight method [9], variation coefficient method [10] and so on.

The goal of this paper is threefold. Firstly, we would like to introduce some subjective and objective weighting methods to help users understand the advantages of combination weighting method. Secondly, we propose a formalized ranking model based on combination weighting method. Finally, we go through a complete case of ranking service to help users deeply understand the internals of APIs and their concrete usages.

This paper is organized as follows: Section II introduces several typical weighting methods and discusses the advantages and disadvantages of subjective and objective weighting methods. Section III combines subjective weighting methods with objective weighting methods to get combination weighting method and proposes new comprehensive evaluation method based on combination weighting method. Section IV presents the results of new comprehensive evaluation methods under several scenarios and analyzes their realtime in a different scenario. Section V concludes the paper and points out some potential future.

2. BACKGROUND In this chapter, we will focus on the subjective

and objective weight method. Typical subjective and objective weighting methods will be introduced and the advantages and disadvantages of subjective and objective weighting methods will be discussed.

2.1 SUBJECTIVE WEIGHTING METHODS The subjective weighting methods discuss how to

ascertain the weight allocation by applying the attribute importance based on the decision maker (experts). The original data is obtained by experts according to their experience and subjective

judgment. There are several commonly used subjective weighting methods: AHP, precedence diagram method, Delphi method and TACTIC method and so on. AHP and precedence diagram method will be then briefly introduced.

Precedence diagram method is put forward by Muti. Its basic idea is that weight of indicators is decided by experts through pairwise comparison between indicators. AHP is proposed by Thomas L. Saaty in the 1970s. It is always used for solving a complicated multi-objective decision-making problem. The first and most important step when using AHP is to establish hierarchy structure on the basis of decision-making target, decision-making objects and multiple indicators affecting objects. Once the hierarchy has been constructed, the indicators that have a impact on the object are pairwise compared against the goal for importance by users.

2.2 OBJECTIVE WEIGHTING METHODS The objective weighting method determines the

weight mainly based on the relationships between the raw data. The commonly used objective weighting methods are as follows: PCA, entropy weight method, variation coefficient method, mean square error method and so on. PCA, entropy weight method and variation coefficient method will be then briefly introduced. PCA was invented in 1901 by Karl Pearson and it linearly compresses multidimensional data into lower dimensions with minimal loss of information. Variation coefficient method obtains weights of all indicators by taking advantage of all the information included in each indicator. Its basic idea is that the bigger the value of one indicators compared with values of other indicators, the more difficult the indicator is to achieve. This kind of indicator can better reflect the differences among the objects to be evaluated of indicator can better reflect the differences among the objects to be evaluated and should be endowed with a larger weight. The concept of entropy as a measure of information is introduced by Claude Shannon (C.E.Shannon) in 1948. Entropy weight method is an objective weighting method. According to variation degree among the evaluation indicators, information entropy is used to calculate the entropy weight of each indicator. Then the weights of the indicators are modified by entropy weight to obtain more objective indicator weight.

2.3 ADVANTAGES AND DISADVANTAGES OF SUBJECTIVE AND OBJECTIVE WEIGHTING METHODS


Advantages and disadvantages of subjective and objective weighting methods are summarized as table I:

3. RANKING SERVICES INSIGHTS To overcome the shortcomings of the methods to

subjectively or objectively determine the weight of indicators, the method combining with the subjective and objective weighting methods was adopted when evaluating the objects. In this section, advantages of combination weighting approach is firstly described, then two combination weighting methods and comprehensive evaluation algorithm are introduced, and finally implementation method for using new comprehensive evaluation method based on combination weighting method to evaluate objects is described in detail.

3.1 ADVANTAGES OF COMBINATION WEIGHTING APPROACH

Compared with the subjective and the objective weight-deciding method, a method combining the subjective and the objective weight-deciding method, i.e. combination weighting approach will be applied to evaluate the objects in this section. The benefits of using the combination weighting method is obvious. This method not only takes decision makers’ preferences about indicators of objects into consideration, but also decreases the subjectivity of weight decision. Therefore the decision or evaluation results made by combination weighting approach is more truthful and reliable.

3.2 COMBINATION WEIGHTING METHOD

1. Multiplicative synthesis

The process to obtain combined weight through multiplicative synthesis [25] proceeds by two steps.

The first step is multiplying the weight of one of the indicators obtained by various weighting methods and then normalization processing is carried out to obtain the combined weight. The calculation formula is as follows:

11 1

( ) ( )q qn

j j jjk k

w k w kθ== =

= ∑∏ ∏

2. Linear combination of weighting method

The process to obtain combined weight through weighted linear combination [26] is to compute the weighted average of the weight of one of the indicators obtained by various weighting methods. The calculation formula is as follows:

1

k

j i iji

b wθ=

= ∑

In the two formulas above, jθ is the combination weights of the jth indicator, ib is the weight coefficient of the ith weighting method, ijw is the weight of jth indicator according to the ith method.

3.3 COMPREHENSIVE EVALUATION ALGORITHM

Two typical comprehensive evaluation algorithms will be briefly introduced.

The grey system theory is a kind of system science theory originated by the Deng Julong. Gray Relational Analysis is a new and quantitative method of grey system theory, which describes the degree of correlation between the objects and factors with relational value.

Table 1. Comparison between Subjective and Objective Weighting Method

Subjective weighting method Objective weighting method Origin Early A bit late

As a service Increases the burden to user Not bring the burden to user Theoretical support Strict mathematical theory of

suppor Strict mathematical theory of

support Complexity Less Complicated Complicated

Evaluation results Reflect importance of different attributes to the decision makers

Not reflect importance of different attributes to the decision makers.

Applicability Limited More general


The core idea behind TOPSIS is to rank objects according to their geometric distance between the positive ideal solution and the negative ideal solution. The positive ideal solution consists of the best value of each indicator among objects and the negative ideal solution consists of the worst value of each indicator among objects. The closer an object to the positive ideal solution, the more likely the objects ranking is to be on top. 3.4 ESTABLISHMENT OF COMPREHENSIVE EVALUATION MODEL

The procedure for using new comprehensive evaluation method can be summarized as figure 1.AHP and three typical objective weighting methods mentioned in section II are combined using multiplicative synthesis to obtain three combination weighting methods. Then six new comprehensive evaluation methods can be obtained through the combination of TOPSIS, Gray Relational Analysis and three combination weighting methods. The process to evaluate objects using new comprehensive evaluation method is complicated and thus six new comprehensive evaluation methods wont be described in detail. One new comprehensive evaluation method combining TOPSIS and the combination weighting method using AHP and entrop weight method will be introduced in the following as an example.

1. Form the original data matrix 11 12 1

11 12 1

11 12 1

n

n

n

x x x

x x x

x x x

=

X

mnx is the nth indicator data of the mth object 2. Use entropy weight method to determine the objective weight of each indicator 1 jw The step of using entropy weight method to calculate objective weight of each indicator w1j is as follows: Step 1: Carry out dimensionless parametrization for the original data matrix. For the bigger the better indicators:

min( )max( ) min( )

ij jij

j j

x xv

x x−

=−

For the smaller the better indicators:

Start

Form the original data matrix

Use one objective weighting method to determine the objective weight of each

indicator

Use subjective weighting method to determine the subjective weight of each

indicator

Adopt a linear weighting method or multiplicative synthesis to assemble the subjective and objective weights of each

indicator and get the combination weight of each indicator

Give the evaluation results through comprehensive evaluation method based on

combination weight

End

Figure I. Process of ranking by new comprehensive evaluation method

max( )max( ) min( )

j ijij

j j

x xv

x x−

=−

Step 2: Calculate the characteristic proportion of the jth indicator of the ith evaluated object.

The characteristic proportion of the jth indicator of the ith evaluated object is recorded as ijp , then

1

m

ij ij iji

p v v=

= ∑

Step 3: Calculate the entropy of the indicator je

11 / ( )

m

j ij iji

e In m p Inp=

= − ⋅∑


Step 4: Calculate the difference coefficient of the jth indicator jd

1j jd e= −

Step 5: Determine the entropy weight of each indicator

11

/n

j j jj

w d d=

= ∑

The entropy weight of each indicator 1 jw is also objective weight of each indicator.

3. Use AHP to determine the subjective weight of eachindicator 2 jw .

The process to use AHP to obtain the subjective weight of each indicator 2 jw is shown in Figure II.

4. Adopt multiplicative synthesis to assemble the subjective and objective weights of each indicator and get the combination weight of each indicator jw .

5. Give the evaluation results through TOPSIS based on combination weight jw .

The TOPSIS process to obtain the evaluation results is carried out as follows: Step 1: Obtain normalized matrix ( )ij m ns ×=S using the normalisation method,

2

1

, 1, 2, , , 1, 2, , .ijij m

iji

xs i m j n

x=

= = =

∑

Step 2: Calculate the weighted normalized decision matrix ( )ij m nu ×=U on the basis of combination weight,

( ) ( ) , 1, 2, ,ij m n j ij m nU u w r i m× ×= = =

Where 1

1n

jj

w=

=∑ .

Step 3: Determine the worst object wA and the best object bA :

{ max( | 1, 2, , ) |w ijA u i m j O−= = ∈

min( | 1, 2, , ) | | 1, 2, , }ij wju i m j O u j n+= ∈ ≡ = ,

{ min( | 1, 2, , ) |b ijA u i m j O−= = ∈

max( | 1, 2, , ) | | 1, 2, , }ij bju i m j O u j n+= ∈ ≡ = Where,

{ 1,2, , |O j n j associated with the indicator+ = = }having a positive impact on the object

{ 1,2, , |O j n j associated with the indicator− = =

}having a negative impact on the object

Start

Establish hierarchical structure model

Form the judgment matrix according to the paired comparison among indicators

Calculate maximum eigen value and corresponding eigenvector of the

judgment matrix

Check the consistency of the judgment matrix

When the judgment matrix satify the consistency,the eigenvector corresponding

to maximum eigenvalue is the weight vector of indicators.

End

Figure II. Processing of using AHP

Step 4: Calculate the distance diw between the ith object and the worst object wA :

2

1( )

n

iw ij wjj

d u u=

= −∑


and the distance between the ith object and the best

object bA :

2

1( )

n

ib ij bjj

d u u=

= −∑

Step 5: Calculate the proximity iwc between each object and the best object:

,0 1, 1,2, ,iwiw iw

iw ib

dc c i m

d d= ≤ ≤ =

+

1iwc = when the the object is as same as the best object and 0iwc = when the object is closest to the worst object. Step 6: Rank the objects according to ( 1, 2, ,iwc i =

)m . Another example is about new comprehensive

evaluation method combining Gray Relational Analysis and the combination weighting method using AHP and variation coefficient method.

1. Form the original data matrix. 11 12 1

11 12 1

11 12 1

n

n

n

x x x

x x x

x x x

=

X

mnx is the nth indicator data of the mth object. 2. Use variation coefficient method to determine

the objective weight of each indicator 1 jw . The step of using variation coefficient method to

calculate objective weight of each indicator 1 jw is as follows: Step 1: Calculate the mean and standard deviation of each column i.e each indicator in the matrix. The mean and standard deviation of the jth indicator is

recorded as jδ and jx , Step 2: Calculate the variation coefficient according to the mean and standard deviation of each indicator. The characteristic proportion of the jth indicator is recorded as jv :

jj

j

vxδ

=

Step 3: Add up the variation coefficient of each indicator and thus the variation coefficient weight of each indicator 1 jw is determined:

1

1

jj n

jj

vw

v=

=

∑

The variation coefficient weight of each indicator1 jw is also objective weight of each indicator.

3. Use AHP to determine the subjective weight of each indicator 2 jw .

The process to use AHP to obtain the subjective weight of each indicator 2 jw is shown in Figure II.

4. Adopt multiplicative synthesis to assemble the subjective and objective weights of each indicator and get the combination weight of each indicator jw .

5. Give the evaluation results through Gray Relational Analysis based on combination weight jw .

The Gray Relational Analysis process to obtain the evaluation results is carried out as follows: Step 1: Calculate the mean of each column in the matrix to form a reference sequence and the reference sequence is recorded as Y ,i.e

1

m

iji

j

x

m==∑

Y

Step 2: Dimensionless all rows in the matrix according to the reference sequence and we denote the ith normalized row as iX :

( ) iji

j

xj =X

Y

Step 3: Calculate the correlation coefficient between ( )i jX and jY and it is recorded as ( )i jξ , a new

variable i∆ is introduced for the convenice of describing ( )i jξ and ( )i i jj∆ = −X Y ,thus

min max( )

( ) maxi ii i

ii j ii

jj

ρξ

ρ

∆ + ∆=

− + ∆X Y

Where ρ is a nonnegative distinguishing coefficient and its value is normally between 0 and 1.The smaller the distinguishing coefficient, the better the resolving power. When 0.5463ρ ≤ , the resolving power is the best. In this paper, 0.5ρ = . Step 4: Calculate the degree of correlation between

iX and Y based on combination weight of each indicator and we denote it as ir :

1( )

n

i j ij

r w jξ=

= ∑

Step 5: Rank the objects according to ( 1, 2, ,ir i = )m .


3.5 INNOVATIONS ON RAAS

RaaS provided by us is innovative in threefold. Firstly, it can deal with objects with missing data. Moreover, it can provide more evaluation results for people as a reference compared with fixed evaluating results provided by other service provider as a reference to make a choice. Thirdly, ranking service provided by us is more people-oriented. It provides ranking results for users according to the importance choice of factors affecting the ranking object made by users. It is reasonable because most of users consider one factor more important than other factors in affecting objects when making a choice.

3.6 IMPLEMENTATION DETAILS

The structure of ranking service is shown in Figure III. The process of data processing in ranking service is implemented through the R language and structure of ranking service is established on the basis of SpringMVC structure.

Web Service Side(Browser)

Controller JSP

Request Response

Model RJRI

Figure III. Architecture of RaaS

4. CASE STUDY In this section, the evaluation results of new

comprehensive evaluation methods based on combination weighting method under several scenarios will be analyzed. What’s more, real-time performance of new comprehensive evaluation methods based on combination weighting method are compared under a different scenario.

As shown in Table II, new comprehensive evaluation methods based on combination weighting methods are recorded as Method I to Method VI.

Function New comprehensive evaluation methods

Method I TOPSIS,AHP and PCA

Method II TOPSIS,AHP and entropy weight method

Method III TOPSIS,AHP and variation coefficient method

Method IV grey-relational analysis AHP and PCA

Method V grey-relational analysis, AHP and entropy weight method

Method VI grey-relational analysis, AHP and variation coefficient method

Table II. Function of New Comprhensive

Evaluation Methods

4.1 EVALUATION RESULTS OF UNIVERSITIES

1. Part of the world university ranking results by TIMES are shown in Table III:

It can be seen from the Table III there are 5 factors that influence university rankings. In the process of ranking the universities, Teaching, Research and Citations are considered equally important and they are far more important than the other two factors. Part of the world university ranking results by new comprehensive evaluation methods based on combination weighting method are shown in Table IV:

To better analyze ranking results of Method I to Method VI, mean square error between ranking results of TIMES and ranking results of Method I to Method VI are calculated. As shown in table V, university ranking results of combination method of TOPSIS, AHP and PCA is closest to college ranking results of TIMES.

Through comparison between Table III and Table IV, Table IV is the specific indicator parameters of each university and we can’t get the intuitive understanding of the university. On the contrary, Table IV shows university ranking results provided by RaaS based on people’s understanding of the factors influencing the university rankings and it can make people have a whole understanding of the university.


rank University Name Teaching International Outlook

Research Research Citations

Industry Income

Overall score

1 California Institute of Technology

95.6 64 97.6 99.8 97.8 95.2

2 University of Oxford 86.5 94.4 98.9 98.8 73.1 94.2 3 Stanford University 92.5 76.3 96.2 99.9 63.3 93.9 4 University of Cambridge 88.2 91.5 96.7 97 55 92.8 5 Massachusetts Institute of

Technology 89.4 84 88.6 99.7 95.4 92

6 Harvard University 83.6 77.2 99 99.8 45.2 91.6 7 Princeton University 85.1 78.5 91.9 99.3 52.1 90.1 8 Imperial College London 83.3 96 88.5 96.7 53.7 89.1 9 Swiss Federal Institute of

Technology Zurich 77 97.9 95 91.1 80 88.3

10 University of Chicago 85.7 65 88.9 99.2 36.6 87.9

Table III. Ranking Results of Universities by TIMES

University Name TIMES Method I Method II Method III

Method IV

Method V

Method VI

California Institute of Technology

1 8 1 1 1 1 1

University of Oxford 2 3 2 3 3 2 3 Stanford University 3 3 4 4 2 3 2 University of Cambridge 4 1 5 5 4 4 4 Massachusetts Institute of Technology

5 4 3 2 6 6 6

Harvard University 6 5 9 10 5 5 5 Princeton University 7 6 8 7 7 8 7 Imperial College London 8 7 10 8 8 9 9 Swiss Federal Institute of Technology Zurich

9 10 6 6 10 7 8

University of Chicago 10 11 15 13 9 11 10

Table IV. Ranking Results of Universities by New Ranking Methods

Mean square error

Method I Method II Method III Method IV Method V Method VI

TIMES 9.74 19.75 18.2 18.92 17.44 15.7

Table V. Analysis of College Ranking Results


City Name Exhibition Number

Exhibition Area

Professional Exhibition Number

Professional Exhibition Indoor Area

Number of Exhibition Management Institutions

Shanghai 798 1200.8 13 44.4 3 Guangzhou 480 831 6 53.48 2 Beijing 418 552.1 9 44.79 1 Chongqing 581 500.4 5 35.6 2 Nanjing 347 370 4 19.5 3 Shenzhen 86 259.77 1 10.5 2 Chengdu 169 300.9 6 56.21 0 Hangzhou 223 227.6 7 37.4 14 Shenyang 242 217.58 5 17.96 4 Zhengzhou 192 191.4 3 24.7 0

City Name UFI member unit UFI Certification Program

Exhibition Number in TOP100

Exhibition Number in TOP3

Shanghai 22 20 24 74 Guangzhou 9 8 25 63 Beijing 26 17 6 43 Chongqing 1 0 5 10 Nanjing 1 1 0 4 Shenzhen 11 11 9 17 Chengdu 1 0 5 19 Hangzhou 1 0 1 1 Shenyang 1 0 4 4 Zhengzhou 2 0 0 1

Table VI. Exhibition and Convention City Ranking Results of China

4.2 REAL-TIME ANALYSIS UNDER SEVERAL SCENARIOS

Part of the exhibition and convention city ranking results of China are shown in Table VI:

As seen in Table VI, there are 9 factors affecting the competitiveness of city meeting and exhibition industry in China.100 cities are analyzed and real-time of new comprehensive evaluation methods are compared under this scenario. Run time of new comprehensive evaluation methods under this scenario are shown in Figure IV and Figure V.

Figure IV. Run time among PCA, entropy weight method and variation coefficient method


It can be seen in Figure IV that run time of Method IV is much longer than Method V and Method VI. The result is reasonable as it is needed to analyze the eigenvalue and characteristic vector of data matrix when using PCA to calculate subjective weight.

Figure V. Run time between TOPSIS and grey

correlation analysis method

As seen in Figure V, run time of Method I is much longer than Method II. The result is reasonable as the process of using grey correlation analysis to rank objects is more complicated than using TOPSIS.

Figure VI. Views of H5 using Various Types of Telephones

4.3 INSTRUCTIONS ON MOBILE PHONE MARKETING SHARE RANKING

RaaS provides a reference for users when they make choices. APIs of RaaS require users to provide data and select the type of ranking algorithm then it returns ranking results to users. However, data

provided by users are always with small number of data fields. In order to make ranking results more convincing, more comprehensive data would be supplemented. The supplemented data will be used for building more data fields. Data supplementing can be realized through the following two ways. One way is to supply data through the crawler and another is to supply data from big data pool, such as the Kingdee’s existing cloud-based bigdata analytics services [29]. Taking mobile phone marketing share ranking as an example, the type of mobile phone used by users visiting HTML5 [30] webpage can be viewed as a reference of market share and popularity of various mobile phones. Flashing HTML5 is a service provided by WeChat [31] light application product KActivity [32]. Data of user’s mobile phone type is shown in Figure VI. In Figure VI, the total views of a single HTML5 page using various types of mobile phones in the recent 7 days is 313832, which is a large number for data supplementing.

5. CONCLUSION This papaer proposes Ranking as a Service(RaaS).

This paper also proposes a comprehensive solution of RaaS, including its development, deployment and evaluation stages.Several typical weighting methods are firstly introduced and the advantages and disadvantages of subjective and objective weighting methods are discussed. Compared with subjective and objective weighting methods, combination weighting method is considered to be more reasonable. This paper divides RaaS into four steps based on combination weighting method and then uses one of new comprehensive evaluation methods as an example to explain usages of RaaS. Finally, for a better leverage of ranking service, this paper has explained a concrete college ranking example.Real- time performance of RaaS is also analysed under a different scenario.

6. ACKNOWLEDGEMENT This work is supported by the central grant

funded Cloud Computing demonstration project of China, R&D and industrialization of the SME management cloud (Project No. [2011]2448), hosted by Kingdee Software (China) Company Ltd., under the direction of National Development and Reform Committee of China; the construction fund of National Engineering Research Center for Supporting Software of Enterprise Internet Services (Project No.2012FU125Q09); the Guangdong province project, under grant No. 2015B010131008; the fund


for leading talents of Guangdong Province [2012]342; the Shenzhen high-tech project, under grant No. FWY-CX20140310010238.

7. REFERENCE [1] TOPSIS. (2015). TOPSIS. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/TOPSIS [2] Grey relational analysis. (2015). Grey relational analysis.Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Grey_relational_analysis [3] Gao Z, McCalley J, Meeker W.(2009). A transformer health assessment ranking method: Use of model based scoring expert system. North American Power Symposium (NAPS). [4] Analytic hierarchy process. (2015). Analytic hierarchy process. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Analytic_hierarchy_process [5] Precedence diagram method. (2015). Precedence diagram method. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Precedence_diagram_method [6] Delphi method.(2015). Delphi method. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Delphi_method [7] Tactic method.(2015). Tactic method. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Ta- ctic_(method) [8] Principal component analysis.(2015). Principal component analysis. Retrieved Oct 1, 2015, from https://en.wikipedia.org/wiki/Principal_component_analysis [9] Zhang Z, Liu P, Guan Z. (2007). The evaluation study of human resources based on entropy weight and grey relating TOPSIS method. Wireless Communications, Networking and Mobile Computing, 2007. WiCom 2007. International Conference on. IEEE(pp.4423-4426). [10] Chunqing W, Xuedong L, Zhaoxia G.(2014). Evaluation of equipment renewal based on combination weighting method. Industrial Engineering and EngineeringManagement (IEEM), 2014 IEEE International Conference on. IEEE, 2014(pp.1394-1398). [11] M.F. Chen, G.H. Tzeng.(2004). Combining gray relation and TOPSIS concepts for selecting an expatriate host country. Country. Mathematical and Computer Modelling,40:1473-1490. [12] M. Janic. (2003). Multi criteria evaluation of high-speed rail, transrapid maglev, and air passenger transport in Europe. Transportation Planning and Technology,26(6),491-512.

[13] C.K. Kwong and S.M. Tam. (2002).Case-based reasoning approach to concurrent design of low power transformers.Journal of Materials Processing Technology,128,136-141. [14] A.S. Milani, A. Shanian and R. Madoliat. (2005). The effect of normalization norms in multiple attribute decision making models: A case study in gear material selection.Structural Multidisciplinary Optimization, 29(4), 312-318. [15] B. Srdjevic, Y.D.P. Medeiros and A.S. Faria. (2004). An objective multi-criteria evaluation of water management scenarios. Water Resources Management, 18,35-54. [16] T. Yang and P. Chou.(2005).Solving a multiresponse simulation-Coptimization problem with discrete variables using a multi-attribute decision-making method. Mathematics and Computers in Simulation, 68, 9-21. [17] K. Yoon and C.L. Hwang. (1985). Manufacturing plant location analysis by multiple attribute decision making: Part Iłsingle-plant strategy. International Journal of Production Research,23, 345-359. [18] Slagle J and Wick M R.(1988). A method for evaluating candidate expert system applications. AI Magazine, 9(4),44. [19] Kwong C K, Ip W H and Chan J W K.(2002). Combining scoring method and fuzzy expert systems approach to supplier assessment: a case study. Integrated Manufacturing Systems,13(7),512-519. [20] Kalogirou S. (2002).Expert systems and GIS: an application of land suitability evaluation. Computers, environment and urban systems, 26(2),89-112. [21] Dujmovi’c J.(1996). A method for evaluation and selection of complex hardware and software systems.CMG 96 Proceedings. [22] Guo P and SHI P. (2005). Research on fuzzy-grey comprehensive evaluation method of project risk.Journal of Xian University of Technology, 21(1), 106-109. [23] Xianguang G.(1998,12). Application of Improved Entropy Method in Evaluation of Economic Result. Systems Engineering Theory Practice. [24] MENG X and HU H.(2009).Application of set pair analysis model based on entropy weight to comprehensive evaluation of water quality.Journal of Hydraulic Engineering,3,002. [25] Schmoldt D L, Kangas J, Mendoza G A, et al. (2001).The Analytic Hierarchy Process in Natural Resource and Environmental Decision Making. Springer Netherlands. [26] Weighted Linear Combination. (2015). Weighted Linear Combination. Retrieved Oct 1, 2015, from http://wiki.gis.com/wiki/index.php/Weighted_ Linear_Combination.

http://wiki.gis.com/wiki/index.php/Weighted_


[27] Times University Ranking 2016.(2015). Times University Ranking 2016.Retrieved Oct 1, 2015, from https://www.timeshighereducation.com/world-university-rankings/2016/world-ranking#!/page/0/length/25,2015 [28] Exhibition and Convention City Ranking Results of China.(2013). Exhibition and Convention City Ranking Results of China. Retrieved Oct 1, 2015, from http://tieba.baidu.com/p/3553849110. [29] Kingdee Cloud. (2015). Kingdee Cloud. Retrieved Oct 1, 2015, from http://cloud.kingdee.com/, 2015 [30] HTML5.(2015). HTML5. Retrieved Oct 1, 2015, from https://www.w3.org/. [31] WeChat. (2015). WeChat. Retrieved Oct 1, 2015, from.https://wx.qq.com. [32] Kactivity.(2015). Kactivity. Retrieved Oct 1, 2015, from https://www.xingdongliu.com/h5/.

Authors Chencheng Ye received the B.S. and M.S. degree from Northwestern Polytechnical Uni- versity, Xi’an, China, in 2014 and 2016, respectively. He is currently working towards the Ph.D. degree in communication and information engineering at the School of Marine and Science

Technology, Northwestern Polytechnical University.

Huan Chen is researcher at Kingdee Research. His research interests are big data architecture and cloud storage.

Liang-Jie Zhang is a computer scientist, a former Research Staff Member at IBM Thomas J. Watson Research Center, Senior Vice President, Chief Scientist, Director of Research at Kingdee International Software Group Company Limited,

and director of The OpenGroup. He is the founding Editor-in-Chief of IEEE Transactions on Services Computing. He was elected as a Fellow of the IEEE in 2011, and won the IEEE Technical Achievement Award” for pioneering contributions to Application Design Techniques in Services Computing”.

Xinnan Li is researcher at Kingdee Research. Her research interest is big data.

Liang Hong received the M.S. degree and the Ph.D. degree in signal and infor-mation processing from Northwestern Polytechnical University (NPU), Xi’an, China, in 1995 and 2004, respectively.Currently, she is a Professor at NPU.Her main research interests include

signal detection, parameter estimation and adaptive signal processing.

https://www.timeshighereducation.com/world-university-rankings/2016/world-ranking#!/page/0/length




EVALUATIONS OF BIG DATA PROCESSING Duygu Sinanc Terzi1 , Umut Demirezen2 , and Seref Sagiroglu1

1Department of Computer Engineering Gazi University, Ankara, Turkey 2STM Defense Technologies Engineering and Trade Inc., Ankara, Turkey

[email protected] , [email protected] , [email protected]

Abstract Big data phenomenon is a concept for large, heterogeneous and complex data sets and having many challenges in storing, preparing, analyzing and visualizing as well as techniques and technologies for making better decision and services. Uncovered hidden patterns, unknown or unpredicted relations and secret correlations are achieved via big data analytics. This might help companies and organizations to have new ideas, get richer and deeper insights, broaden their horizons, get advantages over their competitors, etc. To make big data analytics easy and efficient, a lot of big data techniques and technologies have been developed. In this article, the chronological development of batch, real-time and hybrid technologies, their advantages and disadvantages have been reviewed. A number of criticism have been focused on available processing techniques and technologies. This paper will be a roadmap for researchers who work on big data analytics.

Keywords: big data, processing, technique, technology, tools, evaluations

__________________________________________________________________________________________________________________

1. INTRODUCTION

The size of data starts from giga to zetta bytes and beyond. According to Fortune1000 Companies, 10% of increase in data provides $65.7 million extra income (McCafferty, 2014). Big data flows too fast, requires too many new techniques, technologies, approaches and handles with the difficulties it brings. Big data is generated from online and offline processes, logs, transactions, click streams, emails, social network interactions, videos, audios, images, posts, books, photos, search queries, health records, science data, sensors, and mobile phones including their applications and traffics. They are stored in databases or clouds and the size of them continues to grow massively. As a result, it becomes difficult to capture, store, share, analyze and visualize the data with typical tools. Big data concepts have a combination of techniques and technologies that help experts, managers, directors, investors, companies and institutions to gain deeper insights into their information assets and also to abstract new ideas, ways, approaches, values, perceptions from the analyzed data (Dumbill, 2012). To enable an efficient decision making practice, organizations need

effective processes to turn high volumes of fast-moving and diverse data into meaningful outcomes. The value of big data market is 10,2 billion dollars now, and it is expected to reach 53.4 billion dollars by 2017 (McCafferty, 2014). Organizations and institutions might get benefits from big data analysis for their future developments, investments, decisions, challenges, and directions with descriptive, predictive, and prescriptive analytics like decision support systems, personalized systems, user behavior analysis, market analysis, location-based services, social analysis, healthcare systems and scientific researches.

To clarify and express the big data features, the five Vs of volume, variety, velocity, veracity and value (Dumbill, 2012), (Demchenko, Grosso, De Laat, & Membrey, 2013), (M. Chen, Mao, & Liu, 2014) are frequently used to explain or understand the nature of big data (Fig. 1). Volume is the size of data produced or generated. It is huge and its size might be in terabytes, petabytes, exabytes or more. The volume is important to distinguish the big data from others. Variety has different forms of data, covers the complexity of big data and imposes new requirements in terms of analysts, technologies and tools. Big data is connected with variety of sources in three types: structured, semi structured and unstructured. Velocity is important not only for big


data but also for all processes. The speed of generating or processing big data is crucial for further steps to meet the demands and requirements. Veracity deals with consistency and trustworthy of big data. Recent statistics have shown that 1 of 3 decision makers do not trust the information gathered from big data because of their inaccuracy (Center, 2012). Accordingly, collected or analyzed big data should be in trusted origin, protected from unauthorized access and normal format, even if this is hard to achieve. Value is the most important feature of big data and provides outputs for the demands by business requirements. Accessing and analyzing big data is very important, but it is useless if no value is derived from this process. Values should be in different forms such as having statistical reports, realizing a trend that was invisible, finding cost saving resolutions, detecting improvements or considering new thoughts for better solutions or achievements.

Figure 1. 5 V’s of big data

Working with big data is a complex process with conceptual and technical challenges. This causes the existence of a high number of different approaches. In this paper, an overview on big data's concepts are summarized and techniques, technologies, tools and platforms for big data are generally reviewed in Section 1 and 2. In Section 3, the chronological development of big data processing is reviewed according to the technologies they cover. Finally, discussion and conclusion are outlined in Section 4.

2. TECHNIQUES AND TECHNOLOGIES FOR BIG DATA

Big data is a way of understanding not only the nature of data but also the relationships among data. Identifying characteristics of the data is helpful in defining its patterns. Key characteristics for big data are grouped into ten classes (Hashem et al., 2015), (Mysore & Jain, 2013), (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015) (Fig. 2).

To enable efficient decision-making, organizations need effective processes to turn high volumes of fast moving and diverse data into meaningful outcomes. Big data analytics helps boost digital economy, and provide opportunities via supporting or replacing decision-making processes with automated algorithms. In addition to that, it helps reducing the cost and predicting behaviors of groups, teams, supporters, enemies or habits from enough features of available data. Data management of big data involves processes and supporting technologies to acquire, store and prepare data, while analytics refers to techniques used in analyzing and extracting intelligence from big data (Gandomi & Haider, 2015).

The techniques for big data analytics consist of multiple disciplines including mathematics, statistics, data mining, pattern recognition, machine learning, signal processing, simulation, natural language processing, time series analysis, social network analysis, crowdsourcing, optimization methods, and visualization approaches (M. Chen et al., 2014). Big data analytics need new techniques to process huge amount of data in an efficient time manner and way to have better decisions and values.

The technologies for big data processing paradigms are chronologically transformed as batch processing, real-time processing and hybrid computation because of the big data evolution (Casado & Younas, 2015). Batch processing is a solution for volume issue, real-time processing deals with velocity issue and hybrid computation is suitable for the both issues. The techniques and technologies developed in this context are summarized in Table 1 as platforms, databases and tools (Wang & Chen, 2013), (Cattell, 2011), (Bajpeyee, Sinha, & Kumar, 2015). These tables should be helpful to companies, institutions or applicants to understand and provide new ideas, deep insights, perceptions and knowledge after analyzing big data.


Figure 2. Big data classification

PLATFORM TYPE TOOLS LOCAL Hadoop, Spark, MapR, Cloudera, Hortonworks, InfoSphere, IBM BigInsights, Asterix CLOUD AWS EMR, Google Compute Engine, Microsoft Azure, Pure System, LexisNexis HPCC

Systems

DATABASE TYPE TOOLS SQL Greenplum, Aster Data, Vertica, SpliceMachine

NO

SQL

Column HBase, HadoopDB, Cassandra, Hypertable, BigTable, PNUTS, Cloudera, MonetDB, Accumulo, BangDB

Key-value Redis, Flare, Sclaris, MemcacheDB, Hypertable, Valdemort, Hibari, Riak, BerkeleyDB, DynamoDB, Tokyo Cabinet, HamsterDB

Document SimpleDB, RavenDB, ArangoDB MongoDB, Terrastore, CouchDB, Solr, Apache Jackrabbit, BaseX, OrientDB, FatDB, DjonDB

Graph Neo4J, InfoGrid, Infinite Graph, OpenLink, FlockDB, Meronymy, AllegroGraph, WhiteDB, TITAN, Trinity

IN-MEMORY SAP HANA

TOOL FUNCTIONS TOOLS DATA PROCESSING MapReduce, Dryad, YARN, Storm, S4, BigQuery, Pig, Impala, Hive,

Flink, Spark, Samza, Heron DATA WAREHOUSE Hive, HadoopDB, Hadapt DATA AGGREGATION & TRANSFER Sqoop, Flume, Chukwa, Kafka, ActiveMQ SEARCH Lucene, Solr, ElasticSearch QUERY LANGUAGE Pig Latin, HiveQL, DryadLINQ, MRQL, SCOPE, ECL, Impala STATISTICS & MACHINE LEARNING Mahout, Weka, R, SAS, SPSS, Pyhton, Pig, RapidMiner, Orange, BigML,

Skytree, SAMOA, Spark MLLib, H2O, BUSINESS INTELLIGENCE Talend, Jaspersoft, Pentaho, KNIME VISUALIZATION Google Charts, Fusion Charts, Tableau Software, QlikView

Data Type

Data Source

Data Format

Data Store

Data Frequency

Data Consumer

Processing Propose

Analysis Type

Data Usage

• Transactional • Historical • Master • Meta

• Structured• Semi-structured• Unstructured

• On demand• Real Time• Time Series

• Web and Social Media• Internet of Things (IoT)• Internal Data Sources• Via Data Providers

• Graph • Key-Value • Column Oriented• Document Oriented

• Interactive• Real Time • Batched• Mix

• Industry• Academia• Government• Research Centers

• Human • Business Process• Enterprise Applications• Data Repositories

• Predictive• Analytical• Modelling• Reporting

BIG DATA

Processing Method

• High Performance Computing• Distributed • Parallel • Cluster• Grid


SOCIAL MEDIA Radian6, Clarabridge

Table 1. Big data tools in different perspectives

3. BIG DATA PROCESSING

3.1 BATCH PROCESSING

Big data batch processing was started with Google File System which is a distributed file system and MapReduce programming framework for distributed computing (Casado & Younas, 2015). MapReduce splits a complex problem into sub-problems implemented by Map and Reduce steps. Complex big data problems are solved in parallel ways then combined the solution of original problem.

Apache Hadoop is well-known big data platform consisting of Hadoop kernel, MapReduce and HDFS (Hadoop Distributed File System) besides a number of related projects, including Cassandra, Hive, HBase, Mahout, Pig and so on (C. P. Chen & Zhang, 2014). The framework aims for distributed storage and processing of big data sets in clusters (P. Almeida, 2015). Microsoft Dryad is another programming model for implementing parallel and distributed programs that can scale up capability. Dryad executes operations on the vertexes in clusters and use channels for transmission of data. Dryad is not only more complex and powerful than Map/Reduce and the relational algebra but also support any amount of input and output data unlike MapReduce (M. Chen et al., 2014). HPCC (High Performance Computing Cluster) Systems are distributed data intensive open source computing platform and provide big data workflow management services. Unlike Hadoop, HPCC’s data model defined by user. The key to complex problems can be stated easily with high level ECL (Enterprise Control Language) basis. HPCC ensure that ECL is executed at the maximum elapsed time and nodes are processed in parallel. Furthermore, HPCC Platform does not require third party tools like GreenPlum, Oozie, Cassandra, RDBMS, etc. ("Why HPCC Systems is a superior alternative to Hadoop,").

3.2 REAL-TIME PROCESSING

Big Data Applications based on write once - analyze multiple times data management architectures are unable to scale for real-time data

operations. After some years from using MapReduce, big data analytic applications shifted to use Stream Processing paradigm (Tatbul, 2010). Hadoop-based programming models and frameworks are unable to offer the combination of latency and throughput requirements for real-time applications in industries such as real-time analytics, Internet of Things, fraud detection, system monitoring and cybersecurity. Stream processing programming model mainly depends on the freshness of the data in motion. When any type of data is generated at its source, processing the data from travelling from its source to its destination is very challenging and also effective way. Potential of this approach is very important because eliminating the latency for gaining value from the data has outstanding advantages. The data is analyzed to obtain results at once. The data travels from its source to destination as continuous or discrete streams and this time velocity property of the Big Data has to be handled with this approach.

Using stream processing techniques in Big Data requires special frameworks to analyze and obtain value from the data. Because the data stream is fast and has a gigantic volume, solely a small part of the stream can be stored in bounded memory. In contrast to the batch data processing model where data is first stored, indexed and then processed by queries, stream processing gets the inbound data while it is in motion, as it streams through its target. Stream processing also connects to external data sources, providing applications to integrate selected data into the application flow, or to update a destination system with processed information.

Despite not supporting stream processing, MapReduce can partially handle streams using micro-batching technique. The idea is to process the stream as a sequence of small data chunks. The incoming g stream is grouped to form a chunk of data and is sent to batch processing system to be processed in short intervals. Some MapReduce implementations especially real-time ones like Spark Streaming (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2010) support this technique. However, this technique is not adequate for demands of a low-latency stream processing system. In addition, the MapReduce model is not suitable for stream processing.

The streaming processing paradigm is used for real time applications, generally at the second or even


millisecond level. Typical open source stream processing frameworks Samza (Feng, Zhuang, Pan, & Ramachandra, 2015), Storm (Toshniwal et al., 2014), S4 (Neumeyer, Robbins, Nair, & Kesari, 2010), and Flink (Renner, Thamsen, & Kao, 2015). They all are low-latency, distributed, scalable, fault-tolerant and also provide simple APIs to abstract the complexity of the underlying implementations. Their main approach is to process data streams through parallel tasks distributed across a computing cluster machines with fail-over capabilities.

There are three general categories of delivery patterns for stream processing. These categories are: at-most-once, at-least-once and exactly-once. At most once delivery means that for each message handed to the next processing unit in a topology that message is delivered zero or one time, so there is a probability that messages may be lost during delivery. At least once delivery means that for each message handed to a processing unit in a topology potentially multiple endeavors are made for delivery, such that at least one time this operation is succeeded. Messages may be sent multiple times and duplication may occur but messages are not lost. Exactly once delivery means that for each message handed to a processing unit in a topology exactly once delivery is made to the receiver unit, so it prevents message lost and duplication.

Another important point for stream processing frameworks is state management operations. There are different known strategies to store state. Spark Streaming writes state information into a storage. Samza uses an embedded key-value store. State management has to be handled either implementing at application level separately or using a higher-level abstraction, which is called Trident in Apache Storm. As for Flink, state management based on consistent global snapshots inspired Chandy-Lamport algorithm (Chandy & Lamport, 1985). It provides low runtime overhead and stateful exactly-once semantics. Because of the latency requirements for an application, a stream processing frameworks have to be chosen carefully depending on the application domain. Storm and Samza support sub-second latency with at least once delivery semantics while Spark Streaming supports second(s)-level latency with exactly one delivery semantics depending on the batch size. In addition to this, Flink supports sub-second latency with at exactly once delivery semantics and check-pointing based fault tolerance. If large-scale state management is more important, Samza may be used for that type of application. Storm can be used as a micro-batch processing by using its Trident abstraction and in this case, the

framework supports medium-level latency with exactly one delivery semantics.

Scalable Message Oriented Middleware (MOM) plays very important role in distributed and stream processing application development. The integration of different data sources and databases is critical for a successful stream processing. MOM is used to help building scalable distributed stream processing applications across multiple platforms, gathering data from different sources, creating a seamless integration. There are different types of commercial and open source MOM’s available in this area. Every MOM has its own unique advantages and disadvantages based on its architecture and programming model. In a simple manner, MOM delivers messages from a sender source to a receiving target. It uses queue-based techniques for sending/receiving messages; for instance, a sender application that needs to deliver a message will put the message in a queue. After that, MOM system gets the message from the queue and sends it to the particular target queue.

One of the most well-known messaging system is Apache Kafka (Kreps, Narkhede, & Rao, 2011). Kafka is a distributed publish-subscribe messaging system and a fast, scalable, distributed in nature by its design. It also supports partitioned and replicated commit log service. A stream of particular type of messages is defined by a topic in Kafka system. A producer client can publish messages to a topic and so the published messages are stored at a cluster of servers called brokers. A consumer can subscribe to one or more topics from the brokers. It can consume the subscribed messages by pulling data from the brokers. Kafka is very much a general-purpose system. Many producers and consumers can share multiple topics.

In contrast, Flume (Han & Ahn, 2014) is a special-purpose framework designed to send data to HDFS and HBase (C. P. Chen & Zhang, 2014). It has various optimizations for HDFS. Flume can process data in-motion using its interceptors. These can be very useful for ETL operations or filtering. Kafka requires an external stream processing system to execute this type of work. Flume does not support event replication in contrast to Kafka. Consequently, even when using the reliable file channel, if one of the Flume agent in a node goes down, the events in the channel cannot be accessed until a full recovery.

Flume and Kafka can be used together very well. Streaming the data from Kafka to Hadoop is required, using a Flume agent with Kafka producers to read the


data has some advantages. For instance, Flume’s integration with HDFS and HBase is natural, not only even adding an interceptor, doing some stream processing during delivery is easily possible as well. For this reason, using Kafka if the data will be consumed by multiple sinks and Flume if the data is designated for Hadoop are best practices for this type of work. Flume has many built-in sources and sinks that can be used in various architectures and designs. However, Kafka, has a quite smaller producer and consumer ecosystem.

There are also other Message Oriented Middlewares and selection for an application depends on specific requirements. Simple Queue Service (SQS), is a message-queue-as-a-service offering from Amazon Web Services (Yoon, Gavrilovska, Schwan, & Donahue, 2012). It supports only useful and simple messaging operations, quite lighter from the complexity of e.g. AMQP (Advanced Message Queuing Protocol), SQS provides at-least-once delivery. It also guarantees that after a successful send operation, the message is replicated to multiple nodes. It has good performance and no setup required. RabbitMQ is one of the leading open-source messaging systems. It is developed in Erlang and very popular for messaging. It implements AMQP and supports both message persistence and replication with partitioning. If high persistence is required, RabbitMQ guarantees replication across the cluster and on disk for message sending operations. Apache ActiveMQ is one of the most popular message brokers. It is widely used as messaging broker with good performance and wide protocol support. HornetQ is multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system and implements JMS, developed by JBoss and is part of the JBossAS. It supports over-the-network replication using live-backup pairs. It has great performance with a very plenty messaging interface and routing options.

3.3 HYBRID PROCESSING

Many big data applications include batch and real-time operations. This problem can be achieved with hybrid solutions. Hybrid computation in big data started with the introduction of Lambda Architecture (LA) (Casado & Younas, 2015). LA provides to optimize costs by understanding parts of data having batch or real-time processing. Besides, the architecture allows to execute various calculation scripts on partitioned datasets (Kiran, Murphy, Monga, Dugan, & Baveja, 2015).

Basically, a LA compromises of three distinct layers for processing both data in motion (DiM) and data at rest (DaR) at the same time (Marz & Warren, 2015). Every layer of the LA is related to a certain task for processing different type of data, combines the processed results from these different layers together and serves these merged data sets for querying purposes. Speed Layer is mainly responsible for processing the streaming data (DiM) and very vulnerable to delaying and recurring data situations. Batch Layer is basically used for processing the offline data (DaR) and correction of the errors that sometimes occur for data arrival to the speed layer. Serving layer is in charge of ingestion of data from batch and speed layer, indexing and combining result data sets from the queries from the applications. This layer has a special need for importing both streaming data real time as it comes and also batch data in huge size. Because of this special requirement, usable technology for this layer is currently limited but not a few. It has to be emphasized that LAs are eventually consistent systems for data processing applications and can be used for dealing with CAP theorem (Twardowski & Ryzko, 2014).

Figure 3. Big data classification

A conceptual explanation of LA is shown in Fig. 3. Incoming data from data bus is sent to both speed and batch layers and then generated views for the layers hosted on the serving layer. There are different technologies can be used in all three layers to form a LA. According to polyglot persistence paradigm, each technology is used for the special capability to process data. It is indispensable using big data technologies for IoT devices. LA can be used for IoT based smart home applications (Villari, Celesti, Fazio, & Puliafito, 2014). Data from different IoT sensors can be collected and processed both real time and offline with Lambda Architecture. With this three-layered architecture, Apache Storm is used for real


time processing, and MongoDB was used for both batch layer and serving layer respectively.

It is very common that Apache Storm is used for speed layer of LA, and MongoDB is also used for batch and serving layer for LA applications. Using two different technologies for speed and batch layer result in development of two different software and processing applications for these layers. Maintaining at least two different software for LA applications are not easy in big data domain. Debugging and deployment of different software on large hardware clusters for big data applications require extra effort, attention, knowledge and work. This sometimes may be painful job to do. To overcome this problem, using the same data processing technology for different layers is an approach (Demirezen, Küçükayan, & Yılmaz, 2015). It is shown that by combining the batch and serving layers and also using the high speed real time data ingestion capabilities of MongoDB helps accomplishing this task but not enough. The same data processing technology for speed and batch layers has to be selected. Multi-agent based big data processing approach was implemented for using collaborative filtering to build a recommendation engine by using LA (Twardowski & Ryzko, 2014). Apache Spark/Spark Streaming, Apache Hadoop YARN and Apache Cassandra technologies were used for real time, batch and serving layers respectively for this application. Agent based serving, batch and speed layers were implemented to build both real time and batch views and querying the aggregated data. Apache Hadoop and Storm are mature technologies to implement LA with different technologies.

A different approach for using the speed and batch layers of LA was implemented (Kroß, Brunnert, Prehofer, Runkler, & Krcmar, 2015). In cases of time constraints are not applicable in minutes, running speed layer in a continuous manner is not required. Using stream processing only when batch processing time exceeds the response time of the system is a method to utilize the cluster resources efficiently. Running a speed layer at the right time requires predicting the finishing time of the batch layer data processing operation. Using performance models for software systems to predict performance metrics of the system and cluster resources. Then running the speed layer is an application specific approach. This has to be investigated in design time.

Data bus is formed to ingest high volume of real time data for the Lambda Architecture. One of the most widely used framework for data bus is Apache Kafka and it is very mature and good for this purpose.

It supports high throughput data transfer and is a scalable, fault tolerant framework for data bus operations. For the speed layer Apache Samza, Apache Storm and Apache Spark (Streaming) are very good choices. Using Apache Hadoop, Apache Spark is very common for the batch layer operations. Apache Cassandra, Redis, Apache HBase, and MongoDB might be used as speed layer database. These databases support not only high speed real time data ingestion but also random read and write as well. MongoDB, CouchbaseDB, SploutSQL and VoldemortDB can be used as a batch layer database. These databases can be import bulk data to form and serve batch views. Generally using NoSQL databases for LA is very common instead of relational databases. Scalable and advanced capabilities for data ingestion are main reasons to be used at serving layer.

Programming in distributed frameworks may be complex and debugging and may be even harder. This has to be done twice in LA for batch and speed layers. The most important disadvantage of Lambda Architecture is that sometimes it is not practical to write the same algorithm twice with different frameworks for the developers. Same business logic might be used both layers and it requires implementing same algorithm for both layers. Maintaining and debugging the code might be very challenging process in distributed computing. Using Apache Spark and Spark Streaming together provides reuse of the same code for batch and online processing, join data streams against historical data. As for the Lambda Architecture, Spark Streaming and Spark can be used for developing speed layer and batch layer applications. However, one problem remains that serving layer has to be integrated with both layers for data processing and has to provide data ingestion for both layers. Speed and Batch layers require different data ingestion capabilities and operations. Therefore, Serving Layer has to be formed according to this design challenges. Generally serving layer is formed with using different database technologies in LA. Querying both databases, merging the results and sending as a response, is very hard work to do, especially in Big Data analytics applications. Instead of using this approach, serving layer can be formed by using special database technology that supports the both requirements in LA.

4. CONCLUSIONS

Big data approaches provide new challenges and opportunities to the users, customers or researchers, if


there have been available data and sources. Although available big data systems provide new solutions, they are still complex and require more system resources, tools, techniques and technologies. For this reason, it is necessary to develop cheaper, better and faster solutions.

Big data solutions are specific in three forms: software-only, as an appliance and cloud-based (Dumbill, 2012). These solutions are preferred according to the applications, requirements and availability of data. There are a large number of big data products that have Hadoop environment with a combination of infrastructure and analysis capabilities, while some of the big data products are developed for specific framework or topics. Big data infrastructures and techniques should trigger the development of novel tools and advanced algorithms with the help of distributed systems, granular computing, parallel processing, cloud computing, bio-inspired systems, hybrid-systems and quantum computing technologies (C. P. Chen & Zhang, 2014). For example, cloud computing is served to big data for the purposes of being flexible and effective on infrastructure. Storage and management issues of heterogeneous data are handled via distributed file systems and NoSQL databases. Dividing big problems into smaller pieces can make them easier and faster so, granular computing and parallel processing are good choices. Simulating intelligence or social behaviors of living creature are connected the development of machine learning and artificial intelligence fields with the help of bio-inspired systems. In addition to all, for software innovations, hardware innovations in processor and storage technology or network architecture have played a major role (Kambatla, Kollias, Kumar, & Grama, 2014).

To achieve more successful big data management and better results for applications, it is necessary to select appropriate programming models, tools and technologies (Lei, Jiang, Wu, Du, & Zhu, 2015). Even after providing and qualifying the technical infrastructure, there is a need for big data experts to select appropriate data management model and analysis process, to organize data priorities and to suggest creative ideas on big data problems for scientific developments or capital investments (M. Chen et al., 2014), (Rajan et al., 2013). For qualified people to achieve professional results, it is necessary to supply training and learning opportunities via providing big datasets in public domains to be used in research and development. Opening new courses and programs at universities might help to increase

number of experts to handle problems easily and effectively.

It is also expected that big data will not only provide opportunities for improving operational efficiency, informing better strategic targets, providing better customer services, identifying and producing new tools, products and services, distinguishing customer and users, but also prevent threats and privacy violation and provide better security. Generally accepted that the traditional protection techniques are not suitable for big data security and privacy. However, open source or new big data technologies may host unknown drawbacks if they are not well understood. For this reason, confidentiality, integrity and availability of information and computer architecture must be discussed from every angle in big data analysis. The development of big data systems and applications is led to abolish the individual control about collection and usage of personally identifiable information like to know new and secret facts about people or to add value organizations with collected data from unaware people. As indicated in (Wong, Fu, Wang, Yu, & Pei, 2011), anonymization techniques such as k-anonymity, l-diversity, and t-closeness may be solutions to prevent the situation. Therefore, the law and the regulations must be enforced with clarified boundaries in terms of unauthorized access, data sharing, misuse, and reproduction of personal information.

The evaluations have shown that big data is a real challenge not only for official organizations but also companies, universities and research centers having big data to profound influences in their future developments, plans, decisions, actions and imaginations. Even if enough tools, techniques and technologies are available in the literature, it can be concluded that there are still many points to be considered, discussed, improved, developed and analyzed about big data and its technology. It is hoped that this article would help to understand the big data and its ecosystem more and to be developed better solutions not only for today but also for future

5. REFERENCES

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15.


Bajpeyee, R., Sinha, S. P., & Kumar, V. (2015). Big Data: A Brief Investigation on NoSQL Databases,“. International Journal of Innovations & Advancement in Computer Science, 4(1), 28-35.

Casado, R., & Younas, M. (2015). Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience, 27(8), 2078-2091.

Cattell, R. (2011). Scalable SQL and NoSQL data stores. Acm Sigmod Record, 39(4), 12-27.

Center, I. I. (2012). Peer Research: Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using Big Data.

Chandy, K. M., & Lamport, L. (1985). Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1), 63-75.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171-209.

Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big data issues in scientific data infrastructure. Paper presented at the Collaboration Technologies and Systems (CTS), 2013 International Conference on.

Demirezen, M. U., Küçükayan, Y. G., & Yılmaz, D. B. (2015). Developing Big Data Realtime Stream And Batch Data Processing Platform. Paper presented at the Ulusal Savunma Uygulamalari Modelleme ve Simülasyon Konferansı,, Ankara.

Dumbill, E. (2012). Big Data Now: Current Perspectives: O’Reilly Radar Team, O’Reilly Media, USA.

Feng, T., Zhuang, Z., Pan, Y., & Ramachandra, H. (2015). A memory capacity model for high performing data-filtering applications in Samza framework. Paper presented at the Big Data (Big Data), 2015 IEEE International Conference on.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

Han, U., & Ahn, J. (2014). Dynamic load balancing method for apache flume log processing. Advanced Science and Technology Letters, 79, 83-86.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561-2573.

Kiran, M., Murphy, P., Monga, I., Dugan, J., & Baveja, S. S. (2015). Lambda architecture for cost-effective batch and speed big data processing. Paper presented at the Big Data (Big Data), 2015 IEEE International Conference on.

Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. Paper presented at the Proceedings of the NetDB.

Kroß, J., Brunnert, A., Prehofer, C., Runkler, T. A., & Krcmar, H. (2015). Stream processing on demand for lambda architectures. Paper presented at the European Workshop on Performance Engineering.

Lei, J., Jiang, T., Wu, K., Du, H., & Zhu, L. (2015). Robust local outlier detection with statistical parameter for big data. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 30(5), 411-419.

Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable realtime data systems: Manning Publications Co.

McCafferty, D. (2014). Surprising Statistics About Big Data. Surprising Statistics About Big Data. February 18, 2014. http://www. baselinemag. com/analytics-big-data/slideshows/surprising-statistics-about-big-data. html.

Mysore, S. D., & Jain, S. (2013). Big Data Architecture and Patterns, Part 1: Introduction to Big Data Classification and Architecture. IBM Corp.

Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010). S4: Distributed stream computing platform. Paper presented at the Data Mining Workshops (ICDMW), 2010 IEEE International Conference on.

P. Almeida, J. B. (2015). A comprehensive overview of open source big data platforms and frameworks. International Journal of Big Data (IJBD), 2(3), 15-33.

http://www/


Rajan, S., van Ginkel, W., Sundaresan, N., Bardhan, A., Chen, Y., Fuchs, A., Manadhata, P. (2013). Expanded top ten big data security and privacy challenges. Cloud Security Alliance, available at https://cloudsecurityalliance. org/research/big-data/, viewed on, 12.

Renner, T., Thamsen, L., & Kao, O. (2015). Network-aware resource management for scalable data analytics frameworks. Paper presented at the Big Data (Big Data), 2015 IEEE International Conference on.

Tatbul, N. (2010). Streaming data integration: Challenges and opportunities. Paper presented at the Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on.

Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Donham, J. (2014). Storm@ twitter. Paper presented at the Proceedings of the 2014 ACM SIGMOD international conference on Management of data.

Twardowski, B., & Ryzko, D. (2014). Multi-agent architecture for real-time big data processing. Paper presented at the Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on.

Villari, M., Celesti, A., Fazio, M., & Puliafito, A. (2014). Alljoyn lambda: An architecture for the management of smart environments in iot. Paper presented at the Smart Computing Workshops (SMARTCOMP Workshops), 2014 International Conference on.

Wang, E., & Chen, G. (2013). An Overview of Big Data Mining: Methods and Tools. Paper presented at the International Symposium on Signal Processing, Biomedical Engineering and Informatics, China.

Why HPCC Systems is a superior alternative to Hadoop. 2017, from https://hpccsystems.com/why-hpcc-systems/hpcc-hadoop-comparison/superior-to-hadoop

Wong, R. C.-W., Fu, A. W.-C., Wang, K., Yu, P. S., & Pei, J. (2011). Can the utility of anonymized data be used for privacy breaches? ACM Transactions on Knowledge Discovery from Data (TKDD), 5(3), 16.

Yoon, H., Gavrilovska, A., Schwan, K., & Donahue, J. (2012). Interactive use of cloud services: Amazon sqs and s3. Paper presented at the Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. HotCloud, 10(10-10), 95.

Authors

Duygu SINANC is research assistant of Gazi University Graduate School of Natural and Applied Science. She received her M.Sc. degree from Gazi University Department of Computer Engineering and continues her Ph.D. at the same department. Her research interests are big data analytics, data mining and

information security.

Umut DEMIREZEN is Cyber Security and Big Data Research and Development Group Manager at STM Defense Technologies Engineering and Trade Inc. He received his Ph.D. degree from Gazi University Department of Electrical and Electronics Engineering. His research interests are data science,

big data and machine learning.

Seref SAGIROGLU is professor Department of Computer Engineering at Gazi University. His research interests are intelligent system identification, recognition and modeling, and control; artificial intelligence; heuristic algorithms; industrial robots; analysis and design of smart antenna; information

systems and applications; software engineering; information and computer security; biometry, electronic signature and public-key structure, malware and spyware software. Published over 50 papers in international journals indexed by SCI, published over 50 papers in national journals, over


100 national and international conferences and symposium notice and close to 100 notice are offered in national symposium and workshops. He has three patents and 5 pieces published book. He edited four books. He carried out of a many national and international projects. He hold the many conferences and continues academic studies.

Date post:	15-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

STBD Editorial Boardhipore.com/stbd/2016/STBD-Vol3-No1-2016-Full.pdf · 2019-09-08 · Services...

Documents